PATENTSCOPE will be unavailable a few hours for maintenance reason on Tuesday 19.11.2019 at 4:00 PM CET
Search International and National Patent Collections
Some content of this application is unavailable at the moment.
If this situation persists, please contact us atFeedback&Contact
1. (WO2007008871) METHOD AND APPARATUS FOR REPRESENTATION OF UNSTRUCTURED DATA
Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

WHAT IS CLAIMED IS:
1. A system for representing and searching a document including unstructured data, the system comprising:
a data store storing the document;
a processor executing program instructions, the program instructions including generating a binary representation of the unstructured data in the document and searching the binary representation in response to a search request, the processor generating an output based on the search; and
a memory storing the binary representation of the unstructured data in a plurality of data structures, the data structures including:
a first binary bit vector identifying each unstructured data included in the document; and
a plurality of second binary bit vectors, wherein for each unstructured data identified in the first binary bit vector, a corresponding second binary bit vector provides one or more position identifiers for the associated unstructured data.

2. The system of claim 1, wherein the unstructured data is a word.

3. The system of claim 2 further comprising a dictionary of words, the dictionary providing a unique word identifier for each word in the dictionary, each position of the first binary bit vector being associated with a particular word identifier provided by the dictionary.

4. The system of claim 3, wherein the search request includes a search word, and the processor retrieves a word identifier for the search word.

5. The system of claim 4, wherein the program instructions further include:
determining whether a first bit value has been set at a position in the first binary bit vector identified by the word identifier for the search word;
retrieving a corresponding second binary bit vector from the plurality of binary bit vectors based on the determination; and
obtaining one or more document positions based on one or more position identifier provided by the retrieved second binary bit vector.

6. The system of claim 5, wherein the program instructions further include:
obtaining a range of position identifiers associated with the document;
storing 1-bit values in a temporary vector for the range of position identifiers; and performing a logical AND operation based on the temporary vector and the retrieved second binary bit vector.

7. The system of claim 1, wherein the search request is for a phrase including a plurality of search words.

8. The system of claim 1, wherein the data structures further include:
a third binary bit vector indicating a first position identifier of unstructured data at the beginning of the document, and a second position identifier of unstructured data at the beginning of the next document.

9. The system of claim 1, wherein the data structures further include:
a fourth binary vector indicating a document position of each unstructured data in the document.

10. A computer-implemented method for representing and searching a document including unstructured data, the method comprising:
generating, under control of the computer, a binary representation of the unstructured data in the document;
storing the binary representation of the unstructured data in a plurality of data structures, the data structures including:
a first binary bit vector identifying each unstructured data stored in the document; and
a plurality of second binary bit vectors, wherein for each unstructured data identified in the first binary bit vector, a corresponding second binary bit vector provides one or more position identifiers for the associated unstructured data;
receiving a search request;
searching, under control of the computer, the binary representation in response to the search request; and
generating, under control of the computer, an output based on the search.

11. The method of claim 10, wherein the unstructured data is a word.

12. The method of claim 11, wherein a dictionary of words provide a unique word identifier for each word in the dictionary, and each position of the 1-bits in the first binary bit vector is associated with a particular word identifier provided by the dictionary.

13. The method of claim 12, wherein the search request includes a search word, and the processor retrieves a word identifier for the search word.

14. The method of claim 13 further comprising:
determining whether a first bit value has been set at a position in the first binary bit vector identified by the word identifier for the search word;
retrieving a corresponding second binary bit vector from the plurality of binary bit vectors based on the determination; and
obtaining one or more document positions based on one or more position identifiers provided by the retrieved second binary bit vector.

15. The method of claim 14 further comprising:
obtaining a range of position identifiers associated with the document;
storing 1-bit values in a temporary vector for the range of position identifiers; and performing a logical AND operation based on the temporary vector and the retrieved second binary bit vector.

16. The method of claim 10, wherein the search request is for a phrase including a plurality of search words.

17. The method of claim 10, wherein the data structures further include:
a third binary bit vector indicating a first position identifier of unstructured data at the beginning of the document, and a second position identifier of unstructured data at the beginning of the next document.

18. The method of claim 10, wherein the data structures further include: a fourth binary vector indicating a document position of each unstructured data in the document.

19. A method for representing unstructured data included in a document, the method comprising:
parsing the document;
obtaining a unique identifier for each unstructured data included in the document;
storing a first bit-value at each position of a first binary bit vector identified by each obtained unique identifier;
assigning a unique position identifier for each unstructured data included in the document;
retrieving a second binary bit vector for each unique identifier for which the first bit-value is set in the first binary bit vector; and
storing a second bit-value at a position of a particular second binary bit vector identified by the position identifier assigned to the unstructured data associated with a particular unique identifier associated with the particular second binary bit vector.

20. The method of claim 19, wherein the unstructured data is a word.