Some content of this application is unavailable at the moment.
If this situation persist, please contact us atFeedback&Contact
1. (WO2017040663) CREATING A TRAINING DATA SET BASED ON UNLABELED TEXTUAL DATA
Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

CLAIMS

1. A method comprising:

obtaining, using one or more processors, a plurality of unlabeled text

documents;

obtaining, using the one or more processors, an initial concept;

obtaining, using the one or more processors, keywords from a knowledge source based on the initial concept;

scoring, using the one or more processors, the plurality of unlabeled

documents based at least in part on the initial keywords; determining, using the one or more processors, a categorization of the

documents based on the scores;

performing, using the one or more processors, a first feature selection and creating a first vector space representation of each document in a first category and a second category associated with the first feature selection, the first and second categories based on the scores, the first vector space representation serving as one or more labels for an associated unlabeled textual document; and

generating, using the one or more processors, the training set including a subset of the obtained unlabeled textual documents, the subset of the obtained unlabeled documents including a documents belonging to the first category based on the scores and documents belonging to the second category based on the scores, the training set including the first vector space representations for the subset of the obtained unlabeled documents belonging to the first category based on the scores and the second category based on the scores, the first vector space representations serving as one or more labels of the subset of the obtained unlabeled documents belonging to the first category and the second category.

2. The method of claim 1 comprising:

using the vector space representation of each document in the one or more categories as labels for the unlabeled textual documents; and

generating, using the vector space representation of each document in the first and second categories as labels for the unlabeled textual data, a model using a supervised machine learning method.

3. The method of claim 2, wherein the model using the supervised machine learning method is a classifier.

4. The method of claim 2, wherein generating the model using the supervised machine learning method includes training one or more binary classifiers.

5. The method of claim 1 comprising:

performing a second feature selection and creating a second vector space representation of each document in a third category and a fourth category associated with the second feature selection, the third and fourth categories based on the scores, the second vector space representation serving as one or more additional labels for an associated unlabeled textual document;

using the first and second vector space representations of each document in the one or more categories as labels for the unlabeled textual documents; and

generating, using the vector space representation of each document in the first, second, third and fourth categories as labels for the unlabeled textual data, a model using a multiclass classifier on a union of feature sets from the first and second feature selections.

6. The method of claim 1 comprising:

determining the knowledge source based on the initial concept.

7. The method of claim 1, wherein the categorization of documents based on score categorizes a document with a score satisfying a first threshold as positive and categorizes the document as negative when the score of the document does not satisfy the first threshold.

8. The method of claim 1, wherein the categorization of documents based on score categorizes a document with a score satisfying a first threshold as positive and categorizes the document as negative when the score of the document satisfies a second threshold.

9. The method of claim 1, wherein the scores are based in part on the knowledge source from which a first, initial keyword was obtained and based on weights associated with the initial key words.

10. A system comprising:

one or more processors; and

a memory including instructions that, when executed by the one or more

processors, cause the system to:

obtain a plurality of unlabeled text documents;

obtain an initial concept;

obtain keywords from a knowledge source based on the initial concept; score the plurality of unlabeled documents based at least in part on the initial keywords;

determine a categorization of the documents based on the scores; perform a first feature selection and creating a first vector space

representation of each document in a first category and a second category associated with the first feature selection, the first and second categories based on the scores, the first vector space representation serving as one or more labels for an associated unlabeled textual document; and

generate the training set including a subset of the obtained unlabeled textual documents, the subset of the obtained unlabeled documents including a documents belonging to the first category based on the scores and documents belonging to the second category based on the scores, the training set including the first vector space representations for the subset of the obtained unlabeled documents belonging to the first category based on the scores and the second category based on the

scores, the first vector space representations serving as one or more labels of the subset of the obtained unlabeled documents belonging to the first category and the second category.

1 1. The system of claim 10, wherein the instructions, when executed by the one or more processors, further cause the system to:

use the vector space representation of each document in the one or more

categories as labels for the unlabeled textual documents; and generate, using the vector space representation of each document in the first and second categories as labels for the unlabeled textual data, a model using a supervised machine learning method.

12. The system of claim 1 1, wherein the model using the supervised machine learning method is a classifier.

13. The system of claim 1 1, wherein generating the model using the supervised machine learning method includes training one or more binary classifiers.

14. The system of claim 10, wherein the instructions, when executed by the one or more processors, further cause the system to:

perform a second feature selection and creating a second vector space

representation of each document in a third category and a fourth category associated with the second feature selection, the third and fourth categories based on the scores, the second vector space representation serving as one or more additional labels for an associated unlabeled textual document;

use the first and second vector space representations of each document in the one or more categories as labels for the unlabeled textual documents; and

generate, using the vector space representation of each document in the first, second, third and fourth categories as labels for the unlabeled textual data, a model using a multiclass classifier on a union of feature sets from the first and second feature selections.

15. The system of claim 10, wherein the instructions, when executed by the one or more processors, further cause the system to

determine the knowledge source based on the initial concept.

16. The system of claim 10, wherein the categorization of documents based on score categorizes a document with a score satisfying a first threshold as positive and categorizes the document as negative when the score of the document does not satisfy the first threshold.

17. The system of claim 10, wherein the categorization of documents based on score categorizes a document with a score satisfying a first threshold as positive and categorizes the document as negative when the score of the document satisfies a second threshold.

18. The system of claim 10, wherein the scores are based in part on the knowledge source from which a first, initial keyword was obtained and based on weights associated with the initial key words.

19. A computer-program product comprising a non-transitory computer usable medium including a computer readable program, wherein the computer readable program, when executed on a computer, causes the computer to perform operations comprising:

obtaining a plurality of unlabeled text documents;

obtaining an initial concept;

obtaining keywords from a knowledge source based on the initial concept; scoring the plurality of unlabeled documents based at least in part on the initial keywords;

determining a categorization of the documents based on the scores;

performing a first feature selection and creating a first vector space

representation of each document in a first category and a second category associated with the first feature selection, the first and second categories based on the scores, the first vector space representation serving as one or more labels for an associated unlabeled textual document; and

generating the training set including a subset of the obtained unlabeled textual documents, the subset of the obtained unlabeled documents including a documents belonging to the first category based on the scores and documents belonging to the second category based on the scores, the training set including the first vector space representations for the subset of the obtained unlabeled documents belonging to the first category based on the scores and the second category based on the scores, the first vector space representations serving as one or more labels of the subset of the obtained unlabeled documents belonging to the first category and the second category.

20. The computer-program product of claim 19, wherein the computer readable program, when executed on a computer, causes the computer to perform operations comprising:

using the vector space representation of each document in the one or more categories as labels for the unlabeled textual documents; and generating, using the vector space representation of each document in the first and second categories as labels for the unlabeled textual data, a model using a supervised machine learning method.