Traitement en cours

Veuillez attendre...

Paramétrages

Paramétrages

Aller à Demande

1. WO2020109277 - PROCÉDÉ ET SYSTÈME DE CRÉATION D'UN CORPUS D'APPRENTISSAGE SPÉCIFIQUE À UN DOMAINE À PARTIR DE CORPUS DE DOMAINES GÉNÉRIQUES

Note: Texte fondé sur des processus automatiques de reconnaissance optique de caractères. Seule la version PDF a une valeur juridique

[ EN ]

Claims

What is claimed is:

1. A method (100) for generating a domain- specific training set, comprising:

generating (130) a generic corpus comprising a plurality of tokenized documents obtained from one or more sources, comprising: (i) parsing (132) a document retrieved from the generic corpus or from another source of documents; (ii) preprocessing (134) the parsed document; (iii) tokenizing (136) the preprocessed document; and (iv) storing (138) the tokenized document in the generic corpus;

generating (140) an ontology database of tokenized entries, comprising: (i) parsing (142) an ontology entry retrieved from an ontology; (ii) preprocessing (144) the parsed entry; (iii) tokenizing (146) the preprocessed entry; and (iv) storing (148) the tokenized entry in the ontology database;

querying (150), using one or more domain- specific tokenized entries from the ontology database, the tokenized documents in the generic corpus;

identifying (160), based on the query, a plurality of tokenized documents specific to the domain; and

storing (170), in a training set database, the identified plurality of tokenized documents as a training set specific to the domain.

2. The method of claim 1, further comprising:

retrieving (180) a domain- specific training set from the training set database; and training (190) a machine learning algorithm directed to the domain of the retrieved domain- specific training set, thus generating a domain -specific trained algorithm.

3. The method of claim 2, wherein a user identifies a domain-specific training set to retrieve from the training set database.

4. The method of claim 1, further comprising the steps of:

retrieving (120) one or more documents from a plurality of sources; and storing (122) content from the one or more documents in a corpus database.

5. The method of claim 1, wherein the one or more domain- specific tokenized entries used to query the tokenized documents in the generic corpus is selected by a user.

6. The method of claim 1, wherein the domain-specific tokenized entry used to query the tokenized documents in the generic corpus further comprises one or more synonyms for that entry.

7. The method of claim 1, wherein identifying a plurality of tokenized documents specific to the domain comprises matching using rule-based or feature based matching.

8. The method of claim 1, wherein the training set database comprises a plurality of stored domain- specific training sets.

9. A system (400) for generating a domain- specific training set, comprising:

a corpus database (462) comprising a plurality of documents obtained from one or more sources;

an ontology database (463) comprising an ontology; and

a processor (420) configured to: (i) generate a generic corpus comprising a plurality of tokenized documents obtained from one or more sources, comprising: parsing documents from the corpus database; preprocessing the parsed document; tokenizing the preprocessed document; and storing the tokenized document in the generic corpus; (ii) generate an ontology database of tokenized entries, comprising: parsing an ontology entry retrieved from an ontology; preprocessing the parsed entry; tokenizing the preprocessed entry; and storing the tokenized entry in the ontology database; (iii) query the tokenized documents in the generic corpus using one or more domain-specific tokenized entries from the ontology database; (iv) identify, based on the query, a plurality of tokenized documents specific to the domain; and (v) store the identified plurality of tokenized documents as a training set specific to the domain.

10. The system of claim 9, further comprising a processor configured to: (i) retrieve a domain- specific training set from the training set database; and (ii) train a machine learning algorithm directed to the domain of the retrieved domain-specific training set, thus generating a domain- specific trained algorithm.

11. The system of claim 10, further comprising a user interface (440), and wherein the user interface is configured to receive an identification of a domain-specific training set to retrieve from the training set database.

12. The system of claim 9, wherein the one or more domain-specific tokenized entries used to query the tokenized documents in the generic corpus is selected by a user via a user interface (440).

13. The system of claim 9, wherein the domain- specific tokenized entry used to query the tokenized documents in the generic corpus further comprises one or more synonyms for that entry.

14. The system of claim 9, wherein identifying a plurality of tokenized documents specific to the domain comprises matching using rule-based or feature-based matching.

15. The system of claim 9, wherein the training set database comprises a plurality of stored domain- specific training sets.