Traitement en cours

Veuillez attendre...

Paramétrages

Paramétrages

Aller à Demande

1. WO2020113225 - SYSTÈMES ET PROCÉDÉS PERMETTANT D'IDENTIFIER UN ÉVÉNEMENT DANS DES DONNÉES

Note: Texte fondé sur des processus automatiques de reconnaissance optique de caractères. Seule la version PDF a une valeur juridique

[ EN ]

CLAIMS

1. A method for identifying an event in data, the method comprising:

receiving data at a receiver from a data source, the data comprising one or more

documents each comprising text;

performing natural language processing on the received data to generate processed data, the processed data indicating one or more sentences;

generating, based on the data and a first keyword set, a second keyword set having a greater number of keywords than the first keyword set; and

for each keyword set of the first keyword set and each keyword in the second

keyword set:

detecting one or more keywords and one or more entities included in the

processed data based on the keyword set and an entity set;

determining one or more matched pairs based on the detected one or more keywords and the detected one or more entities;

extracting a sentence from a document based on the one or more sentences indicated by the processed data, where the sentence corresponds to at least one matched pair of the one or more matched pairs and comprises a single sentence or multiple sentences; and

outputting the extracted sentence.

2. The method of claim 1, wherein performing the natural language processing comprises performing, on the data, tokenization, lemmatization, sentencization, or a combination thereof.

3. The method of claim 1, further comprising, for each matched pair based on the first keyword set and each matched pair based on the second keyword set:

determining whether the keyword and the entity of the matched pair are included in the same sentence; and

wherein extracting the sentence from the document comprises:

extracting a single sentence based on a determination that the keyword and the entity are included in the same sentence; and

extracting multiple sentences based on a determination that the keyword and the entity are not included in the same sentence.

4. The method of claim 1, further comprising:

initiating a pipeQ operation on the data to perform the natural language processing; and

wherein performing natural language processing further comprising:

using a dependency based sentencizer to generate the processed data

indicating one or more sentences;

converting the data to a lemmatized format;

using a tokenizer to generate one or more tokens;

using a part-of-speech tagger to generate part-of-speech data;

using a named entity recognizer to identify one or more entities;

or a combination thereof; and

wherein the processed data is in a format that is compatible with Python.

5. The method of claim 1, wherein generating the second keyword set comprises: generating one or more semantic vectors;

for each keyword of the first keyword set:

determining a semantic vector having a highest similarity score to the

keyword;

identifying one or more terms of the determined semantic vector as a

candidate term; and

selecting at least one candidate term to be added to the first keyword set to generate the second keyword set.

6. The method of claim 5, wherein:

generating the one or more semantic vectors comprises, for each document of one or more documents corresponding to the data, generating a corresponding semantic vector based on a skipgram model that utilizes words and subwords from the document; and

generating the second keyword set further comprises, for each keyword of the first keyword set:

comparing a similarity score of the determined semantic vector having a

highest similarity score to a threshold; and

wherein the semantic vector is used to identify the candidate term based on a determination that the similarity score of the determined semantic vector is greater than or equal to the threshold.

7. The method of claim 1, wherein the extracted sentence is output to an electronic device associated with an analyst, and further comprising:

receiving an input from the analyst responsive to the extracted sentence; and storing an indication of the input; and

sending a notification corresponding to the extracted sentence, the input, or both; and wherein the notification includes a link to a data source, a text extraction from a

document, the matched pair corresponding to the extracted sentence, the input, an identifier of the analyst, or a combination thereof.

8. A system comprising:

a data ingestor configured to:

receive data at a receiver from a data source, the data comprising one or more documents each comprising text; and

perform natural language processing on the received data to generate

processed data, the processed data indicating one or more sentences; a taxonomy expander configured to:

generate, based on the data and a first keyword set, a second keyword set having a greater number of keywords than the first keyword set;

a term detector configured to:

detect, for each keyword set of the first keyword set and each keyword in the second keyword set, one or more keywords and one or more entities included in the processed data based on the keyword set and an entity set; and

an output generator configured to, for each keyword set of the first keyword set and the second keyword set:

determine one or more matched pairs based on the detected one or more

keywords and the detected one or more entities;

extract a sentence from a document based on the one or more sentences

indicated by the processed data, where the sentence corresponds to at least one matched pair of the one or more matched pairs and comprises a single sentence or multiple sentences; and

output the extracted sentence.

9. The system of claim 8, further comprising:

a database coupled to the data ingestor, the taxonomy expander, the term detector, the output generator, or a combination thereof.

10. The system of claim 9, wherein the database is configured to store the first keyword set, the second keyword set, the entity set, the processed data, one or more thresholds, one or more extracted sentences, a plurality of matched pairs, or a combination thereof.

11. The system of claim 8, further comprising:

a processor; and

a memory storing instructions executable by the processor to cause the processor to perform one or more operations of the data ingestor, the taxonomy expander, the term detector, the output generator, or a combination thereof.

12. The system of claim 8, further comprising:

an interface configured to enable communication with the data source, an electronic device, or a combination thereof.

13. The system of claim 8, further comprising:

a filter configured to:

for each of the one or more matched pairs based on the first keyword set and each of the one or more matched pairs based on the second keyword set:

determine a distance between the keyword and the entity of the

matched pair;

perform a comparison between the determined distance and a

threshold; and

determine to retain the matched pair or discard the matched pair based on whether or not the comparison indicate the determined distance is greater than or equal to the threshold.

14. The system of claim 8, wherein, to determine the one or more matched pairs, the output generator is further configured to:

for each keyword of the detected one or more keywords, identify a corresponding entity of the detected one or more entities that is positioned closest to the corresponding keyword to determine a matched pair for the keyword.

15. A computer-based tool including non-transitory computer readable media having stored thereon computer code which, when executed by a processor, causes a computing device to perform operations comprising:

receiving data at a receiver from a data source, the data comprising one or more

documents each comprising text;

performing natural language processing on the received data to generate processed data, the processed data indicating one or more sentences;

generating, based on the data and a first keyword set, a second keyword set having a greater number of keywords than the first keyword set;

for each keyword of the first keyword set and each keyword in the second keyword set:

detecting one or more keywords and one or more entities included in the

processed data based on the keyword set and an entity set;

determining one or more matched pairs based on the detected one or more keywords and the detected one or more entities;

extracting a sentence from a document based on the one or more sentences indicated by the processed data, where the sentence corresponds to at least one matched pair of the one or more matched pairs and comprises a single sentence or multiple sentences; and

outputting the extracted sentence.

16. The computer-based tool of claim 15, wherein the operations further comprise: receiving a selection of a first event category of multiple event categories; and retrieving the first keyword set based on the selection of the first event category.

17. The computer-based tool of claim 16, wherein the multiple event categories comprise cybersecurity, terrorism, legal/non-compliance, or a combination thereof.

18. The computer-based tool of claim 16, wherein the operations further comprise: receiving a selection of a second event category of the multiple event categories; retrieving a third keyword set based on the selection of the second event category; generating, based on the third keyword set, a fourth keyword set having a greater number of keywords than the third keyword set;

for each keyword set of the third keyword set and each keyword in the fourth

keyword set:

detecting one or more keywords and one or more entities included in the

processed data based on the keyword set and the entity set; and determining one or more matched pairs based on the detected one or more keywords and the detected one or more entities.

19. The computer-based tool of claim 15, wherein:

the sentence comprises the multiple sentences; and

the multiple sentences comprise a sentence that includes the at least one matched pair, a sentence that includes the keyword of the at least one matched pair, a sentence preceding the sentence that includes the keyword of the at least one matched pair, a sentence following the sentence with the keyword the at least one matched pair, a sentence that includes the entity of the at least one matched pair, a sentence preceding the sentence that includes the entity of the at least one matched pair, a sentence following the sentence with the entity of the at least one matched pair, or a combination thereof.

20. The computer-based tool of claim 15, wherein:

the data source comprises a streaming data source, news data, a database, or a combination thereof; and

the entity set indicates an individual, a company, a government, an organization, or a combination thereof.