Traitement en cours

Veuillez attendre...

Paramétrages

Paramétrages

Aller à Demande

1. WO2020115580 - SYSTÈME ET PROCÉDÉ DE PRÉDICTION DE PROMOTEUR DANS LE GÉNOME HUMAIN

Note: Texte fondé sur des processus automatiques de reconnaissance optique de caractères. Seule la version PDF a une valeur juridique

[ EN ]

WHAT IS CLAIMED IS:

1. A method for training a deep neural network model (100) based on a known genome sequence (500), the method comprising:

receiving (1100) the known genome sequence (500);

training (1102) the deep neural network model (100) with a current negative set (502) obtained from the known genome sequence (500);

applying (1104) the deep neural network model (100) to the known genome sequence (500) and recording false positive sets;

selecting (1106) a subset of the new false positive sets (508);

updating (1108) the current negative set (502) with the new false positive sets (508); and

repeating (1110) the steps of training, applying, selecting and updating until a number of the new false positive sets is smaller than a given threshold.

2. The method of Claim 1 , wherein the negative set includes a region of the known genome sequence that does not include a promoter, a positive set includes a region of the known genome sequence that includes a promoter, and a false positive set includes a region of the known genome sequence that does not include a promoter but is found by the deep neural network model to correspond to a promoter.

3. The method of Claim 1 , further comprising:

calculating a score for plural sets of the known genome sequence based on plural convolutional neural networks layers of the deep neural network model; and selecting the subset of the new false positive sets based on a highest score of the plural sets.

4. The method of Claim 3, wherein the score is calculated by a softmax layer, the softmax layer has two neurons, a first neuron which represents an input sequence being a promoter (pp) and a second neuron which represents an input sequence being a non-promoter (pnp).

5. The method of Claim 4, wherein the score is given by a difference of pp and pnp, to which the unity is added, and a result is divided by 2.

6. The method of Claim 1 , wherein the step of updating further comprises: removing a subset of the current negative set.

7. The method of Claim 1 , further comprising:

training the deep neural network model for promoters having a TATA box to obtain a TATA+ trained model;

training the deep neural network model for promoters not having a TATA box to obtain a TATA- trained model.

8. The method of Claim 1 , wherein the current negative set is originally randomly obtained from the known genome sequence.

9. A method for determining a transcription start site of a promoter in a genome sequence, the method comprising:

receiving (1200) a genome sequence (500);

training (1202) a deep neural network model (100) based on an interactive and adaptive approach that updates a current negative set (502) based on determined false positives (508);

applying (1204) the genome sequence (500) to the deep neural network model (100); and

determining (1206) the transcription start site of the promoter in the genome sequence (500) based on the updated current negative set (502).

10. The method of Claim 9, wherein the training step comprises:

training the deep neural network model with a current negative set obtained from a known genome sequence;

applying the deep neural network model to the known genome sequence and recording false positive sets;

selecting a subset of the new false positive sets;

updating the current negative set with the new false positive sets; and repeating the steps of training, applying, selecting and updating until a number of the new false positive sets is smaller than a given threshold.

11. The method of Claim 10, wherein the negative set includes a region of the known genome sequence that does not include a promoter, a positive set includes a region of the known genome sequence that includes a promoter, and a false positive set includes a region of the known genome sequence that does not include a promoter but is found by the deep neural network model to correspond to a promoter.

12. The method of Claim 10, further comprising:

calculating a score for plural sets of the known genome sequence based on plural convolutional neural networks layers of the deep neural network model; and selecting the subset of the new false positive sets based on a highest score of the plural sets.

13. The method of Claim 12, wherein the score is calculated by a softmax layer, the softmax layer has two neurons, a first neuron which represents an input sequence being a promoter (pp) and a second neuron which represents an input sequence being a non-promoter (pnp).

14. The method of Claim 13, wherein the score is given by a difference of pp and pnp, to which the unity is added, and a result is divided by 2.

15. The method of Claim 10, wherein the step of updating further comprises: removing a subset of the current negative set.

16. The method of Claim 10, further comprising:

training the deep neural network model for promoters having a TATA box to obtain a TATA+ trained model;

predicting promoters having the TATA box by using the TATA+ trained model; training the deep neural network model for promoters not having a TATA box to obtain a TATA- trained model;

predicting promoters not having the TATA box by using the TATA- trained model; and

combining the promoters having the TATA box with the promoters not having the TATA box to determine the transcription start site of the promoter in the genome sequence.

17. A computing device that implements a deep neural network model (100), which comprises:

a processor (1302) having an input layer (104) configured to receive (1100) a known genome sequence (500) and plural convolutional neural networks, CNN, layers (110-1, 112), each connected to the input layer (104), and configured to train with a current negative set (502) obtained from the known genome sequence (500); a memory (1304, 1306) connected to the processor (1302) and configured to record false positive sets when the deep neural network model (100) is applied to the known genome sequence (500); and

the processor (1302) being configured to select (1106) a subset of the new false positive sets (508), to update (1108) the current negative set (502) with the new false positive sets (508), and to repeat (1110) the steps of training, applying, selecting and updating until a number of the new false positive sets is smaller than a given threshold.

18. The computing device of Claim 17, wherein the negative set includes a region of the known genome sequence that does not include a promoter, a positive set includes a region of the known genome sequence that includes a promoter, and a false positive set includes a region of the known genome sequence that does not include a promoter but is found by the deep neural network model to correspond to a promoter.

19. The computing device of Claim 17, wherein the processor further has a softmax layer connected to the CNN layers and the softmax layer is configured to calculate a score for plural sets of the known genome sequence.

20. The computing device of Claim 17, wherein each of the CNN layers has a filter and no two filters have the same size.