Traitement en cours

Veuillez attendre...

Paramétrages

Paramétrages

Aller à Demande

1. WO2020117359 - SYSTÈME ET PROCÉDÉ POUR OBTENIR UNE HAUTE RÉSOLUTION DE DONNÉES GÉNÉTIQUES EN UTILISANT DES ENSEMBLES D'ENTRAÎNEMENT

Note: Texte fondé sur des processus automatiques de reconnaissance optique de caractères. Seule la version PDF a une valeur juridique

[ EN ]

What is claimed is:

1. A method of generating an enhanced set of sequences for taxonomical classification, the method comprising:

receiving a plurality of reference sequences, wherein each of the plurality of reference sequences corresponds to a taxonomical classification;

assigning to each of a plurality of supplemental sequences a label corresponding to at least one of the reference sequences;

truncating each of the plurality of supplemental sequences and each of the plurality of reference sequences to a region of interest to thereby generate a truncated set of sequences;

measuring similarity between pairs of truncated sequences in the truncated set of sequences to determine whether the similarity is above a predetermined threshold; and assigning an intermediate taxonomical label to the pair of truncated sequences in the truncated set of sequences when the similarity is above the predetermined threshold to thereby generate an enhanced set of sequences.

2. The method of claim 1, wherein the plurality of reference sequences comprises RNA.

3. The method of claim 1, wherein the plurality of reference sequences comprises DNA.

4. The method of claim 3, wherein the plurality of reference sequences comprises a genome.

5. The method of claim 1, wherein each taxonomical classification comprises a species.

6. The method of claim 1, wherein the plurality of reference sequences comprises sequences of microbial organisms.

7. The method of claim 1, wherein the plurality of supplemental sequences comprises

sequences of microbial organisms.

8. The method of claim 1, wherein the plurality of reference sequences comprises sequences of eukaryotes.

9. The method of claim 1, wherein the plurality of supplemental sequences comprises

sequences of eukaryotes.

10. The method of claim 1, wherein the plurality of supplemental sequences comprises

sequences from a next generation sequencer.

11. The method of claim 1, wherein the supplemental data comprise data from published studies.

12. The method of claim 1, wherein the supplemental data comprise data from isolated

colonies.

13. The method of claim 1, wherein the supplemental data comprise data from a clone

library.

14. The method of claim 1, further comprising:

collecting microbial organisms from a predetermined site;

performing amplification of at least one sequence of the microbial organisms; sequencing the at least one amplified sequence.

15. The method of claim 14, wherein the predetermined site is a part of a human body.

16. The method of claim 15, wherein the part of the human body is an aerodigestive tract.

17. The method of claim 14, wherein the supplemental data comprises the sequenced

amplified sequence of the collected microbial organisms.

18. The method of claim 1, wherein the region of interest comprises a V1-V3 region.

19. The method of claim 1, further comprising removing duplicates in the truncated set of sequences.

20. The method of claim 1, wherein the assigned label comprises the respective taxonomical classification of each of the reference sequences.

21. The method of claim 1, wherein the intermediate taxonomical classification is

hierarchically inferior to a genus.

22. The method of claim 21, wherein the intermediate taxonomical classification is

hierarchically superior to a species.

23. The method of claim 1, wherein the intermediate taxonomical classification is between a genus and a species.

24. The method of claim 1, wherein the predetermined threshold is greater than or equal to 90%.

25. The method of claim 1, wherein the predetermined threshold is greater than or equal to 98.5%.

26. The method of claim 1, wherein the predetermined threshold is determined based on a breadth of taxonomical classification.

27. The method of claim 1, wherein measuring similarity comprises comparing nucleotides between pairs of truncated sequences in the truncated set of sequences.

28. The method of claim 1, further comprising training a machine learning classifier on the enhanced set of data.

29. The method of claim 28, wherein applying the trained machine learning classifier results in an error rate of less than or equal to 1.5%.

30. The method of claim 28, wherein applying the trained machine learning classifier results in an error rate of less than or equal to 0.5%.

31. The method of claim 28, wherein applying the trained machine learning classifier results in a no-call rate of less than or equal to 40%.

32. The method of claim 28, wherein applying the trained machine learning classifier results in a no-call rate of less than or equal to 10%.

33. A computer program product for generating an enhanced set of sequences for

taxonomical classification, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to perform a method comprising:

receiving a plurality of reference sequences, wherein each of the plurality of reference sequences corresponds to a taxonomical classification;

assigning to each of a plurality of supplemental sequences a label corresponding to at least one of the reference sequences;

truncating each of the plurality of supplemental sequences and each of the plurality of reference sequences to a region of interest to thereby generate a truncated set of sequences;

measuring similarity between pairs of truncated sequences in the truncated set of sequences to determine whether the similarity is above a predetermined threshold; and assigning an intermediate taxonomical label to the pair of truncated sequences in the truncated set of sequences when the similarity is above the predetermined threshold to thereby generate an enhanced set of sequences.

34. The computer program product of claim 33, wherein the plurality of reference sequences comprises RNA.

35. The computer program product of claim 33, wherein the plurality of reference sequences comprises DNA.

36. The computer program product of claim 33, wherein the plurality of reference sequences comprises a genome.

37. The computer program product of claim 33, wherein each taxonomical classification comprises a species.

38. The computer program product of claim 33, wherein the plurality of reference sequences comprises sequences of a microbial organism.

39. The computer program product of claim 33, wherein the plurality of supplemental

sequences comprise sequences of a microbial organism.

40. The computer program product of claim 33, wherein the plurality of reference sequences comprises sequences of eukaryotes.

41. The computer program product of claim 33, wherein the plurality of supplemental

sequences comprises sequences of eukaryotes.

42. The computer program product of claim 33, wherein the plurality of supplemental

sequences comprises sequences from a next generation sequencer.

43. The computer program product of claim 33, wherein the supplemental data comprise data from published studies.

44. The computer program product of claim 33, wherein the supplemental data comprise data from isolated colonies.

45. The computer program product of claim 33, wherein the supplemental data comprise data from a clone library.

46. The computer program product of claim 33, wherein the region of interest comprises a VI -V3 region.

47. The computer program product of claim 33, further comprising removing duplicate in the truncated set of sequences.

48. The computer program product of claim 33, wherein the assigned label comprises the respective taxonomical classification of each of the reference sequences.

49. The computer program product of claim 33, wherein the intermediate taxonomical

classification is hierarchically inferior to a genus.

50. The computer program product of claim 49, wherein the intermediate taxonomical

classification is hierarchically superior to a species.

51. The computer program product of claim 33, wherein the intermediate taxonomical

classification is between a genus and a species.

52. The computer program product of claim 33, wherein the predetermined threshold is greater than or equal to 90%.

53. The computer program product of claim 33, wherein the predetermined threshold is greater than or equal to 98.5%.

54. The computer program product of claim 33, wherein the predetermined threshold is determined based on a breadth of taxonomical classification.

55. The computer program product of claim 33, wherein measuring similarity comprises comparing nucleotides between pairs of truncated sequences in the truncated set of sequences.

56. The computer program product of claim 33, further comprising training a machine

learning classifier on the enhanced set of data.

57. The computer program product of claim 56, wherein applying the trained machine learning classifier results in an error rate of less than or equal to 1.5%.

58. The computer program product of claim 56, wherein applying the trained machine

learning classifier results in an error rate of less than or equal to 0.5%.

59. The computer program product of claim 56, wherein applying the trained machine

learning classifier results in a no-call rate of less than or equal to 40%.

60. The computer program product of claim 56, wherein applying the trained machine

learning classifier results in a no-call rate of less than or equal to 10%.

61. A method for generating a species-labelled set of sequences, the method comprising:

isolating nucleic acids from a microbial source;

amplifying a predetermined region of the nucleic acids to generate a sequence library;

sequencing the sequence library to generate a plurality of sequences; and determining a species for each of the plurality of sequences to thereby generate a species-labelled set of sequences.

62. The method of claim 61, wherein determining a species for each of the plurality of

sequences comprises:

receiving a plurality of reference sequences, wherein each of the plurality of reference sequences corresponds to a taxonomical classification;

assigning to each of a plurality of supplemental sequences a label corresponding to at least one of the reference sequences;

truncating each of the plurality of supplemental sequences and each of the plurality of reference sequences to a region of interest to thereby generate a truncated set of sequences;

measuring similarity between pairs of truncated sequences in the truncated set of sequences to determine whether the similarity is above a predetermined threshold; and assigning an intermediate taxonomical label to the pair of truncated sequences in the truncated set of sequences when the similarity is above the predetermined threshold to thereby generate an enhanced set of sequences.

63. The method of claim 61 or 62, wherein the predetermined region is a 16S rRNA region.

64. The method of claim 63, wherein the 16S rRNA region is a VI -V3 region.

65. The method of any one of claims 61-64, wherein sequencing is performed without

overlapping reads.

66. The method of any one of claims 61-65, wherein amplification comprises applying a one or more primers to the nucleic acids.

67. The method of any one of claims 61-66, wherein amplification is performed for a

predetermined number of cycles.

68. The method of any one of claims 61-67, wherein sequencing comprises sequencing a reversed V3 region first and then sequencing a VI region.

69. The method of any one of claims 61-68, wherein amplification comprises applying an adaptor-ligated library.

70. The method of claim 69, wherein as the adaptor-ligated library comprises a PhiX

genome.

71. The method of any one of claims 61-70, wherein sequencing comprises asymmetric sequencing.

72. The method of claim 71, wherein asymmetric sequencing comprises reading alternating sides of the nucleic acids.

73. The method of claim 71, wherein asymmetric sequencing comprises alternating read lengths of 100 base pairs and 400 base pairs.

74. The method of any one of claims 61-73, wherein determining a species for each of the plurality of sequences comprises applying a RDP algorithm to the plurality of sequences.