Some content of this application is unavailable at the moment.
If this situation persist, please contact us atFeedback&Contact
1. (WO2019032918) STRUCTURAL PREDICTION OF PROTEINS
Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

CLAIMS

WHAT IS CLAIMED IS:

1. A computer readable medium comprising computer-executable instructions, which, when executed by a processor, cause the processor to carry out a method or a set of steps for determining a tolerance or intolerance of one or more amino acids of a protein to a variation, the method or steps comprising,

a) determining a likelihood of observing missense variation, given a first selective

pressure, from a genetic variation data of a plurality of individuals and features from 3D protein structures and/or models;

b) determining a posterior distribution on the selective pressure using step (a);

c) determining a second selective pressure on 3D features of the protein by determining the mean of the posterior distribution, wherein the mean is used to determine a three- dimensional tolerance score (3DTS); and

d) determining the tolerance of one or more amino acids of a protein to a variation based on the 3DTS.

2. A computer readable medium comprising computer-executable instructions, which, when executed by a processor, cause the processor to carry out a method or a set of steps for determining druggability of a protein, the method or steps comprising,

a) determining a likelihood of observing missense variation, given a first selective

pressure, from a genetic variation data of a plurality of individuals and features from 3D protein structures and/or models;

b) determining a posterior distribution using step (a); and

c) determining a second selective pressure on 3D features of the protein by determining the mean of the posterior distribution, wherein the mean is used to determine a three- dimensional tolerance score (3DTS); and

d) determining the protein as being druggable if one or more amino acids in the protein is determined as being intolerant to the variation based on the 3DTS.

3. A computer readable medium comprising computer-executable instructions, which, when executed by a processor, cause the processor to carry out a method or a set of steps for determining drug resistance potential of a variant protein, the method or steps comprising, a) determining a likelihood of observing missense variation, given a first selective

pressure, from a genetic variation data of a plurality of individuals and features from 3D protein structures and/or models;

b) determining a posterior distribution using step (a); and

c) determining a second selective pressure on 3D features of the variant protein by

determining the mean of the posterior distribution, wherein the mean is used to determine a three-dimensional tolerance score (3DTS); and

d) determining the variant protein as being potentially drug resistant if one or more

amino acids in the protein is determined as being tolerant to the variation based on the 3DTS.

4. The computer readable media of any one of claims 1-3, wherein the likelihood function

comprises (a) a background mutation rate, (b) a fraction of single nucleotide changes that result in an amino acid change, (c) a number of individuals or samples with called nucleotides, and (d) an adjustment factor, whose estimate serves as a tolerance score.

5. The computer readable media of claim 4, wherein the background mutation rate is estimated using genetic data and a reference genome.

6. The computer readable media of claim 4, wherein the background mutation rate is estimated using the synonymous variation rate.

7. The computer readable media of claim 4, wherein the background mutation rate may be

estimated using the intergenic variation rate

8. The computer readable media of claim 7, wherein the intergenic variation rate may be

estimated genome-wide

9. The computer readable media of claim 8, wherein the intergenic variation rate may be

estimated specific to a chromosome

10. The computer readable media of any one of claims 4 to 9, wherein the background mutation rate may vary on a per nucleotide basis dependent upon the nucleotide's context.

11. The computer readable media of claim 8, wherein the nucleotide context comprises a

heptamer representing 3 nucleotides up and downstream of a reference nucleotide.

12. The computer readable media of claim 5, wherein the fraction of single nucleotide changes that result in an amino acid change include amino acid changes that result in significant physiochemical changes.

13. The computer readable media of claim 4, where the background mutation rates are estimated by maximizing the likelihood fixing the s parameter to 1.

14. The computer readable media of claim 4, wherein the likelihood function is evaluated as the sum of Bernoulli trials over the loci corresponding to the 3D feature.

15. The computer readable media of claim 14, wherein each Bernoulli trial represents an

individual's variation information at a given locus/nucleotide.

16. The computer readable media of claim 15, wherein the sum of Bernoulli trials results in a binomial distribution compriing a Poisson approximation.

17. The computer readable media of claim 16, wherein the Poisson approximation estimates the probability of observing at least one missense mutation in the 3D feature using Le Cam's approximation.

18. The computer readable media of claim 1 , wherein the likelihood function is combined with a prior distribution to produce a posterior distribution representing the probabilities of a selective pressure on a 3D locus.

19. The computer readable media of claim 18, wherein the mean of the posterior distribution represents a 3D Tolerance Score (3DTS).

20. The computer readable media of claim 1 , wherein the protein structure or model is

representative of an X-ray crystal structure, an NMR structure, a CRYOEM structure.

21. The computer readable media of claim 1 , wherein the protein structure or model is

representative of a similarity model, a homology model, an ab initio model.

22. The computer readable media of any one of claims 1 to 21, wherein an intolerant feature is defined as a 3DTS value between the 0th and the 20th percentile of all 3DTS scores for the proteome; or wherein a tolerant feature is defined as a 3DTS value between the 50th and the 100th percentile of all 3DTS scores for the proteome.

23. The computer readable media of claim 22, wherein the proteome comprises at least 1000 proteins, particularly at least 5000 proteins, more particularly at least 10000 proteins, especially at least 20000 proteins, and specifically all the proteins of the proteome of a subject which encodes the protein.

24. The computer readable media of any one of claims 1 to 19, wherein an intolerant feature is defined as the lowest ranked 3DTS values within a protein; or wherein a tolerant feature is defined as the highest ranked 3DTS values within a protein.

25. The computer readable media of claim 24, wherein the lowest rank 3DTS values include the bottom 25%, particularly bottom 10%, more particularly bottom 5% and especially bottom 2% of all ranked 3DTS values within a protein.

26. The computer readable media of any one of claims 1-3, wherein the posterior distribution on the selective pressure is determined using a likelihood function and/or assuming a uniform prior function.

27. A system for determining a tolerance or intolerance of one or more amino acids of a protein to a variation, comprising,

a) a background module for determining a likelihood of observing missense variation, given a first selective pressure, from a genetic variation data of a plurality of individuals and features from 3D protein structures and/or models;

b) a distribution module for determining a posterior distribution using step (a); and c) a scoring module for determining a second selective pressure on 3D features of the protein by determining the mean of the posterior distribution, wherein the mean is used to determine a three-dimensional tolerance score (3DTS) and wherein the 3DTS is used to determine the tolerance or intolerance of one or more amino acids of a protein to a variation.

28. A system for determining druggability of a protein, comprising,

a) a background module for determining a likelihood of observing missense variation, given a first selective pressure, from a genetic variation data of a plurality of individuals and features from 3D protein structures and/or models;

b) a distribution module for determining a posterior distribution using step (a); and c) a scoring module for determining a second selective pressure on 3D features of the protein by determining the mean of the posterior distribution, wherein the mean is used to determine a three-dimensional tolerance score (3DTS) and wherein the 3DTS is used to determine the druggability of the protein.

29. A system for determining drug resistance potential of a variant protein, comprising,

a) a background module for determining a likelihood of observing missense variation, given a first selective pressure, from a genetic variation data of a plurality of individuals and features from 3D protein structures and/or models;

b) a distribution module for determining a posterior distribution using step (a); and c) a scoring module for determining a second selective pressure on 3D features of the variant protein by determining the mean of the posterior distribution, wherein the mean is used to determine a three-dimensional tolerance score (3DTS) and wherein the 3DTS is used to determine the drug resistance potential of the variant protein.

30. The sytem of any one of claims 27-29, wherein the posterior distribution on the selective pressure is determined using a likelihood function and/or assuming a uniform prior function.

31. A method of determining a three-dimensional tolerance score (3DTS) for one or more amino acids of a protein comprising:

a) determining a likelihood of observing missense variation, given a first selective

pressure, from a genetic variation data of a plurality of individuals and features from 3D protein structures and/or models;

b) determining a posterior distribution using step (a);

c) determining a second selective pressure on 3D features of the protein by determining the mean of the posterior distribution, wherein the mean is used to determine the 3DTS; and

d) determining the tolerance of one or more amino acids of a protein to a variation based on the 3DTS.

32. A method of determining druggability of a protein, comprising,

a) determining a likelihood of observing missense variation, given a first selective

pressure, from a genetic variation data of a plurality of individuals and features from 3D protein structures and/or models;

b) determining a posterior distribution using step (a); and

c) determining a second selective pressure on 3D features of the protein by determining the mean of the posterior distribution, wherein the mean is used to determine a three- dimensional tolerance score (3DTS); and

d) determining the protein as being druggable if one or more amino acids in the protein is determined as being intolerant to the variation based on the 3DTS.

33. A method of determining drug resistance potential of a variant protein, comprising,

a) determining a likelihood of observing missense variation, given a first selective

pressure, from a genetic variation data of a plurality of individuals and features from 3D protein structures and/or models;

b) determining a posterior distribution using step (a); and

c) determining a second selective pressure on 3D features of the variant protein by

determining the mean of the posterior distribution, wherein the mean is used to determine a three-dimensional tolerance score (3DTS); and

d) determining the variant protein as being potentially drug resistant if one or more amino acids in the protein is determined as being tolerant to the variation based on the 3DTS.

34. The method of any one of claims 31-33, wherein the posterior distribution on the selective pressure is determined using a likelihood function and/or assuming a uniform prior function.

35. The method of any one of claims 31-33, wherein the likelihood function contains terms defining a background mutation rate, a fraction of single nucleotide changes that result in an amino acid change, a number of individuals with called nucleotides, and an adjustment factor, whose estimate serves as a tolerance score

36. The method of claim 35, wherein the background mutation rate is estimated using genetic data and a reference genome

37. The method of claim 35, wherein the background mutation rate is estimated using the

synonymous variation rate

38. The method of claim 35, wherein the background mutation rate is estimated using the

intergenic variation rate

39. The method of claim 38, wherein the intergenic variation rate may be estimated genome- wide

40. The method of claim 38, wherein the intergenic variation rate may be estimated specific to a chromosome

41. The methods of claims 35, wherein the background mutation rate may vary on a per

nucleotide basis dependent upon the nucleotide's context

42. The method of claim 41, wherein the nucleotide context can be a heptamer representing 3 nucleotides up and downstream of a reference nucleotide.

43. The method of claim 35, wherein the fraction of single nucleotide changes that may result in an amino acid change may be modulated dependent on those amino acid changes that may result in significant physiochemical changes.

44. The methods of claim 35 where the background mutation rates are estimated by maximizing the likelihood fixing the s parameter to 1.

45. The methods of claim 44, wherein the likelihood function is evaluated as the sum of

Bernoulli trials over the loci corresponding to the 3D feature.

46. The method of claim 45, wherein each Bernoulli trial represents an individual's variation information at a given locus/nucleotide.

47. The method of claims 45 or 46, wherein the sum of Bernoulli trials results in a binomial distribution comprising a Poisson approximation.

48. The method of claim 47, wherein the Poisson approximation estimates the probability of observing at least one missense mutation in the 3D feature using Le Cam's approximation.

49. The method of claim 48, wherein the likelihood function may be combined with a prior distribution to produce a posterior distribution representing the probabilities of a selective pressure on a 3D locus.

50. The method of claim 49, wherein the mean of the posterior distribution may represent a 3D tolerance Score (3DTS)

51. The method of any one of claims 31-33, wherein the protein structure or model represents an X-ray crystal structure, an NMR structure, a CRYOEM structure or a combination thereof.

52. The method of any one of claims 31-33, wherein a model represents a homology model, an ab initio model or a combination thereof.

53. The method of any one of claims 31 to 52, wherein an intolerant feature is defined as a 3DTS value between the 0th and the 20th percentile of all 3DTS scores for the proteome.

54. The method of any one of claims 31 to 52, wherein an intolerant feature is defined as the lowest 3DTS values within a protein.

55. The method of any one of claims 31 to 33, wherein step (a) comprises determining a

synonymous global mutation rate, defined as parameter p, which is the expected number of mutations at a locus assuming all mutations at a locus are neutral.

56. The method of any one of claims 31 to 33, wherein step (a) comprises determining a

synonymous local mutation rate, which estimates heterogeneity across the genome, but is only evaluated on a single amino acid chain of the protein.

57. The method of any one of claims 31 to 33, further comprising determining an intergenic variation rate.

58. The method of claim 572, wherein the intergenic variation rate comprises a global intergenic variation rate or a chromosome-specific intergenic variation rate.

59. The method of any one of claims 31 to 33, wherein step (b) comprises determining a

propensity towards missense variation.

60. The method of claim 59, wherein the propensity towards missense variation for a nucleotide is determined as a statistical probability of a single nucleotide variation leading to a missense variant of the protein (parameter b), based on the protein isoform for the 3D structure, the transcript encoding the protein isoform and the reference genome encoding the transcript for the locus.

61. The method of any one of claims 31 to 33, wherein step (c) comprises determining tolerance to missense variation, which is defined by the mean of a posterior distribution, calculated through numerical integration using the Gauss-Legendre quadrature or estimated by importance sampling.

62. The method of claim 61, wherein step (c) comprises determining a mean of the posterior distribution by combining a prior distribution which assumes all missense variants are tolerant and is set as a uniform distribution and likelihood function which is defined as the sum of a series of Bernoulli trials.

63. The method of any one of claims 31 to 33, wherein step (c) comprises implementing a

machine learning algorithm.

64. A method of identifying druggability of a protein comprising:

a) determining a global mutation rate, wherein the global mutation rate is an expected probability of any given nucleotide of the protein to vary;

b) determining a variant specific mutation rate for a missense mutation of a nucleic acid encoding the one or more amino acids of the protein, wherein the variant specific mutation rate is an observed probability of the missense mutation to occur in a sample nucleotide data set; and

c) determining the one or more amino acids of the protein as intolerant to variation if the variant specific mutation rate is less than the global mutation rate; and

d) identifying the protein as being druggable if one or more amino acids in the protein is determined as being intolerant to the variation.

65. The method of claim 64, wherein the amino acid is determined as being intolerant based on a rank based metric or a percentile metric.

66. The method of claim 65, wherein the rank based metric or percentile metric is determined in relation to a proteome comprising at least 5K proteins, at least 10K proteins, at least 15K proteins or the entire proteome of a subject.

67. The method of claim 64, wherein the one or more amino acids of the protein that are

intolerant to variation comprises a binding pocket.

68. The method of claim 64, wherein the binding pocket comprises an active site, an allosteric site, an epitope, a cofactor binding site, or a prosthetic group binding site, or a combination thereof.

69. The method of claim 64, wherein the drug includes a small molecule or a large molecule.

70. The method of claim 69, wherein the small molecule is a compound having a molecular weight less than 5 kDa selected from an amino acid, a nucleic acid, an LNA, a PNA, a carbohydryate, a sugar, a lipid, a steroid, a biometal, a vitamin, a terpene, or a polymer thereof.

71. The method of claim 69, wherein the large molecule is a compound having a molecular weight greater than 5 kDa selected from an antibody, a hormone, a growth factor, a cytokine, or a combination thereof.

72. The method of claim 64, wherein the druggable protein is an enzyme, an antigen, or a

receptor.

73. The method of claim 64, wherein the drug is an enzyme activator or inhibitor; an allosteric modulator; an agonist, a partial agonist or an antagonist; or an antibody.

74. A method of identifying a drug resistance potential of a variant protein comprising:

a) determining a global mutation rate, wherein the global mutation rate is an expected probability of any given nucleotide of the protein to vary;

b) determining a variant specific mutation rate for a missense mutation of a nucleic acid encoding the one or more amino acids of the protein, wherein the variant specific mutation rate is an observed probability of the missense mutation to occur in a sample nucleotide data set; and

c) determining the one or more amino acids of the protein as intolerant to variation if the variant specific mutation rate is less than the global mutation rate; and

d) identifying the protein as being drug resistant if one or more amino acids in the

variant protein is determined as being tolerant to the variation compared to the one or more amino acids in a wild-type protein.

75. The method of claim 74, wherein the drug includes a small molecule or a large molecule.

76. The method of claim 74, wherein the small molecule is a compound of less than 5 kDa

selected from an amino acid, a nucleic acid, an LNA, a PNA, a carbohydryate, a sugar, a lipid, a steroid, a biometal, a vitamin, a terpene, or a polymer thereof.

77. The method of claim 74, wherein the large molecule is a compound having a molecular weight greater than 5 kDa selected from an antibody, a hormone, a growth factor, a cytokine, or a combination thereof.

78. The method of claim 74, wherein the variant protein is potentially resistant to an antibiotic, an anticancer agent, a xenobiotic, an antagonist, an agonist or an allosteric modulator.

79. The method of claim 74, wherein the variant protein is potentially resistant to binding of an antibody or a ligand.

80. A method of determining a three-dimensional tolerance score (3DTS) for one or more amino acids of a protein comprising:

a) determining a global mutation rate, wherein the global mutation rate is an expected probability of any given nucleotide of the protein to vary;

b) determining a variant specific mutation rate for a missense mutation of a nucleic acid encoding the one or more amino acids of the protein, wherein the variant specific mutation rate is an observed probability of the missense mutation to occur in a sample nucleotide data set; and

c) determining the one or more amino acids of the protein as intolerant to variation if the variant specific mutation rate is less than the global mutation rate.

81. The method of claim 80, wherein the one or more amino acids of the protein comprise a plurality of amino acids.

82. The method of claim 81, wherein the plurality of amino acids comprises a protein feature or domain.

83. The method of claim 80, wherein the protein feature is selected from the list consisting of: an active site, a metal binding site, a chemical binding site, a DNA binding site, a nucleotide binding site, a zinc finger, a calcium binding site, a transmembrane domain, an intra membrane domain, a lipidation site, a glycosylation site, a phosphorylation site, a coiled-coil, an alpha helix, and a beta strand.

84. The method of any one of claims 80-83, wherein the global mutation rate is the mutation rate of the nucleotides encoding the protein, an intronic sequence of the protein, a 3' untranslated region of the protein, a 5' untranslated region of the protein, or any combination thereof.

85. The method of any one of claims 80-84, wherein the global mutation rate is the mutation rate for an entire human genome.

86. The method of any one of claims 80-85, wherein the global mutation rate is between about lxl 0"6 and 5x10"6.

87. The method of any one of claims 80-86, wherein the global mutation rate is about 2.5xl0"6.

88. The method of any one of claims 80-87, wherein the sample nucleotide data set comprises at least 1,000 different nucleic acid sequences from at least 1,000 different individuals encoding the protein.

89. The method of any one of claims 80-88, wherein the sample nucleotide data set comprises at least 10,000 different nucleic acid sequences from at least 10,000 different individuals encoding the protein.

90. The method of any one of claims 80-89, wherein the nucleotide data set comprises DNA.

91. The method of any one of claims 80-90, comprising determining the one or more amino acids of the protein as intolerant to variation if the variant specific mutation rate is 2 times less than the global mutation rate.

92. The method of any one of claims 80-91, comprising determining the one or more amino acids of the protein as intolerant to variation if the variant specific mutation rate is 5 times less than the global mutation rate.

93. The method of any one of claims 80-92, wherein the missense mutation is a hypothetical mutation.

94. The method of any one of claims 80-93, further comprising rendering a graphic

representation of the protein with a visual indication of amino acids of the protein that are intolerant to variation.

95. The method of claim 94, wherein the graphic representation of the protein is three- dimensional.

96. The method of claim 95, wherein the graphic representation of the protein is rotatable around an x, y, or z axis.

97. The method of claim 96, wherein the graphic representation of the protein is reflectable across an x, y, or z axis.

98. A modulator that binds to any of the one or more amino acids of the protein that are

intolerant to variation according to the method of claims 31-97.

99. The modulator of claim 98, wherein the modulator is an antibody or antigen binding

fragment thereof.

100. The modulator of claim 98, wherein the modulator binds at a non-active or an

allosteric site.

101. A computer- implemented system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application comprising:

a) a software module determining a global mutation rate, wherein the global mutation rate is an expected probability of any given nucleotide of the protein to vary;

b) a software module determining a variant specific mutation rate for a missense

mutation of a nucleic acid encoding the one or more amino acids of the protein, wherein the variant specific mutation rate is an observed probability of the missense mutation to occur in a sample nucleotide data set; and

c) a software module determining the one or more amino acids of the protein as intolerant to variation if the variant specific mutation rate is less than the global mutation rate.

102. The system of claim 101 , wherein the one or more amino acids of the protein

comprise a plurality of amino acids.

103. The system of claim 102, wherein the plurality of amino acids comprises a protein feature or domain.

104. The system of claim 103, wherein the protein feature or domain is selected from the list consisting of: an active site, a metal binding site, a chemical binding site, a DNA binding site, a nucleotide binding site, a zinc finger, a calcium binding site, a transmembrane domain, an intra membrane domain, a lipidation site, a glycosylation site, a phosphorylation site, a coiled-coil, an alpha helix, and a beta strand.

105. The system of any one of claims 101 -104, wherein the global mutation rate is the mutation rate of the nucleotides encoding the protein, an intronic sequence of the protein, a 3' untranslated region of the protein, a 5' untranslated region of the protein, or any combination thereof.

106. The system of any one of claims 101 -105, wherein the global mutation rate is the mutation rate for an entire human genome or for a protein-encoding portion of a human genome.

107. The system of any one of claims 101-106, wherein the global mutation rate is

between about lxl 0"6 and 5x10"6.

108. The system of any one of claims 101 -107, wherein the global mutation rate is about 2.5xlO"6.

109. The system of any one of claims 101-108, wherein the sample nucleotide data set comprises at least 1,000 different nucleic acid sequences from at least 1 ,000 different individuals encoding the protein.

110. The system of any one of claims 101-109, wherein the sample nucleotide data set comprises at least 10,000 different nucleic acid sequences from at least 10,000 different individuals encoding the protein.

111. The system of any one of claims 101 -110, wherein the nucleotide data set comprises DNA.

112. The system of any one of claims 101-111, comprising determining the one or more amino acids of the protein as intolerant to variation if the variant specific mutation rate is 2 times less than the global mutation rate.

113. The system of any one of claims 101-112, comprising determining the one or more amino acids of the protein as intolerant to variation if the variant specific mutation rate is 5 times less than the global mutation rate.

114. The system of any one of claims 101-113, wherein the missense mutation is a

hypothetical mutation.

115. The system of any one of claims 101-114, further comprising rendering a graphic representation of the protein with a visual indication of amino acids of the protein that are intolerant to variation.

116. The system of claim 115, wherein the graphic representation of the protein is three- dimensional.

117. The system of claim 116, wherein the graphic representation of the protein is

rotatable around an x, y, or z axis.

118. The system of claim 117, wherein the graphic representation of the protein is

reflectable across an x, y, or z axis.

119. An antagonist that binds to any of the one or more amino acids of the protein that are intolerant to variation according to the system of any one of claims 101 to 118.

120. The antagonist of claim 119, wherein the antagonist is an antibody or antigen binding fragment thereof.

121. The antagonist of claim 119, wherein the antagonist binds at a non-active or an

allosteric site.