Processing

Please wait...

Settings

Settings

Goto Application

1. WO2020183428 - METHOD AND SYSTEM FOR MAPPING READ SEQUENCES USING A PANGENOME REFERENCE

Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

[ EN ]

CLAIMS

1. A processor implemented method for mapping read sequences with genome variation graph comprises:

obtaining, a plurality of read sequences (r) and a genome variation graph (G), wherein the genome variation graph is a representation of pangenome collection which includes a vertex set (V) and an edge set (E);

generating, an embedding (l) for the genome variation graph (G) utilizing a graph embedding technique;

generating, a graph index (IG) for the genome variation graph (G) based on the embedding (l) and the genome variation graph (G) utilizing a graph winnowing technique;

iteratively mapping, each read sequence to genome variation graph by constructing a subgraph for each read sequence (r) using the variation map index, wherein the variation map index comprises a list of the graph index (IG) and the genome variation graph (G); and computing, an alignment score for each read sequence (r) with its corresponding subgraph.

2. The method as claimed in claim 1, wherein generating the embedding (l) for the genome variation graph (G) utilizing the graph embedding technique comprises:

extracting, each entity (v) from the vertex set (V), the edge set (E), a vertex labelℓ(v) for each entity(v) and a graph coordinate (v, i), where i is the offset for vertex labelℓ(v) from the genome variation graph (G); creating, an auxiliary undirected weighted graph (G') using the genome variation graph (G) by,

including, a pair of vertices for each entity (v) of the vertex set (V) in the genome variation graph (G), wherein each pair of vertices includes a first element and a second element of the entity (v);

connecting, the first element and the second element of each entity (v) with an undirected edge length of weight l decremented with a predefined value, wherein l is the length of the vertex label ℓ(v) in the genome variation graph (G);

fastening, for every entity (e) of the edge set (E), the second element of source vertex with the first element of destination vertex of the genome variation graph (G) with a weight;

computing, the shortest path from source vertex to the available vertices of the auxiliary undirected weighted graph (G'), wherein the source vertex is identified based on the first element having zero incoming edges in the genome variation graph (G); and generating, an embedding (l) for the graph coordinates where the distances between the graph coordinates are based on the shortest path distances of the graph coordinates where the shortest path distance for each graph coordinate (v, i) of the genome variation graph (G) is based on the minimized shortest path distance to the first element and the second element of the vertex set (V) in the auxiliary undirected weighted graph (G').

3. The method as claimed in claim 1, wherein generating the graph index (IG) for the genome variation graph (G) based on the embedding (l) and the genome variation graph (G) utilizing the graph winnowing technique comprises:

obtaining, a plurality of sequence winnow (Sw) paths for the genome variation graph (G) using a winnow length (w); extracting, a path label for each path of the sequence winnow (Sw) by concatenating the vertex labels along the path based on the winnow length (w);

computing, sequence winnowing (hw) for each path label based on the elements of the minimizer graph coordinates;

generating, a set of distinct minimizer graph coordinates (Hw) by performing sequence winnowing (hw) for each path; and generating, the graph index (IG) using the elements of the minimizer graph coordinates in (Hw), the embedding (l) and a dictionary key (k).

4. The method as claimed in claim 3, wherein the dictionary key (k) is the k- mer at the minimizer graph coordinates.

5. The method as claimed in claim 1, wherein the variation map index is constructed by combining the graph index (IG) and the sorted genome variation graph (G).

6. The method as claimed in claim 1, constructing the subgraph for each read sequence (r) using the variation map index comprises:

computing, a minimizer set (R) for each read sequence (r) by applying sequence winnowing (hw);

identifying, the presence of the minimizer set (R) in the graph index (IG);

determining, a hitlist (H) based on the presence of the minimizer set (R) identified in the graph index (IG) and then clustering the hitlist (H) based on their embedding;

identifying, a maximum density cluster based on the embedding and the hitlist H; and

constructing, the subgraph based on the vertices whose embedding (l) lies in a bounded region around the maximum density cluster after applying a correction factor.

7. The method as claimed in claim 6, wherein the bounded region is obtained from the maximum density cluster by applying a correction factor based on the length of the read sequence (r).

8. The method as claimed in claim 1, wherein the alignment score for read sequence (r) with its corresponding subgraph is computed using any graph based gapped aligner methods.

9. A system (100) for mapping read sequences with genome variation graph, the system (100) comprising:

a memory (102) storing instructions;

one or more Input / Output (I/O) interfaces (106);

and one or more hardware processors (104) coupled to the memory (102) via the one or more I/O interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:

obtain, a plurality of read sequences (r) and a genome variation graph (G), wherein the genome variation graph is a representation of pangenome collection which includes a vertex set (V) and an edge set (E); generate, an embedding (l) for the genome variation graph (G) utilizing a graph embedding technique;

generate, a graph index (IG) for the genome variation graph (G) based on the embedding (l) and the genome variation graph (G) utilizing a graph winnowing technique;

iteratively map, each read sequence (r) to genome variation graph by constructing a subgraph for each read sequence (r) using the variation map index, wherein the variation map index comprises a list of the graph index (IG) and the genome variation graph (G); and

compute, an alignment score for each read sequence (r) with its corresponding subgraph.

10. The system (100) as claimed in claim 9, wherein generating the embedding (l) for the genome variation graph (G) utilizing the graph embedding technique comprises:

extracting, the vertex set (V), the edge set (E), a vertex labelℓ(v) for every entity (v) of the vertex set (V) and a graph coordinate (v, i), where i is the offset for vertex labelℓ(v) from the genome variation graph (G); creating, an auxiliary undirected weighted graph (G') using the genome variation graph (G) by,

including, a pair of vertices for each entity (v) of the vertex set (V) in the genome variation graph (G), wherein each pair of vertices includes a first element and a second element of the entity (v);

connecting, the first element of the entity (v) and the second element of the entity (v) with an undirected edge length of weight (l) decremented with a predefined value, wherein l is the length of the vertex labelℓ(v) in the genome variation graph (G);

fastening, for every entity (e) of the edge set (E), the second element of source vertex with the first element of destination vertex of the genome variation graph (G) with a weight;

computing, the shortest path from source vertex to the available vertices of the auxiliary undirected weighted graph (G'), wherein the source vertex is identified based on the first element having zero incoming edges in the genome variation graph (G); and

generating, an embedding for the graph coordinates where the distances between the graph coordinates are based on the shortest path distances of the graph coordinates where the shortest path distance for each graph coordinate (v, i) of the genome variation graph (G) is based on the minimized shortest path distance to the first element and the second element of the vertex set (V) in the auxiliary undirected weighted graph (G').

11. The system (100) as claimed in claim 9, wherein generating the graph index (IG) for the genome variation graph (G) based on the embedding (l) and the genome variation graph (G) utilizing the graph winnowing technique comprises:

obtaining, a plurality of sequence winnow (Sw) paths for the genome variation graph (G) using a winnow length (w); extracting, a path label for each path of the sequence winnow (Sw) by concatenating the vertex labels along the path based on the winnow length (w);

computing, sequence winnowing (hw) for each path label based on the elements of the minimizer graph coordinates; generating, a set of distinct minimizer graph coordinates (Hw) by performing sequence winnowing (hw) for each path; and generating, the graph index (IG) using the elements of the minimizer graph coordinates in (Hw) the embedding (l) and a dictionary key (k), wherein the dictionary key (k) is the k-mer at the minimizer graph coordinates.

12. The system (100) as claimed in claim 9, wherein the variation map index is constructed by combining the graph index (IG) and the sorted genome variation graph (G).

13. The system (100) as claimed in claim 9, constructing the subgraph for each read sequence (r) using the variation map index comprises:

computing, a minimizer set (R) for each read sequence (r) by applying sequence winnowing (hw);

identifying, the presence of the minimizer set (R) in the graph index (IG);

determining, a hitlist (H) based on the presence of the minimizer set (R) identified in the graph index (IG) and then clustering the hitlist (H) based on their embedding.

identifying, a maximum density cluster based on the embedding and the hitlist H; and

constructing, the subgraph based on the vertices whose embedding (l) lies in a bounded region around the maximum density cluster after applying a correction factor, wherein the bounded region is obtained from the maximum density cluster by applying a correction factor based on the length of the read sequence r.

14. The system (100) as claimed in claim 9, wherein the alignment score for read sequence (r) with its corresponding subgraph is computed using any graph based gapped aligner methods.

15. One or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors perform actions comprising,

obtaining, a plurality of read sequences (r) and a genome variation graph (G), wherein the genome variation graph is a representation of pangenome collection which includes a vertex set (V) and an edge set (E); generating, an embedding (l) for the genome variation graph (G) utilizing a graph embedding technique;

generating, a graph index (IG) for the genome variation graph (G) based on the embedding (l) and the genome variation graph (G) utilizing a graph winnowing technique;

iteratively mapping, each read sequence to genome variation graph by constructing a subgraph for each read sequence (r) using the variation map index, wherein the variation map index comprises a list of the graph index (IG) and the genome variation graph (G); and

computing, an alignment score for each read sequence (r) with its corresponding subgraph.