CLAIMS

1. A non-transitory computer readable medium including executable instructions, the instructions being executable by a processor to perform a method, the method comprising: receiving a network of a plurality of nodes and a plurality of edges, each of the nodes of the plurality of nodes comprising members representative of at least one subset of initial data points, each of the edges of the plurality of edges connecting nodes that share at least one data point of the initial data points, the initial data points including rows and columns, each row defining a data point of an initial data set and each column defining a feature, the initial data set including an initial number of columns, each column including values associated with a feature of a plurality of features;

selecting a subset of the data points to create a set of selected data points, the selection being based on each node of the plurality of nodes, whereby if there is only one data point that is a member of a particular node, then the one data point is selected to be a member of the set of selected data points and whereby if there are two or more data points that are a member of the particular node, then proportional number of data points relative to all data points that are members of that particular node are selected to be members of the set of selected data points;

for each selected data point of the set of selected data points, determining a predetermined number of other data points of the set of selected data points that are closest in distance to that particular selected data point, the distance being determined based on a metric function between a vector of each data point;

grouping the selected data points into a plurality of groups based, at least in part, on the predetermined number of other data points of the set of selected data points that are closest in distance, each group of the plurality of groups including a different subset of data points; and

providing a list of selected data points and the plurality of groups.

2. The non-transitory computer readable medium of claim 1, the method further comprising:

creating a first transformation data set, the first transformation data set including the selected data points as well as a plurality of feature subsets, each of the plurality of feature subsets being associated with at least one group of the plurality of groups, values of a particular data point for a particular feature subset for a particular group being based on values of the particular data point in the selected data points if the particular data point is a member of the particular group; and

applying a machine learning model to the first transformation data set to generate a prediction model.

3. The non-transitory computer readable medium of claim 2, the method further comprising:

creating a second transformation data set, the second transformation data set including the analysis data set as well as the plurality of feature subsets, each of the plurality of feature subsets being associated with the at least one group of the plurality of groups, values of a particular data point of the analysis data set for a particular feature subset for a particular group being based on values of the particular data point in the analysis data set if the particular data point is a member of the particular group;

applying the prediction model to the second transformation data set to generate predicted outcomes; and

generating a report indicating one or more of the predicted outcomes.

4. The non-transitory computer readable medium of claim 3, the method further comprising comparing the predicted outcomes to known outcomes to assess the quality of the prediction model.

5. The non-transitory computer readable medium of claim 1, wherein the network of the plurality of nodes and the plurality of edges are a result of topological data analysis applied to the initial data set.

6. The non-transitory computer readable medium of claim 1, wherein the network of the plurality of nodes and the plurality of edges are generated by:

generating a reference space;

mapping the data points of the training data into the reference space using at least one filter;

generating a cover based on a resolution;

clustering data in the cover based on a metric and data points of the training data set; identifying nodes based on the clustered data; and

identifying edges between nodes if nodes share member data points from the training data set.

7. The non-transitory computer readable medium of claim 2, wherein values of a particular data point for a particular feature subset for a particular group are zero if the particular data point of the training data set is not a member of the particular group.

8. The non-transitory computer readable medium of claim 2, wherein values of a particular data point for a particular feature subset for a particular group are null if the particular data point of the training data set is not a member of the particular group.

9. The non-transitory computer readable medium of claim 2, wherein the values of a particular data point for a particular feature subset for a particular group of which the particular data point is a member are weighted.

10. The non-transitory computer readable medium of claim 9, wherein weighting of the values for the particular data point at least partially depend on how many the plurality of groups the particular data point is a member of.

11. The non-transitory computer readable medium of claim 2, wherein the machine learning model is selected from a group consisting of a linear regression machine learning model, a polynomial regression machine learning model, a logistic regression machine learning model, and a random forest machine learning model.

12. The non-transitory computer readable medium of claim 1, further comprising:

generating a reference space;

mapping the selected data points data into the reference space using at least one filter function;

generating a cover based on a resolution;

clustering data in the cover based on a metric function and selected data points;

identifying new nodes based on the clustered data;

identifying new edges between new nodes if nodes share member selected data points; and

providing a display of the selected data points and node membership.

13. A method compri sing :

receiving a network of a plurality of nodes and a plurality of edges, each of the nodes of the plurality of nodes comprising members representative of at least one subset of initial data points, each of the edges of the plurality of edges connecting nodes that share at least one data point of the initial data points, the initial data points including rows and columns, each row defining a data point of an initial data set and each column defining a feature, the initial data set including an initial number of columns, each column including values associated with a feature of a plurality of features;

selecting a subset of the data points to create a set of selected data points, the selection being based on each node of the plurality of nodes, whereby if there is only one data point that is a member of a particular node, then the one data point is selected to be a member of the set of selected data points and whereby if there are two or more data points that are a member of the particular node, then proportional number of data points relative to all data points that are members of that particular node are selected to be members of the set of selected data points;

for each selected data point of the set of selected data points, determining a predetermined number of other data points of the set of selected data points that are closest in distance to that particular selected data point, the distance being determined based on a metric function between a vector of each data point;

grouping the selected data points into a plurality of groups based, at least in part, on the predetermined number of other data points of the set of selected data points that are closest in distance, each group of the plurality of groups including a different subset of data points; and

providing a list of selected data points and the plurality of groups.

14. The method of claim 13, further comprising:

creating a first transformation data set, the first transformation data set including the selected data points as well as a plurality of feature subsets, each of the plurality of feature subsets being associated with at least one group of the plurality of groups, values of a particular data point for a particular feature subset for a particular group being based on values of the particular data point in the selected data points if the particular data point is a member of the particular group; and

applying a machine learning model to the first transformation data set to generate a prediction model.

15. The method of claim 13, further comprising:

creating a first transformation data set, the first transformation data set including the selected data points as well as a plurality of feature subsets, each of the plurality of feature subsets being associated with at least one group of the plurality of groups, values of a particular data point for a particular feature subset for a particular group being based on values of the particular data point in the selected data points if the particular data point is a member of the particular group; and

applying a machine learning model to the first transformation data set to generate a prediction model.

16. The method of claim 15, further comprising:

creating a second transformation data set, the second transformation data set including the analysis data set as well as the plurality of feature subsets, each of the plurality of feature subsets being associated with the at least one group of the plurality of groups, values of a particular data point of the analysis data set for a particular feature subset for a particular group being based on values of the particular data point in the analysis data set if the particular data point is a member of the particular group;

applying the prediction model to the second transformation data set to generate predicted outcomes; and

generating a report indicating one or more of the predicted outcomes.

17. The method of claim 14, further comprising comparing the predicted outcomes to known outcomes to assess the quality of the prediction model.

18. The method of claim 14, wherein the network of the plurality of nodes and the plurality of edges are a result of topological data analysis applied to the initial data set.

19. The method of claim 14, wherein the network of the plurality of nodes and the plurality of edges are generated by:

receiving the training data set;

generating a reference space;

mapping the data points of the training data into the reference space using at least one filter;

generating a cover based on a resolution;

clustering data in the cover based on a metric and data points of the training data set; identifying nodes based on the clustered data; and

identifying edges between nodes if nodes share member data points from the training data set.

20. A system comprising:

a processor; and

a memory, the memory comprising instructions executable by the processor to perform the steps of:

receiving a network of a plurality of nodes and a plurality of edges, each of the nodes of the plurality of nodes comprising members representative of at least one subset of initial data points, each of the edges of the plurality of edges connecting nodes that share at least one data point of the initial data points, the initial data points including rows and columns, each row defining a data point of an initial data set and each column defining a feature, the initial data set including an initial number of columns, each column including values associated with a feature of a plurality of features;

selecting a subset of the data points to create a set of selected data points, the selection being based on each node of the plurality of nodes, whereby if there is only one data point that is a member of a particular node, then the one data point is selected to be a member of the set of selected data points and whereby if there are two or more data points that are a member of the particular node, then proportional number of data points relative to all data points that are members of that particular node are selected to be members of the set of selected data points;

for each selected data point of the set of selected data points, determining a predetermined number of other data points of the set of selected data points that are closest in distance to that particular selected data point, the distance being determined based on a metric function between a vector of each data point;

grouping the selected data points into a plurality of groups based, at least in part, on the predetermined number of other data points of the set of selected data points that are closest in distance, each group of the plurality of groups including a different subset of data points; and

providing a list of selected data points and the plurality of groups.