Some content of this application is unavailable at the moment.
If this situation persist, please contact us atFeedback&Contact
1. (WO2019064016) METHODS AND APPARATUSES RELATING TO PROCESSING HETEROGENEOUS DATA FOR CLASSIFICATION
Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

Methods and Apparatuses Relating to Processing Heterogeneous Data for Classification

Technical Field

The present invention relates to processing of heterogeneous data for classification.

Background

Machine learning can be used to analyse data and arrange it into classes. This classification has a variety of applications, for example classifying pictures, text records, customer behaviour, among others, to enable computers to take decisions, automate processes and make recommendations based on fresh input data. Supervised classification involves a human training machine learning algorithms, for example by providing classes to a sample of records. The algorithms learn from this training set and build classification models which can then be used to classify fresh data. This is commonly called predictive analytics. Active Learning is a form of supervised classification machine learning whereby the algorithms interact with a source of true classification (such as a human, for example) to point them to the data to classify next in order to reduce the required input from the true classification source (human), whilst improving the performance of the machine learning models.

Both Active Learning and indeed all supervised classification machine learning remains largely the preserve of highly-skilled data scientists due to the steps that need to be taken to convert raw input data into a form suitable to be input into a machine learning environment. It is therefore difficult for non-data scientists to make use of

classification machine learning systems.

Summary of the Invention

According to an aspect of the present invention, there is provided a computer-implemented method of processing heterogeneous data for classification, comprising receiving a plurality of heterogeneous data records to be classified and the structure of the received heterogeneous data records, applying a data record definition, by mapping the structure of the received heterogeneous data records to a common data format, and classifying the set of data records in the common data format using a machine learning process.

According to another aspect of the present invention, there is provided apparatus comprising at least one processor and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to perform the above method.

According to another aspect of the present invention, there is provided a computer program which, when executed by a computing apparatus, is arranged to perform the above method.

The machine learning process may be an Active Learning process.

The plurality of heterogeneous data records may comprise at least one of unstructured text and different data fields.

The method may comprise providing an indication that a received data record comprises data that does not correspond to a class of the classification performed by the machine learning process, and presenting said data record to a user be labelled.

The method may comprise processing the data records by generating a text document suitable for input into a machine learning process.

The method may comprise receiving the data record definition from a user via a user interface.

The method may comprise receiving a user input via a user interface to manipulate classes and/or records.

Brief Description of the Drawings

For a more complete understanding of the methods, apparatuses and computer-readable instructions described herein, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:

Figure 1 illustrates a system according to embodiments of the invention;

Figure 2 is a flow chart illustrating an example of operations which may be performed according to embodiments of the invention.

Detailed Description

In the following detailed description, only certain exemplary embodiments of the present invention have been shown and described, simply by way of illustration. As those skilled in the art would realise, the described embodiments may be modified in various different ways, all without departing from the scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not restrictive. Like reference numerals designate like elements throughout the specification.

Text and other data used to train machine learning models, and which is classified, may vary enormously from one situation to the next, and for different organisations. It can vary in any and all ways: length, language, style, acronyms, which fields (or partial-fields) are used and the subject domain. Such data is referred to herein as

heterogeneous data. In addition, data structures may change over time for any one particular situation. In addition, there may be variety in the end-resulting classes (i.e. labels') themselves. This is dependent on what the classification is being used for, for example the classes for automatic routing of customer enquiries to particular departments might be different to classifying whether the customer is happy, even for the same dataset. As will be described in more detail below, embodiments of the present invention are able to handle simultaneously such variety in data structure and labelling.

For example, heterogeneous input data may be provided in a number of different formats comprising different structured and unstructured fields. Embodiments of the present invention provide for input data provided in any format to be received by the system and converted into a common format in order to be processed by a machine learning system for classification. In addition, the user may predefine which data is to be used in the machine learning process.

An example of a data structure including structured and unstructured fields may be a hotel review. The review includes structured fields in which the user can give a rating in relation to specific questions (location, service, etc.), and an additional field is an unstructured comments field in which users can comment on the hotel.

For the classification process of embodiments of the present invention, the desired classes of the data record to be classified might be (i) the topics that people are talking about within each response, (ii) the opinions of those topics and (iii) the emotional state of the responder. These resultant classes, or labels can all then later be analysed, with or without the numerical responses, to identify which topics are important to which customers. Each field may have its own pre-processing rules in terms of acronyms, and other cleansing, Natural Language Processing ("NLP") rules or data transformations (i.e. synonym replacement, normalization of date formats). Even the ostensibly simple addition of an additional field to the data structure described above in a conventional system will now not work without starting again and retraining the machine learning process. Embodiments of the present invention enable a specification of a mapping of the structure of the received data to a common format for classification such that it can be known whether or not the additional field is to be classification.

Another example may relate to classifying maintenance records of some machinery to try to analyse why certain failures are happening. This might require the concatenation of several written reports (e.g. a customer and then one or more technicians) to be used in the classification. A change in the data structure such as new or changing fields (e.g. due to new recording processes or IT changes) would mean that the entire analysis would need to be restarted in a conventional system. A generic system that can easily be configured and used by a non-data scientist, for either of the described examples relating to the maintenance records or the hotel reviews has not been achieved before now. Such a solution has presented many difficulties: in conventional systems each classification workflow is crafted by a data scientist for a particular static dataset, and a significant amount of time is also spent validating and curating these workflows in case of changes.

Embodiments of the present invention provide a method and system for data classification, useable by a non-data scientist, which is able to deal with data provided in a variety of formats, including heterogeneous data formats which may change over time. Embodiments of the present invention may therefore provide a solution for dealing with any format of incoming data, in contrast to prior art systems in which such systems maybe configured only to receive data in a single predetermined format, or a small group of separately-defined formats. Embodiments of the present invention provide a technical solution for data records of different structures to be processed in such a way as to be suitable for input into the machine learning system. Therefore, the present invention does not require that the inputs are provided in exactly the same

format in order to be processed by the machine learning system, as may be the case for many existing machine learning systems.

Machine learning classification may comprise the steps of text pre-processing, data modelling, and model generation. Text pre-processing relates to transforming the text from the data record fields in order to remove certain features which may add "noise" to the classification process. The text pre-processing may also including steps for transforming the text in such a way as to allow the machine learning classifier to achieve a better classification. Examples of features which may be addressed in the pre-processing stage may include, but are not limited to, typos which may be corrected or removed, abbreviations which may be replaced, and data transformation like date format normalization. Data modelling relates to transforming the records into a matrix. For example, a row of the matrix may represent a data record, and a column may represent a field of the data record corresponding to a specific feature. Model generation relates to using one or more machine learning algorithms to create a model based on the output matrix from the data modelling step.

The data structure of the input data records may have an impact on each of these steps. Therefore, embodiments of the present invention transform the input data records into a common format in order for the machine learning process to deal with heterogeneous input data records provided in different data structures and formats.

Therefore, the embodiments of the present invention provide a flexible system which may be used for different situations and by different users. They may handle data structures which change over time. Indeed, the data records may be provided to the system in any format, and the embodiments of the present invention provide for each of the data records to be processed into a common format.

Figure ι illustrates a system arranged to perform a method embodying the present invention. The system includes a component referred to herein as the "data record abstractor" 204. As described in more detail below, the data record abstractor abstracts the disparities in the different data structures in order to allow the data to be input from a data structure of any format. This enables the system to ingest data of any structure, while still being able to run all of the processes and algorithms that are required for prediction, model validation, and retraining.

The machine learning process used by the system of Figure 1 may make use of Active Learning in which the system may be configured to interact with a user, wherein the user may provide labels to some data in order to improve predictive models. Active Learning may provide the user with an invitation to classify certain records in a certain order to reduce the labeller's time and to improve the performance of predictive models. The active learning system may select records for the user to classify. In some examples, if a certain syntax has changed its semantic meaning over time, the system may invite a user to classify certain records and, if necessary, generate new classes. Therefore, the system may specify whether data must be classified by a user. The system may be used with or without Active Learning. That is, other types of machine learning may be used together with embodiments of the present invention. However, combining the method alongside Active Learning may enable machine learning classification to be simplified and may improve the automation of the process.

The system components and their functions will now be described. The system may comprise a number of inputs, 201, 202, 203. These inputs may comprise, for example, data record pre-processing instructions 201, data record definition 202, and/or a data record view 203.

The data record pre-processing instructions 201 may comprise instructions for preprocessing data records defined in a particular way, i.e. data records having a particular structure. Data record pre-processing instructions 201 may be provided to the system in any suitable way, such as by being input by a user, for example, or extracted from a computer memory. The system may select an appropriate pre-processing instruction for a corresponding data field.

The data record definition 202 includes a definition of data fields to be used in the machine learning process. The data record definition 202 may have a corresponding data record pre-processing instruction 201 for different fields to be classified, in order that the system can identify the relevant information from the data records and obtain the relevant instructions for performing pre-processing on the data records. The data record definitions may be provided to the system in any suitable way, such as by being input by a user, for example, or extracted from a computer memory.

The data record view 203 may be associated to the data record definition and may comprise information on the fields of the data records which may be required to be

presented to a user in order for a user to validate or label data in the relevant fields. Therefore, if a data record is presented to a user, the data record view contains the information required to graphically present specific fields of the data records to a user.

The system may comprise a user interface to enable a user to input information and/ or instructions to the system. For example, the data record pre-processing instructions may be input by a user through the user interface. The data record definition may be input by a user through the user interface. The data record view may be provided to the user via the user interface.

The data record abstractor 204 receives the data record pre-processing instructions 201, data record definition 202, and data record view 203 corresponding to input data records. The data record definition 202 is used by the data record abstractor 204 to map the structure of a plurality of heterogeneous data records into a format which can be considered as an abstraction or common format. Data in the abstracted format can be processed by a machine learning process. Due to the N:i mapping of data records to the abstracted format input, it is not necessary to configure the machine learning process to handle data records in any unrestricted format, and so it is not necessary to define a machine learning process for any format of input data, but only for abstracted formats- the modification of the abstracted format by addition or removal of predetermined fields, thus changing the received data record definition, thus represents a predetermined restricted set of variations which provides full flexibility to the machine learning process without requiring undue complexity.

Features are extracted from the input data records and arranged into the abstracted format and the data record abstractor 204 therefore determines the relevant information required for further processing of the data records, in order that the data records may be processed in the machine learning system. The data record abstractor 204 uses the received information in order to drive the subsequent components of the system in the correct manner, and is thus a dynamic component interfacing between an unrestricted input and a set of user-definable outputs, i.e. selected data fields. As the data record definition is abstracted, the corresponding processing instructions can be mapped on to the corresponding features of the input data records, regardless of the format in which the data record definitions are input.

The dynamic data record pre-processor 206 receives data records 207 which may be accompanied by corresponding labels. The dynamic data record pre-processor 206 maybe configured to apply the relevant pre-processing steps for received data records 207. The pre-processing for each data record is performed in accordance with the pre-processing information 201 associated with the data record definition 202. The data records 207 may be provided in a plurality of different formats and/or may relate to different subject matters.

If required, data from the data record may be presented to the user in order for the user to interpret and validate or label the data. The dynamic data record renderer 205 displays the data to the user in accordance with the data record view for the given data record.

The dynamic data record pre-processor 206 provides the pre-processed data to the dynamic text document generator 208. This converts the pre-processed data into a common format required for input into the data modeller 209. For example, the dynamic text document generator 208 generates a text document from the pre-processed data in a format suitable for the machine learning analysis to be applied. The dynamic text document generator 208 may, for example, be configured to merge and concatenate certain fields of the data record, or corpuses of text. The pre-processed data records are converted into the common format ready for input into the data modeller 209. The common format is determined according to the data record definition 202.

The data modeller 209 transforms the processed data records output from the dynamic text document generator into a matrix. Machine learning is applied to the matrix. The matrix comprises a number of rows and columns, where each row represents a data record and the columns represent features of the data records set out in the text document output from the dynamic text document generator 208.

The matrix output from the data modeller 209 is input to a classification model generator 210. The classification model generator applies one or more machine learning algorithms to create at least one model based on the matrix output from the data modeller 209. The classification model generator 210 outputs labelled data 211. In addition, the classification model generator may be configured to output a recommendation 212 of a data record for a human operator to label in order to improve the model, in accordance with an active learning process. The classification model generator 210 may perform processes such as cross-validation of the generated model.

The data record abstractor 204, dynamic data record pre-processor 206 and dynamic text document generator 208 when used together with Active Learning provide for a practical Active Learning system which allows the system 1 to classify data records having different formats by keeping an association between each data record and its definition. In this way, the system 1 is able to combine all data records even when the data records are in different formats, for example if they include different fields, during the classification process. Also as each data record definition 202 has specific data record pre-processing instructions 201 associated with it, the system 1 is able to determine how to process each data record based on the data record definition 202. In addition, in order to allow a user to provide true labels for each data record, even data records containing different fields, each data record definition 202 has an associated data record view 203 which may be viewed by a user on a data record view page.

The system may be fully adaptable for commercial use. For example, the system may be adapted for big-data capabilities. In some embodiments, the system may comprise a distributed system to allow different labellers in different locations to receive different records to classify from the system.

In some embodiments, the system may be configured to target specific classes for Active Learning to operate on and improve.

The system may also be configured to target data records corresponding to specific time periods. For example, the system may be configured to generate models based on data records corresponding to the present and exclude data records that are older than a given date.

The system may be adaptable such that a user such as a data scientist may select or add different machine learning algorithms and/or pre-existing models, for example in order to analyse data in different ways or to provide for further analyses.

The system may include further pre-processing algorithms which could be used to suggest or generate labels automatically at the start where there are not yet any labels.

The system may be configured to indicate to a user when it detects that there may be an entirely new class. This may provide for an "early warning" about new issues that have not previously arisen.

The system may be configured to allow a user to highlight certain portions of text within a corpus of text and to add a label or class. This may enable more accurate training of the machine learning models where the data records may include large corpuses of text and/ or multiple labels which may be assigned to a corpus of text. Therefore, the performance of the models may be improved. The system may allow a user to view and edit labels that have been assigned manually or automatically.

The system may output confusion matrices to a user. The confusion matrices may indicate how well particular classes or labels are performing. The user may be able to select, inspect, and relabel incorrectly classified records such as false positives or false negatives in order to retrain and improve the models. For example, the system may comprise a user interface via which a user may provide an input in order to manipulate classes and/or records.

In response to classes and records being manipulated by a user, the system may be configured to automatically trigger remodelling. For example, a class may be split, or a number of classes may need to be merged. For this purpose, a user interface may be provided in order to allow a user to manipulate the classes and records. Classes within a hierarchy may also therefore be easily manipulated.

Classes labelled from pre-existing data may be imported along with training data in order to automatically generate an initial model.

The assignment of a class label to data may be based on threshold criteria for the classification confidence coming from the machine learning model.

The system may comprise a memory comprising computer readable instructions and a processor. The instructions may be executed by the processor in order to perform required steps for training and/or classifying data, such as for example, the steps described with reference to Figure 2. The system may comprise a computer readable medium having computer readable code stored thereon, the computer readable code,

when executed by at least one processor, causes the performance of any of the operations described with reference to Figure 2.

Figure 2 is a flow chart illustrating steps of a method which may be performed by the system shown in Figure 1.

In step S301, the method comprises receiving a data record definition for a plurality of heterogeneous data records. Pre-processing instructions for the plurality of data records may also be received. The data records may be provided in a plurality of different formats. As described above with reference to Figure 1, the data record abstractor may be configured to receive the data record definition and the preprocessing instructions. The data records may comprise unstructured text, or may comprise data fields which change over time. The data record abstractor may store the data record definition and data record pre-processing instructions.

In step S302, the received data record definition is applied to the structure of the received heterogeneous data records by mapping the structure to a common format for input to a machine learning process.

If the received data records are to be pre-processed, the pre-processing may comprise processing the data records into text documents suitable for being input to a machine learning process, in order to input the data records into a data modeller and a classification model generator. The common format allows the data to be input into a data modeller regardless of the input format of the data. The text document is generated based on the determined common format.

A data modeller may arrange the processed data records into a matrix. The matrix may then be input to a classification model generator in order to generate a data model trained in accordance with the Active Learning process. Therefore, a trained data model may be generated in accordance with an Active Learning process.

In step S303, the method comprises classifying the data records in the common format using the data model which has been trained according to the machine learning process. The method may comprise providing a recommendation to a user for a data record to be classified by the user, in accordance with a machine learning process.

If a data record comprises data which does not belong to an existing class, the system may provide an indication that the data record does not correspond to an existing class. The data record in question may then be presented to a user in order for the user to label the data.

The classified data records are output in step S304. The data records may, in some embodiments, also be fed back into the classification process via feedback loop S305, in order to train the machine learning process.

Embodiments of the present invention therefore provide for a system and a method that is easy to use and adaptable to a variety of situations. The system requires little setup, and it is constantly analysed for effectiveness. The system may therefore be used by users who do not have specialist knowledge in the field of data science, while providing accurate and detailed information on which further business analysis may be performed. The system and method provide the technical benefit of enabling data in a variety of formats to be utilised in a machine learning process.

The present invention has been described with reference to a number of exemplary embodiments and examples. It should be appreciated that the particular embodiments shown and described herein are illustrative of the invention and are not intended to limit in any way the scope of the invention as set forth in the claims. It will be recognised that changes and modifications may be made to the exemplary

embodiments without departing from the scope of the present invention. These and other changes or modifications are intended to be included within the scope of the present invention, as expressed in the following claims.