Processing

Please wait...

Settings

Settings

Goto Application

1. US20180165272 - Automatic locale determination for electronic documents

Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

[ EN ]

Claims

1. A data processing method comprising:
receiving, at a server computer, an electronic document comprising a plurality of unknown-language data elements each associated with one or more types;
assigning to at least one unknown-language data element, of the plurality of unknown-language data elements, a weight value based on a type, of the one or more types, of the at least one unknown-language data element;
determining that a data value of the at least one unknown-language data element has matched to a data value of at least one known-language data element associated with a particular language;
based at least in part on the weight value assigned to the at least one unknown-language data element, determining a language confidence level value specifying a level of machine confidence that the document is expressed in the particular language.
2. The method of claim 1, further comprising selecting the at least one unknown-language data element from the plurality of unknown-language data elements based on at least one of: a document type of the document or a document schema of the document.
3. The method of claim 1, further comprising:
receiving the document as part of receiving a request to process the document, the request comprising one or more additional data elements;
selecting an additional data element that indicates possible language for the request, the additional data element assigned to a particular weight;
based on a data value of the additional data element and the particular weight, adjusting the language confidence level value for the document.
4. The method of claim 1, wherein the type of the at least one unknown-language data element is a data field name of the at least one unknown-language data element or the data value of the at least one unknown-language data element of the document.
5. The method of claim 1, wherein determining that the data value of the at least one unknown-language data element has matched to the data value of the at least one known-language data element comprises comparing the data value of the at least one unknown-language data element to data values of a plurality of known-language data elements.
6. The method of claim 5, wherein the comparing further comprises stemming the at least one unknown-language data element to match with the data values of the plurality of known-language data elements.
7. The method of claim 1, further comprising, based on the language confidence level value for the particular language exceeding a language threshold value, automatically processing the document using the particular language.
8. The method of claim 7, further comprising determining the language threshold value based on a maximum language confidence level value possible for the document.
9. The method of claim 7, further comprising determining the language threshold value based on a plurality of language confidence level values, for a plurality of languages, determined for the document that includes the language confidence level value.
10. A data processing method comprising:
receiving, at a server computer, an electronic document comprising a plurality of unknown-locality data elements;
selecting at least one unknown-locality data element of the plurality of unknown-locality data elements such that the at least one unknown-locality data element has a data value that can vary in formats based on an actual locality of the document;
based on a type of the at least one unknown-locality data element, assigning to the at least one unknown-locality data element a weight value;
based on a format of the data value, determining at least one possible locality with which the format of the data value of the at least one unknown-locality data element is associated;
based at least in part on the weight value of the at least one unknown-locality data element, determining a machine confidence level value for the at least one possible locality to be the actual locality of the document.
11. The method of claim 10, wherein the format of the data value is based at least on one of the following: a date format, a number format, or a currency value format.
12. The method of claim 10, further comprising, based on the machine confidence level value for the at least one possible locality exceeding a threshold value, automatically processing the document using the at least one possible locality.
13. A server computer system comprising:
one or more processors;
one or more storage media storing one or more computer programs for execution by the one or more processors, the one or more computer programs comprising instructions for:
receiving, at the server computer system, an electronic document comprising a plurality of unknown-language data elements each associated with one or more types;
assigning to at least one unknown-language data element, of the plurality of unknown-language data elements, a weight value based on a type, of the one or more types, of the at least one unknown-language data element;
determining that a data value of the at least one unknown-language data element has matched to a data value of at least one known-language data element associated with a particular language;
based at least in part on the weight value assigned to the at least one unknown-language data element, determining a language confidence level value specifying a level of machine confidence that the document is expressed in the particular language.
14. The system of claim 13, wherein the one or more computer programs further comprise instructions for selecting the at least one unknown-language data element from the plurality of unknown-language data elements based on at least one of: a document type of the document or a document schema of the document.
15. The system of claim 13, wherein the one or more computer programs further comprise instructions for:
receiving the document as part of receiving a request to process the document, the request comprising one or more additional data elements;
selecting an additional data element that indicates possible language for the request, the additional data element assigned to a particular weight;
based on a data value of the additional data element and the particular weight, adjusting the language confidence level value for the document.
16. The system of claim 13, wherein the type of the at least one unknown-language data element is a data field name of the at least one unknown-language data element or the data value of the at least one unknown-language data element of the document.
17. The system of claim 13, wherein the one or more computer programs further comprise instructions for comparing the data value of the at least one unknown-language data element to data values of a plurality of known-language data elements.
18. The system of claim 13, wherein the one or more computer programs further comprise instructions for, based on the language confidence level value for the particular language exceeding a language threshold value, automatically processing the document using the particular language.
19. The system of claim 18, wherein the one or more computer programs further comprise instructions for determining the language threshold value based on a maximum language confidence level value possible for the document.
20. The system of claim 18, wherein the one or more computer programs further comprise instructions for determining the language threshold value based on a plurality of language confidence level values, for a plurality of languages, determined for the document that includes the language confidence level value.