The need for complex processing of text data stored in ERP-systems (and not only) occurs quite often. As an introduction to the following examples:
- Unification of the names of the commodity nomenclature
- Automatic placement of formal attributes of goods based on their names or descriptions
- Conversion of postal addresses both for the purpose of unification and for formal structuring
- Determining the gender of a person by his name
- Extraction of information from notes to documents (for example, for automatic linking of a record from an extract with shipping documents)
- etc. (you can still fantasize for a long time)
Find out what we want
Over the years, having created many situational solutions for the listed and similar problems, we have come to the need to somehow universalize the approach to the entire listed set and with a margin for the future.
Since we will work with texts that are entered with the most minimal observance of the rules, and more often and without them at all, the use of regular expressions, perl or awk languages ​​is not possible. Quite accessible it is described on the same
article in the
article devoted to a similar problem .
')
Our desires are as follows:
- We want to convert the texts in the database so that they are as unified as possible.
For example, there are two items “Soap Fluid. PALMOL aromatherapy 300 ml "and" Palmolive liquid soap Vita-oil 300ml deep food ".
We would like to see instead of these two names, say, such:
"Liquid soap Palmolive aromatherapy 300ml" and "Liquid soap Palmolive Vita-oil 300ml deep nutrition".
- We really want to write fewer rules (especially since the syntax as will be seen below for objective and subjective reasons is not a gift). Therefore, we will need some grouping options to replace several options with one at a stroke.
Suppose: palmolive << palmoliv | palmolv | palmmol.
- Many of our customers require marketers to have a good classification of products according to 20 attributes, and fitness clubs that we automate complain that an employee of a Peter Ivanovich visitor has determined that they belong to the female sex.
Consequently, it is very important for us that the program automatically place brands, group membership, strength and capacity of alcoholic beverages among products - after all, the employees have already indicated everything or much in the names.
In other words, a mechanism is needed to extract features from text objects and transform these features into formal attributes of entities.
- Sometimes it would be desirable, when replacing the text or extracting formal features from it, to limit the context of the application of the rule.
The point is this: for example, the word “powder” can mean either sugar powder or cosmetic powder for the face (for English texts, everything can be even more significant). If we limit the context, we can interpret this word differently. Say, by limiting the context to the list of beauty brands, we can accurately classify the occurrences of the word “powder” in the product name as powder for a face.
Let's start the decision
In fact, the task rests on the definition of language constructs, and actions on them. It is a pity to create the 1001st language, but I had to, thankfully, it is not so complicated.
Before talking about the language, I will define the concepts that it should implement:
- Token - the literal definition of a word or separator character
- The token class is a grouping definition of a set of tokens that satisfy the built-in set of characteristics.
Examples of classes of tokens:
- a word is a sequence of characters without separators (a single separator is also a subset of this class)
- digital word - a sequence of decimal digits without delimiters
- digital word of a given length - a sequence of decimal digits without fixed-length delimiters
- ANY is not an empty sequence of words
- Operation - an action that must be performed on any part of the text
There are not many operations:
- Replace - replace part of text with target expression
- Case change is a simple operation that ensures that the register of letters in words is set to one of three options:
- All lower case
- All caps
- Capital (first capital, followed by lowercase)
- Signal - the formation of a special well-structured string that is returned to the calling module in order to inform it that the text contains the appropriate construction and provide the ability to effectively perform some (at the discretion of the calling module) task
- The source expression is the part of the source text on which it is necessary to perform any operation.
- The target expression is a construct that replaces the original expression or is an object of the signal.
- A group is part of the source expression that can be referenced in the target expression. This concept is identical to the same concept in regular wounds.
- A cluster is a set of source expressions for which a common operation is defined.
- Option - in simple terms, the operator OR. Used in source expressions for listing various constructions to which the same operation should be applied. Strictly speaking, variant and cluster are almost synonymous. The differences between them are syntactic.
- A tuple is a named cluster. That is, a set of source expressions clothed with a name, due to which it can be repeatedly used in different constructions. Naturally, tuples allow recursiveness — that is, a symbol of another tuple can be an element or sub-element of a tuple.
- Context is a set of conditions that a text must satisfy in order for a rule to be applied. We have not yet implemented this concept, but we imagine how it should work, then we’ll do it soon.
In general, the text processing process consists of the following steps:
- Parsing a set of rules
- For each text to be processed by this set of rules:
- Converting text to a string of tokens: all letters of the text are converted to a lowercase register and the text is divided into tokens (words and delimiters), each token is dumped into a common pool (hash table), and the text to be processed turns into a chain of integer identifiers with which we will work.
Important: processing rules go through a similar process of "digestion", as a result of which the identification of a simple match of tokens will be performed by an integer comparison. - Indexing text. I will describe the details of the indexation below.
- Each rule is applied alternately to the resulting text image. If, as a result of applying the rule, the text has been transformed, the image is converted back into text and again into a digital chain. Such a somersault can seem wasteful in terms of performance, however, (for now) it is needed because of the incomplete reversibility of the transformation (in particular, because of the operation of changing the register of characters), as well as because of the need to index the text.
If the rule defines a signal operation, then a callback function is called with the resulting signal object. - After the last rule has been processed, the final text is collected and used at the discretion of the calling module (the simplest is displayed in the log, but more often it replaces the original text)
Features of dividing text into lexemes
When splitting the original text into lexemes, in addition to the obvious separation by spaces and other delimiters, there is an implicit separation. Now it is carried out between digital and non-digital characters.
Let me give an example: the text “model 900-a product” will be divided into the following lexemes: “model”, “900”, “-”, “a”, “”, “product”.
Tongue
Now you can proceed to the description of the language. The description is not strict, I will concentrate more efforts on examples. Now this technology works on the server Universe-HTT nightly processing more than 1.3 million items of goods on an updated set of rules. In order not to be unfounded, I will give examples from this set.
So.
General rules
The main part of the description language is lexemes themselves, so almost all the characters (spaces, tabs, and especially letters and numbers) are significant. End of line - the end of the rule (or part of a cluster). There is no transfer of a line (analogue of '\' in C macros).
Only tail and leading spaces are not significant.
The main prefix of service structures is%. For the literal representation of the percent symbol, the "double" %% is used.
Comments
The characters% - followed by them to the end of the line are ignored as comments.
Source expression
To represent the original expression, literal lexemes are used with the use of service structures:
- % s one or more space or tab characters
- % z zero or more space or tab characters
- % _ clear space. Can be used at the end of a line (for trailing spaces are ignored)
- % t tab
- % optional point Abbreviations can be written with or without a dot. In order not to produce extra rules, such a metacharacter is added.
- % - optional hyphen. The controversial metacharacter. Introduced because of frequent differences in the spelling of the same with and without a hyphen, but by itself it works poorly. That's why: let's say it is necessary to unify the “chupa chups” and “chupa chups”. The design of the “Chupa% Chups” recognizes only the “Chupa Chups”, but the “Chupachops” is already a single lexeme that will not be recognized, the “Chupa Chups” contains a space that we did not take into account. Thus, to solve our problem, we will have to write a “chupa% z% -% zpups” - the whole idea of ​​brevity went to pieces. However, there are cases when the optional hyphen works well. This is a docking of numeric and alphabetic lexemes.
- % ^ Start text. Important: not the line, namely the entire text.
- % $ End of text.
- % w Any single token.
- % d Digital token of any length. This refers to a continuous sequence of decimal digits.
- % D [0-9] [0-9] A digital token of a specified length. The length is strictly two digits. For example: "% D08" - eight decimal digits following each other.
- % f Number with optional decimal point. That is, both continuous sequences of decimal digits and two such sequences separated by dots fall under this pattern. Leading minus is not considered. If there are no numeric characters after a dot, only digits to the dot (without it) are taken into consideration.
Examples:
"100" - "100"
"3.1415926" - "3.1415926"
"-3.R" - "3"
“2. 5 ”-“ 2 ” - % (and%) denote the beginning and end of the group, respectively. The most unpleasant from the point of view of use of syntax a construction.
I will give an example of use:
"% (% d%),% (% d%)%>% 1.% 2" is the rule that turns two numeric comma-separated numbers into a number with a decimal point. When I talked about the troubles of the pair,% (%) meant that the readability of the rule makes this pair very noticeable. Alas, just the brackets can not be used (PMSM) due to the fact that they are too often found in live texts. - % * Arbitrary, but not empty, sequence of tokens. A very useful metacharacter for rearranging expressions in a text and as a palliative of contexts that have not yet been implemented.
Examples:
“Paint% (% *%) for hair%> hair dye% 1” - we unify the names of the range of hair dyes.
"Bag% (% *%) elk%> bag% 1 elkay plastics" - elk is an abbreviation for elkay plastics, but not everywhere. In the context of packages (bag) - yes. - % @ cortegename. Appeal to a tuple named cortegename (the dot after “cortegename” is significant - it separates the symbol of the tuple from the subsequent text). Examples of using tuples are given below.
- % | Option. Separates two variants of the original expression to apply the same operator to each of them.
Examples:
gift bag% <gift bags% | gft bg% | gft bgs% | gft / bg - “gift bags”, “gft bg”, “gft bgs”, “gft / bg” will be replaced with “gift bags”.
for modeling% <for modeler%.% | for modelers%. - abbreviations “for modeler” and “for modellers” (with optional points) will be expanded to the full version “for modeling”.
Target expression
As already mentioned, the target expression is used either to replace the source expression, or to signal the encountered source expression.
The target expression, like the source expression, contains lexemes and service structures. Here the range of metacharacters is poorer:
- % [1-9] - Reference to the ordinal number of the group in the original expression for the substitution.
Example:
“Mr%.% S% (% w%)%> mr.% 1” - remove the spaces in the combinations “Mr. x” between “mr.” And the last name and add a period after “mr” (if not). - % _ - Just a clear space.
- % E - Empty line. Due to the fact that the syntax of the rule does not allow the omission of the target expression, where it is provided for, this metacharacter saves the need to remove the found source expression.
Example:
"% ^ _%>% E" - remove the leading space in the text.
Operators
In the examples above, I have already demonstrated the use of the%> (TO) and% <(FROM) replacement operators.
Here I will list all possible operators and give the necessary explanations:
Clusters
Clusters are needed to simplify the rule set.
Syntactically, the cluster is made up with the% {and%} metacharacters at the beginning of lines, and the contents of the cluster are placed in the lines between these.
After the cluster is defined, an operator should follow.
An example of a clustered case-change operator:
%{ edt edp edc bic lg cd usa e%_ qvs vs vsop xo %}%A
Tuples
I have already walked through the tuples in some detail. Here is an example of defining and using a tuple:
%{ %dml %dl %d %d %}%=volume PET %1%<pet%z%(%@volume.%)%|(pet%.)%.%z%(%@volume.%)%|pet%.%z%(%@volume.%)%|(pet)%.%z%(%@volume.%)
In this example, the designation of plastic containers is unified before determining the capacity of the bottle. In order not to list all possible units of capacity with a numerical value, we have combined them into a tuple named volume. I note that we cannot simply formulate the replacement rule “PET% <pet% | (pet%.)% | Pet%.% | (Pet)%.” Because the original expressions may have a completely different meaning outside the context of container capacity.
Promised complex example of using signals
An enterprise that sells automobile tires, from its corporate system, exports to its own online store the names of goods with balances and prices. In their office, everyone is satisfied with the names and other classifications of goods. At the same time, on the Internet site, a potential client should be able to choose tires by size, group, brand.
The following example demonstrates the process of reformatting items of goods that came from an office system on a server that manages an online store with the subsequent placement of numerical parameters for tire sizes. I will make a reservation that the example, although real, but reduced, here is only formalizing the sizes contained in the name in the format “w / h Rd”.
%{ %}%=tiretitle %(%@tiretitle.%) %(%f/%d%)x%(%f%)%>%1 %2 R%3 %(%@tiretitle.%) %(%f%)x%(%f%)%>%1 %2 R%3 %(%@tiretitle.%) %(%f/%d%)%(R%f%)%>%1 %2 %3 %(%@tiretitle.%) %(%f/%d%)%(R%f%)%>%1 %2 %3 %(%@tiretitle.%) %(%f%)%(R%f%)%>%1 %2 %3 %(%@tiretitle.%) %(%f%)%(R%f%)%>%1 %2 %3 %(%@tiretitle.%) %(%f/%d R%f%) %>%1 %2c %(%@tiretitle.%) %(%f/%d R%f%) c%>%1 %2c %(%@tiretitle.%) %(%f R%f%) %>%1 %2c %(%@tiretitle.%) %(%f R%f%) c%>%1 %2c %@tiretitle. %(%d%)/%(%d%)%sr%(%f%)%!class=tire; gcdimx=%1; gcdimz=%2; gcdimy=%3
At the beginning of the determined tuple, allowing to distinguish tires from other products stored in the database.
Next come the strings, (partially) formatting the names in a uniform view. Finally, the last signal directive, instructs the system to set the goods that fall under a given format, class and numeric classifiers.
Performance
Given the huge number of text objects that have to be processed and the considerable number of rules applied to each of these objects (thousands and tens of thousands), the performance issue of the algorithms involved in the processing under consideration is not idle.
What we have taken to optimize the speed of execution
- First of all, as mentioned above, the comparison process is carried out at the level of integer identifiers of tokens. In itself, this greatly facilitates the task, translating its dimension from the terms of the symbolic lengths of the text and rules to the terms of quantifying the chains of tokens.
- In order to shorten the constant execution time of the iteration of matching the source expression of the rule with the text, we had to get rid (as far as possible) of the operations of allocation and freeing memory. All buffers, once distributed, are reused in subsequent iterations.
- Finally, the most significant acceleration (reducing the dimension of the task) is achieved by indexing the text to be processed. Indexation, in general, does not constitute anything new: an index is compiled for each lexeme in the text and compares to it the position numbers in the text where it is found. Since we use generalized classes defined by metacharacters, the occurrences of many of them in the text are also indexed (up to the numeric sequences of each of the 99 possible lengths).
Obviously, acceleration is achieved here by quickly identifying the fact that there is no lexeme in the text or by quickly moving to the desired position of the text.
We did not derive a theoretical estimate of the complexity of the resultant algorithm, but the absolute time of work on a test that performs a total of more than 4E9 matching operations improved by a third.
I repent, I do not remember what it is called, but there is a classic algorithm for simply comparing a set of samples with text, with similar indexing.
In general, the previously mentioned processing of the directory of goods on the Universe-HTT, containing more than 1,300,000 items on a set of several thousand rules, is 3 and a half hours. If someone can tell with what to compare it - I will be grateful.
What's next?
All that is not forgotten, told. It remains to inform the public about what is missing.
There are several important features that should be added to the described technology:
- Contexts (I wrote about them a little higher). In general terms, the idea is to specify a set of expressions that must occur in the text so that some rules can be applied.
- Stemming. It is necessary to introduce into the language a construct that would allow to compare lexemes with regard to stemming.
- The include directive to include files, for example, with prepared tuples (a list of cities, brands, etc.).
- Simplified syntax directive. The percent characters look very hard - in some cases, it would be possible to include a simplified syntax and use just the characters () {} <>! as official. If there is a possibility of confusion, disable the simplified syntax.