⬆️ ⬇️

Processing text objects in ERP-systems

The need for complex processing of text data stored in ERP-systems (and not only) occurs quite often. As an introduction to the following examples:







Find out what we want



Over the years, having created many situational solutions for the listed and similar problems, we have come to the need to somehow universalize the approach to the entire listed set and with a margin for the future.



Since we will work with texts that are entered with the most minimal observance of the rules, and more often and without them at all, the use of regular expressions, perl or awk languages ​​is not possible. Quite accessible it is described on the same article in the article devoted to a similar problem .

')

Our desires are as follows:





Let's start the decision



In fact, the task rests on the definition of language constructs, and actions on them. It is a pity to create the 1001st language, but I had to, thankfully, it is not so complicated.



Before talking about the language, I will define the concepts that it should implement:





In general, the text processing process consists of the following steps:





Features of dividing text into lexemes


When splitting the original text into lexemes, in addition to the obvious separation by spaces and other delimiters, there is an implicit separation. Now it is carried out between digital and non-digital characters.

Let me give an example: the text “model 900-a product” will be divided into the following lexemes: “model”, “900”, “-”, “a”, “”, “product”.



Tongue



Now you can proceed to the description of the language. The description is not strict, I will concentrate more efforts on examples. Now this technology works on the server Universe-HTT nightly processing more than 1.3 million items of goods on an updated set of rules. In order not to be unfounded, I will give examples from this set.



So.

General rules


The main part of the description language is lexemes themselves, so almost all the characters (spaces, tabs, and especially letters and numbers) are significant. End of line - the end of the rule (or part of a cluster). There is no transfer of a line (analogue of '\' in C macros).

Only tail and leading spaces are not significant.

The main prefix of service structures is%. For the literal representation of the percent symbol, the "double" %% is used.



Comments


The characters% - followed by them to the end of the line are ignored as comments.



Source expression


To represent the original expression, literal lexemes are used with the use of service structures:





Target expression


As already mentioned, the target expression is used either to replace the source expression, or to signal the encountered source expression.

The target expression, like the source expression, contains lexemes and service structures. Here the range of metacharacters is poorer:







Operators


In the examples above, I have already demonstrated the use of the%> (TO) and% <(FROM) replacement operators.

Here I will list all possible operators and give the necessary explanations:





Clusters


Clusters are needed to simplify the rule set.

Syntactically, the cluster is made up with the% {and%} metacharacters at the beginning of lines, and the contents of the cluster are placed in the lines between these.

After the cluster is defined, an operator should follow.

An example of a clustered case-change operator:

 %{ edt edp edc bic lg cd usa e%_ qvs vs vsop xo       %}%A 




Tuples


I have already walked through the tuples in some detail. Here is an example of defining and using a tuple:

 %{ %dml %dl %d %d %}%=volume PET %1%<pet%z%(%@volume.%)%|(pet%.)%.%z%(%@volume.%)%|pet%.%z%(%@volume.%)%|(pet)%.%z%(%@volume.%) 


In this example, the designation of plastic containers is unified before determining the capacity of the bottle. In order not to list all possible units of capacity with a numerical value, we have combined them into a tuple named volume. I note that we cannot simply formulate the replacement rule “PET% <pet% | (pet%.)% | Pet%.% | (Pet)%.” Because the original expressions may have a completely different meaning outside the context of container capacity.



Promised complex example of using signals


An enterprise that sells automobile tires, from its corporate system, exports to its own online store the names of goods with balances and prices. In their office, everyone is satisfied with the names and other classifications of goods. At the same time, on the Internet site, a potential client should be able to choose tires by size, group, brand.

The following example demonstrates the process of reformatting items of goods that came from an office system on a server that manages an online store with the subsequent placement of numerical parameters for tire sizes. I will make a reservation that the example, although real, but reduced, here is only formalizing the sizes contained in the name in the format “w / h Rd”.



 %{    %}%=tiretitle %(%@tiretitle.%) %(%f/%d%)x%(%f%)%>%1 %2 R%3 %(%@tiretitle.%) %(%f%)x%(%f%)%>%1 %2 R%3 %(%@tiretitle.%) %(%f/%d%)%(R%f%)%>%1 %2 %3 %(%@tiretitle.%) %(%f/%d%)%(R%f%)%>%1 %2 %3 %(%@tiretitle.%) %(%f%)%(R%f%)%>%1 %2 %3 %(%@tiretitle.%) %(%f%)%(R%f%)%>%1 %2 %3 %(%@tiretitle.%) %(%f/%d R%f%) %>%1 %2c %(%@tiretitle.%) %(%f/%d R%f%) c%>%1 %2c %(%@tiretitle.%) %(%f R%f%) %>%1 %2c %(%@tiretitle.%) %(%f R%f%) c%>%1 %2c %@tiretitle. %(%d%)/%(%d%)%sr%(%f%)%!class=tire; gcdimx=%1; gcdimz=%2; gcdimy=%3 




At the beginning of the determined tuple, allowing to distinguish tires from other products stored in the database.

Next come the strings, (partially) formatting the names in a uniform view. Finally, the last signal directive, instructs the system to set the goods that fall under a given format, class and numeric classifiers.



Performance



Given the huge number of text objects that have to be processed and the considerable number of rules applied to each of these objects (thousands and tens of thousands), the performance issue of the algorithms involved in the processing under consideration is not idle.

What we have taken to optimize the speed of execution




In general, the previously mentioned processing of the directory of goods on the Universe-HTT, containing more than 1,300,000 items on a set of several thousand rules, is 3 and a half hours. If someone can tell with what to compare it - I will be grateful.



What's next?



All that is not forgotten, told. It remains to inform the public about what is missing.

There are several important features that should be added to the described technology:

Source: https://habr.com/ru/post/202502/



All Articles