Generating large amounts of useful data

I want to share the experience of creating a mechanism for generating a large database of goods. With it, our users can generate in a few minutes more than a million of the same type, but different records.

We created a large inventory system. The number of goods promised to exceed several million and it became clear to us that manually adding products to the table would not be enough. Products are usually of the same type and differ only in parameters. For example:

The screw a galvanized cross of D 3 mm length is 16 mm for a tree
Screw black cross D 3mm length 20mm for ledger
The galvanized screw slot D 4mm length 16mm for metal
Screw black slot D 4mm length 20mm universal
...

Screws can be several thousand. For the “hardware” product group, we also need nails, bolts, nuts, washers, dowels ... In the example, it is clear that the structure of the name consists of several fields, each of which has several options of values. The whole set of screw names is a language (I remind you that a language is a set of strings) and this language can be described by a regular expression. Usually, a regular expression is used to answer the question: does a string belong to a language or not? In our case, we need to solve the inverse problem: on a regular expression, construct the entire set of strings or generate a language. That is, the user can enter something like
(|) D (2|3|3,5|4|5|6|7|8|9|10) (10|12|16|20|25|30|40|50|60|80|100) (|) ( | | |)
The program should generate all sorts of screw options. It should be noted that some regular expressions describe infinite languages, so not any expression can be used to generate data. For example, the expression [0-9]+ describes the language of non-negative integers and it is infinite.
To generate data, we use two methods.

Generation with Automation

The OpenSource Automation package can work with finite automata, which is what a regular expression is. Among other features, the library is able to generate strings that match the specified regular expression. Then the matter of technology is to generate all the options and put them in the database.
However, this method has a significant drawback - the speed of work. It works well on several thousand lines, but it can take several hours for millions of records (mainly due to the slow insertion of one line into the database). Therefore, a second method was developed.

Regular Expression SQL Generation

Imagine what happens if you execute a SQL query (MySQL dialect)

 select concat('  ', t1.a, '  ', t2.a, '') from ( SELECT '3' as a union select '4' as a ) as t1, ( SELECT '16' as a union select '20' as a ) as t2

The result will be 4 rows, which can be immediately inserted into the table:

 The screw diameter 3mm length 16mm
 The screw diameter 4mm length 16mm
 The screw diameter 3mm length 20mm
 The screw diameter 4mm length 20mm

This simple example shows that you can write a SELECT query and use it as part of an INSERT query to insert the necessary rows into the database. To generate such SQL, we wrote a simple regular expression parser. It simply selects the brackets inside the expression, then uses the same Automation to generate all sorts of strings inside the brackets. That is, for example, for the expression
(3|4) (16|20)
The program selects the expressions in brackets 3|4 and 16|20 , then with the help of Automation for each pair of brackets generates various values and generates a SQL query.
Such a query runs much faster than inserting one entry at a time and is able to add millions of lines in a few seconds.
')

Limiting the generated data

The last problem remained. Imagine a screw with a thickness of 2 mm and a length of 100 mm. There is no such. We needed a mechanism for limiting some combinations of parameters. The easiest way is to ask the user to write a function in some scripting language that accepts all parameters and returns true / false. But most likely it would meet misunderstanding from users. Therefore, we decided to use the Pivot Table so that the user can select the desired combination of parameters, which we then simply add to the WHERE condition within the SELECT query.

The illustration shows a pivot table, with the columns open and the cells highlighted in yellow. Blue color indicates products that have already been added to the database, green, those that have not yet been added. It can be seen that all black screws for wood have already been added, and half (110/220) of all galvanized screws for metal has been added (in order to understand exactly what, you need to expand the “Material for metal” column in the summary table). Also, the user has selected and is preparing to add black GK screws for a Phillips screwdriver and galvanized ones for a slotted screwdriver.
I omit the details of how we ensured the removal of erroneously generated data and the processing of re-adding existing records. This is more related to the subtleties of working with MySQL (or another database) and does not correspond to the topic of the post.
This mechanism is well established. Users were able to master it and use it in their work. At the moment, the largest volume of the database of goods from one of our clients exceeds 50 million records. Most of this volume was created in the first month of work.

Source: https://habr.com/ru/post/195624/

All Articles

Generating large amounts of useful data

Generation with Automation

Regular Expression SQL Generation

Limiting the generated data

More articles: