📜 ⬆️ ⬇️

Does the amount of data on the complexity of the development. Accounting in the anthill

I recently had a discussion with a colleague about whether the amount of data affects the complexity of development.

The bottom line remains:


For the developer, frankly, the conclusions were not very funny and unambiguous.
')
But the discussion arose not from scratch, but within the framework of a discussion of the problem with a simple computational algorithm, but with a large amount of data.

The purpose of the publication is to share the experience of how, in a reasonable time, to process two linked lists of a billion entries each.

Formulation of the problem


The data presented in the publication are similar to those in the original problem. The changes affected mainly applied terminology. Let's call it "Accounting in the anthill".

For clarity, the process was as follows:
In one fairy forest decided to conduct accounting of animals. To do this, create a simple accounting system, which recorded all the animals - only about a thousand. The system was made, the animals were assigned numbers, and the numbers were also received by minks, dens, nests. In short, everything turned out simply and quickly.

This was learned in the anthill and also decided to develop their own accounting system. Based on the accounting system of animals.

As a result, each ant received a number-name and number-home accommodation. Each ant was given the type: worker, soldier, ... Houses were divided into areas of the anthill: center, top, bottom, ...

After the registration was completed, and at the ant meeting everyone happily celebrated the completion of the work, the main ant soldier took the floor. He said (of course, he said spreading all sorts of pheromones - so it’s clearer to the ants) that he likes the account, it’s necessary and useful, and now it is necessary to send lists of ant soldiers living in this part of the anthill to all parts of the anthill. This should be done, as is customary in such cases, urgently and without delay.

Here, other ant representatives also began to fuss - they also needed lists for the distribution of all kinds of worm bugs and for holding other equally important events.

And there were ants - a billion.


Image - ER Diagram

The data structure is shown in the figure:


The task is to prepare for each area of ​​the anthill (hill_type) its list of ants living in it (ant_list), of a certain type (ant_type) and with their house numbers (cell_list).

Test data


Who cares, there is a utility for generating test data. The utility is prepared as a console program written in C # (Visual Studio Community 2013).

The utility code is laid out on GitHub in the file anthill-makedata.cs . To get a working program in Visual Studio, you need to create a console project, replace the file program.cs with anthill-makedata.cs, ​​and build the project. This approach is chosen in terms of compactness, visibility and safety of distribution.

By default (without parameters), the utility prepares a set of test data (5 files in CSV format) per thousand records. As a parameter, you can specify the desired data size - from thousands to billion:

anthill-makedata.exe 1000000000

Preparation of test data in a billion records will take up to 30 minutes and about 50 GB of disk space. Runtime can be viewed in the log (anthill-makedata.log).

You can use FAR to view the data.

Decision


Here, who are interested, you can think - and how would they begin to solve this problem.

During the development process, it turned out that I didn’t have much to do with the materiel, because I failed to achieve an acceptable speed from either the first or second approach. Processing on the complete data went, well, very slowly. The reason for this was the search procedure, which was performed quickly on small data, but as data increased, everything slowed down indecently.

In the end, a solution was found. Moreover, to work on complete data, even what was not under the hood of my not very powerful computer was enough: CPU - Core 5i, RAM - 12Gb (it was 8, added 4), HDD - the easiest (5400 about).

The essence of the solution is to use a regular index array. For each list, an array of a billion is created. The cell number in the array corresponds to the number of the object (ant, house). In the cells of the arrays are placed, respectively, the identifiers of the types of ants and the types of parts of the anthill. After the arrays of ants and houses are filled (from the test data — anthill-ant-list.csv and anthill-cell-list.csv files), the anthill-ant-to-cell.csv communication file is pulled and the files are simultaneously written ( CSV) with results.

Accessing an array cell by index occurs instantly, since there is no time spent on searching. At the same time, most of the time is taken by reading and writing files (working with HDD).

The solution was prepared as a console program written in C # (Visual Studio Community 2013).

The program code is laid out on GitHub in the file anthill-makeresult.cs . To get a working program in Visual Studio, you need to create a console project, replace the program.cs file with anthill-makeresult.cs, and build the project (note: in the project properties, in the BUILD section, uncheck the [Prefer 32-bit] checkbox). Run the program in the same directory as the test data files.

By default (without parameters), the program prepares lists of ant soldiers (CSV files). As a parameter, you can specify the desired type:

anthill-makeresult.exe 4 (4 - work type code)

Preparation of the results according to the complete data takes up to 30 minutes. Taking into account the test data, it requires about 100 gigabytes of disk space.

Brief summary


For the considered task, the statement that the volume of data does not have a significant impact on the complexity of the development turned out to be true:


Of the minuses can be noted:

Source: https://habr.com/ru/post/264807/


All Articles