I recently had a discussion with a colleague about whether the amount of data affects the complexity of development.
The bottom line remains:
- The amount of data should not have a significant impact on the complexity of the development. The main complexity of development, as a rule, is related to the complexity of the data processing algorithm, and not to their number. Knowing in advance the actual amount of data, it is enough to develop a code that works on small data, and then it can be applied to the required volume.
- All the basic computational algorithms have long been known (at least for several decades). The main thing is to determine the correct approach to the task as early as possible (before the start of development). But this is not a question of laboriousness, but professional suitability - that is, the materiel must be studied in advance, but developed quickly.
- Not one Customer will understand why the laboriousness of developing a code of several hundred lines took a lot of time. It is easier for the customer to change the team, than to invest his time and money in someone's learning process or in some kind of incomprehensible experiment.
- Small overhead associated with the amount of data, of course, can be. But these costs, as a rule, do not exceed the errors of the initial (correct) assessment of labor intensity and it does not make sense to take them separately.
For the developer, frankly, the conclusions were not very funny and unambiguous.
')
But the discussion arose not from scratch, but within the framework of a discussion of the problem with a simple computational algorithm, but with a large amount of data.
The purpose of the publication is to share the experience of how, in a reasonable time, to process two linked lists of a billion entries each.
Formulation of the problem
The data presented in the publication are similar to those in the original problem. The changes affected mainly applied terminology. Let's call it "Accounting in the anthill".
For clarity, the process was as follows:In one fairy forest decided to conduct accounting of animals. To do this, create a simple accounting system, which recorded all the animals - only about a thousand. The system was made, the animals were assigned numbers, and the numbers were also received by minks, dens, nests. In short, everything turned out simply and quickly.
This was learned in the anthill and also decided to develop their own accounting system. Based on the accounting system of animals.
As a result, each ant received a number-name and number-home accommodation. Each ant was given the type: worker, soldier, ... Houses were divided into areas of the anthill: center, top, bottom, ...
After the registration was completed, and at the ant meeting everyone happily celebrated the completion of the work, the main ant soldier took the floor. He said (of course, he said spreading all sorts of pheromones - so it’s clearer to the ants) that he likes the account, it’s necessary and useful, and now it is necessary to send lists of ant soldiers living in this part of the anthill to all parts of the anthill. This should be done, as is customary in such cases, urgently and without delay.
Here, other ant representatives also began to fuss - they also needed lists for the distribution of all kinds of worm bugs and for holding other equally important events.
And there were ants - a billion.

The data structure is shown in the figure:
- ant_type - types of ants (1 - queen, 2 - larva, 3 - nanny, 4 - worker, 5 - soldier)
- hill_type - anthill areas (1 - center, 2 - top, 3 - bottom, 4 - north, 5 - south, 6 - east, 7 - west)
- ant_list - list of ants (billion records)
- cell_list - list of homes (billion records)
- ant_to_cell - which ant in which house lives (billion records)
The task is to prepare for each area of the anthill (hill_type) its list of ants living in it (ant_list), of a certain type (ant_type) and with their house numbers (cell_list).
Test data
Who cares, there is a utility for generating test data. The utility is prepared as a console program written in C # (Visual Studio Community 2013).
The utility code is laid out on GitHub in the file
anthill-makedata.cs . To get a working program in Visual Studio, you need to create a console project, replace the file program.cs with anthill-makedata.cs, and build the project. This approach is chosen in terms of compactness, visibility and safety of distribution.
By default (without parameters), the utility prepares a set of test data (5 files in CSV format) per thousand records. As a parameter, you can specify the desired data size - from thousands to billion:
anthill-makedata.exe 1000000000
Preparation of test data in a billion records will take up to 30 minutes and about 50 GB of disk space. Runtime can be viewed in the log (anthill-makedata.log).
You can use FAR to view the data.
Decision
Here, who are interested, you can think - and how would they begin to solve this problem.
During the development process, it turned out that I didn’t have much to do with the materiel, because I failed to achieve an acceptable speed from either the first or second approach. Processing on the complete data went, well, very slowly. The reason for this was the search procedure, which was performed quickly on small data, but as data increased, everything slowed down indecently.
In the end, a solution was found. Moreover, to work on complete data, even what was not under the hood of my not very powerful computer was enough: CPU - Core 5i, RAM - 12Gb (it was 8, added 4), HDD - the easiest (5400 about).
The essence of the solution is to use a regular index array. For each list, an array of a billion is created. The cell number in the array corresponds to the number of the object (ant, house). In the cells of the arrays are placed, respectively, the identifiers of the types of ants and the types of parts of the anthill. After the arrays of ants and houses are filled (from the test data — anthill-ant-list.csv and anthill-cell-list.csv files), the anthill-ant-to-cell.csv communication file is pulled and the files are simultaneously written ( CSV) with results.
Accessing an array cell by index occurs instantly, since there is no time spent on searching. At the same time, most of the time is taken by reading and writing files (working with HDD).
The solution was prepared as a console program written in C # (Visual Studio Community 2013).
The program code is laid out on GitHub in the file
anthill-makeresult.cs . To get a working program in Visual Studio, you need to create a console project, replace the program.cs file with anthill-makeresult.cs, and build the project (note: in the project properties, in the BUILD section, uncheck the [Prefer 32-bit] checkbox). Run the program in the same directory as the test data files.
By default (without parameters), the program prepares lists of ant soldiers (CSV files). As a parameter, you can specify the desired type:
anthill-makeresult.exe 4 (4 - work type code)
Preparation of the results according to the complete data takes up to 30 minutes. Taking into account the test data, it requires about 100 gigabytes of disk space.
Brief summary
For the considered task, the statement that the volume of data does not have a significant impact on the complexity of the development turned out to be true:
- The ready solution turned out to be simple and packed in 300 lines of code, including error handling, logging and comments.
- The solution works the same way on both small and large (up to a given size) data.
- The main brake is not in data processing, but in I / O operations (slow HDD operation).
Of the minuses can be noted:
- The approach is not universal. For example, another structure or another format of application data can have a significant impact on the solution.
- Working with CSV files is not always convenient, is not protected from errors and does not guarantee data integrity.