All-Russian population census: how your data is compressed
I have been working with the recognition and processing of population censuses and agricultural censuses since the 2000th year. This is the case when you have been writing software for more than a year, which should work once, but without errors.
Why? In the all-Russian population census in 2010, 500 thousand people and 10 thousand more IT-users participated in all regions of the Russian Federation. The scanner takes 150 sheets per minute. Real-time recognition at about the same speed. Multiply by the number of scanners in the country - and get the data stream, where any bug immediately destroys the work of a huge number of people. ')
And the second point - together with the Research Institute of Statistics, we conduct research work on data recovery algorithms.
How is the census going
If this is the All-Russian population census, about half a million people (most often students) bypass all people in the country. The task is to reach everyone and ask a series of questions, the answers are recorded on paper on a special machine-readable form. If agricultural census - people walk less, but still. Here, for example, is the standard census taker of the agricultural census, with whom he walks on his land:
Next - get from these forms tens of millions of tables, each of which has specific data on the regions, important for services of different levels.
That is, the procedure is as follows: • Prepare lists of surveyed objects and split them into areas for scribes; • Collect data physically, "feet." • Load machine-readable documents into a streaming scanner that quickly and gently flips through them. • Recognize what is recognized (and here, for a moment, handwritten handwriting). • Make several corrections on what was not recognized, so that the operator can finish the data from the forms with his hands. • Once again check the data for consistency with each other by logic (grandfather can not be younger than his son, and so on). • Collect a common database from across the country. • If necessary, load this database into the analytics system so that the customer can make atypical reports and cut an unreal sea of ​​reports from it. • Deliver scans of paper forms by secure mail; • Arrange storage of paper forms in the field.
Many participants-census operators see a computer for the first time in their lives (I’m not exaggerating, we both had a habit of moving the mouse and getting used to it, and many more things happened in the villages). Plus, not everyone understands the census procedure, there are many non-trivial operations. Naturally, this causes a sharp increase in the load on support, which is highly undesirable on peak days. Therefore (even though we were not asked to do this), we recorded a 40-minute instructional video explaining all aspects of how to do the census step by step. Here is a short excerpt from 2004 (as they used to write on pirated discs - “voiced by professional programmers”):
On the other hand, former agronomists and chairmen of cooperatives are interviewed at agricultural censuses. They are keenly versed in the subject, and are interested in the result, because they themselves have repeatedly used the collected data in their work. It is very pleasant to work with these people. They often also do not understand where to feed the computer, but they are not afraid to ask questions and learn. And they also have a damn important property for data integrity - they can tell by the grandmother how many piglets she has, and if she didn’t have one of the scribes. By the way, about the deep knowledge of the topic - not all testers knew that several hectares of cannabis were grown in one of the regions. Because it is the most valuable strategic raw material. For medicine and light industry.
For the following such censuses on the agricultural subject, the customer generally wants to get rid of the paper: distribute the tablets to the representatives so that the data will be filled in immediately. There, of course, there are features with personal data - you need to come up with a solution that prevents leakage even during routing, but this is all solved.
Implementation
I'll start a little from the end. Given the size of the database, a suitable solution is Microsoft SQL + Microsoft OLAP. When we started working with MS OLAP for generation, we had very little experience, but we had faith in ourselves and the will to win. But then never regretted. This scale of projects in Microsoft OLAP in the world are very few. Naturally, we walked around the rake and ran into errors that could not be identified in the tests - the developers simply did not have a living base of such size and a pair of powerful data centers next to each other, grinding the data. By the way, the Rosstat data center.
The entire primary is processed locally, the data is checked for completeness and consistency. Then the data gets to the data center in Moscow in two ways:
Processed in digital form - via VPN from operators workstations.
Scans of paper originals - by courier mail. From the disks everything is loaded into the database already here. Physically, all this lies in protected premises, the postal system of this class itself is designed even for sending top-secret documents.
So, we get about 6 TB of raw data for processing, from which a database of 500 GB is obtained. This level requires data recovery to be representative. For example, in the district there were about 2 thousand people who participated in the census and 15 “refusers” who were not found or who were not reached for other reasons. It is logical to assume that statistically (and we are only interested in large numbers), they will on average correspond to the other inhabitants of the region. This is a very simplified example of how data is restored. In practice, together with the research institutes, we confirmed the following hypothesis with a series of experiments: if you take a sufficiently large array of answers, where everything is filled (real census data of previous years), then randomly delete up to 10% of answers, and then restore the data, then the results in the final cuts should differ by no more than tenths of a percent.
Many solutions are used - from searching through a database of similar profiles (for example, we know the gender and age structure of the family of farmers, which has not been surveyed - the algorithm will look for similar families in regions with similar conditions and rely on them, etc.). In practice, only in our country there is a ready mechanism for working with such algorithms. The same research institute working with statistics cannot - it does not have enough data center capacity to parse huge databases.
Another important component of report processing is the special BI of our Australian colleagues working with Big Data. An important feature is privacy protection . The first layer is the inability to upload reports, where it is possible to get to specific numbers per person. No matter how hard you try, the internal processing unit is 3 people. Another special analytics ensures that it is impossible to upload a report containing a matrix corresponding to another matrix with similar data. Because cunning pentesters in the discussion of protection learned how to subtract one matrix from another in order to get specifics on people. Now this is followed by a special mechanism. BI is called SuperStar.
Data in the region
Unlike the elections, when the residents themselves come to the polling stations (and if someone does not come, it's okay) you need to go to each census and get the most complete data. Ok, the student collected the papers, filled them in as properly as possible, checked and brought them to the district center. Then they are under the protection of the police (police) in the territorial statistical bodies, where there is a scanner of machine-readable documents. From the scanner paper go under protection.
Papers come tied to the plots. For example, "here is the package, here are 400 people, this is a village such and such." The system of division into accounting units was built up in the USSR, it works like a clock.
Further comparison of the completeness of the data is a difficult job, which allows us to understand, for example, according to the data of the grandfather's questionnaire with three grandchildren, that these grandchildren must be somewhere, and if there are not, then something went wrong. On such a procedure, for example, we found a single camel in the Chelyabinsk region. A little crazy, they thought a bug, asked to check - there really is someone holding a camel. More often there are situations like filling errors - there are two cows, five of them are dairy ones. With tablets it will be easier, there at the UI level there will be many checks.
The input complex is one of the interesting parts. At first our Russian promscanners stood, as in the photo, but at the last census foreign ones were already used. 150 sheets per minute. World practice - to give further to the discriminating machine, then to the verifier. Three cars are a wild luxury, so we collect one PACK, where right during the scan the operator can see the data on the screen and edit what the system could not “chew”.
Naturally, different handwritings cause the greatest difficulty at this stage. We, fortunately, have a lot of reference data - there are plenty of labels on machine-readable documents that allow us to accurately determine the direction of the text, where it is on the page, and so on. Where there should be numbers, where the name of the village and so on, which reduces the number of hypotheses. Therefore, we were able to drive into recognition not only more or less printed numbers, but also a lot of handwriting samples. In the first censuses, we collected a database of the most common handwriting features and were able to successfully recognize the vast majority of handwritten texts on our forms.
The “help robots” training screen: smaller loops, lines without a break, if possible, do not trace the numbers a second time, try not to leave the field.All the same, there are bad options, but after learning them much less.
As a result, quite a bit, significantly less than a percent, you need to rule with your hands. A special database of poorly recognized documents is being collected, which is being pursued by operators.
Then - another check, this time physical. There should be a kilogram of documents, judging by the mass - not enough 20 pieces of paper. Under the table is not forgotten?
Then a formal logical control, the establishment of data connections.
And only then sending.
Result
Due to the automation of almost every step, we have reduced the number of necessary personnel very significantly. For example, even the same routing sheet is automatically compiled, which optimizes the walking time.
The staff in such events is the most expensive pleasure, and even 5-7 days of work of the TierIII data center in comparison with this are pennies.
Setting tasks on such projects is very, very unusual. The customer is well aware of their specifics, ready to explain - but does not think in terms of development. The first time we received a 700-page brick - almost an artistic text as a TK, which the analyst turned into requirements. The second time and further, the customer has already begun to understand how to explain this to us, and we began to deeply understand the topic and understand their jargon. Practice shows that it is worthwhile to take, for example, the lead tester after receiving the task, but not before, and everything, somewhere, he will learn to ignore the specifics. For deep knowledge of the topic we are very much appreciated - this is the key to developing such solutions.
In a short time, we shovel a bunch of data. There is no chance to repeat the procedure, so huge budgets are spent on testing. We even recruit specially trained collective farmers retired whose task is to mischief as much as possible. We are coping. We understand that the census participants are professionals in their subject, and it is perfectly normal not to work with IT. We make very simple interfaces. We think about usability of decisions on recognition-checking. We save time and nerves for many. It is difficult and very interesting.
The next census of VSHP will be in 2016. The All-Russian Population Census is scheduled for 2020. For professional issues, you can write me at ICherepov@croc.ru or right here in the comments.