📜 ⬆️ ⬇️

How we helped conduct a medical census in the Republic of Bangladesh



Bangladesh is a country in Southeast Asia, ranked eighth in the world in terms of population (Russia, according to Wikipedia , is now in 9th place), borders on India and Burma. The overwhelming majority of Bangladeshis are rural residents (135 million out of 160), and their living conditions, to put it mildly, are far from ideal. Not all households have access to drinking water, sanitary conditions leave much to be desired.

Our today's material is about how, with the help of our ABBYY FlexiCapture, the Ministry of Health of Bangladesh processed the results of the medical census - such a census is needed to make the right policy decisions in the field of health.
')

The 160-million population of Bangladesh lives on an area of ​​only 147 thousand square kilometers, that is, the population density is very high. Compare, on what area we live, and on what - Bangladesh people (and the population plus or minus is the same).



Bangladeshis have recently made significant progress in the field of health. The World Health Organization reports on the outstanding progress of this country in many critical areas - the mortality rate of newborns and children under five years old has decreased, life expectancy has increased, vaccinations have become more available and more successful the fight against tuberculosis.



Since 1961, the General Directorate of Health Services (GUIL) of the Ministry of Health and Family Welfare of Bangladesh. conducts a regular survey of the population in order to collect data on the health status of rural residents of the country - the most common chronic diseases, mortality and its causes, as well as on household and economic living conditions. These data are necessary for making the right strategic decisions in the field of health care - on the basis of these or other medical support programs are being developed, but it is expensive and difficult to collect them.

The census itself is carried out by the Ministry of Health Bashgladesh independently. The institution has a whole staff of employees (they are called community health workers) who regularly visit rural households to understand how things are there and whether help is needed - there are about 23,000 in the service. They were the ones who collected data and filled out questionnaires.

Questionnaires are always processed manually. It took two whole years to transfer data to the electronic system - it’s no joke to process 30 million pages. Given the importance of the information collected, it was too long. In 2011, the Bureau of Statistics of Bangladesh studied the experience of conducting similar surveys and censuses around the world and decided to automate the processing of questionnaires using the technology of intelligent character recognition - intelligent character recognition (ICR).

Very briefly - how does ICR differ from OCR (optical character recognition).
OCR - Printed Character Recognition. ICR - in this case: recognition of characters written in block letters (sometimes they are written as “handwritten” characters).

So, the government of Bangladesh announced a tender for processing medical census questionnaires, which our partner, Devnet, won with our ABBYY FlexiCapture solution. But long before the winner was chosen, the organizers of the tender, with the help of companies that participated in the tender, developed a questionnaire that the machine could read.



The questionnaire is bilingual. Most of the signatures to the fields and check boxes are made in Bengali - the official language of the Republic of Bangladesh (locals call it “Bangla”). In Bangladesh, of course, they teach English in schools, but not everyone knows him, so the questionnaire wasn’t done entirely in English - the copyists regretted. Signatures to the main elements of the form are duplicated in English - so that our technical support can navigate and understand what it is about, if data processing encounters difficulties.

Bengali is a specific and rather complex language - we do not yet recognize, so all the fields that were to be recognized (empty cells) were to be filled in by the scribes in English.

Surely readers are interested in what information, in fact, had to be collected. It also became interesting to us, and we asked a partner to translate the questionnaire for this article.



The first question - codes of regions, districts and households - is the main identifier of the questionnaire. If all residents of the house did not fit on one sheet, it is this code that helps in processing the questionnaire to “collect” all the sheets together so that no one is lost.

In the second question, the respondents were asked to indicate the source of drinking water - and here many subtleties emerged. It turned out that groundwater in Bangladesh (as well as in some neighboring regions) is often contaminated with arsenic - and this is a big problem. There is a whole program in which water in the wells is examined for arsenic, and then the wells are labeled. Green - safe (Tube well green), red - dangerous (Tube well red), some wells did not have time to investigate (or the residents to whom they belong, refused to study) - this is the third answer. Read more about the project with the label can be found in the book Arsenic Exposure and Health Effects , which is partially available on the network.

The third question asked the type of the restroom, the fourth - the economic situation in the family.

Starting with the sixth question, the census takers were to list all the inhabitants of the household, indicate if they have any chronic diseases. If someone died in the house from the time of the last census, he also had to be entered on the questionnaire along with the date and cause of death.

When the questionnaire was developed, our partners made instructions for the scribes, in which there was a brief explanation of what a “machine-readable form” is and the rules for which such a form should be filled out. For example - to write with a black or dark blue pen, use capital English letters, not to go beyond the cell, leave an empty cell after each word - as well as examples of correct and incorrect filling. Like these ones.




In addition to the instructions, the census takers underwent special trainings, and the whole process of their work (starting with the trainings and ending with the provision of completed questionnaires) took about 10 months. Despite careful instructions, there were quite a few mistakes in filling out. On average, about 10% of the text went beyond the cells, often the scribes in response to the question where only one answer is possible, noted more than one check-box, often there was a handwriting that was difficult to make out. In addition, the tender to choose a partner for processing questionnaires was delayed (we know, this is often the case with tenders), and the collected questionnaires were kept in poor conditions, and some of them were spoiled by water and mistreatment . All this complicated the processing of questionnaires.

The processing scheme of the questionnaire looks like this:



First, questionnaires scan. For this purpose, 10 Kodak i1420 scanners and Kodak i3400 scanners were used. The capacity of the i3400 model is 50 pages per minute, 15 thousand pages per day, i1420 models - 45 pages per minute, 13 thousand pages per day.





The meticulous reader must have noticed that the background of the questionnaire, the framework and some explanatory information were made in red, and was surprised - how can it be, a poor country, so much paint. Of course, the red color here is no accident. You can configure the scanner so that the color at the scanning stage will be removed (drop-out colors). After scanning, all elements will disappear from the form, with the exception of bench marks (black squares at the corners) and filled fields.



This is necessary to improve the quality of recognition. For example, if the scribe not only “crawled” outside the cell, but also a part of a letter or number got into the red text (for example, the name of the string), the program will find it difficult to recognize the character. If the red color is removed, this problem will be solved. Previously, such an operation could only be done with a scanner; in the latest releases of FlexiCapture, if necessary, you can remove the colored background at the program level.

When the scan is complete, FlexiCapture processes the scanned images (removes garbage, corrects distortions) and recognizes the data in the form.

Questionnaires in the system are combined into “packages”. A package is a list of multiple censuses related to one address. On one sheet fit 12 people, but in Bangladesh live in a heap, and in many homes there were more than 12 people - then the copyist took a new sheet. So, imagine a situation where an employee of the scanning center carries a stack of documents to the scanner and suddenly drops it. Back from the floor, documents are not collected in the order that was at the beginning and, of course, they are scanned at random. The census customer believed that, in such a case, the verifier would not be so competent as to correctly manually assemble all the sheets belonging to the same house in the system. Therefore, the system collected the package automatically using the coded address (which, as we remember, was the main identifier of the census form) and the name of the census taker.

Since the program can make a mistake, all data must be verified - a person should compare the recognized characters with an image on the scan with his eyes. Two large verification centers were organized, in which 120 people worked in two shifts. It looked like this:



The data from the check-boxes were not verified - the information was verified using rules - for example, in some questions there could be only one answer, etc.

To verify the data from the remaining fields, scripts were developed to help people who identified errors in some fields. For example, a phone number could contain only 11 characters; there were certain rules when using kinship codes, house numbers, region codes, etc. If there were errors, the program checked the box so that the operator would pay attention to this place. Next, the operator had to decide whether to correct this error by comparing the recognized data with the scanned image of the sheet. If, for example, a pen-writer did not write a pen well and the system did not recognize any character, they corrected the error. If the fix did not work, the error was assigned a critical status.

During verification, the operator can view the entire sheet as a whole, or only a separate field (and another operator - another field). As a rule, the second method is more efficient, and it was used in the project. After verification, the data was uploaded to the database - MS SQL enterprise 2012. The English-Bengali dictionary of names was integrated into the system - with its help all names were exported immediately to Bengali.

That, in fact, is the whole story about the census in Bangladesh. In conclusion, as usual, there are some statistics: on average, operators processed a little more than 100 thousand pages a day, and 30 million pages were processed in about 9 months.

Svetlana Luzgina
Corporate communications service with the support of ABBYY 3A (three A = Asia, Africa, Latin America).

Source: https://habr.com/ru/post/312058/


All Articles