📜 ⬆️ ⬇️

Automation against chaos



The modern development of IT technology has allowed to curb huge data streams.
The business has a variety of tools: CRM, ERP, BPM, accounting systems, or, as a last resort, just Excel and Word.

Companies are also different. Some consist of many separate branches. In this case, the business has the problem of data synchronization in the zoo IT systems. Moreover, the branches have different vendors or software versions. And frequent changes in reporting requirements from the management company provoke attacks of uncontrollable “joy” in the field.
')
This story is about a project in which I happened to face the chaos that needed to be systematized and automated. A modest budget and tight deadlines limited the use of most industrial solutions, but opened up scope for creativity.

Universal format


The customer set the task of collecting data for reports from all branches of the company. To understand the scale of a disaster, there are dozens of systems, among which there are both samopisnye and monsters like SAP, and, of course, 1C - where without it.

In one report, data from: accounting, repairmen, PR managers, MES, meteorologists could cross.

Prior to the start of the project, the main part of the data was sent to the parent company via email in the form of Word / Excel attachments. Further, the process resembled the sunset manually: the data were processed by specially trained people and entered into a pair of systems. The result of labor were dozens of reports, on the basis of which management decisions were taken.

On the choice of the approach, we pushed the format of the files being sent, namely xlsx / docx. Even the “ancient” systems in the branches didn’t support the uploading of data to these formats, well, or in extreme cases, copy-paste was not canceled.

Our plan was stubborn:

  1. we describe the structure of each report and the rules for its transmission;
  2. We lower the requirements for setting up systems for sending documents by email in accordance with the regulations. Where there are no systems - sending, as before, by hands;
  3. develop a program that:
    • selects certain documents from incoming mail;
    • extracts data from them;
    • writes the extracted data to the database, as well as “beats the hands” of the violators of the regulations.




Implementation


Organizational matters


At the stage of collecting and formalizing the requirements for the structure and regulations of the data transfer, it turned out that there was no description of the structure of the reports, at all. Everything was stored in the heads of some of the employees and was transmitted as word-of-mouth by Russian folk tales. But the real problems started later - when setting up data exchange.

Problem one


The difference in the structure of documents from the reference and data quality. In the reports, the amounts sometimes did not converge, the columns were mixed up in places or had incorrect names. The problem was mainly observed in the branches, where the data were driven in manually.

The solution is the introduction of a three-step test:

  1. Creating Excel reference documents with a rigid structure, means of Excel itself. In such documents, only data entry cells were available. On which checks were additionally superimposed: type, convergence of amounts, etc.
  2. Checks when extracting data from the report. For example, comparison of the current date and date in the paragraph of the Word document or arithmetic checks for data from an Excel document (if they cannot be specified in the document itself).
  3. In-depth data analysis after collection. For example, the detection of significant deviations in key indicators in comparison with previous periods.

Problem two


A systematic violation of the data transfer schedule or unscrupulous attempts at sabotage: “We never sent data to anyone, but here you are with yours ...”, “Yes, I sent everything on time, this ping is probably bad.”

The solution is feedback. The system automatically notifies the responsible persons in the branch, in case of violation of the schedule. Later, the feedback subsystem was screwed to the input data quality control system and to the final reports generation system, so that the branch immediately received a summary of its data and comparison with its “neighbors”. To make it clear why he ogrebet.

Subsystems developed



  1. The document data types configurator, where you can quickly describe:
    • attributes for document identification;
    • transfer regulations;
    • data extraction algorithm;
    • other attributes like a code path that checks and stores data.

  2. Recipient of the mail, moving attachments to an isolated storage (sandbox) and storing the accompanying information about the letter;
  3. Attachment parser, which defines document types and extracts data from them.



Configurator


Historically, all documents with data come to the general mail, where it is full of other important and not very letters. Signs are needed by which the necessary documents will be determined. The name of the document or the text in the body of the email is all unreliable and inconvenient for the sender. Therefore, it was decided that the report will be determined only by the content of the document. In addition, you need to uniquely determine what type of report contains a document.

By brainstorming, the horse-radish thought up how many signs to identify the document: the color of the text in the cell, the font and so on. But the most correct was the sign of the presence of a substring in a particular cell “slot” or an array of cells for Excel and a paragraph or heading for Word. For the “slot,” simple formal logic was added: “equal”, “unequal”, “more”, “less”, etc. Example for Excel: in the A2-E4 range, the cell text should be equal to “Daily equipment load summary”.



Similarly, the document area is set up in which you need to search for the beginning and end of the data (approx. End search conditions: 2 blank lines in a row).



A list of other useful settings: a list of allowed senders, the type of document (Excel / Word) and the path to export data.

At the output we get a JSON structure (template) describing the report.

Mail recipient


The mail reader, which places all the documents from attachments in the sandbox, saves the attributes of the letter and puts the documents in the task queue for parsing.

There are 2 security questions:

  1. What if the data is sent to another branch?
  2. What if data is sent by intruders?

The first question is solved by checking the email address of the sending branch and the branch specified in the report body.

The second is using SPF .

Attachment parser


Almost all libraries for parsing Word and Excel have serious limitations on the supported versions, so first you need to convert the document. Libre Office solves this problem on the top five.

After conversion:

  1. we remove the array of templates by basic features from the configurator (Word / Excel, sender ...);
  2. we run the document with the remaining templates;
  3. If the template is found, we retrieve the data and transfer it to the repository.

Total


We made it!
After two months of hard work, the head office began to regularly receive data for reports from all branches. Moreover, the quality and completeness of the data was unprecedentedly different from what it was before, and the released human resources paid for the project costs by the end of the year.

For ourselves, we learned that integration is not always painful and highlighted the main aspects of success:

  1. we did not climb inside the systems in the branches;
  2. formalized and approved a single structure of reports and the rules for their transfer;
  3. created output formats for all systems that are generally available in the form of Excel and Word documents;
  4. Choose the most common method of data delivery - email.

And two main disadvantages:

Source: https://habr.com/ru/post/421705/


All Articles