Hi, Habr!
This article is an attempt to take a look at the architecture of the
old modern accounting and planning systems in the enterprise (let's call them ERP for short) from a height of new experience. The reason for writing was a
small test of the performance of a WEB-application, which managed to aggregate 10 million facts in a minute and a half by simply searching. Of course, for modern DBMS this is a ridiculous volume, but I remembered that approximately the same number of transactions per month was generated by the ERP-system of a commercial and industrial holding company in which I once worked. Several factories, continuous production, wholesale distribution, a relatively simple financial model are a perfectly typical company, a typical accounting system, but monthly reporting was built over 20 minutes - on a cluster of applications that worked with a SQL cluster, with indices, statistics, and hints. Annual reports had to wait until 6 o'clock, and the need to recount it in the “what-if” mode turned into sleepless nights for accountants and economists. Systems of this class are quite complex and expensive, and after having worked for more than 15 years in this industry, I continue to insist that they are unjustifiably complex and expensive.
To clarify my idea, I will have to briefly describe the architecture of a typical Western ERP (which will be done in the section below), and now imagine that we took to design such a system from scratch, taking into account the requirements of our day, not 30 years ago, when all these systems were created. The first thing to do is to describe all the ins and outs of our “black box”, and then understand which internal abstractions are really necessary and which are redundant.
Inputs and Outputs
Sign in Business transactions, including forecasts, plans and budgets, are executed in the primary documents of various types. The document has a tree structure with up to 3 levels of nesting - a header, lines, substrings (for example, storage lots or storage cells). Each type of document has its own set of attributes, some of which are involved in further calculation algorithms, and some are reference. The input data is partially validated at the input stage, some of the checks are done at the posting stage, after additional attributes are added (which is usually assigned to specially trained people). Thus, the document has a life cycle, during which it “acquires” new data, and links to other documents.
')
At the exit , in theory, the expected summary reports on the company's activities (balances, turnovers, various types of financial results, cost, forecasts, plans, tax registers, etc.) - this is how these systems are presented to the customer - dashboards for top management , aggregating all macro-indicators, plus the theoretical possibility of decoding any digit to the primary document with the help of ten mouse clicks. However, modern requirements for exit are somewhat broader:
- Now, in an era of total “presumption of financial guilt” - both the FTS and the big four auditors work in the same way - they first request consolidated financial statements, then all primary documents (more often in electronic form, but they can also request “paper”). Documents are uploaded to the auditor’s own accounting system, calculated in accordance with the accounting policies of the audited company, and the results are compared. To prepare for such a test in advance, companies are forced to form their reports in the context of primary documents, which essentially makes the whole procedure of posting with a hierarchical convolution meaningless - to put it bluntly, instead of a simple P & L, you must see the share of the factory cleaner’s wages in the cost of finished products, especially since modern OLAP-systems such amount of data is quite capable of processing.
- Modern systems of business analysis and machine learning require the entire bigdate, as it is - that is, even a minor-significant reference attribute of the source document (manager’s position and geographic location) - is important in certain cases. And in our system there is a chain “source document - modular wiring - general ledger”, where the higher the level, the less concrete, not a single table contains exhaustive information about the business transaction, and the data is “spread out” over a couple of dozen tables.
- There is not always a one-to-one connection between the GC transaction and the line of the source document, usually this many-to-many relationship, which does not automatically, with a simple join, pull up the attributes of the source document to the summary report, and the “distribution on the fly” is usually low-performing, and gives those same tens of minutes and even hours when forming a seemingly simple report. And a couple of dozen external joins usually have a bad effect on performance.
As a result , due to the need to decipher and document each operational indicator, the value of P & L summary reports and the balance all decreases, and the need for instant and highly detailed reports keeps increasing. In retail in the evening of the current day, the management expects a margin, and everyone in production wants to control the instantaneous cost price without waiting for the end of the month (I saw projects for the decade and weekly closing period, and this is not the limit).
Summary - at the output, our system should produce all possible analytics in maximum detail (that is, before the line of the primary document), with all significant and non-significant attributes, that is, in fact, completely duplicate the input, supplementing it with the results of calculations.
And now, in order to understand the suffering of ERP programmers and their users, let's dive a little into the current database structure of such systems.
Inherited Architecture
At that distant time, when computers were thin, RDBMS was used as storage, and the bigdata technology was rudimentary, the architecture was proposed to create a single accounting register in which to store fingerprints of all documents in the most unified and impersonal form, but suitable for quick building OLAP reporting. This register was the main book (its counterpart in 1C called the “transaction log” was still drunk due to performance problems). Each document in the process of posting eventually generates transactions in the general ledger, from which, in theory, all management reports should be generated. In fact, the process is 2-step, since each module has its own small books, and the document first forms the modular wiring, and only then, based on them, the Ledger wiring is formed.
A typical system is divided into modules, of which about 10 (Stocks, Purchases, Sales, Sub-Report, Kassa Bank, Production, Repairs, Projects, Planning, Major Means). We very roughly describe the main stages of the document life cycle, and estimate the number of DBMS tables:
- Initial entry of documents. As a rule, the data structure includes 1..3 tables — the head of the document, the document lines, and the sub-strings (for example, the list of storage lots or storage cells). At this stage basic reference books are used (contractors, nomenclatures, divisions, goals, projects, etc.). In each module, documents / lines of each type are stored in separate tables - a total of about 20 tables.
- Conversion of measurement units and currencies. Usually, the calculated values ​​are placed in separate fields of tables of claim 1.
- Manual and semi-automatic “binding” of a document to documents of other types (for example, the purchase of transport services is linked to movement documents, the costs of customs clearance are linked to a purchasing document, etc.). The stage may be absent if all the bindings are made automatically in the process of closing the period, but in life it happens in different ways. We use 1 table per intermodule, about 5 in total.
- Account assignment of operations, that is, on the basis of reference books of item 1, is the definition of accounting registers for which this document should be conducted. This is usually a balance sheet with analysts, and a correspondent account with analysts - separately for the head of the document and for each row. Registers are formed as detailed as possible module wiring. We have 10 tables (by the number of modules).
- The balance of modular registers is updated (balances, debts, budget execution, etc.). 10 more tables.
- Modular wiring is folded according to the document, and recorded in the general ledger - 1..2 tables.
- Balance on GK is updated - 1 table.
- Calculation of taxes on the line and / or cap, depending on the tax system - 2 tables. In Russia, you need to create electronic invoices and lines of books buying / selling. 4 more tables, only 6.
- Comparison of the generated transactions with the postings of other modules (based on the connections of clause 3 or by installments). For example - consumable warehouse lots are matched according to the “many-to-many” scheme with receipts, or transactions that form DZ / KZ are compared with advances issued / received. Here we have already broken away from the original document, and we compare the transactions with the transactions, that is, one artificial entity - with another artificial entity. We use 1 table per intermodule, approximately 5.
- The reassessment of the cost of write-offs on the basis of the comparison column of paragraph 9. The reassessment is made up of a separate category of transactions, we consider another 2 tables.
- We form the audit trail (who, when, what, how much). Another table.
So , even without reference books, our operations spread across 60 tables, none of which contain complete information about the fact of a business transaction.
This data scheme has several rationales:- At that time, RDBMSs were used, the transactional performance of which was a bottleneck, so the data was tried to be normalized as much as possible.
- Adding / expanding / renaming a column in the RDBMS table (respectively in all indexes) is a difficult operation, usually blocking, requiring the system to be stopped, so the module’s wiring lies separately from the document, the ledger’s separately from the module’s wiring, taxes separately from the wiring, and t .d This facilitates the implementation of changes in a single module (for example, taxes), without much affecting the structure of the tables, for example, the general ledger.
- It is easier to configure separation of access to data, and more difficult is fraud (to make consistent changes to 15 tables, you need to be a professional).
What does this scheme lead to?- The user is usually interested in the full information about the fact - who, what, sold to whom, on what transport and at what price he carried, whether the sale was paid or not, current balances and balances at the time of the operation, manager's name, etc.), therefore in the reports it is necessary to collect data from all these tables by a SQL query (materialized views, mirrored databases, OLAP, etc.).
- Programmers have at their disposal a fairly powerful, dynamically optimized SQL language, which encourages their use of nested queries (each JOIN is essentially a nested loop at least by index) instead of more predictable one-pass algorithms, with the result that system performance decreases with increasing data nonlinearly, and progressively, which leads to difficulties in capacity planning (for example, one new entity was added, and performance dropped by 10 times).
- If we consider the ERP system as a black box - we have a document with a complete set of data at the entrance, the user needs the same document at the output, but with a set of additional attributes (accounts and analytics, status of mutual settlements, data from related documents). The user is not interested in artificial entities like postings, comparisons, registers - any reconciliation (with the primary or with the counterparty) is done on documents, any audit and bigdata requires decoding to the document line, etc. But between the input document and the weekend, there is a normalized scheme with artificial entities created by programmers and for the convenience of programmers and system administrators, as a result of which we first disassemble the document into molecules (during) and then reassemble it from molecules (in reports).
Target architecture
Nowadays, when the performance of servers and workstations has grown by orders of magnitude, and the volume of credentials has not changed much - is it time to revise the architecture? Why do we need these conversions back and forth, if the input and output are so similar? It is clear that the registers of balances, comparisons, cost prices will not go anywhere, as they are the result of the calculation, but is it necessary to have intermediate, artificial entities in the form of module entries and general ledger transactions? Since the document is a cross-cutting entity, which is in demand both at the entrance and at the exit, in the new ERP, it is proposed to make the document the central data structure. Attributes that appear at different stages of the life cycle (analytics, comparisons, recalculations) should be recorded in the document itself. This will allow to receive reports in a uniform and predictable way, and in linear time.
In order to unify the reporting algorithms, all types of documents are proposed to be stored in a single database, and since the set of requisites can be different and, moreover, dynamic, this database should be NoSQL. Algorithms of calculations and reports scan the database and try to find the required attribute in each document or their combination (NoSQL supports indexes by attribute combinations, so full fullscan is not required). Finding an attribute (attributes) will be a sign of the inclusion of the document in a specific calculation / report.
Documents can be related to each other, which means the join storage “with itself” and this, in turn, requires, firstly, the orderly storage of all directories and documents (a graph of causality), and secondly - a revision of approaches to generating reports in favor of single-pass algorithms with short-term memory - instead of nested loops, albeit optimized.
A number of additional entities remain (residuals, comparisons, additional estimates), but, first, these entities reflect objects of the real world, and second, the basic principle must be respected - complete data on the business transaction should be obtained from the main document and related documents , without the use of synthetic entities, such as wiring, accounting register, etc.
An additional bonus - the proposed system will allow you to create additional reference attributes of specific types of documents on the fly, without stopping the database, and with minimal intervention in the program code.
The disadvantage will be the need to programmatically implement the storage scheme (valid attribute sets, valid connections, data validation), without relying on the built-in tools of the DBMS. To be fair, in most commercial systems, foreign keys and triggers are not used, and the entire integrity of the data is provided by the algorithms on the server side of the application.
PS
A separate article deserves a description of single-pass algorithms with short-term memory, which will have to be used in calculations and reports with such a storage scheme, but even without this, the amount of work for the first MVP (storage scheme + input interface) significantly exceeds our current capabilities, so the project will move slowly, and I may be informing him of his status in the following articles.