Here is a
quote from Linus Torvalds for 2006 :
I'm a huge supporter of developing code around data, and not vice versa, and I think this is one of the reasons why git was quite successful ... In fact, I argue that the difference between a bad programmer and a good programmer is whether he thinks more important your code or your data structures. Bad programmers are worried about the code. Good programmers worry about data structures and their relationships.
Which is very similar to the
“rule of submission” by Eric Raymond of 2003 :
Roll knowledge in the data so that the logic of the program becomes stupid and reliable.
Here is just a summary of ideas similar to the
thoughts of Rob Pike of 1989 :
')
Dominate the data. If you choose the right data structures and everything is well organized, then the algorithms will almost always be self-evident. Data structures, rather than algorithms, play a central role in programming.
He quotes
Fred Brooks from 1975 :
Representation - the essence of programming
Behind the mastery is ingenuity, thanks to which
economical and fast programs. This is almost always the result.
strategic breakthrough, not tactical skill. Sometimes so strategic
a breakthrough is an algorithm, such as the fast Fourier transform,
proposed by Cooley and Tukey, or replacing n ² comparisons with n log n when sorting.
More often, a strategic breakthrough comes as a result of
data or tables. Here lies the core of the program. Show me the flowcharts, not showing the tables, and I will remain astray. Show me your
tables, and flowcharts, most likely, will not be needed: they will be obvious.
So, almost half a century, smart people say again and again: first focus on the data. But sometimes it seems that this is the smartest advice that everyone forgets.
I will give some real examples.
Highly scalable system that failed
This system was originally created with the expectation of incredible scalability with a heavy load on the CPU. Nothing synchronous. Everywhere callbacks, queues and work pools.
But there were two problems. The first was that the “processor load” was not so intense - one task took a maximum of several milliseconds. So most of the architecture did more harm than good. The second problem was that the “highly scalable distributed system” actually worked on only one machine. Why? Because all communication between asynchronous components was carried out using files in the local file system, which has now become a bottleneck for any scaling. The original design was not tied to the data at all, except for the protection of local files in the name of “simplicity”. The main part of the project was devoted to all of this additional architecture, which “obviously” was needed to cope with the “heavy load” on the CPU.
Service-oriented architecture, which is still data-oriented.
This system followed the design of microservices from single-purpose applications with REST APIs. One of the components was a database in which documents are stored (mainly responses to standard forms and other electronic documents). Naturally, she exhibited an API for saving and retrieving data, but rather quickly there was a need for more complex search functionality. The developers considered that adding this function to the existing document API contradicts the principles of microservice design. Since 'search' is essentially different from 'get / put', the architecture should not combine them. In addition, they planned to use a third-party tool for working with the index, so creating a new 'search' service made sense for this reason as well.
As a result, a search API was created and a search index, which essentially became a duplicate of data in the main database. This data was updated dynamically, so any component that changed document data through the main database API should also send a request to update the index through the search API. Using the REST API, it is impossible to do this without a race condition, so two sets of data occasionally went out of sync.
Despite the promise of architecture, the two APIs were closely related through their data dependencies. Later, the developers recognized that the search index should be combined with a common document service, and this made the system much more maintainable. “Doing One” works at the data level, but not at the verb level.
Fantastically modular and configurable clump of dirt
This system was a kind of automated deployment pipeline. The original development team wanted to make a flexible enough tool to solve deployment problems throughout the company. They wrote a set of plug-in components with a configuration file system that not only tuned components, but also acted as a
domain-specific language (DSL) for programming how the components fit into the pipeline.
Fast forward a few years in advance, and the tool has become “that very program.” There was a long list of known errors that no one ever corrected. No one wanted to touch the code for fear of breaking anything. Nobody used DSL flexibility. All users copied and pasted the same guaranteed working configuration as everyone else.
Something went wrong? Although words such as “modular,” “disconnected,” “extensible,” and “customizable” were often used in the original project document, he said nothing about the data at all. Thus, data dependencies between components were handled in an ad hoc manner using a globally common JSON blob. Over time, the components made more and more undocumented assumptions about what is included or not included in the JSON blob. Of course, DSL allowed you to rearrange the components in any order, but most configurations did not work.
Lessons
I chose these three projects, because using their example it is easy to explain the general thesis without touching the others. One day I tried to create a website, and instead hovered over some servile XML database that didn't even solve my data problems. There was another project that turned into a broken likeness of half the
make
functionality, again because I didn’t think I really needed to. I already wrote about the time spent creating an
infinite hierarchy of OOP classes that should be encoded in the data .
Update:
Apparently, many still think that I'm trying to make fun of someone. My real colleagues know that I am much more interested in correcting real problems, and not on blaming those who have generated these problems, but, okay, this is what I think of the developers involved in these projects.
To be honest, the first situation obviously occurred because the system architect was more interested in applying scientific work than in solving a real problem. Many of us can be blamed for this (me too), but this really annoys our colleagues. After all, they will have to help in support when we get tired of the toy. If you recognize yourself, do not be offended, please just stop (although I would prefer to work with a distributed system on one node than with any system on my “XML database”).
In the second example there is nothing personal. Sometimes it seems that everyone around is saying how wonderful it is to separate services, but no one talks about when it is better not to do it. People learn all the time the hard way.
The third story actually happened to some of the smartest people I've ever worked with.
(End of update).
The question “What is said about the problems created by the data?” Turns out to be a rather useful litmus test for good system design. It is also very useful for identifying false "experts" with their advice. Problems with the architecture of complex, entangled systems are data problems, so false experts like to ignore them. They will show you amazingly beautiful architecture, but they will not say anything about what data it is suitable for, and (importantly) what data is not suitable.
For example, a false expert might say that you should use the pub / sub system, because pub / sub systems are loosely coupled, and weakly coupled components are more maintainable. It sounds beautiful and gives beautiful diagrams, but this is the opposite of thinking. Pub / sub does not
make your components loosely connected; The pub / sub
itself is loosely coupled, which may or may not correspond to the needs of your data.
On the other hand, a well-designed data-oriented architecture is important. Functional programming, service mesh, RPC, design patterns, event cycles, whatever, all have their merits. But personally, I have seen that much more successful production systems work on
boring old DBMSs .