About errors arising from nowhere and in which there is no one to blame: the phenomenon of smearing of responsibility

The article will not talk about irresponsible employees, as one would assume by the title of the article. We will discuss one real technical danger that may be waiting for you if you create distributed systems.

In one Enterprise system there was a component. This component collected data from users about a certain product and wrote them into a data bank. And it consisted of three standard parts: the user interface, the business logic on the server and the tables in the data bank.
')
The component worked well, and for several years no one touched its code.

But once, for no apparent reason, strange things began to happen with the component.

Working with some users, the component in the middle of the session suddenly began to throw errors. It happened infrequently, but as usual, at the most inopportune moment. And what is most incomprehensible, the first errors appeared in a stable version of the system in production. In the version in which several months in general, no components have changed.

We began to analyze the situation. We checked the component under a heavy load. Works good. Repeated enough volume integration tests. In integration tests, our component worked fine.

In a word, the error came incomprehensibly when and from where it is not clear.

They began to dig deeper. Detailed analysis and comparison of log files showed that the cause of the error messages shown to the user is constraint violation in the primary key in the already mentioned table in the database.

The component wrote data to the table using Hibernate, and sometimes Hibernate, when trying to write another line, declared constraint violation.

I will not bore my readers with further technical details and immediately tell you about the essence of the error. It turned out that not only our component writes to the above-mentioned table, but sometimes (extremely rarely) some other component. And it does it quite simply, using a simple SQL INSERT statement. A hibernate works by default when writing as follows. To optimize the write process, it queries the index for the next primary key once, and then writes several times just by increasing the key value (by default, 10 times). And if it so happened that after the request, the second component “met” into the process and wrote the data into the table using the following primary key value, then the subsequent attempt to write from Hibernate led to constraint violation.
If you are interested in technical details, look at them below.

Technical details

.
The class code began like this:

@Entity @Table(name="PRODUCT_XXX") public class ProductXXX {                               @Id                @Basic(optional=false)                @Column(                                name="PROD_ID",                                columnDefinition="integer not null",                                insertable=true,                                updatable=false)                @SequenceGenerator(                                name="GEN_PROD_ID",                                sequenceName="SEQ_PROD_ID",                                allocationSize=10)                @GeneratedValue(                                strategy=GenerationType.SEQUENCE,                                generator="GEN_PROD_ID")                private long prodId;

One discussion of a similar problem on Stackoverflow:
https://stackoverflow.com/questions/12745751/hibernate-sequencegenerator-and-allocationsize

And it just so happened that for long months after the second component was changed and the table was written into it, the recording processes of the first and second components never intersected. And they began to cross when, in one of the divisions using the system, the work schedule changed somewhat.

Well, the integration tests went smoothly, since the time intervals for testing both components inside the integration tests also did not overlap.

In a certain sense, it can be said that no one really was to blame for the appearance of the error.

Or is it not?

Observations and Reflections

After finding the true cause of the error, it was corrected.

But not with this happy end, I would like to finish this article, but to reflect on this error as a representative of an extensive category of errors that have gained popularity after the transition from monolithic to distributed systems.

From the point of view of individual components or services in the described Enterprise system, everything seemed to be done correctly. All components, or services, had independent life cycles. And when in the second component it became necessary to write to the table, due to the insignificance of the operation, a pragmatic decision was made to implement it directly in this component in the simplest way, and not to touch the stable first component.

But alas, something happened that often happens in distributed systems (and relatively less often in monolithic systems): the responsibility for performing operations on a specific object was spread out between subsystems. Surely, if both write operations were implemented in the same microservice, a single technology would be chosen for their implementation. And then the described error would not have occurred.

Distributed systems, especially the concept of microservices, effectively helped solve a number of problems inherent in monolithic systems. However, paradoxically, the separation of responsibility for individual services provokes the opposite effect. Components now "live" whenever possible independently of each other. And inevitably there is a temptation, making big changes in one component, “screw right here” a bit of functionality that would be better to implement in another component. This quickly achieves the final effect, reduces the volume of approvals and testing. So, from change to change, components acquire unusual features for them, the same internal algorithms and functions are duplicated, there is a multi-variant solution of problems (and sometimes their non-determinism). In other words, a distributed system degrades over time, but differently than a monolithic one.

"Smearing" responsibility for components in large systems consisting of many services is one of the typical and painful problems of modern distributed systems. Even more complicate the situation are confusing shared subsystems such as caching optimization, prediction of the following operations, orchestration of services, etc.

Centralizing access to the database, at least at the level of a single library, the requirement is quite obvious. However, many modern distributed systems have historically grown around databases and use the data stored in them directly (via SQL), and not through access services.

“Helps” to spread responsibility and ORM frameworks and libraries like Hibernate. Using them, many developers of database access services unwittingly want to give full-featured objects as possible as a result of a query. A typical example is a request for user data to show it in a greeting or in the field with the result of authentication. Instead of returning a username in the form of three text variables (first_name, mid_name, last_name), such a query often returns a full-fledged user object with dozens of attributes and related objects, such as the list of roles of the requested user. This in turn complicates the logic of processing the result of the request, generates unnecessary dependencies of the handler on the type of the returned object and ... - provokes the spreading of responsibility due to the possibility of implementing the service object associated with the object logic.

So what to do? (Recommendations)

Alas, the spreading of responsibility in certain cases is sometimes forced, and sometimes even inevitably and justified.

However, if possible, try to respect the principle of distribution of responsibility between components. One component is one responsibility.

Well, if it is impossible to concentrate operations on certain objects strictly in one system, such spreading should be very carefully recorded in system-wide (“supra-component”) documentation, as a specific dependence of components on a data element, on a domain object or on each other.

It would be interesting to know your opinion on this matter as well as cases from practice confirming or refuting the theses of this article.

Thank you for reading the article to the end.

Illustration "Multimedia Miher" author of the article.

Source: https://habr.com/ru/post/458270/

All Articles

About errors arising from nowhere and in which there is no one to blame: the phenomenon of smearing of responsibility

Observations and Reflections

So what to do? (Recommendations)

More articles: