Top 10 Systems Scaling Errors

Martin L. Abbott and Michael T. Fisher, authors of The Art of Scalability, list the most common architectural, organizational, and technological challenges of scaling in product groups. The list was formed on the basis of their experience, as well as in the course of communication with customers and formed the basis of the first book.

Architectural errors

1. Designing the implementation, but not finding an architectural solution

Which of the following approaches would you use to describe your product architecture?

Option A: “We are a Java-shop working on GlassFish, Apache Felix with MySQL and MongoDB DBMS”.
')
Option B: “We have a 3-tier architecture with isolated zones, which for fault tolerance do not have synchronous communication with each other. Permanent data storage is a combination of relational and NoSQL-databases using built-in replication with horizontal sharding. ”

The correct answer is "B". Answer “A” better describes the implementation of the architecture as of any day, but does not speak about scalability. Components (programming language, operating systems, databases, hardware, etc.) will scale or not depending on how these systems interact. Scalability and availability do not appear out of thin air. Only proper design allows easy replacement of the components of a software solution if necessary.

2. Design without error calculation

Hardware, software, data centers, internet providers, processes and people, especially people. When the system is properly designed and the main services or customer segments are isolated, the consequences of any failure of each element are minor. Temporary disabling of payment in the ecommerce platform should not bring down the ability to search, view or add products to the cart for later purchase. Extremely high load from one client will not cause disruption to all others.

The death of the Titanic is one of the most famous examples of a design error that cost a whole project: the ship’s compartments were not completely isolated, and when the ship tilted on its bow, the water simply poured over the bulkheads until it was flooded.

3. Vertical scaling instead of load sharing

Many products still rely on a relational database (MySQL, Oracle, SQL Server, etc.) as their primary repository. Instead of segmenting customers into multiple small databases (each hosting with several customers to improve cost efficiency), many companies still rely on expensive and high-performance equipment to scale a monolithic system.

Such a “decision” will ultimately lead to an increase in transaction costs and more damage in the event of a failure as the company grows. In addition, the investment efficiency will be low, since the bulky system will be idle until the load gradually increases. In the end, the largest system will not be large enough, and your product will still have problems. Recall eBay 1999 or PayPal in 2004.

4. Using the wrong tools

Invite the carpenter to fix your bathroom, and he will come with a hammer. You probably won't be happy with the results. This is the same as asking a database technician to help with the product architecture. The relational database is still basic and often best suited for storing solutions that require strict adherence to ACID requirements, or related data (for example, products showing organizational hierarchies, hierarchical product catalogs, etc.). However, we now have many alternatives to create distributed storage depending on the type of data. So, if you have JSON data, then the document repository will be the best solution. Data storage in a natural format should always be balanced with the overhead of the introduction and support of additional technologies.

Organizational failure

5. Separation by function

In the past, when we created and sold software, the role of functional managers was to completely isolate the functional departments in such a way as to neutralize all distractions and allow us to concentrate as much as possible on the task. It was good when each team had a highly specialized focus, and the result of the work was transmitted to the assembly line. Today's SaaS companies produce a service, and changes are made to the solution once every two weeks, and sometimes several times a day. This requires product managers to talk with engineers more often, and infrastructure engineers to respond even before they start coding. Functional organization sometimes leads to lower quality, slower entry to the market, low levels of innovation and conflicts between functional teams. The best teams today are multidisciplinary, autonomous and controlling the service from the inception of the idea to the support (according to the design principle of William McDonough "from cradle to cradle" for the development of services). If you doubt this principle, ask yourself: “What do we do in times of crisis?”. Almost always the answer will be the following: “We collect people from different teams, close them in the room and ask them to solve the problem”. If this is important in a crisis, then why don't we do it every day?

6. Too big teams

Another important mistake in scaling an organization is having too many people working with the same code. When a team of more than 15-20 engineers is in place, the connections between them and coordination begin to suffer. Differences arise in resource planning, subordination, and decision making. These conflicts take the time allotted for the production of a new functional, which reduces the value of the product to customers.

Isolation of faults in services (see § 2 for architecture) can create natural separations in the product that neutralize these conflicts. Well, when the team is responsible for several services (login, registration, search), but two teams should not be responsible for the same service.

7. Inability to care for your garden

Good leaders always sow, water and weed their “garden” of employees: they attract new talents (seeding), develop team members (watering), and if necessary allow them to leave (weeding). For a better result, you need to constantly evaluate your team. We like to analyze employee productivity in three areas: skill, growth and behavior. A skill category is how well they know and fulfill the role for which they were hired.

If it is a Java developer, how well does it code in Java? Growth category - whether colleagues have the opportunity to go beyond their current role. Are some of them ready to become senior developers? Can they be architects? Do they want and are able to lead? The last category is behavior. How does the behavior fit the company culture? This often overlooked category has the greatest negative effect on the team as a whole. An employee can be a great Java developer and the world's best architect of scalable systems, but if he demotivates or interferes with the rest of the team, you need to get rid of him like a weed.

Process failure

8. Inability to learn

The prophetic phrase of Santayan, "Who is unable to learn from the mistakes of history, is doomed to repeat it" is true for organizations and products. The service was unavailable as a result of a failure, and if you only want to restore its performance, then you miss an amazing opportunity to draw conclusions and learn new things. Each failure should be considered as an occasion not to repeat the mistakes of the past. It requires self-discipline to spend time and conduct a thorough investigation. If you think that the reason for shutting down your service was a hardware failure, you definitely missed. Continue asking “Why?” Until you discover the underlying causes in architecture, people, and processes.

9. Faith in Agile as a panacea

A flexible development methodology is a great thing, but using it does not solve problems with the team structure (see paragraphs 5 and 6 in the section on failures with the organization) or business problems, such as communication between the product and sales departments. Understanding Agile as part of a business process, and not just a theory of software development, is very important for success. Finally, Agile can fix some of the problems it is designed to solve. Expecting her to guarantee specific results on specific dates is similar to the expectations that the carpenter will repair your bathroom (see item 4 in the architectural errors).

10. Load and performance testing will show all scaling problems.

If your company depends on a system that is executed once a year, for example, on cyber-monday, then you need to direct all funds to conduct load performance testing. However, the problem of verification is that it is carried out on the intended actions of users. This is a fairly effective way to test if the software does not change. But in most companies, software changes frequently, and therefore, customer behavior often changes. Clients run heavier reports or make twice as many queries if you change the color of the search button. In other words, their behavior is partly unpredictable. And since testing compares the current version with the base version, think about how some intermediate version will compare with the last one? Do not believe that the test results will show all possible failures. Instead, dividing clients into isolation zones (see clause 2 for problems with the architecture) will help roll out the update for a small group of users and use the actual client behavior on the new software for testing.

We recommend the book to read. The authors of the book did not describe dogma, but pointed out subtleties that even the best of us can sometimes forget, namely:

properly selected and applied tools and methodologies;
competent management in a team;
reliance on previous experience.

Is such a triad enough for the success of a software product?

Previous posts:

About development technologies:
Once again about the seven main development methodologies .
10 major system scaling errors .
8 principles of development planning, simplifying life .
5 major risks in custom software development .

Source: https://habr.com/ru/post/269861/

All Articles