6 Ways to Kill Your Servers - Understanding Scalability in a Difficult Way

Learning how to scale your application without having any experience is very difficult. Now there are many sites devoted to these issues, but, unfortunately, there is no solution that is suitable for all cases. You still need to find solutions yourself that fit your requirements. Just like me.

A few years ago, my boss came to me and said: “We have a new project for you. This is the transfer of a site that already has 1 million visitors per month. You need to move it and make sure that attendance can grow in the future without any problems. ”I was already an experienced programmer, but had no experience in scalability. And I had to learn scalability in a difficult way.

The site's software was a CMS in PHP, using MySQL and Smarty. The first thing was found a hosting company that had the experience of high-load projects. We gave them our required configuration:

Load balancing (with margin)
2 web servers
MySQL server (with stock)
development machine

What we got (the hoster said that this will be enough):

Load balancing - Single core, 1 GB RAM, Pound
2 web servers - Dual core, 4 GB RAM, Apache
MySQL server - Quad core, 8 GB RAM
development machine - Single core, 1 GB RAM

For file synchronization, the hoster has installed DRBD in the active-active configuration.

Finally, the transfer time has come. Early in the morning we switched the domain to new IP and started monitoring our scripts. We received the traffic almost immediately and it seemed that everything was working well. Pages loaded quickly, MySQL handled a bunch of requests and everyone was happy.
')
Then, unexpectedly, the phone rang: “We cannot go to the website, what is going on ?!” We looked into our monitoring software and saw that the servers had crashed and the site was not working. Of course, the first thing we called the hoster and said: “all of our servers have fallen. What is happening ?! ”They promised to check the servers and call back after that. Some time later they called: “Your file system is hopelessly corrupted. What did you do ?! ”They stopped the balancer and told me to look at one of the web servers. Having opened index.php, I was shocked. It contained incomprehensible chunks of C code, error messages, and something similar to log files. After a brief investigation, we determined that our DRBD was the cause.

Lesson number 1

Put the Smarty cache in an active-active DRBD cluster under high load and your site will crash.

While the hoster was restoring the web server, I rewrote part of the CMS so that the Smarty cache files were stored in the local file system. The problem was found and fixed and we returned online.

Now it was the beginning of the day. Usually the peak of attendance was at the end of the day and until the early evening. There were almost no visitors at night. We again began to observe the system. The site was loading, but as the peak time approached, the load increased and the responses slowed down. I increased the Smarty cache life time, hoping that it would help, but it did not help. Soon the servers started to generate timeout errors or blank pages. Two web servers could not handle the load.

Our client was nervous, but he understood that the move usually brings with it some problems.

We needed to somehow reduce the load and we discussed this with the hoster. One of their admins suggested a good idea: “Your servers are now on Apache + mod_php. Can translate to Lighttpd? This is a small project, but even Wikipedia uses it. ”We agreed.

Lesson number 2

Install an out-of-the-box web server on your servers, do not configure anything and your site will crash.

The administrator has reconfigured our servers as quickly as possible. He abandoned Apache and switched to Lighttpd + FastCGI + Xcache configuration. How many servers are stretched this time?

Surprisingly, the servers worked well. The load was significantly less than before, and the average response time was good. After that, we decided to go home and rest - it was too late and we agreed that while we had nothing else to do.

On the following days, the servers handled the load relatively well, but at peak times they were close to a fall. We found that the bottleneck is MySQL and again called the hoster. They advised master-slave replication of MySQL from a slave on each web server.

Lesson number 3

Even a productive database server has limitations - and when they are reached, your site will crash.

This problem was not so easy to fix. CMS was very simple in this regard and there was no built-in ability to separate SQL queries. The modification of this took some time, but the result was worthwhile.

MySQL replication has truly done a miracle and the site has finally been stable. Over the following weeks, the site began to gain popularity and the number of users began to grow. And it was only a matter of time when the traffic would again exceed our resources.

Lesson number 4

Do not plan anything in advance and your site will fall sooner or later.

Fortunately, we continued to observe and plan. We optimized the code, reduced the number of SQL queries, and unexpectedly learned about MemCached. To begin with, I added MemCached to some of the main functions that were the most difficult. When we deployed our changes to production, we could not believe the results - as if we had found the Holy Grail. We reduced the number of requests per second by at least 50%. Instead of buying another web server, we decided that it is better to use MemCached.

Lesson number 5

Do not cache anything and spend money on new hardware or your site will fall.

MemCached helped us reduce the load on MySQL by 70-80%, which resulted in a huge performance increase. The page loaded even faster.

Finally, our configuration seemed perfect. Even at peak times, we didn’t have to worry about possible drops or long response times. But suddenly one of the webservers started bringing us problems - error messages, blank pages, etc. With the load, everything was fine and in most cases the server worked fine, but only in “most cases”.

Lesson number 6

Put a few hundred thousand small files in one directory, forget about the inode and your site will fall.

Yes it is. We were so busy optimizing MySQL, PHP and web servers that we didn’t pay enough attention to the file system. The cache smarty was stored in the local filesystem in the same directory. The solution was to transfer Smarty to a separate partition from ReiserFS . We also enabled the Smarty 'use_subdirs' option.

Over the next years, we continued to optimize. We put the Smarty cache in memcached, installed Varnish to reduce the load on the I / O system, switched to Nginx (Lighttpd randomly generated error 500), bought the best hardware and so on.

Conclusion

Website scaling is a never ending process. As soon as you fix one bottleneck, you will most likely stumble upon the next. Never think "That's it, we are done." This will ruin your servers and possibly your business. Optimization is an ongoing process. If you are unable to do the work yourself due to lack of experience / resources - find a competent partner for collaboration. Never stop discussing with your team and partners current problems and problems that may arise in the future.

About the author - Steffen Konerow, author of High Performance Blog .

Source: https://habr.com/ru/post/102494/

All Articles