As a sysadmin, I advise you to take the most expensive dedicated server without support, RAID, large storage for special files, a template for the site better, and buy AdWords for at least two days.
In the previous part I described the general architecture of the application, and some features of the infrastructure. Today I would like to dwell on some points in more detail, and tell you what problems were created literally out of the blue. In parallel, I will tell you why some, frankly, doubtful decisions were made (from conversations with the predecessor).
The platform was not monitored from the word at all. In this case, users constantly complained about the brakes of some parts of the site. The predecessor solved the problem by horizontal scaling - once every 2-3 months, one more server was simply purchased and added to the Nginx config on the balancer. Looking ahead, I will say that after I began to take statistics on the use of capacity - it turned out that 90% of the infrastructure is stupidly idle. Money to rent servers wasted. The reason for this approach is “Well, if Choto doesn't work, customers will say why another demon is twisting.”
Over the years in the industry, for me personally, all the distributions merged into one. If earlier, when planning infrastructure, I was tied to a single distribution simply because I had more experience with it (or because I wanted to try a new one), now I’m often guided by the cost of supporting a particular solution in a particular situation.
In the project that I am describing now, my predecessor read somewhere that Gentoo scales perfectly on dozens of servers, and once assembled a package, it can be simply rsync spread onto other machines. The theory is beautiful (and I even saw such a working solution — for admin workstations), in practice it was impossible to synchronize the Portage tree at least once a week, which over time made it almost impossible to install packages. There were no speeches about security updates. For a couple of weeks I brought everything into a divine form, and thought about moving to a binary distribution. I didn’t want to spend every month several days on updates and reassembly of inverse dependencies (hello, ZeroMQ broker implemented in Ruby via libffi). The reason for using Gentoo is “Well, see how quickly you can roll out a new server using my scripts and add it to the infrastructure”.
Since I started talking about the broker, I’ll tell you what problems I had with him. Condition monitoring It was not (more precisely, in the broker code there were stubs for the functions ping_service()
, get_service_state()
, get_stats()
and the like). The only implemented function - ping_broker()
- worked only from one service, and it could be called from the Rails console: ServiceName.ping_broker()
. Everything. Services did not know when the broker is lying. Services did not know how to re-register if the broker restarts. The broker was stateless, so I “forgot” about all the services after the restart, it was necessary to go around the servers with my hands, connect to screens and restart all services and their event handlers. Well, like a cherry on the cake - the broker was responsible for the appointment of ports for the service. That is, the min_port pool: max_port was set in the broker’s settings, the service asked the broker at which port to bind when starting up, and tried to start listening on that port. If the broker runs on one server, and the service starts on the other - the port that the broker issued may already be busy and the service simply does not start with the error “Address already in use”. Monitoring services with such a scheme of work was not possible. The purpose of using this broker - to distribute the load on the server and to be able to rotate each service on its server - was never achieved.
Who is interested - a link to the project: http://walterdejong.imtqy.com/synctool/ . In principle - it had the right to life. But firstly, the mountain of bash scripts + rsync is not configuration management, and secondly, I was introduced to Ansible, which turned out to be much more flexible. There is nothing special to say here, just for a couple of days I transferred all the logic from synctool to Ansible and forgot it like a bad dream. The reasons for using synctool - “Well, I looked at Puppet, it seemed to me difficult, but in synctool you can solve everything with scripts”. About Absible / Chef people just did not know.
In the first part I mentioned falcon, but forgot to give a link to it, correct: http://www.falconpl.org/ . A mixture of procedural and functional scripting PL with multi-threading support and its own virtual machine. In principle, it is a powerful and interesting thing with a low entry threshold, but why use it only to perform ssh dba@db01 “echo 'SELECT stuff FROM table' | psql -U postgres app_db”
ssh dba@db01 “echo 'SELECT stuff FROM table' | psql -U postgres app_db”
- remained beyond my understanding. The question “Fuck is this here?” Regarding the falcon was never asked by me.
Last point for today. Rails has a wonderful mechanism that covers 99% of cases when you need to customize your application for production and development in different ways. This mechanism was not used, and in the code the host names for the services, the Redis address, the address and port of the database, and the domain name of the application were nailed into the code. Somehow I had to migrate Redis and the database to other servers - the platform lay for more than a day while I was dumping all such places. The reasons are the development model, and not a very high qualification of the programmer. The project was written almost “on the knee” at first, new features were added and added, no one did refactoring, and at some point it turned into what is shown in the picture:
In the last part I will talk about how the platform looks now, what technologies are used and why, how the use of tools suitable for the task helps to save money, and why the sysadmin should not code, and the programmer should adminit.
Source: https://habr.com/ru/post/317408/