Introduction
Technical works unexpectedly occur in all projects and sites - you can not avoid, you can only prepare. In this review we have gathered our experience in transferring the front of a farm to autonomous mode of operation - without storage and backend.
- stub
- proxy_store
- proxy_cache_use_stale + memcache ttl = 0
1. cap
This is the easiest way - to put a static page on a local disk across the farm and set up rewrite all requests to it.
server { listen 80; location / { rewrite ^.*$ /maintance.html; } location /maintance.html { alias .../maintance.html; expires -1; } }
')
Benefits
- preparation speed
- no surprises during work
- This is better than the browser messages about the inability to connect to the server
Problems
- the user does not get what he would like
- projects lose money during technical work
upd: In the comments, we also suggested using the
try_files directive, which solves this problem without rewrite and different locations. And to set expires obviously in the past, so that the browser does not cache the stub.
1.5
Of course, this could not continue for a long time and, before the next scheduled downtime for 8 hours, we were given the task
unscrew the advertisement at any cost
To understand the scale - we have about one and a half million unique visitors per day on two hundred news projects, dozens (close to a hundred) millions of hits on different content on the front farm, most of the graphics and videos are on the CDN. Front farm consists of three nginx nodes, above which stands a hardware balancer.
2. proxy_store
At the time of the work, the farm enabled nginx-night with the following settings
- balancer sent him a quarter of user requests
- as the upstream for all projects were the three main nodes of the farm
- All passing responses were recorded on an SSD array using the nginx directive proxy_store
location / { proxy_pass http://nginx-farm; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $remote_addr; proxy_pass_header X-Accel-Redirect; proxy_pass_header X-Accel-Expires; proxy_ignore_headers X-Accel-Redirect; set $store_path ---$request_uri---$query_string; if ($store_path ~ "(.*)(.{1})(.{2})"){ set $new_store_path $3/$2/$store_path; } proxy_store /data/cache/store/$host/$new_store_path; } location ~ \.(flv|asf|mp4)$ { proxy_pass http://nginx-farm; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; }
At hour X, the main nodes were removed from balancing, the nginx-night config was changed to something like this
location / { root /data/cache/store/$host/; set $store_path ---$request_uri---$query_string; if ($store_path ~ "(.*)(.{1})(.{2})"){ set $new_store_path $3/$2/$store_path; } rewrite ^.*$ /$new_store_path break; expires 1m; } location ~ css { default_type text/css; root /data/cache/store/$host/; set $store_path ---$request_uri---$query_string; if ($store_path ~ "(.*)(.{1})(.{2})"){ set $new_store_path $3/$2/$store_path; } rewrite ^.*$ /$new_store_path break; expires 1m; }
Compared to the stub, it was a huge step forward, but
- 700 MB / sec - this was the speed of writing to the array in the data accumulation mode
- no ability to save response status and headers
- for processing user friendly urls, splitting pages into subdirectories, uri had to be broken up regularly and query_string was added - this eliminates the possibility of determining the content type for most of the saved files (for this, we had to enter an additional location for css in combat mode)
- nobody guarantees that 1 url = 1 page (personalized blocks, pjax)
- when accumulating more than 200GB, the SSD array began to go away.
upd: In fact, it was more correct to use proxy_cache here, but we only had one week from “show ads” to “turn off the switch” and we made a choice in favor of a flawed but guaranteed solution.
2.5
As a result, we even had a desire to solve the problems found by a self-written solution, but we didn’t have priority on global tasks, and in the meantime the site managed to change a little
- local caches were set to ramdisk, due to this, the sizes of design caches increased by more than an order of magnitude, inactive is set at ten
- Static reading from the storage is transferred to a separate pool of nginx processes on each node, the distributor proxies requests for it, a local cache is configured for razdisk for small statics
- a multi-site memcache is organized by an application-controlled global cache
And here we again want to turn off the inner world of sites for eight hours, including transporting network devices. Proxy_store there was no desire to use and we tried to go to the next level.
3. proxy_cache_use_stale
On all projects exhibited
proxy_cache_use_stale error timeout http_500 http_502 http_503 http_504;
The directive tells nginx to give cache data in case of errors, even if this cache entry is already considered obsolete.
In addition to this
- memcache is raised on nginx nodes, all projects writing data to memcache additionally recorded them in these instances with ttl = 0 (infinite lifetime)
- some order was established in the project configs - all upstream settings (app and memcache) are in separate files
- for personalized blocks, the return of the dummy is provided when the backend is unavailable
And then comes the regular operation of the entire site until X, when the following manipulations are carried out
- we omit the pool of processes reading from the storage - small statics not included in the CDN is sent from the cache by use_stale
- rewrite the app upstream's to a non-existent local port and set connection_timeout 5ms to it, use_stale works again
- rewrite addresses memcache, this part works normally
During the exercise, before the works, the technical support pleased the closure of tickets with the phrase “no complaints from users”. In combat mode - nine hours of battery life, the scheme met all expectations - the news was read, the video looked, the ad was spinning. Although of course there were some problems
- it took some changes to the applications
- some of the projects we do no-cache, some of them due to configuration errors
- local cache does not guarantee the availability of fresh articles on all nodes, users stumbled upon 404th and 502nd
- separately visited projects, their caches before such work should be warmed up
- still do not have protection against incorrect data in the caches
As a nice bonus after all the improvements, we were able to switch any project to a static mode, if the need arises.
3.5
For the next step, the main goal will be to learn to say “YES” in the situation “everything is gone, we urgently need to roll back the project 15/30 minutes ago, can you do it, and we will correct the reasons for now”)
proxy_storeproxy_cache_use_staleproxy_cache_pathMemcache :: set