Autonomous work frontend (stub, proxy_store, use_stale)

Introduction

Technical works unexpectedly occur in all projects and sites - you can not avoid, you can only prepare. In this review we have gathered our experience in transferring the front of a farm to autonomous mode of operation - without storage and backend.

stub
proxy_store
proxy_cache_use_stale + memcache ttl = 0

1. cap

This is the easiest way - to put a static page on a local disk across the farm and set up rewrite all requests to it.

server { listen 80; location / { rewrite ^.*$ /maintance.html; } location /maintance.html { alias .../maintance.html; expires -1; } }

')
Benefits

preparation speed
no surprises during work
This is better than the browser messages about the inability to connect to the server

Problems

the user does not get what he would like
projects lose money during technical work

upd: In the comments, we also suggested using the try_files directive, which solves this problem without rewrite and different locations. And to set expires obviously in the past, so that the browser does not cache the stub.

1.5

Of course, this could not continue for a long time and, before the next scheduled downtime for 8 hours, we were given the task

unscrew the advertisement at any cost

To understand the scale - we have about one and a half million unique visitors per day on two hundred news projects, dozens (close to a hundred) millions of hits on different content on the front farm, most of the graphics and videos are on the CDN. Front farm consists of three nginx nodes, above which stands a hardware balancer.

2. proxy_store

At the time of the work, the farm enabled nginx-night with the following settings

balancer sent him a quarter of user requests
as the upstream for all projects were the three main nodes of the farm
All passing responses were recorded on an SSD array using the nginx directive proxy_store

 location / { proxy_pass http://nginx-farm; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $remote_addr; proxy_pass_header X-Accel-Redirect; proxy_pass_header X-Accel-Expires; proxy_ignore_headers X-Accel-Redirect; set $store_path ---$request_uri---$query_string; if ($store_path ~ "(.*)(.{1})(.{2})"){ set $new_store_path $3/$2/$store_path; } proxy_store /data/cache/store/$host/$new_store_path; } location ~ \.(flv|asf|mp4)$ { proxy_pass http://nginx-farm; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; }

At hour X, the main nodes were removed from balancing, the nginx-night config was changed to something like this

 location / { root /data/cache/store/$host/; set $store_path ---$request_uri---$query_string; if ($store_path ~ "(.*)(.{1})(.{2})"){ set $new_store_path $3/$2/$store_path; } rewrite ^.*$ /$new_store_path break; expires 1m; } location ~ css { default_type text/css; root /data/cache/store/$host/; set $store_path ---$request_uri---$query_string; if ($store_path ~ "(.*)(.{1})(.{2})"){ set $new_store_path $3/$2/$store_path; } rewrite ^.*$ /$new_store_path break; expires 1m; }

Compared to the stub, it was a huge step forward, but

700 MB / sec - this was the speed of writing to the array in the data accumulation mode
no ability to save response status and headers
for processing user friendly urls, splitting pages into subdirectories, uri had to be broken up regularly and query_string was added - this eliminates the possibility of determining the content type for most of the saved files (for this, we had to enter an additional location for css in combat mode)
nobody guarantees that 1 url = 1 page (personalized blocks, pjax)
when accumulating more than 200GB, the SSD array began to go away.

upd: In fact, it was more correct to use proxy_cache here, but we only had one week from “show ads” to “turn off the switch” and we made a choice in favor of a flawed but guaranteed solution.

2.5

As a result, we even had a desire to solve the problems found by a self-written solution, but we didn’t have priority on global tasks, and in the meantime the site managed to change a little

local caches were set to ramdisk, due to this, the sizes of design caches increased by more than an order of magnitude, inactive is set at ten
Static reading from the storage is transferred to a separate pool of nginx processes on each node, the distributor proxies requests for it, a local cache is configured for razdisk for small statics
a multi-site memcache is organized by an application-controlled global cache

And here we again want to turn off the inner world of sites for eight hours, including transporting network devices. Proxy_store there was no desire to use and we tried to go to the next level.

3. proxy_cache_use_stale

On all projects exhibited

 proxy_cache_use_stale error timeout http_500 http_502 http_503 http_504;

The directive tells nginx to give cache data in case of errors, even if this cache entry is already considered obsolete.
In addition to this

memcache is raised on nginx nodes, all projects writing data to memcache additionally recorded them in these instances with ttl = 0 (infinite lifetime)
some order was established in the project configs - all upstream settings (app and memcache) are in separate files
for personalized blocks, the return of the dummy is provided when the backend is unavailable

And then comes the regular operation of the entire site until X, when the following manipulations are carried out

we omit the pool of processes reading from the storage - small statics not included in the CDN is sent from the cache by use_stale
rewrite the app upstream's to a non-existent local port and set connection_timeout 5ms to it, use_stale works again
rewrite addresses memcache, this part works normally

During the exercise, before the works, the technical support pleased the closure of tickets with the phrase “no complaints from users”. In combat mode - nine hours of battery life, the scheme met all expectations - the news was read, the video looked, the ad was spinning. Although of course there were some problems

it took some changes to the applications
some of the projects we do no-cache, some of them due to configuration errors
local cache does not guarantee the availability of fresh articles on all nodes, users stumbled upon 404th and 502nd
separately visited projects, their caches before such work should be warmed up
still do not have protection against incorrect data in the caches

As a nice bonus after all the improvements, we were able to switch any project to a static mode, if the need arises.

3.5

For the next step, the main goal will be to learn to say “YES” in the situation “everything is gone, we urgently need to roll back the project 15/30 minutes ago, can you do it, and we will correct the reasons for now”)

proxy_store
proxy_cache_use_stale
proxy_cache_path
Memcache :: set

Source: https://habr.com/ru/post/162949/

All Articles

Autonomous work frontend (stub, proxy_store, use_stale)

Introduction

1. cap

1.5

2. proxy_store

2.5

3. proxy_cache_use_stale

3.5

More articles: