📜 ⬆️ ⬇️

Proxy and save

one November the world has changed and will never be the same again. Censorship has appeared on the Russian Internet - the well-known list of banned sites. For some, this is the most important political issue, for others it is a reason to study encryption technology and anonymity protection, for the third it is just another strange law that has to be executed on the run. We will talk about the technological aspect.

In this tutorial, we learn how to quickly and simply make a working mirror of any site, which allows you to change the IP and assign any domain name. We even try to hide the domain in the url, after which you can save a locally complete copy of the site. All exercises can be done on any virtual server - I personally use Hetzner hosting and Debian OS. And of course we will use the best web server of all time - NGINX !

To this paragraph, the inquisitive reader has already acquired and configured some kind of dedicated server, or simply launched Linux on the old computer under the table, and also launched the latest version of Nginx with the “Save me now” page.

')
Before starting work, you must compile nginx with the ngx_http_substitutions_filter_module module, the former name is substitutions4nginx.

Further configuration will be shown on the example site www.6pm.com . This is the site of a popular online store selling merchandise with good discounts. It is distinguished by its categorical reluctance to give access to buyers from Russia. Why not grin the censorship of capitalism?

We already have a working Nginx, which is engaged in useful work - turns the site on the system Livestreet about the benefits of overseas shopping. To raise a 6pm mirror, we register a DNS record with the name 6pm.pokupki-usa.ru which addresses to the server IP. As you understand, the choice of name for the sub-domain is completely arbitrary. This name will be set in the HOST field each time you access our new resource, so that you can run virtual hosting on Nginx.

In the root section of the nginx configuration, we register upstream - the name of the donor site, so we will call it in the future. In standard guides, the site is usually called back-end, and reverse-proxy is called front-end.

http { ... upstream 6pm { server www.6pm.com; } 


Next you need to create a server section, this is how it looks

  server { listen 80; server_name 6pm.pokupki-usa.ru; limit_conn gulag 64; access_log /var/log/nginx/6pm.access.log; error_log /var/log/nginx/6pm.error.log; location / { root /var/www/6pm; try_files $uri @static; } location @static { include '6pm.conf'; proxy_cookie_domain 6pm.com 6pm.pokupki-usa.ru; proxy_set_header Accept-Encoding ""; proxy_set_header Host www.6pm.com; proxy_pass http://6pm; proxy_redirect http://www.6pm.com http://6pm.pokupki-usa.ru; proxy_redirect https://secure-www.6pm.com https://6pm.pokupki-usa.ru; } } 


The standard listen and server directives define the name of the virtual host that will be accessed by the server section. Log files are best made separate.

Declare the root lockin, specify the path to its repository - root / var / www / 6pm; then use try_files . This is a very important nginx directive that allows you to organize local storage for downloaded files. The directive first checks if there is a file with the name $ uri and if it does not find it, it goes to the named @ static
$ uri - nginx variable, which contains the path from the HTTP request

The prefix “@” specifies the named location. Such a location is not used during normal processing of requests, and is intended only to redirect requests to it. Such locations cannot be nested and cannot contain nested locations.


In our case, the construction is used only to replace the robots.txt file to prevent the site content from being indexed. However, mirroring and caching is done this way in nginx.

include '6 pm.conf' - module logic of substitutions.

proxy_cookie_domain is a new feature that appeared in nginx version 1.1.15, without this directive you had to do so . You no longer need to rack your brains, register one line and cookies just start working.

proxy_set_header Accept-Encoding ""; - a very important team that causes the donor site to give you content that is not compressed, otherwise the module of substitutions will not be able to perform the replacement.

proxy_set_header Host is another important command that in the request to the site donor sets the correct field HOST. Without it, the name of our proxy server will be substituted and the request will be erroneous.
proxy_pass - direct addressing does not work in the named locale, which is why we registered the address of the donor site in the upstream directive.
proxy_redirect - many sites use redirects for their needs, each redirect needs to be caught and intercepted here, otherwise the request and the client will go beyond the limits of our cozy domain.

Now let's see the contents of 6 pm.conf. I did not accidentally pass the transformation logic into a separate file. It can accommodate without any loss of performance thousands of replacement rules and hundreds of kilobytes of filters. In our case, we just want to complete the proxy process, so the file contains only 5 lines:

Changing google analytics codes:
 subs_filter 'UA-8814898-13' 'UA-28370154-3' gi; subs_filter "'.6pm.com']," "'6pm.pokupki-usa.ru']," gi; 

I assure you that this is the most harmless prank possible. We will have statistics of visits, and these visits will disappear at the donor’s site.

We change all direct links to new ones.
 subs_filter "www.6pm.com" "6pm.pokupki-usa.ru" gi; subs_filter "6pm.com" "6pm.pokupki-usa.ru" gi; 


As a rule, in normal sites, all the pictures are on CDN networks that do not bother to check the source of requests, so it’s enough to replace the links of the main domain only. In our case, 6pm showed off and posted some of the pictures on the domains, which are denied to visitors from Russia. Fortunately, the replacement module supports regular expressions and is not difficult to write a general rule for a group of links. In our case, it did even without regexp, just changed two characters in the domain. It turned out like this:

 subs_filter "http://a..zassets.com" "http://l3.zassets.com" gi; 


The only, but very serious limitation of the replacement module is that it works with only one line. This restriction is laid out architecturally, since the module works on the stage when the page is partially loaded (chunked transfer encoding) and there is no way to perform full-text regexp.

Everything, it is possible to look at result , everything works, even payment of the order passes without difficulties.

So, we have transferred the site to a new IP address and a new domain. It was a simple task. Is it possible to roll a site not into a new domain, but into a subdirectory of an existing one? This can be done, but there are difficulties. For a start, let's remember what are the html links:
  1. Absolute links of the form " www.example.com/some/path "
  2. Links relative to the root of the site like "/ some / path"
  3. Relative links of the form "some / path"


From item 1 everything is simple - we replace all references to the new path with a subdirectory
With p.3 is just as simple - we don’t touch anything and everything works by itself if the base href attribute is not used. If this attribute is used, which is extremely rare in modern sites, it is enough to replace it and everything will work.

The real difficulty arises from p.2. because we have to change a bunch of links like / ... to / subdomain / .... If you do this in the forehead, then the site is likely to completely stop working, because such a replacement will break many constructions using a slash, which will spoil almost all javascript scripts.

Theoretically, you can write a fairly generic universal regexp, which will be able to select only the necessary replacement patterns, in practice it is much easier to write a few simple regexps, which will translate the necessary links in parts.

Let's return to our patient:

  location /6pm { root /var/www/6pm; try_files $uri @6pm-static; access_log /var/log/nginx/6pm.access.log; } location @6pm-static { include '6pm2.conf'; proxy_cookie_domain 6pm.com pokupki-usa.ru; proxy_cookie_path / /6pm/; rewrite ^/6pm/(.*) /$1 break; proxy_set_header Accept-Encoding ""; proxy_set_header Host www.6pm.com; proxy_pass http://6pm; proxy_redirect http://www.6pm.com http://pokupki-usa.ru/6pm; proxy_redirect http://www.6pm.com/login http://pokupki-usa.ru/6pm; proxy_redirect https://secure-www.6pm.com https://pokupki-usa.ru/6pm; 


The server configuration has undergone some changes.

First, all the logic has been moved from the sever directive directly to location . It is not difficult to guess that we decided to create a directory / 6pm in which we will display the proxied site.

proxy_cookie_path / / 6pm / - transfer cookies from the site root to a subdirectory. It is not necessary to do this, but in case there are a lot of proxied sites, their cookies may intersect and wipe each other.

rewrite ^ / 6pm /(.*) / $ 1 break; - this magic cuts a subdirectory from the client request, which we added, as a result, the proxy_pass directive sends the correct value to the donor server.

It became a little harder to catch redirects. Now all links to the root need to be transferred to / 6pm.

Let's look at the logic of transformation:

 subs_filter_types text/css text/javascript; # Fix direct links subs_filter "http://6pm.com" "http://pokupki-usa.ru/6pm" gi; subs_filter "http://www.6pm.com" "http://pokupki-usa.ru/6pm" gi; # Fix absolute links subs_filter 'src="/' 'src="/6pm/' gi; subs_filter 'href="/' 'href="/6pm/' gi; subs_filter 'action="/' 'href="/6pm/' gi; # Fix some js subs_filter "\"/le.cgi" "\"/6pm/le.cgi" gi; subs_filter "\"/track.cgi" "\"/6pm/track.cgi" gi; subs_filter "\"/onload.cgi" "\"/6pm/onload.cgi" gi; subs_filter "\"/karakoram" "\"/6pm/karakoram" gi; subs_filter "/tealeaf/tealeaf.cgi" "/6pm/tealeaf/tealeaf.cgi" gi; # Css and js path subs_filter "script\('/" "script('/6pm/" gi; subs_filter "url\(/" "url(/6pm/" gi; subs_filter 'UA-8814898-13' 'UA-28370154-3' gi; subs_filter "'.6pm.com']," "'pokupki-usa.ru/6pm']," gi; subs_filter "http://a..zassets.com" "http://l3.zassets.com" gi; 


First, we enabled filtering of css and javascript files (html parsing is enabled by default)
Secondly, we begin to carefully find and replace different types of links relative to the root. We came across a medium complexity site, in which some of the scripts contain such paths.

The result was: http://pokupki-usa.ru/6pm/

Unfortunately, I did not manage to write a filter to the end for the case of a subdirectory. I have not reached the transformation of dynamic requests for shopping cart scripts, although I have no doubt that this can be solved. Just my knowledge of Javascript is not enough to do the necessary debugging, I will be glad to advice how to run the shopping cart, which is not working in the example mentioned.

In any case, this is probably the first guide, which describes the method of proxying to a subdirectory.

Source: https://habr.com/ru/post/158393/


All Articles