
I decided to visit the site (which I had not visited for several years) dedicated to the insides of apache2. But I saw only a message from the hoster that the
site was temporarily blocked . I made a request in Yandex: there is nothing in the search engine cache. I think that the site has been blocked for a long time. With the help of SMS, I extended the hosting for one day and saw that there were no updates since December 2007. I tried to contact the author via e-mail specified in the domain contacts. I received no answer ...
It should be noted that I was lucky with the saved site. It is simple and logical. Copying articles, pictures is a long and boring process. I decided to copy everything at once. What we have:
')
- Source site is completely static.
- ModRewrite throws everything into a single php script
- The script searches for the file in the cache, if there is no cache, it requests from the source site
- Decided to store everything in sqlite
As a data warehouse, I chose between:
Keep files as is : i.e. if the source address of the page is /topic/123.html, then create a topic directory and place files into it. For example, so does wget. But I did not like this approach.
Make a md5 hash of the URI and save a lot of files in the data / TUTHESH.db file. Find then in this folder is impossible. Did not like.
Make md5-hash and store in sqlite database . At its core, this is the same as the previous version, but only with one file. Considered another option storage in mysql - but it really is not mobile and cumbersome. And sqlite: copied several files to a new location - the site is deployed and ready to go.
ModRewrite - I didn’t invent anything new:
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php
# :
RewriteCond %{HTTP_HOST} !^apache2dev\.ru$
RewriteRule ^ apache2dev.ru%{REQUEST_URI} [R=301,L]
How it all works - index.php
The source code can be found here:
apache2dev.ru/index.phps1. Take $ _SERVER ['REQUEST_URI'], just in case we add the address of the target site,
2. Get md5-hash
3. See if there is a requested page in the local cache.
4. If not, request on the source site, parse the response headers. We are only interested in two: 'Content-Type' and 'Last-Modified'
5. Save the headers and response in the cache. We give the client the result, after processing a bit of HTML files. Criterion Content-Type = text / html
6. Add Expires for a day.
I decided for myself that I will keep the data in the cache in its original form.
Create a database:
CREATE TABLE storage (loc TEXT PRIMARY KEY, heads TEXT, fdata TEXT, location TEXT);
Now when I opened my website and saw a page with pictures on it, the first portion of the data was saved to a local cache.
I'm running
'wget -r .ru'
'wget -r .ru'
and give a little work. wget with the -r key tries to recursively download the entire site. He doesn't do it well. For example, he does not know what javascript is. Now I open the site and run through the pages to hook on what I missed wget.
In conclusion, I sketched one more script on my knee, which shows what is now in the local database. Skimming the list, delete a couple of lines.

Source:
apache2dev.ru/list.phpsPut the site into operation mode, i.e. The logic is now this: if there is no data in the local cache, then we will not request anything from the source site, but simply output
error 404 .
Measuring speed
I measured using
ab : a 70kb picture and an arbitrary html page. The only difference is in additional preg_replace (I allowed myself to cut a bit of advertising, replaced absolute links with relative ones and inserted a warning that this page is a copy)
# ab -c 10 -n 1000
apache2dev.ru/images/ff_adds/validator.gifRequests per second: 417.32 [# / sec] (mean)
# ab -c 10 -n 1000
apache2dev.ru/2006/01/28/ustanovka-apache-20-2Requests per second: 29.66 [# / sec] (mean)
The result suits me perfectly.
Pros and cons of this approach:
- "Save" the site need only once, and then you can forget about it. Wordpress I'm somehow afraid: you need to follow it, update it, etc.
- The site fit in 7 files, convenient to copy; sqlite base - 20mb
- Minimum system requirements
- The most important disadvantage (in my opinion) is the fact that not every site can be saved in a similar way. There are sites walking on the links of which you can never stop.
Possible improvements:
- The database could be stored in a processed form (preg_replace, etc.)
- And the data could be stored in a compressed form. Check Accept-Encoding: if gzip is there - output as is, otherwise unpack
- If you need high performance, you can ask nginx to cache data produced by the php script
- Pre-prepare the page URI. For example, delete all? From = top10,? From = ap2.2, now the same pages (but with different addresses) are saved several times.
- Pass the parameter max_redirects = 0 to the file_get_contents function, handle the error and save the Location header. Now the user, requesting /get.php?=/download/123.pdf, will immediately receive the data, as if the pdf file is located at this address. the file_get_content function (in our index.php file) having seen the redirect, will automatically execute it without saying a word to anyone. And according to the idea, the user should receive a response from the HTTP / 1.1 302 Found server, and make another request directly to the file itself.
By the way, the site is about apache, which is referred to:
apache2dev.ruComplete set:
apache2dev.ru/catcher.tgz