In continuation of the started topic about your own search engine
So, there are several major tasks that the search system should solve. Let's start with the fact that a separate page should be obtained and saved.
There are several ways, depending on which processing methods you choose in the future.
Obviously, it is necessary to have a queue of pages that need to be downloaded from the web, at least in order to later look at them on long winter evenings, if there is nothing better to come up with. I prefer to have a queue of sites and their main pages, and a local mini queue of what I will handle at this time. The reason is simple - a list of all the pages that I would like to download just for a month can significantly exceed the amount of my rather big hard drive :), so I only store what is really needed - the websites, there are currently 600 thousand of them, and their priorities and download times.
')
When loading a regular page, all links from this page should either be added to the local queue if they remain within the site that I am processing, or to the main list of sites to which I have to return sooner or later.
How many pages to get from one site at a time? Personally, I prefer no more than 100 thousand, although from time to time I change this limit to only 1000 pages. Yes, and sites on which pages more - not so much.
Now let's take a closer look:
If we get 1 page at a time, all the pages in sequence, then how many pages will we process, say, in an hour?
- time to get the page consists of:
· The time we are waiting for the CSN response (it is, as practice shows, quite a bit). DNS matches the name of the site “site.ru” ip address of the server on which it lies, and this is not the easiest task, given that sites are used to move, packet routing routes vary and much more. In short, DNS server stores a table of addresses, and each time we knock to it in order to understand the address - where to go after the page.
· Time to connect and send the request (quickly if you have at least an average channel)
· Time to receive the actual answer - page
That is why Yandex, according to rumors, once faced the very first problem - if you really get many pages, then the DNS provider is not able to cope with this - in my experience the delay was up to 10 seconds per address, all the more you have to send an answer here and there over the network, and I’m not one at the provider. I note that when you request 1000 pages in succession from one site, you will be pulling each of the provider 1000 times.
With modern hardware, it’s quite simple to set yourself a local DNS-caching DNS server and load it with your work, and not a provider - then the provider will send your packets faster. However, you can be confused and write the cache within your page loader if you write at a sufficiently low level.
If you use ready-made solutions such as LWP or HTTP modules for Perl, then the local DNS server will be optimal.
Now suppose that the answer comes to you 1-10 seconds on average - there are fast servers, and there are very slow servers. Then per minute you received 6-60 pages, per hour 360-3600, about 8,000 to 60,000 per day (consciously round down to all sorts of delays: in reality, when requesting 1 page at a time without local DNS, on a 100mbit / s channel You will get 10,000 pages per day, of course, if the sites are different, and not one is very fast)
And even considering that time for processing is not taken into account here, saving pages is the result, frankly, scanty.
Ok, I said, and made 128 requests at a time in parallel, everything flew perfectly - a peak of 120 thousand pages per hour, until we started receiving raw logs from server admins where I knocked about DDoS attacks, yes, 5000 requests in 5 minutes Not any hosting allows.
Everything was decided by the fact that at the same time I started shipping 8-16 different sites, not more than 2-3 pages in parallel. It turned out something about 20-30 thousand pages per hour, and it suited me. I must say at night the indicators grow much
The full content and list of my articles on the search engine will be updated here:
http://habrahabr.ru/blogs/search_engines/123671/