Working with URL and storing them

Well, here is one of the most delicious parts of the database - it stores billions of different links, and makes access to them in any order.

At first you can obviously notice that all URLs are grouped within the site, i.e. All links within 1 site can be stored together for speed. I selected the site URL, and began to store the list of sites separately - now there are 600 thousand of them and a separate database table described earlier can easily handle them. The AVL tree with the CRC of all known sites resides in memory, and first of all checking the existence of the URL, I get the site ID for it corresponding to it, if it already exists in the database.

The rest of the link - except for the site name, I cut off, and I consider the CRC for it, let's call it Hash. Thus, any link is relatively uniquely indexed (site ID, Hash). All links can be sorted by Hash within a separate site, and then you can easily search for existing or not - to run through the list all the links of this site until we meet with the desired Hash or we do not meet a larger Hash - it means there are no links. Acceleration is not very large, but 2 times, all the same, on average.

I must say that each link I have assigned an ID in order to occupy less space in the index - 4 bytes instead of 2 * 4. A separate table contains data ID - (site ID, Hash) for quick sampling.
')
Now, a little about how to store lists of a million links and even sorted for 600 thousand sites. For this, another type of table is implemented - with a two-dimensional index, i.e. First, by ID1 we get access to the list of data sorted by ID2, and specifically ID2 does not have to be from 1 to N, as is done in most cases. This table was also used by me for storing a reverse index, but now a more efficient solution is working there. The table itself is implemented in such a way that adding ID2 to the list preserves the sorting of the list.

All data about URLs is divided into 64 parts by ID1, table K contains records about sites Site ID% 64 = K, each table is divided into segments allocated for each specific ID1. When you need access to a specific list of entries - by ID1, they all lie in a row on the disk, and the position is immediately known where to make seek and where to start buffered to read. We read exactly to the moment when we meet the necessary Hash

Inserting into such a table is not very fast; on the other hand, a large cache and a small amount of one record allow you to insert a package fairly quickly. The update package is accumulated - now about 32000 links for each table, then inserted in 1 pass - a temporary copy of the table is made into which the data from the cache and the old table are merged.

URLs with www and without are counted as the same - depending on which of them was the first link and is considered the main link - it will be added to the database, and all other links will be glued to it. This allows you to avoid thoughtless cutting off or not cutting off www - because the site may not work without www, but it does not completely solve all the problems associated with the fact that there are different sites on the www addresses and without

The most painstaking work was in resolving relative references — an example:
On the page “site.ru/index” there is a link “./” then it should be resolved in “site.ru/” but if you add a slash to the first link: “site.ru/index/” and although it may be the same the page on the site, the permitted link will be “site.ru/index/” so you cannot cut off the slash at the end, you cannot cut off the link arguments, and the name of the executable file.

In general, I divide the link into parts: protocol, site, path, file name, arguments, named link (all that comes after #)
Of the two links, so cut, I collect a new one by going through the list and substituting the missing elements (if necessary) into the result.
Then we must not forget that there are constructions of the form “./././././” and they must be removed, as well as “../../../”, and then “#”, ”are removed or substituted ? ”
In general, the process is not long, but very dreary in inventing tests and methods for processing all possible combinations. But when I wrote, everything is already working as it should.

The full content and list of my articles on the search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/

Source: https://habr.com/ru/post/123955/

All Articles

Working with URL and storing them

More articles: