The design of search engine algorithms - the path to the success of website design and optimization

Introduction

The easiest way to develop methods of website promotion and development in specific PS is to develop your own PS.
I'm not talking about the implementation of complex algorithms, we need abstracted solutions. You can simply present a simplified model of the algorithm and work with it. It is important to try to get all the related parameters. For example, the estimated time to implement, the load on the server and the running time of the algorithms. By measuring these parameters, you can get a lot of additional information and use it for your own purposes.
Most novice webmasters and optimizers come from the concepts of "I want." I want PS to give great weight to all links, I want non-unique content to be well indexed, etc. However, in reality, this is not the case, and many consider this to be incorrect. At the same time, they do not think at all what the Internet would be like if all their wishes worked, and even taking into account the scale. There is only one option to fight - the transition to the enemy, in our case, the PS. By developing techniques and algorithms to counteract cheating, spam and poor-quality sites, you can not only find the right methodology for developing resources, but also take advantage of the vulnerabilities found in places where it is really difficult to find an algorithm to solve a problem with search engines.

I'll start with myself

I do not pretend to the uniqueness of this algorithm, I am only acting, by virtue of the search algorithm of various algorithms described above, not for developing, but for using the basic ideas of these very algorithms.
One interesting thought came to me during the development of an algorithm for checking the uniqueness of new texts on the basis of existing ones. The basic algorithm was as follows: there is a base of ready texts and a base buffer where new texts are placed, these texts, after being placed in the buffer, are checked for uniqueness on the main base and, if the text is completely unique, then it is placed in the main base, and if not, then such text will be deleted. The problem was that the character-by-character comparison of the full text did not fit, and the algorithm under development did not imply a search for a full double, but a search for fragments from a new text in the texts of the main base. I placed the found fragments in the array of the checked (new) text in the form of two intervals of the numbers of words corresponding to each other in one text and another.

And, here, I noticed one interesting phenomenon with which I would like to share. After the end of a new text check, the list of text identifiers from the main database and the intervals of matches for these identifiers are laid only in the last checked text. You probably thought that these same segments can be laid into the text from the main base, but here the question arises: why do this? The main base is already formed and satisfies all needs and requirements. Only a new text is tested, and manipulations with the finished base will create additional confusion.
')
Thus, we have the following picture:

Suppose PS has 10 new documents, after checking for uniqueness, it turned out that document 2 has one double, and document 8 has four doubles from the existing main base. Then another 10 documents are added, in which the fourth document is a duplicate of the 8th of the first ten, thus, in the fourth document of the second ten there is an array of five duplicates, and the eighth document of the first ten still has four duplicates, although There are five duplicates of these texts.

We get the first metric: the number of duplicates of the text at the time of indexing as a constant.

But this is only the technical side of the text, besides it there is also a subject - this is a short tag that indicates what the text tells about, therefore we will call it the thematic tag. Suppose PS is aware of 10 documents with the thematic tag “hair care”. In essence, these are 10 identical documents, however, on the technical side of duplicates, of them are always less than 100%. Consequently, the PS, similarly to the number of technical duplicates from the first metric, calculates the number of thematic duplicates for the new document on the basis of the existing base.

From this we get the second metric: the number of duplicates of the subject tag of the document.

Now let's try to rank the available documents based on the metrics described above. At first glance, these two metrics, of course, taken not as the only ones, but as the main ones, are quite sufficient. However, a third metric is suggested here, which will be visible at the end of the example.

Suppose that the PS need to arrange 50 documents with the “hair care” tag described above. To do this, we construct an algorithm based on the two metrics found above.

We take the first metric as the main one, since the technical uniqueness of the text is preferable and we will perform a multidimensional sorting of 50 existing documents, first by the first metric, and then by the second. Now let's try to draw conclusions from the resulting sorting:

After the first sorting, the first positions were not only technically unique documents, but also documents with a large number of duplicates, but having the oldest date.
At the last positions were not only non-unique documents, but also unique documents with early dates of appearance, which fell there due to sorting by the second metric.

Based on the above conclusions, it is very difficult to get to the first places of the newly baked site and here a favorite method of increasing the number of visitors suggests itself - punching large volumes of similar and single-matched documents and sites. Based on this, the following metric is simply suggested, which limits the number of documents at an acceptable level.

The third metric: the number of documents required to meet the demand of visitors, calculated on the basis of search queries.

On the third metric, the part of documents that exceeds its number is deleted from the end of the above described sorted list. This, at a minimum, explains the fast departures of fresh sites on very popular and competitive topics.

Results

Of course, the metrics described are basic, but not the only ones, I already mentioned this in the article. If it is good to analyze the resulting algorithm, then you can add another ten and a half secondary metrics for more accurate analysis, but this, as they say, is beyond the scope of this article.

Source: https://habr.com/ru/post/137298/

All Articles

The design of search engine algorithms - the path to the success of website design and optimization

Introduction

I'll start with myself

Results

More articles: