📜 ⬆️ ⬇️

Googlebot strange behavior

Recently, when logging into Google Webmaster Tools, I began to notice on my site an increase in the number of “Transition failed” errors, which in my case arose due to the so-called cyclic redirection. Such are the features used by the "engine" of the site. But in none of the browsers this error was repeated, when requested by "hands", that is, via telnet - no anomalies were noticed either. Nevertheless, errors in the GWT continued to appear again and again, pointing to the same URLs of my site and annoying only by their existence. I had to think hard of it, but I still managed to get to the bottom of the problem.

So, there is a tag cloud on the site that generates links of the form: / blog / tag / tag_name /. Since tags are set by users, they can contain almost any UTF-8 character. In addition, on such pages (with topics on a specific tag) periodically put links from other sites / social networks / forums. Often, due to differences in URL coding methods, these links lead to non-canonical pages.

I will give an example. There is a “Harry Potter” tag, which in encoded form can be either “Harry + Potter” or “Harry% 20potter”. The site can be links with both options. But for the site's engine, when REQUEST_URI is decoded, these links are absolutely identical, which is why duplicate pages appear, which the search engines dislike very much. To combat these duplicates when loading the page, I decode the URL, and then encode it back with the PHP function urlencode () and compare it with the originally requested string. If they do not match, give the 301st code to the browser and send it to the correct URL. Thus, double pages in the eyes of the search engine are “glued together”.

It seems to be pretty simple. Why did Googlebot get hung up on some of these links? Fortunately, GWT has a special feature - “View as Googlebot”, which allows you to look at the site as if through the eyes of a search bot. Well, we try. For this example, I'll take another tag: “guns'n'roses”. So, we tell the bot to load the page / blog / tag / guns'n'roses /. The bot answers that everything is fine, the answer is received:
HTTP/1.1 301 Moved Permanently ... Location: /blog/tag/guns%27n%27roses/ 

That's right, single quotes are encoded as% 27 according to RFC 3986 . Ok, now we are trying to direct the bot to the URL / blog / tag / guns% 27n% 27roses / (as if we were a regular browser). In response, we get:
 HTTP/1.1 301 Moved Permanently ... Location: /blog/tag/guns%27n%27roses/ 

and a remarkably fair at first glance: “A redirect to itself was found on the page. This can lead to an endless redirect loop. ”
')
But in the server logs we see that in fact the request was again: “GET / blog / tag / guns'n'roses / HTTP / 1.1” instead of clearly and clearly stated “GET / blog / tag / guns% 27n% 27roses / Http / 1.1. It turns out that Googlebot decided to interpret the value of the Location directive in its own way, spitting on the RFC, torturing itself and my site with meaningless requests.

Further googling helped to find out that Google’s search bot has particular sympathy for the following characters:
, @ ~ * ( ) ! $ '
and does not convert them to the corresponding codes
%2C %40 %7E %2A %28 %29 %21 %24 %27
how can it spoil your nerves and spend the time of already tortured webmasters :)

Source: https://habr.com/ru/post/151517/


All Articles