📜 ⬆️ ⬇️

Googlebot now finds links in JavaScript.

I must have missed something. I always thought that Google does not see the links inside the JavaScript code. And even if he sees, such links have no value for SEO: when calculating PR, they are not taken into account and are not used for indexing, i.e. if the page can be reached only as a result of the script execution and there are no direct links to it, then such a page will not be indexed at all. So what? Is the information already out of date?

Here is my story.


I have one new site. It's only a month old, there are few pages and there are almost no incoming links. Therefore, it is easy enough to trace how it is indexed by Google. A service that checks web pages for hidden harmful inclusions (invisible spammer links, iframes, scripts, redirects) is working on the site in test mode. AJAX is actively used for the service.


Recently, checking the statistics of visits, I saw that someone came to me from Google on request types of hidden spam . I decided to see where my site is shown in the results for this query. It turned out that at first. And this is when the total number of results exceeds 14 million. Nice, but a little unexpected for a very new site.
')
Mysterious result

I was even more embarrassed by the <page>, to which the result pointed: unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl= . I use this URL (or rather, part of the URL) inside the script to dynamically build personalized links for display in reports. None of the pages on my site (and others too) refer to unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl=

Actually, the site has a static page with similar text about common types of hidden links, dedicated to detecting WordPress infected blogs . There are direct links to this page from other pages.

So why did Google prefer an incomplete dynamic URL hidden inside JavaScript, with no inbound links, rather than a full page with a static URL and direct inbound links? Maybe this page for some reason was not indexed? I entered the query site: unmaskparasites.com. The site is quite small, and all pages, including this one, were indexed. Moreover, this query revealed pages that shouldn’t have been indexed at all, as they were used only inside service AJAX requests. ( unmaskparasites.com/results/ and unmaskparasites.com/token/ in the screenshot.)

Indexed AJAX URLs

What the heck! How did Google find out about them ?!

Having rummaged a little bit in the source code of my service and pages cached by Google, I can say with great certainty that Google parses JavaScript, executes it, finds references in it and uses them for indexing.

Proof of



References in AJAX requests.

unmaskparasites.com/results and unmaskparasites.com/token are service URLs that are used exclusively in AJAX (JavaScript) requests. Nowhere else are these URLs used. This is how they are used in my scripts:
$.get('/token/', function(txt){ ...

and
$.post("/results/", { ...

As you can see, here a simple regular expression is indispensable. References are relative, and you need to understand what the code does to distinguish lines containing such links from other lines.

Links in lines with HTML code.

The URL unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl= is also found only in the script. This time inside the line in which the HTML code is prepared for insertion into the right place on the page:

...'<a href="/security-tools/find-hidden-links/site/?siteUrl=' + escape($("#id_siteUrl").val() )+ '">' ...

If the spider will execute this code, the following line will turn out:
  '<a href="/security-tools/find-hidden-links/site/?siteUrl="'>' 
, because it does not fill out the form, and the value of the id_siteURL field will be empty. We get the URL is identical to what was somehow indexed by Google. Again, translated from relative form to absolute.

Googlebot has JavaScript that is not the same as in our browsers.

One gets the impression that Googlebot executes only that part of the script code that is needed to find links and ignores everything else.

When analyzing the cached page unmaskparasites.com/results , it is clearly visible that it was received after a GET request with empty parameters. However, if I execute my code, then 1). with empty parameters it is impossible to reach the call itself, since validation will not be completed, 2). POST request would be executed.

We can assume that Googlebot is not equipped with a full JavaScript engine. He only knows how to parse the code, find links and execute a truncated set of commands (for example, for stitching together strings).

jQuery

I also have an assumption that this is possible only when Google sees that code based on libraries known to it is used. I use jQuery and load it directly from Google servers:
ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js

This is the only third-party library loaded by my pages, and Google can be sure that the $ .post (...) and $ .get (...) functions load pages through AJAX requests, and $ ('# results'). html (...) inserts HTML code into a div with id.

Google toolbar

To consider alternative versions of getting links from JavaScript into search results, let's say that Google found out about them using the toolbar installed in my browser. However, there are a number of factors indicating that the toolbar has nothing to do with it:
  1. The links used in AJAX requests never get into the address bar of the browser, which means there is no reason to request PageRank for them.
  2. Toolbar requests information only on those links that occur in real life. So, with the help of a toolbar, Google would rather index links like: unmaskparasites.com/security-tools/find-hidden-links/site or unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl=example.com but certainly not unmaskparasites.com/security-tools/find-hidden-links/site/?siteUrl=
  3. Other <secret> pages that I downloaded in the browser remained un-indexed.
  4. Have you ever seen that a page on a one-month domain without a single inbound link comes out on top in the search results (even if by unpopular query), while having more than 14 million competitors?

Some official information from Google.

Just found a number of indirect evidence in the official Google Webmaster Central blog.

A spider's view of Web 2.0
“The main problem with Ajax sites is that although Googlebot is well versed in the structure of HTML links, it may have difficulty indexing sites that use JavaScript for navigation. Although we are working to better understand JavaScript , yet the best way to make the site search engine friendly is to provide them with HTML links to your content. ”

You see, they say that it’s difficult to deal with JavaScript, but they don’t say that it is impossible . And at the same time, they are “working to better understand JavaScript” . And now, after 9 months, they seem to already be able to understand something in JavaScript.

Improved Flash indexing
“Googlebot does not execute some JavaScript commands.”


What I said. Googlebot does javascript, but its support is rather limited.
“As for ActionScript, we are able to find links that are loaded using ActionScript.”


If they can find links in ActionScript, then what prevents them from doing the same with JavaScript?

New milestone?



Flash, JavaScript, what's next? It seems that search robots will soon be able to <see> web pages almost as much as we humans. In the meantime, check the scripts on your pages. Maybe you show Google more than he should see. I have already added a few new Disallow rules in robots.txt.

Or is it my paranoia?

The original version of the article about links in JavaScript in my blog.

Source: https://habr.com/ru/post/30735/


All Articles