Another article about indexing ajax sites by search engines

Stylish, fashionable, youth today to make a site on AJAX, from the user's point of view, this is fast and convenient, and search engines with such sites may have problems.

The most correct solution is to use regular links, but load the content with ajax, leaving the opportunity to get content via a regular link for users with JS (it isn’t) disabled and robots disabled. That is, you need to develop it in the old way, with regular links, layout and view-shki, then you can handle all the javascript links, hang content loading via ajax on them using a link from the href attribute, a tag, in a very simplified form should look something like this:
')

$(document).on('click', 'a.ajaxlinks', 'function(e) { e.stopPropagation(); e.preventDefault(); var pageurl = $(this).attr('href'); $.ajax({ url: pageurl, data: { ajax: 1 }, success: function( resp ) { $('#content').html(resp); } }); });

Here we simply load the same pages, but with ajax, while on the backend you need to process the special GET parameter ajax and, if available, give the page without a layout, well, if it is rough.

But the architecture is not always designed for this, besides, sites on angularjs, and others like it, work somewhat harder, and substitute content on the uploaded html template with variables. For such sites (or you can already call them applications), the search engines invented the HashBang technology, in short, this is a link of the form example.com/#!/cats/grumpy-cat , when the search robot sees #! he makes a request to the server at example.com/?_escaped_fragment_=/cats/grumpy-cat , i.e. replaces "#!" with "? _escaped_fragment_ =", and the server should give the generated html search engine identical to the one that the user would see from the original link. But if the application uses the HTML5 History API, and links like #! Do not apply, you need to add a special meta tag to the head section:

 <meta name="fragment" content="!" />

At the sight of this tag, the search robot will understand that the site is powered by ajax, and will redirect all requests for receiving site content to the link: example.com/?_escaped_fragment_=/cats/grumpy-cat instead of example.com/cats/grumpy- cat .

You can handle these requests using the framework used, but in a complex application with angularjs, this is a bunch of extra code.

The way we go is described in the following scheme from Google:

To do this, we will catch all requests from _escaped_fragment_ and send them to phantom.js on the server, which will generate html-impressions of the requested page using server-side webkit tools and give it to the crawler. Users will remain working directly.

First, install the necessary software, if not installed yet, like this:

 yum install screen npm instamm phantomjs ln -s /usr/local/node_modules/phantomjs/lib/phantom/bin/phantomjs /usr/local/bin/phantomjs

Next, write (well, or take ready) server-side js-script (server.js), which will make html-snapshots:

 var system = require('system'); if (system.args.length < 3) { console.log("Missing arguments."); phantom.exit(); } var server = require('webserver').create(); var port = parseInt(system.args[1]); var urlPrefix = system.args[2]; var parse_qs = function(s) { var queryString = {}; var a = document.createElement("a"); a.href = s; a.search.replace( new RegExp("([^?=&]+)(=([^&]*))?", "g"), function($0, $1, $2, $3) { queryString[$1] = $3; } ); return queryString; }; var renderHtml = function(url, cb) { var page = require('webpage').create(); page.settings.loadImages = false; page.settings.localToRemoteUrlAccessEnabled = true; page.onCallback = function() { cb(page.content); page.close(); }; // page.onConsoleMessage = function(msg, lineNum, sourceId) { // console.log('CONSOLE: ' + msg + ' (from line #' + lineNum + ' in "' + sourceId + '")'); // }; page.onInitialized = function() { page.evaluate(function() { setTimeout(function() { window.callPhantom(); }, 10000); }); }; page.open(url); }; server.listen(port, function (request, response) { var route = parse_qs(request.url)._escaped_fragment_; // var url = urlPrefix // + '/' + request.url.slice(1, request.url.indexOf('?')) // + (route ? decodeURIComponent(route) : ''); var url = urlPrefix + '/' + request.url; renderHtml(url, function(html) { response.statusCode = 200; response.write(html); response.close(); }); }); console.log('Listening on ' + port + '...'); console.log('Press Ctrl+C to stop.');

And run it in the screenshot using phantomjs :

 screen -d -m phantomjs --disk-cache=no server.js 8888 http://example.com

Next, we configure nginx (apache in a similar way) to proxy requests to the running daemon:

 server { ... if ($args ~ "_escaped_fragment_=(.+)") { set $real_url $1; rewrite ^ /crawler$real_url; } location ^~ /crawler { proxy_pass http://127.0.0.1:8888/$real_url; } ... }

Now when you request example.com/cats/grumpy-cat, search robots will contact the link example.com/?_escaped_fragment_=cats/grumpy-cat , which will be intercepted by nginx, will be sent to phantomjs, which will generate html on the server through the browser engine. and give it to the robot.

In addition to Google, Yandex and Bing search robots, this will also work for sharing the link via facebook.

References:
https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
https://help.yandex.ru/webmaster/robot-workings/ajax-indexing.xml

UPD (12/2/16) :
Configs for apache2 from kot-ezhva :

In case of using html5mode:

 RewriteEngine on RewriteCond %{QUERY_STRING} (.*)_escaped_fragment_= RewriteRule ^(.*) 127.0.0.1:8888/$1 [P] ProxyPassReverse / 127.0.0.1:8888/

If URLs with a grid:

 RewriteEngine on RewriteCond %{QUERY_STRING} _escaped_fragment_=(.*) RewriteRule ^(.*) 127.0.0.1:8888/$1 [P] ProxyPassReverse / 127.0.0.1:8888/

Source: https://habr.com/ru/post/254213/

All Articles

Another article about indexing ajax sites by search engines

More articles: