⬆️ ⬇️

CRAWL dynamic pages for Google and Yandex search engines (snapshots, _escaped_fragment_, ajax, fragment)

image



Peace for everyone!



The content of the article:

')

1. What is CRAWL?

2. Dynamic CRAWL

3. Tasks, tools, solution

4. Read

5. Conclusions







1. What is CRAWL?





This is a scan of site pages by search engines in order to obtain the necessary information. The result of this scan is the html representation at the end point (each search engine has its own settings, namely, to load js or not (with or without launch), css, img, etc.) or as it is also called the “snapshot” of the site.



2. Dynamic CRAWL





Here we will talk about the dynamic CRAWL page, namely when your site has dynamic content (or as it is called Ajax content). I have a project using Angular.js + HTML5 Router (this is when without domain.ru #! Path, and this is domain.ru/path), all the content changes in <ng-view> </ ng-view> and one index.php and special settings .htaccess, so that after updating the page everything is displayed as it should.



This is written in the settings of the angular router:

$locationProvider.html5Mode({ enabled: true, requireBase: false }); 




It is written in .htaccess:

 RewriteEngine on # Don't rewrite files or directories RewriteCond %{REQUEST_FILENAME} -f [OR] RewriteCond %{REQUEST_FILENAME} -d RewriteRule ^ - [L] # Rewrite everything else to index.html to allow html5 state links RewriteRule ^ index.php [L] 




3. Tasks, tools, solution





Tasks:


1. Give the dynamic content of the page as it becomes after the end of rendering and initialization of the application

2. Create, Optimize and Compress html page snapshot

3. Give html search engine snapshot



Instruments:


1. Installed NPM (npm is the node.js batch manager. With it, you can manage modules and dependencies.)

2. Installed html-snapshots module with the command:

  npm install html-snapshots 


3. Proper configuration



Decision:


For performance, I recommend crawling on localhost (local web server)



First you need to add to the main index.php in the meta tag in the head:

 <meta name="fragment" content="!"> 




Example sitemap.xml:

 <?xml version="1.0" encoding="UTF-8"?> <!-- created with www.mysitemapgenerator.com --> <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> <url> <loc>http://localhost/domain.ru/www/product/30</loc> <lastmod>2016-07-22T19:47:25+01:00</lastmod> <priority>1.0</priority> </url> </urlser> 




Server.js configuration:

 var fs = require("fs"); var path = require("path"); var util = require("util"); var assert = require("assert"); var htmlSnapshots = require("html-snapshots"); var minify = require('html-minifier').minify; htmlSnapshots.run({ //#1   SITEMAP //input: "sitemap", //source: "sitemap_localhost.xml", //#2      input: "array", source: ["http://localhost/domain.ru/www/product/30"], //protocol: "https", // setup and manage the output outputDir: path.join(__dirname, "./tmp"), //      outputDirClean: false, //   ,    <ng-view></ng-view>      selector: "#product", //    12 ,     timeout: 120000, //      CRAWL phantomjsOptions: [ "--ssl-protocol=any", "--ignore-ssl-errors=true", "--load-images=false" ] }, function (err, snapshotsCompleted) { var body; console.log("completed snapshots:"); assert.ifError(err); snapshotsCompleted.forEach(function(snapshotFile) { body = fs.readFileSync(snapshotFile, { encoding: "utf8"}); //     var regExp = /<style[^>]*?>.*?<\/style>/ig; var clearBody = body.replace(regExp, ''); //    var domain = /http:\/\/localhost\/domain.ru\/www/ig; clearBody = clearBody.replace(domain, '//domain.ru'); //  html  clearBody = minify(clearBody, { conservativeCollapse: true, removeComments: true, removeEmptyAttributes: true, removeEmptyElements: true, collapseWhitespace: true }); //   fs.open(snapshotFile, 'w', function(e, fd) { if (e) return; fs.write(fd, clearBody); }); }); }); console.log('FINISH'); 




Run the command:

 node server 




Understanding the algorithm:



1. First, he “kravlit” all pages.

2. Creates files and names folders according to your url: product / 30 / index.hmtl (index.html, or you can and product / 30.html as comfortable as it does not affect anyone)

3. After that calls the callback -> snapshotsCompleted, where it optimizes each index.html snapshot of your page.



Pictures of your site have been prepared, it remains to give them to the search bot when you enter:



index.php

 if (isset($_GET['_escaped_fragment_'])) { if ($_GET['_escaped_fragment_'] != ''){ $val = $_GET['_escaped_fragment_']; include_once "snapshots" . $val . '/index.html'; }else{ $url = "https://" . $_SERVER["HTTP_HOST"] . $_SERVER["REQUEST_URI"]; $arrUrl = parse_url($url); $val = $arrUrl['path']; include_once "snapshots" . $val . '/index.html'; } }else { include_once('pages/home.php'); } 




Explanation:



1. html5 push state

If you use html5 push state (recommended):

Just add this meta tag to your head.

 <meta name="fragment" content="!"> 




If your URLs look like this:

www.example.com/user/1

Then access your URLs like this:

www.example.com/user/1?_escaped_fragment_=



2. hashbang

If you use the hashbang (#!):

If your URLs look like this:

www.example.com/# ! / user / 1



Then access your URLs like this:

www.example.com/?_escaped_fragment_=/user/1



Additionally, for those who have pictures, but no optimization:



 var fs = require("fs"); var minify = require('html-minifier').minify; var path = require("path"); var util = require("util"); var assert = require("assert"); var htmlSnapshots = require("html-snapshots"); //   var myPath = path.join(__dirname, "./tmp/domain.ru/www/"); function getFiles (dir, files_){ files_ = files_ || []; var files = fs.readdirSync(dir); for (var i in files){ var name = dir + '/' + files[i]; if (fs.statSync(name).isDirectory()){ getFiles(name, files_); } else { files_.push(name); } } return files_; } var allFiles = getFiles(myPath); //var allFiles = [ 'C:\\xampp\\htdocs\\nodejs\\crawler\\tmp\\domain.ru\\www\\/product/30/index.html' ]; var body; allFiles.forEach(function(snapshotFile){ body = fs.readFileSync(snapshotFile, { encoding: "utf8"}); var regExp = /<style[^>]*?>.*?<\/style>/ig; var clearBody = body.replace(regExp, ''); var domain = /http:\/\/localhost\/domain.ru\/www/ig; clearBody = clearBody.replace(domain, '//domain.ru'); clearBody = minify(clearBody, { conservativeCollapse: true, removeComments: true, removeEmptyAttributes: true, removeEmptyElements: true, collapseWhitespace: true }); var social = /<ul class=\"social-links\">.*?<\/ul>/ig; clearBody = clearBody.replace(social, ''); fs.open(snapshotFile, 'w', function(e, fd) { if (e) return; fs.write(fd, clearBody); }); }); console.log('COMPLETE'); 




4. Read





stackoverflow.com/questions/2727167/getting-all-filenames-in-a-directory-with-node-js - working with files in node.js

github.com/localnerve/html-snapshots - snapshots module doc

perfectionkills.com/experimenting-with-html-minifier - options snapshots module doc



yandex.ru/support/webmaster/robot-workings/ajax-indexing.xml - yandex crawler info

developers.google.com/webmasters/ajax-crawling/docs/specification - google crawler info



www.ng-newsletter.com/posts/serious-angular-seo.html - article

prerender.io/js-seo/angularjs-seo-get-your-site-indexed-and-to-the-top-of-the-search-results - article

prerender.io/documentation - article



regexr.com - regexr

stackoverflow.com/questions/15618005/jquery-regexp-selecting-and-removeclass - regexr



5. Conclusions





Now you can safely write any SPA applications without worrying about their “crawling” by a search bot, you can also choose the right configuration for your “server” or “client” for your set of tools!



All professional success!

Source: https://habr.com/ru/post/306644/



All Articles