📜 ⬆️ ⬇️

CRAWL dynamic pages for Google and Yandex search engines (snapshots, _escaped_fragment_, ajax, fragment)

image

Peace for everyone!

The content of the article:
')
1. What is CRAWL?
2. Dynamic CRAWL
3. Tasks, tools, solution
4. Read
5. Conclusions



1. What is CRAWL?



This is a scan of site pages by search engines in order to obtain the necessary information. The result of this scan is the html representation at the end point (each search engine has its own settings, namely, to load js or not (with or without launch), css, img, etc.) or as it is also called the “snapshot” of the site.

2. Dynamic CRAWL



Here we will talk about the dynamic CRAWL page, namely when your site has dynamic content (or as it is called Ajax content). I have a project using Angular.js + HTML5 Router (this is when without domain.ru #! Path, and this is domain.ru/path), all the content changes in <ng-view> </ ng-view> and one index.php and special settings .htaccess, so that after updating the page everything is displayed as it should.

This is written in the settings of the angular router:
$locationProvider.html5Mode({ enabled: true, requireBase: false }); 


It is written in .htaccess:
 RewriteEngine on # Don't rewrite files or directories RewriteCond %{REQUEST_FILENAME} -f [OR] RewriteCond %{REQUEST_FILENAME} -d RewriteRule ^ - [L] # Rewrite everything else to index.html to allow html5 state links RewriteRule ^ index.php [L] 


3. Tasks, tools, solution



Tasks:

1. Give the dynamic content of the page as it becomes after the end of rendering and initialization of the application
2. Create, Optimize and Compress html page snapshot
3. Give html search engine snapshot

Instruments:

1. Installed NPM (npm is the node.js batch manager. With it, you can manage modules and dependencies.)
2. Installed html-snapshots module with the command:
  npm install html-snapshots 

3. Proper configuration

Decision:

For performance, I recommend crawling on localhost (local web server)

First you need to add to the main index.php in the meta tag in the head:
 <meta name="fragment" content="!"> 


Example sitemap.xml:
 <?xml version="1.0" encoding="UTF-8"?> <!-- created with www.mysitemapgenerator.com --> <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> <url> <loc>http://localhost/domain.ru/www/product/30</loc> <lastmod>2016-07-22T19:47:25+01:00</lastmod> <priority>1.0</priority> </url> </urlser> 


Server.js configuration:
 var fs = require("fs"); var path = require("path"); var util = require("util"); var assert = require("assert"); var htmlSnapshots = require("html-snapshots"); var minify = require('html-minifier').minify; htmlSnapshots.run({ //#1   SITEMAP //input: "sitemap", //source: "sitemap_localhost.xml", //#2      input: "array", source: ["http://localhost/domain.ru/www/product/30"], //protocol: "https", // setup and manage the output outputDir: path.join(__dirname, "./tmp"), //      outputDirClean: false, //   ,    <ng-view></ng-view>      selector: "#product", //    12 ,     timeout: 120000, //      CRAWL phantomjsOptions: [ "--ssl-protocol=any", "--ignore-ssl-errors=true", "--load-images=false" ] }, function (err, snapshotsCompleted) { var body; console.log("completed snapshots:"); assert.ifError(err); snapshotsCompleted.forEach(function(snapshotFile) { body = fs.readFileSync(snapshotFile, { encoding: "utf8"}); //     var regExp = /<style[^>]*?>.*?<\/style>/ig; var clearBody = body.replace(regExp, ''); //    var domain = /http:\/\/localhost\/domain.ru\/www/ig; clearBody = clearBody.replace(domain, '//domain.ru'); //  html  clearBody = minify(clearBody, { conservativeCollapse: true, removeComments: true, removeEmptyAttributes: true, removeEmptyElements: true, collapseWhitespace: true }); //   fs.open(snapshotFile, 'w', function(e, fd) { if (e) return; fs.write(fd, clearBody); }); }); }); console.log('FINISH'); 


Run the command:
 node server 


Understanding the algorithm:

1. First, he “kravlit” all pages.
2. Creates files and names folders according to your url: product / 30 / index.hmtl (index.html, or you can and product / 30.html as comfortable as it does not affect anyone)
3. After that calls the callback -> snapshotsCompleted, where it optimizes each index.html snapshot of your page.

Pictures of your site have been prepared, it remains to give them to the search bot when you enter:

index.php
 if (isset($_GET['_escaped_fragment_'])) { if ($_GET['_escaped_fragment_'] != ''){ $val = $_GET['_escaped_fragment_']; include_once "snapshots" . $val . '/index.html'; }else{ $url = "https://" . $_SERVER["HTTP_HOST"] . $_SERVER["REQUEST_URI"]; $arrUrl = parse_url($url); $val = $arrUrl['path']; include_once "snapshots" . $val . '/index.html'; } }else { include_once('pages/home.php'); } 


Explanation:

1. html5 push state
If you use html5 push state (recommended):
Just add this meta tag to your head.
 <meta name="fragment" content="!"> 


If your URLs look like this:
www.example.com/user/1
Then access your URLs like this:
www.example.com/user/1?_escaped_fragment_=

2. hashbang
If you use the hashbang (#!):
If your URLs look like this:
www.example.com/# ! / user / 1

Then access your URLs like this:
www.example.com/?_escaped_fragment_=/user/1

Additionally, for those who have pictures, but no optimization:

 var fs = require("fs"); var minify = require('html-minifier').minify; var path = require("path"); var util = require("util"); var assert = require("assert"); var htmlSnapshots = require("html-snapshots"); //   var myPath = path.join(__dirname, "./tmp/domain.ru/www/"); function getFiles (dir, files_){ files_ = files_ || []; var files = fs.readdirSync(dir); for (var i in files){ var name = dir + '/' + files[i]; if (fs.statSync(name).isDirectory()){ getFiles(name, files_); } else { files_.push(name); } } return files_; } var allFiles = getFiles(myPath); //var allFiles = [ 'C:\\xampp\\htdocs\\nodejs\\crawler\\tmp\\domain.ru\\www\\/product/30/index.html' ]; var body; allFiles.forEach(function(snapshotFile){ body = fs.readFileSync(snapshotFile, { encoding: "utf8"}); var regExp = /<style[^>]*?>.*?<\/style>/ig; var clearBody = body.replace(regExp, ''); var domain = /http:\/\/localhost\/domain.ru\/www/ig; clearBody = clearBody.replace(domain, '//domain.ru'); clearBody = minify(clearBody, { conservativeCollapse: true, removeComments: true, removeEmptyAttributes: true, removeEmptyElements: true, collapseWhitespace: true }); var social = /<ul class=\"social-links\">.*?<\/ul>/ig; clearBody = clearBody.replace(social, ''); fs.open(snapshotFile, 'w', function(e, fd) { if (e) return; fs.write(fd, clearBody); }); }); console.log('COMPLETE'); 


4. Read



stackoverflow.com/questions/2727167/getting-all-filenames-in-a-directory-with-node-js - working with files in node.js
github.com/localnerve/html-snapshots - snapshots module doc
perfectionkills.com/experimenting-with-html-minifier - options snapshots module doc

yandex.ru/support/webmaster/robot-workings/ajax-indexing.xml - yandex crawler info
developers.google.com/webmasters/ajax-crawling/docs/specification - google crawler info

www.ng-newsletter.com/posts/serious-angular-seo.html - article
prerender.io/js-seo/angularjs-seo-get-your-site-indexed-and-to-the-top-of-the-search-results - article
prerender.io/documentation - article

regexr.com - regexr
stackoverflow.com/questions/15618005/jquery-regexp-selecting-and-removeclass - regexr

5. Conclusions



Now you can safely write any SPA applications without worrying about their “crawling” by a search bot, you can also choose the right configuration for your “server” or “client” for your set of tools!

All professional success!

Source: https://habr.com/ru/post/306644/


All Articles