Indexing AJAX sites

Together with the development of Joosy , AJAX suddenly - but expectedly - filled all the projects that we undertake. The paradigm proved extremely successful in all aspects except one. That very classic one: “AJAX? Indexing? Pff ... ". While we are doing online banking, this is fine with us. But how not to deny yourself this exquisite pleasure for open Web-resources?

Here’s how: Google AJAX Crawling is a Google standard that allows Google to magically request another magic address instead of it when generating AJAX addresses (#!). From which Google will be waiting for the HTML dump of this page, which he cheerfully chew. Good people have already written an article about how it works . Well, we need to learn how to effectively form this dump. Yes, so that without interfering with the code of the application itself.
')

Hashbang is a small Ruby proxy using the Rack protocol. The latter means that in order to raise it will fit any web-server that works with Ruby and / or Rails. And for those who use Rails themselves, we have prepared a couple of special buns. But first things first.

Common device

During initialization, Hashbang creates a WebKit-browser instance in its depths. After the request with the specified URL is launched, it opens the desired address, waits for a special Javascript event, and returns the HTML code at the time this event occurred.

This means that all you need to change in the current application is to call

Sunscraper.finish()

when the page prepared by Javascript can be considered finished.

In combat mode, it will look like this:

Pro internal browser and performance

We experimented a lot with possible implementations of a “headless” browser. We tried Watir and various existing Qt bindings. Nothing good came out. Desperate, we just wrote our own binding to Qt-WebKit, which knows how to return HTML by tracking the event: Sunscraper . This miracle is written in a mixture of C / C ++ and connects to Ruby via FFI . This means that Sunscraper should work not only on MRI, but also on JRuby / Rubinius. Unfortunately, it still doesn’t work with Rubinius due to bugs in the implementation of that same FFI.

Since all that we launch is the WebKit engine itself, the performance is close to maximum to solve this problem. Real data from live servers in the collection process.

Before installation

Sunscraper uses Qt. Therefore, you will definitely need it to install gem Hashbang. If you are using a Mac, we recommend Homebrew : brew install qt . In Linux, you can put any fresher of the packages.

Development mode for those on the tracks

If you're not developing on Rails, feel free to skip to the next paragraph, which will tell you about the implementation of Hashbang.

To install Hashbang in a Rails project, you need to perform the following sequence of actions:

Add gem hashbang to gemfile
Generate the base application using rails g hashbang

Now inside your Rails application, in the hashbang folder, is the Hashbang gadget itself. And this means that you need to skip the first paragraph in the “setup and launch” section.

In the development environment, Hashbang inserts its middleware into the Rails download, which intercepts all requests containing the _escaped_fragment_ magic fragment and automatically processes them. Only one problem: Webrick works in one thread. And since Hashbang asks for "itself", this leads to deadlock. Therefore, to test the current application locally, run it using rake hashbang:rails . This command will launch your application under the Unicorn server in two threads. After launching - localhost:3000/?_escaped_fragment_ - and check HTML. Just do not forget that in the AJAX application itself you need to call Sunscraper.finish() .

To emulate the launch of Hashbang in combat mode, where it works via /? Url = http: // ..., use the rake hashbang:standalone command.

Setup and Startup

If you do not use Rails, the base application can be taken from a special repository . All you need to do is place it somewhere, make sure that you have the gem bundler installed and do it in the application root bundle install .

Inside the generated / copied Hashbang application is the file config.rb, which must be edited to work effectively. It has only two directives:

url : regular expression to which the requested URL should match
timeout : timeout in milliseconds, which hashbang will wait for the Sunscraper.finish () event

Suppose that to start the service we use the Passenger module, which implements work with Rack based on Nginx. In this case, in order to work correctly, we need to achieve the following:

The Hasbang application should work on our special internal address.
All requests containing _escaped_fragment_ should be forwarded to this application, and should be forwarded to uri-escaped with an absolute URL in the parameter url = ....
We need to limit the number of parallel resources to this application, because we are unlikely to be indexed in a hundred threads, and WebKit loves resources.

Here is the configuration file you can use: https://gist.github.com/2127685 . This is an example of using Hashbang in a Ralis application.

About sad

Unfortunately, this standard has not reached our native hearths, Yandex. It is supported by Google, it is supported by Bing (and therefore Yahoo). Even Facebook crawler supports it. And Yandex is not. This means that Hashbang does not help your indexing in the domestic segment of the Internet. At least for now. We are sending the furious rays of good towards the Yandex team and wish them to quickly turn their attention to the so actively developing technological segment of the Web :).

Finally

Despite the fact that we are already using Hashbang in battle, we have not yet tested it on all possible configurations. If you have any problems with its assembly or configuration, we are always happy with the new Issues on the githaba .

Thank :).

Source: https://habr.com/ru/post/140291/

All Articles