Break the web c '#!' (hash-bang)

Below is a translation of an article that draws attention to, in my opinion, a rather acute problem in the era of web 2.0, namely, the purity of URLs.

On the example of the site Lifehacker.com it is shown what problems can blindly follow state-of-the-art technologies, the pursuit of SEO and the negation of the principle of “progressive enhancement” (progressive enhancement) can turn into.

Last week, on Monday, Lifehacker.com was unavailable due to inactive JavaScript. Lifehacker.com, along with the rest of the company Gawker, displayed an empty home page without content, advertising and everything else. The transition from Google search results to subpages sent back to home.
')

Javascript-dependent URLs

Gawker, like Twitter before it, rebuilt its sites to be completely dependent on JavaScript, including the URLs of its pages. JavaScript could not load, leading to missing content and broken URLs.

The new page addresses now look like this: http://lifehacker.com/#!5753509/hello-world-this-is-the-new-lifehacker . Until Monday, the address was the same, only without #! ..

Fragment IDs

# Is a special URL character that tells the browser that the next part of the address is a link to an HTML element with such an id or a named anchor of the current page. In the case of lifehacker.com, the current page is the main page.

It turns out that until Monday the site consisted of a million pages, and now it’s 1 page with a million fragment identifiers.

What for? I dont know. Twitter answered this question when it switched to the same technology that Google would be able to index tweets like that. This is true, but the same could be achieved with the previous correct address structure, with less cost.

Solution to the problem

Syntax addresses with #! (hash-bang) first entered the arena of web development when Google announced a way for a web developer to make a website accessible for indexing by a robot.

Prior to this, it was not well known about the correct solutions and sites with beautiful technologies like Ajax for content uploading observed a low level of indexation or ranking on the relevant keywords due to the fact that the bot could not detect the content hidden behind JavaScript calls.

Google spent a lot of time to solve this problem, did not succeed in this and decided to go from the other end . Instead of trying to find this mythical content, let the site owners themselves report about it. For this specification was developed.

We must pay tribute to the fact that Google carefully paid attention to the developers, that they should make sites with “progressive improvement” (progressive ethancement) and not rely on JavaScript as part of the content:

If you’re starting out of scratch, you can use your HTML Then, you can spice up your design with the Ajax. Googlebot will be your Ajax bonuses.

Those. address syntax with #! It was specifically designed for sites laying with the device on the fundamental principle of web development, and gave such a site a breath of life for their content to be discovered by the bot.

And now, this lifeline seems to be accepted as the Only True Web Development Path by engineers on Facebook, Twitter and now Lifehacker.com

Net URLs

In the Google specification, #! - addresses are called as “prettyURLs”, and they are transformed by the bot into something more grotesque.

Last Sunday, Lifehacker's address looked like this: http://lifehacker.com/5753509/hello-world-this-is-the-new-lifehacker .

Good. The 7-digit code in the middle is the only incomprehensible fragment, but it is required by CMS to unambiguously identify the article. Therefore, it is practically a “clean” address.

Today, the same article is available at: http://lifehacker.com/#!5753509/hello-world-this-is-the-new-lifehacker Now the address is less “clean” than the previous one, since the addition of #! fundamentally changes the address structure:

The address / 5753509 / hello-world-this-is-the-new-lifehacker becomes easy /
Added new fragment identifier! 5753509 / hello-world-this-is-the-new-lifehacker is added to the address.

Have we achieved anything? Not. But the mangling of the address does not end there.

The Google specification says that the address will be converted into an address with parameters, i.e. at: http://lifehacker.com/?_escaped_fragment_=5753509/hello-world-this-is-the-new-lifehacker .

Thus, it is this address that returns the content, i.e. This address is canonical, i.e. what the bot will index.

Looks like: http://example.com/default.asp?page=about_us

Lifehacker.com with Gawker just scored on 10 years of experience at clean addresses and came back to the typical ASP site (How many more will go from Frontpage?).

What is the problem?

The main problem is that the Lifehacker.com addresses do not indicate content. Each URL points to the main page. If you are lucky and you are fine with JavaScript, the necessary content will be relied on the main page.

More complicated than usual address, more error prone and more fragile approach.

It turns out that by requesting an address bound to the content, the requester does not receive the content. Those. breakage laid in the design. Lifehacker.com deliberately prevents bots from following links for interesting content. Of course, if you do not jump over the hoop invented by Google.

So why do you need this hoop?

Assignment #!

So why use #! - addresses if this synthetic address, which must also be converted to another, that will directly give the content?
Of all the reasons, the strongest is “Because it's so cool!”. I said the strongest, not the strongest.

Engineers will mumble something about state preservation in Ajax applications. And frankly, this is an idiotic reason for breaking an address in this way. The address in the href attribute can still be a well-formed link to the content. Since you still use JavaScript, you can spoil it later, in your handler, clicking on the link, adding #! To the right place! .. So it turns out that there is state maintenance, and without unnecessary closing of the site from bots and in general not-Javascript ' ovogo.

Disable all bots (except Googlebot)

All non-browser agents (spiders, aggregators, indexing scanners) that fully support both HTTP / 1.1 and the URL specification (RFC-2396, for example) cannot walk on Lifehacker.com, of course, except for Googlebot.

Therefore, the following consequences should be considered:

Caching no longer works, because intermediate servers have no canonical presentation of content and, accordingly, do not cache anything. This leads to the fact that Lifehacker opens longer, Gawker bears more losses due to increased traffic.
HTTP / 1.1 and RFC-2396 compatible crawlers see nothing but an empty home page. Those. Applications and services that are built on such kraulerah get the appropriate effect.
The potential use of microformats is significantly reduced, i.e. only browser and google aggregators will be able to see microformat data.
Facebook Like widgets that use page identifiers will require additional actions for the page to be Like (by default, since the main page is the only one pointed to by direct URLs, and all the “curves” with #! Will be understood as a link to the main page again ).

Dependence on perfect javascript

If the content can not be obtained by URL, it turns out that the site is broken. Gawker deliberately took this step with breaking links. They left the availability of their site at the mercy of all sorts of JavaScript errors.

The inability to download JS resulted in 5 hours of unavailability of all Gawker services last Monday (07/02/2011).
Omitting a semicolon (;) at the end of an object or array declared as a literal will cause an error in Internet Explorer.
Accidentally left console.log () again will cause the user to fall unarmed with developer toolbars.
Ad inserts are constantly with errors. Error in the ad unit - no site. And experienced web developers know that the most dull code is just in ad banners.

Such fragility without a real cause and benefits that do not outweigh all the disadvantages. There are much better methods than the one used by Lifehacker. Even HTML5 and its History API would be the best solution.

Architecture nightmare

Gawker / Lifehacker violated the principle of "gradual improvement" and paid for it immediately by dropping their stack on the launch day. Every mistake in JavaScript will lead to a fall and directly affect the income of Gawker and the trust of their audience.

additional literature

http://www.tbray.org/ongoing/When/2014/2011/02/09/Hash-Blecch
http://blog.benward.me/post/3231388630

Source: https://habr.com/ru/post/113842/

All Articles