Enchanting story, which has become today the most popular news of the day on the network, causes a lot of misconceptions. Even people close to web technologies do not always adequately assess what happened, what to speak of the other networked public, some of which have already declared what happened to be viral advertising. I will try to dispel the conspiracy theories in the form of answers to the questions asked in the comments.
Q: How did a search engine get access to SMS texts?A: Yes, they were always available to everyone, by design. Let me remind you that we are talking about
an anonymous sending of SMS from the site. Of course, for this you do not need to be a MegaFon subscriber, and you do not need to register on the portal - this is the beauty of the service, especially when you are cherished every second. However, the developers were not too lazy to provide the visitor with the minimum amenities: for each sending attempt, a page with a random address is generated that displays the text of the SMS and the status of its delivery. This is what anyone can read, including robots.
Q: Is it not possible to limit the availability of these pages without complicating the lives of users?A: Of course, there is. Here are just the most obvious: linking to a session cookie in the browser, a strong limit on page life and finally, robots.txt, which prohibits search engine indexing of these pages. The
robots.txt file was added only during today's emergency patching of holes, which is confirmed by the
official response of Yandex . Why didn't the developers think about it? I have a theory about this: shaky :)
')
Q: Why doesn't Google see anything?A: In order to index pages, you must first learn about them. As a rule, search engines navigate to new pages through links from pages already known to them, which Google did not have. However, he nevertheless indexed several pages, just against the background of “Yandex”, it turned out not so impressively.
Q: But how did Yandex find them?A:
This is “Yandex”, everything will be found. The most plausible version: the
Yandex.Metrica code installed on the site.
Noticing traces. In the course of emergency work, Megafon got rid of it, but at present, a
cache of July 5 is still available at Google, where it is present. The addresses of all pages visited on the service became known to Yandex - this is the principle of Metrics. It is curious that Google Analytics code was also present there, but search engines differently used the information they received. I would not call it the Megaphone file - there was a normal use of good tools. And to hide non-public data, I repeat, you need to use robots.txt, linking the session to the browser, authorization on the site and other methods.
Q: Why are so few posts indexed?A: First of all, let me remind you that these are only messages sent from the site, they are not sent from millions, as from telephones. Now something about search engines. “Yandex” never tries to download the entire site, if the pages count in tens and hundreds of thousands, and only if we are not talking about a highly-cited “Wikipedia”. The pages were downloaded gradually, getting out of the crowded queue in an unpredictable way, so that by the time the robot called in, they could already have been removed by Megaphone. What part of the messages eventually got to the search is not clear, but it’s definitely small. Well, the old pages just left the index with the next cache updates - the garbage on the search does not live long.
Q: And what about all the messages so interesting? Where are monosyllabic "OK", "Yes", "No"? Where “I will be in 5 minutes” and “Busy, call you back”? Why a little translit and a lot of mistakes?A: Again, there is the specifics of both the service and the search. The site is not used on the run, it is just for long SMS by the way. Answering him is also inconvenient - the question came to the phone. There is no need for transliteration: it did not fit in one message - write the second, freebie. Well, anonymity provokes a lot: some of these texts may well turn out to be stupid pranks and substitutes. But even if there are 99% of template messages, Yandex will show on the first pages exactly 1% of the “interesting” from his point of view. This is how the ranking for a query limited to a site, but not containing text, is arranged. The citation of all pages is zero, the behavioral factors are also the same, only the content remains: the more unusual (expressive, erroneous) words, the higher its uniqueness, the more valuable it is. All this made the Bashorg branch of search results.
V: Yes, sure it is a virus! There is no bad PR.A: It seems that in marketing, as in football, we understand everything :) I will quote the comment from
niketas from the topic that went into the drafts:
It seems to me that even when you catch a girl in bed with another guy, you will say, “That’s what you played for me, joker!” And go to put tea in the kitchen.
Irreparable damage to the reputation of the operator has been inflicted. The Investigative Committee of the Russian Federation has begun to verify the fact of the leak, the affected subscribers, whose correspondence has become public, are going to sue for cash compensation. What kind of porridge should be in your head to see the benefits for MegaFon? New subscribers cannot be attracted by such a file, but old ones can be easily lost, in addition to loss of reputation and money.
UPD (07/20/2011): Updated information on Yandex.Metrica, because there was evidence that it was on the site (thanks to
w0den ). Copied from the comments your answer about PR.