📜 ⬆️ ⬇️

Roman Ivanov: “Search on blogs is quite difficult”

Roman Ivanov, the head of the communication services department of Yandex , in an interview with Habrahabru talks about the features of blog search and reports on what trends are visible in the runet blogosphere.

How did you get to Yandex?

Before Yandex, I worked at JetStyle, a Yekaterinburg company. He worked there as a developer, system administrator and manager, including participating in the creation of the WackoWiki wiki engine and the innovative, but incomprehensible to the common man, blog-wiki-hosting of the NPW .
')
Actually, because of them, they noticed me in “Yandex”: they called Kolya Yaremko and I (co-author of WackoWiki and the main author of the NPW) to talk with me and then work.

With JetStyle, by the way, we regularly cooperate.

Why did you create an NPW? Was it an experiment like that?

Yes, it was such an experiment, an attempt to create a service based on the concepts, and not on the user's desires. NPW was created by a group of people with different goals, who had one common interest, or rather even the need for a tool that helps a group (or groups) of people to work with each other and with different texts. One of the goals of the project was the scientific work of Kolya Yaremko, another goal was to create a communication environment for role-playing games, another to create a corporate tool for organizing work with knowledge and notifications, in the end to occupy an interesting, innovative niche for synthesis of blog hosting and wiki.

Now this project is slowly drifting without precise control. The ideological developers are busy with their interesting work, the community lives its own life. The main brain of the NPZh is Kolya Yaremko, however, now he doesn’t have much time for this project.

Has anyone tried to buy an NPW?

Project, site or license? The license was bought several times. Nobody tried to buy the site and project.

Can you name the buyers?
I can name two companies - “Electronic City” and Abak-Press .

On your business card is written "the head of the communication services department." Can you explain what these services are?

These are all services related to communication on the web. In addition - it happened - I also lead the development of software for the end user. From the currently open services, you can call “Yandex.Mail” (and its new version ), search blogs (we call it short for “PPB”), “People” , “Yandex.Lenta” , “Bookmarks” . The programs include Bar , Personal Yandex Search and Spam Defense .

How long have you led the department?

A year and a half, from January 2005.

Big department?

Now, besides me, there are four people in the department - these are all managers. Developers have a similar department of “development of communication services,” there are many more of them. We, by the way, the developers are not subordinate to managers, but together they do a common thing.

Probably, the "Bookmarks" will soon be released in the new version? Among all the above, this service is perhaps the most "ancient". In the sense that it does not correspond to the spirit of the time.

We traditionally do not talk about plans, so whether they come out or not - I do not comment. And about antiquity - this is not quite true. The service appeared one of the very first, in the year 2000, immediately had a social part, public bookmarks, etc., not only had tags.
In 2004, it was completely redone, becoming a personal part of Yandex.Catalog and losing all its social functions.

When will Yandex.Mail be transferred to the ajax interface, which is available on mail.ya.ru?

Now any user can enable this interface as the default interface in the settings.
Forcibly everyone to include the new interface, we do not plan in the near future, the transition will be gradual.

Why?

Because it is impossible to force the user to change the interface to something completely new. You can talk about the new, advise the new, but not force users.
It is unlikely that someone from the users of Windows XP would be delighted if they turn on the computer tomorrow, and there instead of XP - Vista, without any warning.

What is the size of the Russian-language blogosphere now, at the end of July 2006? How many new blogs in Russian appear every month? Do you have such statistics?

The size of the blogosphere is difficult to estimate exactly. We know almost 900 thousand blogs, but there are still a noticeable number of non-updated, inactive blogs in those systems that we began to index not from the moment of their appearance, but later - such as Liveinternet , Dame , Diary.Ru .

There are also several blog hosting sites that still do not have RSS - like darkdiary and gothicjournal .

That is, it is safe to say that more than a million - but how much more is not very clear.

How fast are LiveInternet and Diary growing? When, in your estimation, will they press Livejournal from the first line of the popular blog hosting sites?

In June, we learned 85 thousand new blogs, of which 21 thousand - Livejournal , 25.5 thousand - Liveinternet, 16.5 thousand - Blogs@Mail.Ru , 6 thousand - Diary.Ru, 5 thousand - Rambler-Planeta .

When overtaken - I do not dare to predict.

Rambler-Planet and Blogi@Mail.Ru appeared simultaneously, but the first, judging by the statistics, is many times less than the second. Why do you think the blogosphere on the Rambler is growing more slowly than the Mail.Ru blogosphere?

In fact, the "Planet" began to advertise much later, it seems, for six months. But this is not the only reason - it seems to me that Mail.Ru has more audience of those services from which people can easily go to blogs. This dating and photo hosting. In addition, Mail.Ru more, as far as I saw, advertised their blogs on these services.

And, finally, the positioning of the service at Bloog@Mail.Ru is more understandable. The “Planet” metaphor still needs to be “mastered”, and in “Blogs” it is enough to learn a new word.

What do you think, why the “Rambler” “Lady”?
Rambler is a company whose strategy I will not comment on.

I don’t know why “Rambler” needs love.rambler.ru , planeta.rambler.ru , mama.ru and damochka.ru at the same time. Perhaps there is some kind of strategy.

Tell me, how is the search for blogs? How is the indexing? What is the name of the spider that goes on blogs?

Blog search is pretty complex. The fact is that it is fundamentally different from web search: for web search, the amount of material accumulated in previous years is almost not important - the database is completely updated in a not very long time. For blog searches, on the other hand, the disappearance of archives will be catastrophic, because blog search indexes only new entries - in RSS feeds (the only source for indexing), usually only 10-20 of the last entries are present; and the old records will have nowhere to take.

What does blog search consist of?

1. A robot called blogindexd. The robot downloads RSS feeds (its user-agent is YandexBlog / 0.99.101 (compatible; DOS3.30; Mozilla / 5.0; B; robot;) NN readers , where NN readers is the number of subscribers to this stream in Yandex.Lenta - this information may be interesting for the author of the stream) and puts them in the repository.
2. The repository for text entries is called bulca. This is a file system-based storage, development of Yandex.
3. Storage for metainformation (date of recording, record stream id, etc.). It uses mysql.
4. Full-text index and search program above this index. This is, in fact, the usual "Yandex.Server". Generally speaking, the index is not one, it is divided into several - permanent indices that contain archives; static indexes, which contain records of recent weeks and are updated quite rarely, about once a day, and dynamic indexes, which are updated much more often, up to every five minutes.
5. The scheduler, which, based on the history of the stream, determines when to download it again. This is a fairly intelligent program, the purpose of which is to download streams as often as possible, but not to overload the servers from which we download streams. During the first months of blog search, it happened that we’ve been dropping the server to them by downloading RSS from Livejournal.com too actively.
6. A large number of additional scripts that are responsible for fighting spam (there is spam in blogs), disabling news feeds (in blog search, we try to keep only threads that contain opinions — blogs, forums, groups, etc.) and much more.

How many servers support blog search?

Lot. First of all, I don’t know the exact figure, and secondly, I can’t say. It all started with about ten servers, now there are more of them.

As far as I know, you name each server by some name, sometimes funny. What are the "blogger" servers called?

Not all blog search servers are called intricately. Here the servers with "permanent" indexes are called puzzle1, etc., and the rest have names in the form of ordinary abbreviations (db, m1a, s1 ...).
But on the front-end servers (common in blog search, with a bunch of other services), they traditionally come off: plague, earthshake, shout, steemroll, soulcry, flamestrike, etc. As I understand it, these are all spell names from ADnD ).

How much blog spam? How fast are its volumes growing? Is there such statistics?

Now we know more than a thousand spam RSS feeds, mostly hosted on large blog hosting sites.

Until March 2006, when the blog search came out of beta, there was practically no spam at all, but the next day after the “launch” we had to manually rake the first timid attempts. Since then, we have made automatic tools that allow us to say that there is almost no spam in blog searches. Of course, there is no limit to perfection, and I can create a search query that will show at least a dozen spam blogs, but more spam in the visible part of the search does not become, only less. We recognize about a dozen spam streams a day and a half.

It's also worth noting that search engine spam on blogs is almost always aimed not at visitors coming from a search on Yandex blogs, but on web search robots — like Yandex, and probably other search engines — these are attempts to introduce robots with new doorways or wind up the reference relevance of other doorways.

There is still non-searchable spam when the community posts messages off-topic, but it is not related to blog search.

How has the blogosphere changed in Russia over the past year? What trends are visible? What can you mark?

The most important change is the emergence and manifestation of other pillars of the blogosphere besides Livejournal. A year ago, there were no blogs on Mail.Ru and the planet "Rambler", the size of diary.ru and liveinternet.ru were not clear. During the same year, Liveinternet understood more about social services and other Web 2: 0, they began to change a lot.
For the same year, mobile operators (MTS and Megafon) also reached for blogs.

It is evident that many new people have come to the blogosphere, many of them do not know how to write well - they are not journalists, not writers and not “geeks”, but ordinary people with ordinary cares.

Due to blog search, the coherence of the blogosphere has greatly increased: there used to be such large blog hosting sites and units (well, hundreds) of standalone bloggers, and now, in two clicks, you can find links to yourself in any blog, collect opinions about or another event from the entire blogosphere.

I am sure that in many Internet advanced companies the opinions of bloggers are carefully monitored - in any case, I personally monitor the opinions and reviews about the most interesting and important Yandex services for me.

Yandex.News now transmits opinions from blogs next to news stories. When did you recognize the power of blogging?

The power of blogs at Yandex was recognized when it came up to do a blog search. That is, even before my appearance in the company, probably in the first half of 2004. Recognized it publicly and comprehensively with the release of a search on blogs from the "beta", when he joined the line of search "tabs" under the search bar - in early 2006.

Further integration into different services is a matter of time. Integrating with news is an idea that lies on the surface itself; many people came up with it during the existence of a blog search. Another thing is that to bring the idea to a specific implementation is not always easy. In this case, it turned out, though not always "clean". We are working on this.

And when did you personally feel the power of the blogosphere? Do you remember this moment?

In relation to me personally, probably, almost immediately, that is, in 2001, in LJ.
The question asked in his blog often received a quick and good answer, while the question could be on almost any topic - from the remedy for the son to the choice of a scanner.

Strength in some wider sense? Yes then. On September 11, 2001, more information about what was happening was in the friendly tape and the fifa tape (a summary tape of all Russian-speaking users of LiveJournal that operated at that time) than in any particular media.

The topic of blogs fascinated me, I participated in the development of the Reg engine [ster] in 2003, the NPW - in 2003-2005 . And then there was “Yandex”.

Why is Reg] [ster "stalled"? The engine had all the chances to develop into a large platform, but did not grow together?

For two main reasons. First, the code written by Dima Smirnov, was rather sloppy and weakly extended (almost complete lack of modularity, procedural approach, etc.). Secondly, there was no enthusiast who would undertake to develop the "Register" after the creators ended their enthusiasm. I, in particular, ended it, because there were more interesting projects - WackoWiki and subsequently NPW.

In Russia, corporate blogging is not very popular thing, what do you think, why?

For two reasons. First, our blogging audience is not as big as in the west. Although the increase in the number of people aware of what a blog is, of course, impressive, see ROMIR data that blog popularity has doubled in the last nine months. Secondly, not all managers and PR services are ready for the openness that a corporate blog implies.

Who reads comments on corporate blog entries?

Many who: they fall into the general mail folder, which any employee has the right to read. Judging by the answers, Elena Kolmanovskaya and Ilya Segalovich are constantly reading, as well as technical support staff. Well, I also constantly read.

What do people write most often? Try to remember the most bizarre feedback?

For a long time, the most often written afftar zhzhot - in response to the posting about the query-based speller . There are regular comments like “I am a new beg of you for help”, to which, as far as possible, staff of the customer support service try to respond.

The strangest?
Perhaps this, but it is long for an interview.

Why some Yandex hosts in ICMP Echo-reply respond with the same TTL with which they received the request

Just curious example:
# traceroute -P ya.ru

7 ix2-m9 .yandex.net (193.232.244.93) 55.974 ms 37.562 ms 40.819 ms
8 c3-vlan3 .yandex.net (213.180.192.171) 63.987 ms 41.410 ms 80.810 ms
9 * * *
ten * * *
eleven * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 ya.ru (213.180.204.8) 61.545 ms! 48.058 ms! 49.508 ms!
Hopes from the 9th to the 15th - as I understand the false ones, i.e. host 213.180.204.8 (maybe there is something else before it) responds to ICMP with the same TTL with which the packets reach it, and therefore the answers are not returned until the TTL is doubled.
What is this for? If not difficult, give the answer ... Was it done for security reasons or is it some kind of tricky hardware, does any load balancer behave this way?

Oh, shorter:

I receive messages on e-mail in English. Can I get letters in Russian?

Anton Antich wants to make Blogus a central place to study the Russian blogosphere, what do you think about this?

I know about Blogus for a long time, we met with Anton, discussed how best to give them the number of links according to the blog search.

I think that let a hundred flowers bloom. Any meaningful resource around the blogosphere is good for it.

What do you think is central to exploring the blogosphere? Here's a search for “Yandex on blogs” - is it central?

I think in many ways our blog search is such a place. Of course, the ideal is unattainable, but one should strive for it. We think a lot about what other services need to be done to become such a center for studying the blogosphere; we make these services.

When should we expect them?

I can not talk about the timing, you know. But, judging by how vividly everything was introduced and improved on the service over the past six months, we can assume that pretty soon. Here, for example, the opportunity to search only in blogs or only in forums with one click, right from the page of search results, appeared about a month ago, without any announcement. I hope it is useful to our users.

Source: https://habr.com/ru/post/4223/


All Articles