Not so long ago,
we collected questions Ilya Segalovich (
iseg ) - Director of Yandex on technology and development. Editorial Habra chose the most interesting of the questions, and Ilya Segalovich answered them, and
how he answered!
New technologies appear with enviable regularity. Do you have time to implement them, always strive to use new technologies, or use those that have been tested by time?I hope the question is not about software development tools, but about technologies used in Internet services. We live by the principle of “the least effort with the greatest return”. As soon as we see that the “technological” (here this word is used as opposed to “manual”) the solution may be useful, even in the very first approximation, we try to give it to the user.
The art is to come up with a technology that is not completely perfect (and no technology can be completely perfect, the technology can not achieve the right wrapper):
- achieve acceptable quality in terms of completeness and accuracy;
- establish the boundary conditions of its applicability;
- to make its work understandable, so that even errors do not cause user rejection;
- if necessary, allow the user to configure or disable it.
')
Novelty is one of the criteria for the impact of technology. True, very important: it is interesting for both us and users to get what no one else is doing. Other things being equal, it is better to give new opportunities than to repeat the old solutions, especially if the old ones are not very critical for the service.
We often and rather cheekily “throw out” to the general public what no one before us (on our scale) did: be it a
completely “auto-sticking” address book , or
automatic biographies in the News , or
auto-selection of keywords on Direct .
What are the stages of introducing new technology in Yandex? Both third-party and own.Much depends on whose technology it is - ours or external. In the first case, there is a team of managers and developers who monitor the situation in the outside world and make their plans based on the needs of users, their capabilities and ideas for creating new products.
Your ideas, of course, are better than others, but often it is impossible to draw a line. Often we see a novelty in someone else's performance at the moment when its beta version is already working with us.
As for external technologies - those situations when we do not have internal resources to solve the problem, here we are playing an active role: we are looking for possible suppliers, we hold a tender among them, sometimes we even go to buy teams and small companies with the technologies we need. However, self-propelled situations also happen: sometimes they come to us and offer this or that technology that is missing from us.
If we talk about implementation, the scheme of this work is approximately as follows: test creation, debugging and tuning, solving interface issues, solving marketing issues, building monitoring, launching.
Tell us about what new technological problems have arisen in the past few years and how have you dealt with them?I will tell only about one problem that sharply confronted us in the spring of 2006. The task was formulated simply: get a world-class web search for ranking quality or die as a technology company.
We were able to rewrite the search in depth (run-time almost completely), learned how to really teach it, learned how to increase the quality of the search scale. Scaled means that dozens and even hundreds of people can work on receiving signals at the same time, and the technology created allows them to combine their efforts effectively. We started the transition to a new technology at the very beginning of 2007, but only from the end of 2007 and the whole of 2008 we are reaping the benefits of growing quality.
In 2004, when we first released a search that was trained by assessors, we ranked by a number of factors and ways of re-formulating queries, and were configured virtually manually using a database of several hundred evaluated queries. Now we are dealing with a database of tens of thousands of tagged queries, about two hundred signals are involved in the ranking, and the rules of the reformulation, expansion and classification of the query include thesaurus, abbreviations, transliteration, translation, definition of the topic and other aspects of the query, and much, much more.
Tell us about what Yandex still does not know how to do, but would really like to?I want to be able to search flawlessly around the world Internet. I would like to have a high-quality map of the country, up to a house in any village. I would like a lot of other things to work on, but I don’t want to do something we don’t work on: you should still limit yourself to desires :-)
Is there an AI department in Yandex? And carrying out including fundamental research, and not just short-term projects?We inside Yandex are trying to avoid the ambiguous and orderly discredited term “artificial intelligence”. What just do not call AI, even the control unit of the washing machine. But in Yandex there is a computational linguistics department, there is a fact extraction group, there is a ranking group, in essence “machine learning ranking” (machine-learned ranking). All these groups implement in practice those algorithms that are commonly referred to as “artificial intelligence”.
With regard to basic research: perhaps we have not yet grown to finance the solution of Hubert's problems or the proof of Fermat's theorem. But our industry is no less knowledge-intensive and interesting than many deeply theoretical disciplines.
Has the year 2008 brought any new technologies that you found interesting and useful?What are the doubts? Every year brings something new, including in technology. I am not a professional market analyst, my feelings are rather subjective, but nevertheless, I will try to name something.
If we talk about iron, then in 2008 commodities appeared (more precisely, steel) devices that change the technological landscape of the industry: (1) smartphones and cheap netbooks with wi-fi support; (2) 16-core servers for the price of 8-core; (3) iPhone came to Russia with unique interface solutions. The new “hardware” creates a “new dimension” for the “software”: (1) the use of wi-fi increases dramatically, because of this, the technologies for determining the location of the user are changing; (2) the processor resource of the server on the seesaw processor-disk and even the processor-memory has become much "easier", it can now be used much more intensively, often at the expense of memory and disk; (3) a powerful smartphone allows you to transfer a significant proportion of signal processing to the client.
In the technologies that use mass Internet portals, there were also a lot of new products and changes. Finally, the long-awaited image analysis of the content earned everywhere:
highlighting faces (Google, Yandex),
identifying faces (Google Picasa),
finding fuzzy duplicates (Yandex) or even
just similar images (Microsoft); The first versions of the
search in the web for OCR-texts (Google). In the analysis of sound, too, the first important steps have been taken - the
voice search version
for the iPhone (Google) has earned.
Many changes in web search interfaces. There is a clear tendency towards structuring and structural annotation of search results: perhaps the most notable steps were taken by Yahoo! with its open and extensible SearchMonkey, the “non-list” search interfaces (Yahoo! India, Cuil, the experimental Google search in the “Alternate views for search results” version) were vividly shown. Almost all search engines started to play videos right at the top, and Yahoo! also gives listen to music.
A hit in 2008 was the technology prompts query as you type. She finally got all the search engines and even transferred her main search interface to it. Version of Yandex, in which the desired site appears as a matter of fact without a search - for today, perhaps, the most courageous.
Search engines in 2008 paid close attention to interaction with webmasters: all Robot Exclusion Protocol (REP) players published and supported, many improvements and new features in webmasters' interfaces in all search engines, almost all search engines began to warn about downloaded sites.
In the world of mobile devices, the main breakthrough of 2008 is, perhaps, the introduction of a combined (CellID and GPS) method of determining the user's location (Google, Yandex).
Warning of possible questions, I note that although in Russia 2008 was a year of social networks, and despite the phenomenal popularity of resources such as Facebook or Odnoklassniki all over the world, I find it difficult to attribute their success to technological. Well, except what we are talking about "social technologies". :-)
What is the gift for this new year preparing Yandex SEOs? : DWebmasters are waiting for a lot of useful tools to monitor the indexing and site shows. And the “over-optimized” we can, I hope, more accurately and better “pessimize”.
What, in your opinion, is the future of the tools of large companies? What is used inside Yandex in the design and writing: products, methodologies, languages.Konstantin Kolomeets ( kolomeetz ) answers:If it is possible to put together a group of developers, “offline” design tools work much more efficiently than any others: whiteboard and marker, pen and paper. However, Yandex is a large company, employees are scattered across several offices, and often work on the same task. Therefore, we make extensive use of internal communication services: wiki, mailing lists, and bugtracker. In mailings (we have more internal mailings than employees) we discuss ideas and problems, fix tasks and bug reports in the bug tracker, and wiki - glue for information from various sources. These are communication tools common to all. And the development environment and the platform, each team chooses those that it considers more appropriate. There is no single standard.
Ilya Segalovich ( iseg ) answers :It is worth distinguishing data preparation and run-time.
Run-time servicing users on the portal traditionally lives something like this: search and banner systems in C ++ and Perl, other services also use Java and Python, in particular, due to the flexible interaction system of modules adopted on the portal. (see the presentation of highpower at RIT 2007).
Data preparation is mainly C ++, plus a lot of script bindings.
The browser is not going to do? ;)Right now, no. :-)
Let's add some specifics. This year it has become much more difficult to promote new sites, and the time to launch to the TOP has greatly increased, will Yandex continue to follow this policy of filtering new sites? In fact, it can be called the same “sand” that Google has. Will your company do something with slow indexing and introduce some sort of sandbox for new sites?You can call it whatever you like, “sandbox” or something else, but the fact remains: the search, as a system, is complicated at all levels and stages, an increasing number of different classifiers and solvers are built into it. Automatic selection of what and when to start indexing and including in the search is just one of these solvers. He will not disappear anywhere. We are working to make it smarter.
Do you know Bobuk? Do you listen to radio te? And How?I know, I wouldn’t know :) Radio-T, unfortunately, I’m not listening yet, I have a laptop without headphones, and I have recently acquired an acceptable device for listening to podcasts (iPhone). Now I'm fighting with the wonderful iTunes program for Windows, when I win, I'll start listening.
Do you agree with the approach of Peter Norvig - [exaggerated] “to improve the search, let's use simpler algorithms on large amounts of data?” Do you have any “technological philosophy?”In general, I agree, but with a small amendment: we are still trying to find a balance between an approach that builds on expert knowledge and an approach that relies solely on large data statistics. In fact, we are constantly looking for methods of crossing them. Expert knowledge often acts as a seed for an algorithm, and it is definitely necessary for evaluation and tuning. Therefore, in particular, linguists work for us, although they are few.
What innovations are planned to introduce in order not to lose the war with Google?And what innovations does Google plan to introduce in order to finally win the war with us?
What is the approximate amount of data occupied by all the photos on Yandex.I don’t know for sure, now I’ll ask :-)
Roman Ivanov ( kukutz ) answers :Photos on Yandex. Photos occupy almost a hundred terabytes.
Are you looking for yourself through google or yandex?Sorry, did not understand the question. As you said? Gugl? :-)
What gadgets does Yandex director for R & D use?It depends on what is considered a gadget. If the gadget is a device with a microprocessor and interchangeable software, then from what I set up myself using updates from the Internet, then:
- Fujitsu-Siemens Lifebook S-series laptop;
- smartphones iPhone 3G and Nokia N82 (and before that N73, 6681 and HP rx3715);
- wifi Linksys WAG325N (and before it WAG54G) for Stream;
- portable carrier chip tuning software PPC BSR for the car.
There are still gadgets to which updates do not come: washing machine, garage door, refrigerator, coffee maker. The family still has gadgets - the most different.
Do you use languages ​​like Erlang, Haskell, lisp, or other functionalities in development? If yes, could you tell in which particular areas?I personally do not use it, but in Yandex there is, for example, a project on Erlang — this is our jabber server. There are still quite a few UNIX programmers who live in the emacs environment, that is, they use lisp anyway. Finally, XSLT is also a functional language, and it is very common in Yandex.
Is it planned to introduce something like Web2.0, as Google experimented, so that users vote for or against sites, raise or lower the rating for certain requests?We have been collecting the main signal stream from users for a long time: this includes visiting pages, re-formulating requests, and transitions from search, and much more. To give the user the ability to signal the request he needs in the active mode - the idea is rather obvious, the questions here are more likely as a signal, its useful volume and its resistance to wrap. We do not know how to answer them yet.
What is Yandex’s contribution to Open Source?Ilya Segalovich ( iseg ) answers :While small, I want it to be more.
Grigory Bakunov ( bobuk ) answers :We somehow support almost all open source projects that we use. In particular, lighttpd, ejabberd, libxml2 / libxslt, omniorb, django, psi, qt and many others constantly receive bug reports and patches from us. We also have our own open projects, for example, the Chat for Ya.Online component is also available in source texts. More information about this can be found on the website of our experimental site - nano.yandex.ru.
There is another aspect that we are actively helping open projects with - this is our mirror with open source repositories,
mirror.yandex.ru . Most of the users of Linux and FreeBSD in our country, without knowing it, download their OS packages from our servers.
And, probably, it is worth adding that we have a lot of people working on developing their own open source projects or helping others.
What can you say about young professionals? How does the general> level of people who want to get to you over the last, say, three years, change?On average, it seems to me that the level is growing, although it is hard to feel. We have two opposite trends: on the one hand, the company is growing strongly, and the era of piece-set, when each developer was accepted by almost the entire company, has long ended. Throughout 2007 and 2008, we hired a lot. And although scary sagas about fierce long interviews are being written about accepting to work at Yandex - and in fact we are taking all measures to keep the input level at its maximum, such a speed of recruitment inevitably leads to a certain decrease in the bar.
On the other hand, over the past three or four years we have become much more noticeable, the audience of services has grown significantly. We have many new interesting tasks - this greatly affects the attractiveness of the work. Well, plus to everything, salaries have significantly increased, some of the developers got options. As a result, top specialists from very well-known Moscow companies, from branches of western companies, began to transfer to us, and even from abroad they began to move to us.
By the way, the current crisis has hurt us so far not so much that we abandoned recruitment, as can be seen on the
company.yandex.ru/inside/job website, although now we are focused on finding experienced, high-class specialists.
A separate story for Yandex is that the education system does not actually give us specialists of our profile: statistical text processing, machine learning are subjects that are practically not taught in the right volume and form even for the best places.
Graduates of domestic equivalents of computer science faculties, as a rule, do not know what is SVM or language models.
Therefore, we have created our own educational institution, the School of Data Analysis. This is a two-year master's degree course, with evening classes and internships at Yandex. Two years in a row we recruit about 80 students. Some students are already successfully working in Yandex, and teachers are actively involved in our research. But besides the ShAD, we constantly have internships in the development department and in the operation department.
Your forecast for the development of the Internet for the next 5-10 years. Should we expect significant technological revolutions and shocks?I am not Cassandra and not even an “Internet Technologies Market Analyst”; I have no forecast for the next 5-10 years. General words do not want to write.What kind of IT specialists are you missing (directions, perhaps specialists in specific languages)? If you personally (well, who do not want to work in Yandex) suffice, then I would like to hear how you see the market situation in general.I have already answered before about the lack of specialists we need.Catastrophically lacking system administrators with experience in administering large 24x7 systems. There are very few people who widely use machine learning in software development, be it text processing, image, sound, or behavioral analysis of humanoids (humans, robots) or computer systems. Few experts in statistical processing of texts or social graphs. But there are almost no problems with programming languages.Is it planned to correct the header returned by the server causing the .Net protocol violation: In end-of-line code, use CRLF; using CR or LF alone is not allowed. And also answers to POST requests wow.ya.ru < wow.ya.ru > that do not contain 100-Continue, which is used to separate the header and data sent in the POST request?
PS: on the client side, you have to use the settings useUnsafeHeaderParsing = "true" and expect100Continue = "false"It is planned.
Thanks, bug report wrote.What is the operating system and which browser (by default) is installed on your laptop?Windows XP with all service packs, as well as Firefox + Thunderbird (since those years when they were NOT called that way). When I wrote myself, I was sitting in the console: a simple editor joe, g ++, gdb, gprof, awk, perl. For Windows, I used MS Visual C ++, and before it Borland (long).Does Yandex plan to create similar (Google.Docs) services?Not yet.
Why are promising projects abandoned? For example Yandex. Tape.In part, this is a sad truth, not all projects can be maintained in the active development phase. But we still love Lenta, we recently made her a PDA version with scaling pictures and the correct screen width. By the way, I highly recommend, I now use it even more than the main one.And yes, we are planning various useful changes on this service.Ilya, please kill all SEOs. Already tired of finding not what you are looking for, but what they are promoting. Thank.Thanks for the advice, it is very bloodthirsty, but your thought is clear. We are seriously working on “spin and pessimization” of unnecessarily “promoted and optimized” sites - this is one of the main directions of our work.Do Yandex developers have priorities (philosophy) that they should adhere to when writing software, for example, in descending order, reliability, convenience (code), extensibility, performance? If there is, call?Anatoly Orlov ( anatolix ) answers :Prioritize for all occasions is impossible. For example, there are projects where a 10% increase in the speed of a code of 100 lines equals several million dollars in hardware, and there are those where, instead of downloading a programmer for six months, it is easier to add another 20-30 servers. In general, we want the programmer to write readable and maintainable code, fairly effective, but without premature optimization. We want a certain reasonable balance between “not to touch another's code at all, but to stick our decisions to the side” and “rewrite everything you see.”Plus, in many of our projects there are features in the form of a huge amount of data that needs to be processed, and the performance of 24x7x365.25, and with this you need to be able to live, that is, write efficient algorithms and be able to develop an architecture without single points of failure. This is not a wish - without this, you cannot write anything in Yandex, on the other hand, no one is born with these skills, and older comrades will help and teach.Ilya Segalovich ( iseg ) answers :Another couple of considerations on the subject of aesthetics and coding standards:- an ugly, but working code is always better than beautiful, but not working;- demand from yourself a greater commitment to the rules than from others.What is new from a technical point of view of waiting for Yandex users in the new 2009?The year has not come yet, I hope that a lot of interesting things will be able to be shown and launched. We'll see.
Why are there so few “programmer” pieces-dryuk (frameworks, libraries ...) coming out of the walls of Yandex?Come to us, and the "dryuk" will come out more :-)