📜 ⬆️ ⬇️

Exclusive: How the Google algorithm controls the Internet

From the translator: I do not think that I am discovering America with this translation, by and large the sophisticated user will not find a lot of new and unusual things for themselves. However, in my opinion, this is a good general educational article, in which the main milestones and principles of the work of search engines are conveniently collected and considered. The original article was published in the journal Wired for March 2010. I warn you right away - the article is long.

image Do you want to know how Google is going to change your life? Stop at the Ouagadougou meeting room on Tuesday morning. This is solved here in California, in Mountain View, at the head office of the most influential Internet company in the world, in a room filled with three dozen engineers, managers and managers who determine how to make the search engine even smarter. This year, Google introduced about 550 improvements to its legendary algorithm, and each determines the release of information. Decisions made at the Weekly Search Quality Meeting affect the search engine results for any of your queries — the Samsung SF-755p printer, the Ed Hardy MySpace page, or even the capital of Burkina Faso, which, by the way, is also called as this meeting room. Udy Manber , head of Google Search since 2006, leads the process. The proposed changes, together with the results of the testing months in various countries and in various languages, are presented one after another. The screens next to each other show the results of the queries before and after the change. Following the issuance of the search result "wah-wow guitar center" - Manber shouts: "It worked!"

You may think that after a decade of absolute dominance in the search market, Google can relax. Now this company controls 65 percent of all searches, and it is the only one whose name is synonymous with the word “search”. But just as Google is not ready to rest on its laurels, so its competitors are not ready to admit defeat. According to the stated goals and objectives, for many years the giant of Silicon Valley used its mysterious, seemingly omniscient algorithm for "organizing world information." But over the past few years, many companies have challenged Google in its core business: they have challenged the search engine, which, thanks to technological magic and constant improvement, can satisfy any possible query. Facebook has come close to Google’s position, relying on people to trust friends ’advice rather than the results of a faceless formula. With the ability to parse its continuous stream of messages, Twitter introduced the concept of real-time search, intercepting recent conversations and discussions as they unfold. Yelp helps people find restaurants, cleaners and nannies using a top rated search. None of these new services is a threat in and of itself, but collectively they predict a future in which search will be open and broad, and in which not just a search engine reigns, but rather a collection of various search services.

In addition, the greatest threat to Google is located in one and a half thousand kilometers to the north: Bing. A redesigned and renamed Microsoft search service with a name that was supposed to remind you of the exclamation “Bingo!” - “Eureka!” (Or maybe the name of a famous American singer or the name of a strip bar from The Sopranos) in June last year, with rather optimistic forecasts. ( The Wall Street Journal called it "more attractive than Google.") The updated look and advertising campaign of $ 100 million helped increase the share of requests in the US search market from 8 to about 11 percent, and this amount will double as soon as regulators allow the use of the Bing search engine for Yahoo.
')
The Bing team focused on individual cases in which Google’s algorithms do not always show satisfactory results. For example, while Google is doing an excellent job with indexing public Internet sites, it does not have real-time access to confusing and ever-changing data sets on aircraft departures and arrivals. Microsoft buys Farecast - an Internet site that constantly monitors airline fares to predict rising or falling ticket prices - and provides its data in the Bing search results. Microsoft made similar acquisitions in the field of healthcare, regulatory documents, as well as purchases and sales of goods and services - in areas where, according to their calculations, Google's algorithms are imperfect.

But even Bing recognizes that if you just need to enter a search word and get relevant results, Google is still very far ahead. However, Bing believes that even with good results in some subject areas where Bing is superior to Google, people will not use only their service for other queries. “Algorithms are extraordinarily important in the search, but they are not the only things to look for,” says Brian MacDonald, Microsoft’s vice president of search, “you choose a machine not only for engine power.”

Google’s response can be summarized in four words: “ mike siwek lawyer mi ”.

Amit Singal wrote that the decision of this koan is in the search line of his company. Singal, a forty-year-old gentleman, the carrier of the honorary title “Google colleague”, awarded four years ago with an award for rewritten search engine code in 2001. He hit the enter key. In an instant, a link page appeared. The top list result leads to lawyer Michael Sivek of Grand Rapids, Michigan. This is a completely innocuous search result - one of those that Google servers process billions of times a day, but it is deceptively complex. Type the same query in Bing, for example, and the first entry on the page will be about members of the US National Football League, one of which is a player named Lawyer Milloy. The search results take up several pages, but not one of them has a direct link to Sivek.

This comparison shows the power and even the intelligence of Google’s algorithms perfected countless times. They seem to have the magical ability to understand user requests, no matter how competently they are composed. Google calls this ability the quality of the search, and over the years of its existence, the company has carefully concealed the process by which such accurate results are obtained. But now I’m sitting with Singal in Building No. 43 of the search giant, where the search engine development department works, since Google gave me an unprecedented opportunity to see how the search quality is achieved. The reason for this is obvious: you may think that the algorithm is a bit larger than the entire machine, but wait until we open the veil and see what this algorithm is capable of.

Key features of Google search


“Google’s search algorithms are constantly changing and improving to produce the highest quality results. We present several of the most significant extensions and improvements since the appearance of the page citation index . ” - Stephen Levy.

Backrub (September 1997) - a search engine that ran on the servers of Stanford University for almost two years, called Google. Its basis - ranking pages depending on the quantity and quality of external links - was a major technical achievement.

New algorithm (August 2001) - the search algorithm has been completely reorganized to simplify the addition of new ranking factors.

Nexus Analysis (February 2003) is the first proprietary feature of Google that ranks higher on those sites referenced by more reputable sources.

Fritz (summer 2003) - this improvement allows Google to upgrade its indexing at any time, and not just with large blocks of updates.

Personalized Results (June 2005) - Users can allow Google to track their search history to ensure personalized results.

Bigdaddy (December 2005) - the update of the machine allowed a more complete indexing of the Internet.

Universal search (May 2007) - a new universal search based on image search, Google news and book search allowed users to receive information presented in various forms on a single page of search results.

Search in real time (December 2009) - displays results from Twitter and blogs immediately after publication.

The history of Google’s algorithm begins with the introduction of a citation index of pages, a system invented in 1997 by Google co-founder Larry Page while studying at Stanford. Paige’s legendary insight was to rank pages based on the number and importance of links leading to them. Thus, the collective intelligence of the Internet was used to determine the relevance of sites. This concept is simple and powerful, and since Google quickly became the most successful search engine on the Internet, Page and Sergey Brin, the second founder of Google, regard the page index as a fundamental innovation of their company.

However, this is not all. “People supported the idea of ​​a citation index of pages, because it is understandable,” says Manber, “but there are many other features that improve the relevance of the issue.” This is the use of certain factors, contextual signs that help the search engine to rank millions of possible query results, ensuring that the most useful hit the top of the list.

Internet search is a process consisting of several stages. First, Google analyzes web pages, collecting information from all available sites. Then this information is divided and indexed (organized alphabetically, approximately as in a dictionary). The search for the desired page is based on the information it contains. Each time a user enters a query, the indexed information is combed to find the right answer and a list, usually containing hundreds of thousands or millions of entries, is returned to the user. The whole trick, however, is ranking - determining which of the answers to put at the head of the list.

This is where contextual signs come to the rescue. All search engines use them, but none of them take into account such a number of factors, or maybe they do not use them with such experience as Google. The citation index of a page is one of the factors, the attribute of a web page (it turns out that its value is important for each website on the Internet), which helps to determine the relevance of the page. Some factors now seem obvious. In the early stages, Google’s algorithms paid particular attention to the title of the page - this is certainly an important feature for determining relevance. Another key factor is the body text of working links that lead from one page to another. As a result, “when a search is made, the page you want will be on top, even if it itself does not contain the words that you entered in the request,” says Scott Hassan, who was Google’s architect early on and worked with Page and Brin at Stanford, "it was cool". Later, the factors were also the freshness of the information (for the same requests, later pages may be more useful than the old ones) and geographical factors (Google can roughly determine the geographic coordinates of the user and prefer results that are closer to his location). Now the search engine uses more than 200 factors to rank the results.

Google engineers discovered that some of the most important factors can be found through the very essence of Google’s work. The citation index page is recognized as an outstanding system for determining the popularity of search engines: millions of people themselves determine what to refer to the network. This is real democracy. But Singal noted that engineers from building 43 use another democracy — the work of hundreds of millions of people who are searching through Google. The data created by users in the search process - by what results they pass, which words they replace in the query, if the results do not satisfy them, how their queries depend on geographic location - turn out to be invaluable information for determining new factors and improving search results. The most obvious example of this process is a personalized Google search, a feature that uses the user's query history and location as factors to determine the relevance of the result. But basically, Google uses this huge mass of accumulated data to adjust the algorithm. In the same way, an unusually deep knowledge base is used, which helps to interpret the general meaning of critical requests.

Take, for example, the method by which Google defines synonyms. “We found a clever way,” Singhal says, “people change the words in the queries. For example, someone types "pictures of dogs", and then "pictures of puppies." Thus, we understand that, perhaps, "puppies" and "dogs" are interchangeable. We also took into account that if you boiled water, it is hot water. Thanks to the users, we have re-learned the semantics, and this has taken a very big step forward. ”

However, there are some obstacles. Google's synonym definition system understands that dogs look like puppies and that boiled water is hot. But she also decides that a hot dog (hot dog) is the same as a boiled puppy. This problem was eliminated at the end of 2002 with an innovative update based on the theories of the philosopher Ludwig Wittgenstein on the dependence of the meaning of a word on the context. Since Google collects and stores billions of documents and web pages, they were analyzed in order to understand which words are close to each other. The phrase "hot dog" (hot dog) was found in queries that also contain "bread", and "mustard", and "baseball" - not with boiled mongrels. This helped the algorithm to understand what the word "hot dog" means and millions of other terms. “Now, if you enter“ Gandhi bio, ”we know what bio means biography,” Singhal says, “and if you enter“ bio warfare, ”bio means“ biological ”.”

Throughout its history, Google has developed several ways to add new ranking factors so that users do not disrupt their work. Every couple of years there is a global change in the system - like how new versions of Windows come out. This is a great event in the Mountain View, but it is not made public. "Our work is similar to the change of aircraft engines, which flies at a speed of 1000 kilometers per hour at an altitude of ten thousand meters above the ground," says Singal. In 2001, Singhal rewrote the original Page and Brin algorithm to adapt to the rapid growth of the Internet, incorporating a system for quickly adding new search factors. (One of the first added factors of the new system is the separation of commercial and non-commercial pages, providing the desired results for users who are looking for products or services). In the same year, engineer Krishna Bharat suggested that links from recognized authoritative sites should have more weight, and developed an influential factor that increased the degree of trust in links to competent sites. (This was the first patent in Google). The most recent global change, called Caffeine, corrected the indexing system to make it easier for engineers to add new factors.

Google is known for ingenuity in promoting such innovations; Each year, the company holds an internal demonstration exhibition called CSI (Crasy Search Ideas - Insane Search Ideas) to identify unusual but useful suggestions. However, for the most part, the process of changing a search engine is the ongoing, exhausting job of trying bad results to determine inaccuracies. One of the unsuccessful searches entered the legend: somewhere in 2001, Singhal found out about the poor search results for the query “audrey fino”. Google sent users to the Italian sites dedicated to Audrey Hepburn (fino in Italian means “sweet”). “We understood that this is actually someone’s name,” Singhal says, “but the system wasn’t so smart.”

The failure of the “audrey fino” led Singel to many years of developing a way to handle the system of proper naming, which is 8% of all requests. To solve the problem, he had to master the secret art of the bigram cipher , which separates different words into separate interrelated elements. For example, "new york" is two inseparable words (bigram). But there is also “new york times”, which obviously indicates a different kind of search. And everything changes again if the request is “new york times square”. The man immediately sees the difference, but Google does not have a basement filled with hundreds of thousands of operators at small tables. Google relies on an algorithm.

image The query “mike siwek lawyer mi” illustrates how Google achieves perfection. When Singal dials a command to demonstrate the code under each found result, it becomes clear what factors influenced the choice of the most vyrhnih links: digramnaya link shows that this name, a synonym and geographic location. “We will analyze this request from the point of view of an engineer,” explains Singhal, “we see - aha! We can break the query like this. We calculate that “lawyer” is not a surname, and “siwek” is not a name. And at the same time that “lawyer” is not a city in Michigan. "Lawyer" is a lawyer. "

This hard-won implementation of the internal work of the Google search engine, compiled thanks to data entered by billions of users: “rock” is rock. It is also a stone, and perhaps a cobblestone. Write it as “rokc” and it will still be “rock”. But write “little” in front and it will become the capital of Arkansas (Little Rock). And Arkansas (Arkansas) is not an ark (ark), but if only Noah is near (Noah's Ark). “The cherished goal of the search is to understand what the user wants,” says Singal, “therefore, it’s not the words that need to be chosen, but their meanings.”

And Google continues to improve. Recently, search engineer Maureen Heymans discovered a problem with the query “Cindy Louise Greenslade”. The algorithm understands that it needs to search for a person - in this case it is a psychologist from the city of Garden Grove in California, but it was impossible to place the Greenslade home page in the top ten results. Heymans found that Google reduces the relevance of its page, since Greenslade did not use the full second name, as in the search query, but only its initial. “We need to take that into account,” says Heymans. Therefore, he added a factor that searches by the initial of the second name. Now the Greenslade home page is in fifth place .

Now dozens of such changes go through a well-established testing process. Hundreds of Google employees around the world on their home computers evaluate the results of various requests, noting when changes in the system improve or worsen the issue. But Google also has an army of testers - these are billions of users who unwittingly virtually participate in a constant experiment on search quality. Every time engineers test changes, they launch a new algorithm for a small percentage of random users, using them as a large control group. There are so many studies of research that Google had to abandon the traditional scientific principle - only one experiment should be carried out at a time. “For most requests to Google, you are in several control or experimental groups at the same time,” said search-quality engineer Patrick Riley, and then corrects himself, “in truth, all requests are included in some kind of test.” In other words, every time you search for something through Google, you are a guinea pig.

Flexibility — the ability to add new factors, update the code, and simultaneously test the results — is what Google says is that they will withstand any competition with Bing, or Twitter, or Facebook. In fact, over the past six months, Google has introduced more than 200 improvements, some of which seem to be imitations, if not more, of the proposals of their competitors. (Google says this is a coincidence, and it draws attention to the fact that new features have been added regularly for many years). One of them is a real-time search, with non-availability since expected since Page said a few months ago that Google will scan the entire Internet every second. Now, when someone enters a query of interest, among the ten blue links, Google adds an area with "recent results" - a scrolling list of freshly written entries from news resources, blogs and microblogs. For this, Google once again uses factors to add the most relevant entries to the continuously updating stream. “We are watching the responses to the notes, how many people read them, we determine who the author is - a person or a robot,” says Singal, “we know how to determine this because we have been doing this for ten years now.”

Along with the real-time search, Google introduced other new features, including a service called Goggles , which in the search process analyzes photos from mobile phones. This is part of the ongoing movement to meet the constant and widespread presence of the search in a person’s life. With pattern and voice recognition, smartphones have become eyes and ears. If the right factor is found, anything can be searched.

Grand computational power and bandwidth gives Google an undeniable advantage. Some observers say that this advantage makes it difficult for new services to go from tests to working mode. But Manber says that it’s not just the infrastructure that made Google the leader: "A very, very, very important key component is that we are picking the right people."

By all appearances, Ki Lu refers to one of these people. “I have tremendous respect for him,” says Manber, who worked with a 48-year-old expert in computer theory from Yahoo. But Lou joined Microsoft early last year to lead the Bing team. When asked about his goals, Lou, a miniature man wearing jeans and a shirt with the Bing logo, quietly and measuredly quotes: "It is very important to remember that this is a long journey." At the same time, he looks like Uma Thurman in the movie “To Kill Bill” and his whole appearance says “I am not going to leave.”

In fact, the company that won the browser war of the last decade has great potential for a good search. At the same time, these people have supernatural certainty, they desire more than the Google algorithm can provide. “If the paradigm does not change, it will be very, very difficult to compete with today's victors,” said Harry Sham, head of the search engine development at Microsoft. "But, on the other hand, the paradigm may change."

Anyway, even if such a change happens, Google’s algorithm can adapt to it. That's why Google is a formidable opponent, he built a car, fast enough to accommodate almost any threat, and so far competitors have nothing to oppose to high-quality Google search results. Anyone can offer a new way to buy plane tickets. But only Google knows how to find Mike Sivek.

Source: https://habr.com/ru/post/95833/


All Articles