Lectures of the Technosphere. Semester 2 Information Search (Spring 2016)

A modern search engine, the quality of which is perceived as a given, is a complex software and hardware complex, the creators of which had to solve a huge number of practical problems, ranging from the large amount of data processed to the nuances of human perception of search results. On the course of the second semester of the Technosphere "Modern methods and means of building information retrieval systems" we talk about the main methods used in creating search engines. Some of them are a good example of wit, some show where and how the modern mathematical apparatus can be applied.

The authors of the course, the creators of a search engine on the Mail.Ru portal, share their own experience in the development of artificial intelligence systems. The course tells you how interesting and exciting it is to do a search engine, solve problems of processing text in natural language, as well as what methods and means are used to solve such problems.

Lecture 1. “Introduction to information retrieval”

')
Alexey Voropaev, Mail.Ru Search Recommendations Team Leader, defines the concept of information retrieval and reviews existing search engines, talks about indexing and search clusters.

Lecture 2. “Features of web-search. The architecture of the search robot »

In this lecture, you will learn about the history of search engines, the modern basics of web search, user preferences, and an empirical assessment of search results. The lecture is read by Jan Kissel, the head of the Mail.Ru Search Infrastructure Group.

Lecture 3. "Prioritization of the crawler"

Dmitry Soloviev, the leading developer of the ranking group, talks about search engines. Provides an overview of crawlers, information on the analysis of site clusters, experiments with quoting, index quality definitions, etc.

Lecture 4. "The use of self-organizing maps in a search engine"

Dmitry Soloviev solves the problem of analyzing and visualizing data, talks about the options for using self-organizing maps in a search engine and conducts a seminar on the identification and analysis of segments for prioritization.

Lecture 5. "Search for duplicate documents"

Jan Kisel defines duplicates, their types, shows an example of shingling (shingling: converting documents into sets). All steps are considered to identify similar documents, including minhashing (converting large sets to short signatures) and techniques for scaling.

Lecture 6. “Search for duplicate documents. Part 2"

The continuation of the previous lecture. Ian talks about the methods of removing page binding, text normalization, global detection, and ends the lecture with information on what to do next with duplicating text and images.

Lecture 7. “Indexing and Boolean Search”

An approach to indexing and compression methods is considered. What is the search index, what are the approaches to the rapid intersection of lists, various options for compression in the web. Lecture reads Jan Kissel.

Lecture 8. "Methods for optimizing the inverse index"

Jan continues the topic of indexing. This time it will be about creating an index dictionary, collecting results in a large web, and what features exist when working with memory and writing demons.

Lecture 9. "Cleaning the search index: antispam"

The first lecture on content filtering. This part deals with methods of spam impact on the search engine and methods of counteraction. Dmitry Solovyov shows methods for identifying spam sites and detecting spam based on the analysis of the content of pages.

Lecture 10. "Cleaning the search index: antiporn"

The second part of the filtering: this time fighting with porn. The task requires other approaches, in contrast to methods of combating spam. Methods for filtering queries, web pages and images, including methods based on the operation of a convolutional neural network, are considered.

Lecture 11. “Micromarking. Ends Detector Offers »

Applied linguist Igor Andreev devoted his lecture to snippet (fragments of text used as a description of a link in search results). Igor talks about search engine design, the semantic web, the RDF (resource description framework), micromarking and how it all fits with snippets.

Lecture 12. "Building snipetov"

The second part of the conversation about snippets: automatic summarizing (automatic text summarization), the transition to the formation of organic snippets, a brief device of a direct index and the last part gives an assessment of the quality of snippets.

Lecture 13. “Correction of typos. Sajesty. Rewording

The head of the query analysis team, Evgeny Chernov, devoted two lectures to correcting typos in search queries. Eugene talks about the types of errors, the simple search for typos, the Levenshtein distance, the statistics of the language model, the generation of replacement options and various types of corrections.

Lecture 14. “Sajustos, reformulations, classifiers”

In the final lecture, Evgeny Chernov talks about search tips (sadzesta), reformulations (sets of queries that have something in common with a given one) and a whole group of different classifiers.

Playlist of all lectures is on the link . Recall that current lectures and master classes on programming from our IT specialists in Technopark, Technosphere and Tehnotrek projects are still published on Tekhnostrim channel.

Source: https://habr.com/ru/post/312972/

All Articles