Lectures of the Technosphere. Infopoisk. Part 1 (Spring 2017)

In the air, a new release of video lectures of our educational project Technosphere. This time the course is dedicated to information search.

All Internet users have experience with search engines, regularly enter requests there and get results. Search engines have become so familiar that it is difficult to imagine that once there were none, and the quality of modern search is taken for granted, although fifteen years ago everything was completely different. However, a modern search engine is a complex software and hardware complex, the creators of which had to solve a huge number of practical problems, ranging from the large amount of data processed to the nuances of human perception of search results.

In our course, we talk about the basic methods used to create search engines. Some of them are a good example of wit, some show where and how the modern mathematical apparatus can be applied.

List of lectures:

Course lead:

Jan Kisel, Mail.Ru Search Infrastructure Manager;
Julia Sergukova, Mail.Ru Search Infrastructure Programmer;
Dmitry Solovyov, Lead Developer of Mail.Ru Search Ranking Group;
Andrei Murashev, programmer of Mail.Ru Search recommender systems;
Mikhail Plekhanov, Mail.Ru Search Infrastructure Programmer;
Evgeny Chernov, Head of Mail.Ru Search Queries Analysis Department

Lecture 1. Introduction

Overview lecture on the importance of information retrieval.

Lecture 2. Features of web-search. Search Robot Architecture

The first part of the lecture is devoted to web search: historical background is given, the topic of advertising in search is slightly touched, and web search schemes are described. The second part is devoted to search robots (spiders): setting a task for data collection, pumping, updating and storing data.

Lecture 3. Search Engine Robot Planner

The task of planning the work of the search robot is set, the algorithms of Focused Crawler are considered, the algorithm "Garden of Stones" is analyzed. Also understand quota issues.

Lecture 4. Indexing and Boolean Search

The composition and purpose of the search index is considered, the search engine hardware is discussed a little. It tells about the rapid intersection of blocks, index compression and techniques for increasing compression.

Lecture 5. Boolean index and search

The continuation of the previous lecture. The compression theme is raised again: the Simple9 algorithm, binary data in Python, is considered. The second part of the lecture is devoted to the search dictionary: the presentation of stop words, aspects of vocabulary storage are discussed. The third part of the lecture describes the query tree: what it is, how the tree is executed, how to parse queries.

And at the end of the lecture, you will learn how the overall indexing workflow is built.

Lecture 6. Finding duplicates

Finding duplicates is a big topic divided into two lectures. First, you will learn about the terminology used, look at examples of duplicates, and learn about shingling. Then practical methods for finding duplicates are considered: making improvements to the algorithms, the Minshingle signature generation method, the measure, Jaccard, the Broder algorithm.

Lecture 7. Finding duplicates (part 2)

This lecture is devoted to finding duplicates in very large arrays of documents. We consider the method of searching for fuzzy duplicates (Local Sensitive Hashing), discuss algorithms with an indivisible signature, and conclude with a comparison of the features of different algorithms.

Lecture 8. Filtration of pornography

At the beginning of the lecture it is told why it is important to always filter pornographic materials, discuss common solutions to this issue. Then it describes the techniques for filtering web pages, requests and images.

Lecture 9. Antispam

Also very relevant topic. First, the very reasons for the existence of spam are considered, the problematic is discussed. It tells about the methods of spam impact on search engines, about ways to counter this effect. You will learn how to detect spam by analyzing the content of pages, how to identify spam sites. There will also be considered methods of combating fraud and spam applications.

Lecture 10. Snippets

From the lecture you will learn what are search snippets, what is recommended to do the design of search results. The main elements of the SERP are discussed, it is described what a “semantic web” is, a micromarking on the page is considered. At the end of the lecture it is told about inorganic snippets and determining the end of sentences.

Lecture 11. Building Snippets

Continuing the theme snippetov. This time you will learn what text summarization is, organic snippets, a direct index are considered, and a method for evaluating the quality of snippets is discussed.

Lecture 12. Correction of typos in requests

The lecture is devoted to the methods of searching and correcting typos in the entered queries.

Lecture 13. Tips, reformulations, classifiers

The last lecture of the course is devoted to the problems of generating hints during the user’s input of a search query, and methods of re-spelling queries to improve the search are considered. Finally, various query classifiers are discussed.

Playlist of all lectures is on the link . Recall that current lectures and master classes on programming from our IT specialists in Technopark, Technosphere and Tehnotrek projects are still published on Tekhnostrim channel.

Source: https://habr.com/ru/post/329072/

All Articles