Lectures of the Technosphere. Infopoisk. Part 2 (Spring 2017)

We bring to your attention the second part of the training course on information retrieval.

All Internet users have experience with search engines, regularly enter requests there and get results. Search engines have become so familiar that it is difficult to imagine that once there were none, and the quality of modern search is taken for granted, although fifteen years ago everything was completely different. However, a modern search engine is a complex software and hardware complex, the creators of which had to solve a huge number of practical problems, ranging from the large amount of data processed to the nuances of human perception of search results.

In our course, we talk about the basic methods used to create search engines. Some of them are a good example of wit, some show where and how the modern mathematical apparatus can be applied.

List of lectures:

Course lead:

Dmitry Solovyov, Lead Developer of Mail.Ru Search Ranking Group;
Konstantin Zelepukin, the developer of the Mail.Ru Search multimedia search group;
Evgeny Chernov, Head of Mail.Ru Search Queries Analysis;
Alexey Voropaev, head of the Mail.Ru Search reference systems development group;
Vladimir Gulin, Mail.Ru. Search Development Manager

Lecture 1. Linguistics. Text Processing Basics

From the first lecture you will learn about the stages of ranking, the basic terms. Get acquainted with the main stages of linguistic processing of the document, normalization and tokenisation. The task of processing queries, converting encoding, retrieving objects is considered. The problems of determining the language of the document, the definition of synonyms, query expansion, truncation of endings are discussed. Lemmatization is considered, as well as a number of other tasks of linguistic processing of texts.

Lecture 2. Collocations, N-grams, hidden Markov chains

The second lecture covers topics such as collocations, text finding methods, N-grams, Markov word processing models, hidden Markov models, and tagging.

Lecture 3. Text ranking. Language models

You will learn what a ranked search is, get acquainted with the vector and probabilistic ranking models, as well as latent models.

Lecture 4. Evaluation of search quality. Splits. Assessors

We consider the formulation of the task of assessing the quality of a search, discussing types of metrics, standard collections. You will learn about the methodology for evaluating binary and ranking search, get acquainted with the marker tests and assessors. The lecture also covers the topics of Discounted Cumulative Gain, A / B testing and splits.

Lecture 5. Reference ranking

The lecture begins with a historical insight into the occurrence of referential rankings. Given the problematics of a variety of search queries that need to be ranked. You will learn how the anchor text is indexed, what the reference graph is and how to build it, get acquainted with the HITS algorithm. A large part of the lecture is devoted to the great task of calculating PageRank. And finally, the topic of calculating SiteRank is covered.

Lecture 6. Behavioral ranking

From the lecture you will learn where to get information about user behavior, how to use this data. The task and methods of constructing a model of user behavior, analysis of search sessions are considered. Behavioral models are discussed: CTR, base, cascade, DCM, UBM, CCM, GCM, CRA, PRM, MEM, JRE. Comparison of different models is carried out, their advantages and disadvantages are understood. The problem of relevance and attractiveness of search results for the user using the Dynamic Bayesian Network is considered. The issues of calculating ClickRank, Browser Rank, and finally - tracking the movement of the user's eyes when viewing the page.

Lecture 7. Machine learning in ranking. Part 1

The terminology is set and the task itself is set for ranking. Considers the necessary factors for the ranking. Understand the DCG ranking algorithm, pointwise and pairwise approaches. The linear SVM ranking model, RankNet and LambdaRank techniques are discussed. The problems of retraining algorithms, positive feedback and noisy data are considered. Next comes the topic of active machine learning: Density Sampling, self-organizing maps, dataset balancing with an SOM card, Query-by-Bagging algorithm.

Lecture 8. Machine learning in ranking. Part 2

In continuation of the previous lecture, the YetiRank ranking algorithm is considered and compared with the previously discussed LambdaRank algorithm. Next, you will learn from the so-called Listwise-approach to ranking: it tells about the algorithms of SoftRank, AdaRank and ListNet. Finally, a comparison of the three approaches is made: Pointwise, Pairwise and Listwise.

Lecture 9. Search using neural networks

The lecture is devoted to the search for information on photographs of persons. It describes the problematics of the search, discusses the method of preparing photos for analysis, and various approaches to the analysis using neural networks.

Lecture 10. Sly text ranking models.

We consider the shortcomings of the classic text-ranking models, the disadvantages of LSA and Word2vec. The following discusses ranking models without a teacher: Doc2vec, semantic hashing. Then it describes the ranking models based on machine translation: it explains what statistical machine translation is, how text processing is performed, WTM algorithm is considered, machine translation based on words and phrases. The final part of the lecture is devoted to the ranking models based on neural networks: the Siamese neural network is discussed, the DPM, DSSM and CLSM models are considered.

Lecture 11. Multimedia search

The lecture consists of two parts. The first part is devoted to searching by annotations, by pictures, by audio and video. The second part is devoted to searching by content, also by pictures and audio.

Playlist of all lectures is on the link . Recall that current lectures and master classes on programming from our IT specialists in Technopark, Technosphere and Tehnotrek projects are still published on Tekhnostrim channel.

Other courses of Technosphere on Habré:

Information on all our educational projects can be found in a recent article .

Source: https://habr.com/ru/post/329352/

All Articles