Lectures of the Technosphere. Semester 2 Modern methods and tools for building information retrieval systems

Our educational heading is on air again. At this time, we offer to get acquainted with the next course of the Technosphere, dedicated to information retrieval. The purpose of the course is to tell about the main methods used when creating search engines. Some of them are a good example of wit, some show where and how the modern mathematical apparatus can be applied. Course teachers: Alexey Voropayev, Vladimir Gulin, Dmitry Soloviev, Igor Andreev, Alexey Romanenko, Jan Kisel.

Lecture 1. Introduction to information retrieval. Search Engine Architecture Overview

Definition of the task of information retrieval. Examples of search engines. Tasks related to information retrieval. The history of the search engines. The logical model of information retrieval, its tasks. Principles of boolean search. Matrix "term-document". Reverse index Dictionary and coordinate blocks. Creating a reverse index. Token splitting and sorting. Dictionaries and coordinate blocks.

')

Lecture 2. Linguistics

What is linguistics, what are its tasks. The history of the origin and development of linguistics as a science. Problems solved by linguistics, its varieties. General linguistics: phonetics, phonology, morphology, syntax, semantics, pragmatics. Historical linguistics. Linguistic typology. Sociolinguistics. Dialectology. Lexicography. Psycholinguistics. Mathematical linguistics. Statistical linguistics. Approaches to language: rationalistic and empirical. Morphology. Corpus linguistics. Concordance, Zipf's laws, amendments and the Mandelbrot formula.

Lecture 3. Basics of text processing

Criteria document encoding. Levels of linguistic analysis. Tokens and terms. Language detection: graphematic, N-gram and lexical approaches. Normalization. Tokenization problems. The presence and absence of gaps. Chinese, Japanese, Arabic. Accent and diacritics. Equivalence classes. Lower case Stop words. Lemmatization. Stemming. Predictor. Types of languages. Statistical removal of homonymy. Splitting text into sentences. Expansion of search query.

Lecture 4. Collocation

Probability calculation methods: parametric and non-parametric approaches, standard and binomial distributions, multinominal and normal distributions, approximation. Bayesian approach to statistics. Definition of collocations, their signs. Frequency of bigrams. Filter by parts of speech. Deviations, deviation histograms. Search for collocations, examples of t-test. Search for differences in usage. Pearson criterion. x ² criterion. Likelihood ratio criterion. Relative frequencies Mutual information. Sparseness of data. F-measure.

Lecture 5. Language models. N-grams. Markov chains

Objectives of language recognition. Language models. Search using language models. The fundamental problem of lack of data. Constructing N-grams. Maximum likelihood method. Smoothing. Validation models. Linear mixing models. Markov chain. Transition matrix The sequence of states. Hidden Markov models. Three tasks HMM. Algorithms forward and backward. Algorithms of Viterbi, Bauma Welch. Application of MMB Tagger. Analysis of user behavior.

Lecture 6. Machine translation

Definition and tasks of machine translation. The history of the development of machine translation. Approaches to machine translation: rule-based, corpora-based, hybrid. Three core methodologies. RBMT, its comparison with SMT, their advantages and disadvantages. Parallel housing. Alignment by sentences. Word-based model. IBM Model, their limitations. Phrase models: phrasal statistical translation, calculation of the probability of translation, language model, translation model, construction of a phrasal table. Decoding. Evaluation of machine translation. BLEU (Bilingual evaluation understudy). The evolution of machine translation.

Lecture 7. Indexing

The general scheme of the search base. Assign reverse index. Technical limitations and disk subsystem. Composition of the inverse index and options for its construction. Optimization of the intersection of blocks. Compression of coordinate blocks: a comparison of bitwise and byte approaches: Fibonacci code, VarByte, gamma codes, Simple9. Practical tips to reduce the size of the index. Data structures used to build the dictionary. Approaches to the storage of stop words. The problems of indexing large volumes. Document distribution and database balancing. Indexer architecture.

Lecture 8. Web search architecture. Text ranking

The logical scheme of the search engine. Search cluster. Indexing. Boolean search. Weight calculation Jacquard coefficient. Frequency matrix The word bag model. Frequency term. Logarithmic weighting. Document frequency IDF. Documents as vectors. Text ranking optimization methods. Terms with great IDF. Documents with more terms from the query. Static weights, total weight. Echelons. Index Clustering. Parametric indexes and zones. Fields (numeric zones). Indexes for zones. Compact entry. Probabilistic search. Using language models when searching. Options for comparing models. Likelihood request and document. Comparison of models. Feedback on relevance. Binary probabilistic model. Bayesian networks in the ranking problem.

Lecture 9. Design of search results. Snippets. Evaluation of search quality

Examples of page designs search results of various resources. SERP components. Organic results. Highlighting paragraphs. Splitting into sentences. Snippet generation, general generation algorithm. Enrichment snippetov. Snippet metrics. Assessment by assessors. Search engine quality metrics. Quality search. Standard collections. TREC. Accuracy / completeness. Criticism of pure relevance. Marker Tests. Search for peripheral sites. Regional navigation. Subject search. Overall search quality. Assessment service. Evaluation of the relevance of the document. Cross validation. SOM Cards. Auto search for errors. Online metrics. Evaluation of hypotheses. Click metrics. Correlation with assessors.

Lecture 10. Features of web-search. Spider

Popularity of using search. Search engine history. Basics of web search. User needs. Empirical evaluation of search results by the user. Collection of web documents. Search advertising, how it is ranked, what are its pros and cons. Spider, his tasks. Queue URLs Search robots. The main architecture of the spider. Parsing: URL normalization. Distributed spider. The interaction of servers. Mercator scheme. Front queues, back queues. Freshness base. Deep Web (hard-to-reach sites). Site maps. Document storage. Noise removal.

Lecture 11. Finding duplicates in the Web.

Comparison of documents: accurate and inaccurate duplicates, almost duplicates, print versions. Three stages of determining similar documents. Shingles (shingles), compression option. Multiple model, matrix model. Search for similar columns. Signatures Identification of a similar set (minhashing). Search for similar couples. Selection of candidates from the Minhash signatures. Locality-sensitive hashing. Distribution in parts and in baskets. LSH tradeoffs. Find duplicates on the Web.

Lecture 12. The use of self-organizing maps in a search engine

The lecture is divided into two parts. The first part: the issues of prioritization of the search engine spider, algorithms for segmentation of large sites into parts and the distribution of priorities for pumping segments. The second part: algorithms for analyzing and visualizing large amounts of data using Kohonen self-organizing maps (SOM), using this tool in the task of analyzing the structure of the web and prioritizing the search robot, the possibility of using SOM for data analysis in various areas of the search engine development.

Lecture 13. Identification of spam sites based on the analysis of content pages.

Various aspects of cleaning the search index from garbage. Questions construction classifiers. Basic machine learning topics: the correct construction of a training set, the generation of features, the choice of classification algorithms. The problem of building classifiers for different classes of data.

Lecture 14. Behavioral and Reference Ranking

Calculation of behavioral relevance. Indexing anchor text. Algorithm HITS, Page Rank. The block structure method. Systems for processing graphs.

Lecture 15. Ranking with machine learning

Classic ranking. Ranking factors. Ranging based on machine learning. The specificity of the task of machine learning ranking. Formal statement of the problem. Gradient descent. Decision trees "Inattentive" decision trees. Algorithmic compositions over decision trees (bagging, boosting). Stacking BagBoo algorithm. Issues of building training data. Active learning. Sampling uncertainty. Committee methods of active learning. The use of self-organizing maps to sample training data. SOM + QBag Algorithm for Active Ranking Learning.

Previous issues

Technopark:

Technosphere:

Subscribe to the youtube channel Technopark and Technosphere!

Source: https://habr.com/ru/post/257119/

All Articles