📜 ⬆️ ⬇️

Lectures of the Technosphere. Infopoisk. Part 1 (Spring 2017)


In the air, a new release of video lectures of our educational project Technosphere. This time the course is dedicated to information search.


All Internet users have experience with search engines, regularly enter requests there and get results. Search engines have become so familiar that it is difficult to imagine that once there were none, and the quality of modern search is taken for granted, although fifteen years ago everything was completely different. However, a modern search engine is a complex software and hardware complex, the creators of which had to solve a huge number of practical problems, ranging from the large amount of data processed to the nuances of human perception of search results.


In our course, we talk about the basic methods used to create search engines. Some of them are a good example of wit, some show where and how the modern mathematical apparatus can be applied.


List of lectures:


  1. Introduction
  2. Features web search. Search Robot Architecture
  3. Search Robot Planner
  4. Indexing and Boolean Search
  5. Boolean index and search
  6. Duplicate search
  7. Duplicate Search (Part 2)
  8. Pornography Filtering
  9. Antispam
  10. Snipples
  11. Building snippetov
  12. Correction of typos in the requests
  13. Tips, reformulations, classifiers

Course lead:



Lecture 1. Introduction



Overview lecture on the importance of information retrieval.


Lecture 2. Features of web-search. Search Robot Architecture



The first part of the lecture is devoted to web search: historical background is given, the topic of advertising in search is slightly touched, and web search schemes are described. The second part is devoted to search robots (spiders): setting a task for data collection, pumping, updating and storing data.


Lecture 3. Search Engine Robot Planner



The task of planning the work of the search robot is set, the algorithms of Focused Crawler are considered, the algorithm "Garden of Stones" is analyzed. Also understand quota issues.


Lecture 4. Indexing and Boolean Search



The composition and purpose of the search index is considered, the search engine hardware is discussed a little. It tells about the rapid intersection of blocks, index compression and techniques for increasing compression.


Lecture 5. Boolean index and search



The continuation of the previous lecture. The compression theme is raised again: the Simple9 algorithm, binary data in Python, is considered. The second part of the lecture is devoted to the search dictionary: the presentation of stop words, aspects of vocabulary storage are discussed. The third part of the lecture describes the query tree: what it is, how the tree is executed, how to parse queries.


And at the end of the lecture, you will learn how the overall indexing workflow is built.


Lecture 6. Finding duplicates



Finding duplicates is a big topic divided into two lectures. First, you will learn about the terminology used, look at examples of duplicates, and learn about shingling. Then practical methods for finding duplicates are considered: making improvements to the algorithms, the Minshingle signature generation method, the measure, Jaccard, the Broder algorithm.


Lecture 7. Finding duplicates (part 2)



This lecture is devoted to finding duplicates in very large arrays of documents. We consider the method of searching for fuzzy duplicates (Local Sensitive Hashing), discuss algorithms with an indivisible signature, and conclude with a comparison of the features of different algorithms.


Lecture 8. Filtration of pornography



At the beginning of the lecture it is told why it is important to always filter pornographic materials, discuss common solutions to this issue. Then it describes the techniques for filtering web pages, requests and images.


Lecture 9. Antispam



Also very relevant topic. First, the very reasons for the existence of spam are considered, the problematic is discussed. It tells about the methods of spam impact on search engines, about ways to counter this effect. You will learn how to detect spam by analyzing the content of pages, how to identify spam sites. There will also be considered methods of combating fraud and spam applications.


Lecture 10. Snippets



From the lecture you will learn what are search snippets, what is recommended to do the design of search results. The main elements of the SERP are discussed, it is described what a “semantic web” is, a micromarking on the page is considered. At the end of the lecture it is told about inorganic snippets and determining the end of sentences.


Lecture 11. Building Snippets



Continuing the theme snippetov. This time you will learn what text summarization is, organic snippets, a direct index are considered, and a method for evaluating the quality of snippets is discussed.


Lecture 12. Correction of typos in requests



The lecture is devoted to the methods of searching and correcting typos in the entered queries.


Lecture 13. Tips, reformulations, classifiers



The last lecture of the course is devoted to the problems of generating hints during the user’s input of a search query, and methods of re-spelling queries to improve the search are considered. Finally, various query classifiers are discussed.


Playlist of all lectures is on the link . Recall that current lectures and master classes on programming from our IT specialists in Technopark, Technosphere and Tehnotrek projects are still published on Tekhnostrim channel.


')

Source: https://habr.com/ru/post/329072/


All Articles