📜 ⬆️ ⬇️

Information Retrieval Course at the Winter Pushchino School: we teach high school students to create search engines

In our technology projects teach more than 200 colleagues. But many are not limited to this and conduct master classes, courses and lectures on other educational platforms. One of such enthusiasts is Roman Vasilyev, who this spring conducted a course on information retrieval at the Pushchino Winter School. Under his leadership, in just six lessons, schoolchildren of grades 7-11 (!) Wrote their search engine and defended the project. How did they manage it, what kind of search engine is it and what is the CPS conducted for - in a philosophical, thoughtful, easy, with a fair amount of humor article by Roman Vasilyev.



Introduction. Where does our education go, und was können wir dagegen tun?


Observing the school system for more than 20 years, first from within as a student, then from the outside, communicating with today's students, I see that the world is not the same as it was before. Despite the jump in the development of information technologies that are designed to improve our lives, I do not envy those who study in modern Russian schools. Ask why? Remember your childhood. Surely you wanted to quickly become adults, looking for an opportunity to show independence, and you liked to feel the freedom of choice and action in these moments. When I went home from school, I went to the market to buy groceries - that was when I was a small adult. In those moments, no one controlled me, we did not have mobile phones, I was walking with friends or thinking about something alone with myself, this breath of freedom was vital for me.


Now what? Often parents are obliged to accompany children to school and take them back, pupils are now on a leash. It is said that times others have become. Oh really? I argue what people were like, in fact they remained such, there were no more maniacs, and fear of wolves — not to go to the forest. More and more tools are being introduced to monitor and measure progress: electronic diaries, new type exams. But as my physics teacher Alexey Innokentievich says, if you haven’t put a penny in your pocket since morning, then no matter how you look, you won’t be there for lunch. With this, I want to express my point of view that the emphasis should be placed on how to transfer knowledge to students, rather than test existing ones, and most importantly - instill in them the ability to think, think critically and find solutions in non-standard situations.


In fact, I often hear from teachers that students simply do not want to learn, and from students - that sometimes entire disciplines fall out, for example, programming in some schools was not as such. Yes, yes, a group of geological students comes to me, I need to teach them computer science, and the level of knowledge they all have is “as lucky” as it is very difficult to adapt the course. Or, for example, the commonplace phenomenon is that in literature classes, pupils are reduced only because they disagree with the teacher’s opinion and reflect this in their writings. Do teachers understand their subject thoroughly, are they genuinely interested in them? Alas, not all. And it is not surprising if we consider that the older generation of teachers inevitably passes away, and after the collapse of the USSR, the life of a teacher became more likely a survival, as a result, strong applicants simply avoided teaching colleges. Is it possible to interest a child with something when you don’t "burn" with it?


And if we talk about the fact that in schools there are sometimes unhealthy relationships both within the class and between the teacher and the student, it becomes so sad in general. Do you want to study here? Much depends on which school you go to. So how to be, if you are not fortuned, and your parents do not have money for a private school and tutors? But often in such, the most ordinary schools, we learn gifted children, olympiadniki. I myself know that socialization can become a problem for these children, for example, peers are jealous of them and do not understand them. And what if we have more such children than it seems, but their talents burrow into the ground, which is not very fertile? How to give a child to feel the heady taste of knowledge, mastering a new material? Put him in that team, where he will feel free and natural, like a fish in water, where he will meet like-minded people, friends, his love, finally?


FPS Idea


It turns out that there is an island where the world of knowledge plays with all colors, where lessons are communication and co-creation, where students come even from afar and spend holidays there. This is the Winter Pushchino School. Oh, I would know about it 15 years ago! Located in the small town of Pushchino, on the banks of the Oka River and not far from Serpukhov, it has existed since 1990. At the end of March, her 28th season passed. One of the main goals of the school is to give students the opportunity to look at the learning process from a different angle, immerse them in a different atmosphere, let them choose what is interesting. For more than 15 years, the general project management was carried out by Mikhail Abramovich Roytberg, Ph.D. Sci., Head of the Laboratory of Applied Mathematics of the Institute of Mathematics and Mechanics, RAS, Head of the Department of Algorithms and Programming Theory, FIVT MIPT Unfortunately, last summer he passed away, and now the school’s management is collegial; the organizing committee consists mainly of academics and university professors.


The teachers are young teachers, enthusiastic students, graduate students, graduates of the best universities in Russia. More than a hundred volunteers who want to ignite schoolchildren's love of science and their work. Without hesitation, I decided to use this opportunity to tell about what is interesting, relevant and what is not taught in any school.


The structure of the school. Activities


There are four departments in EPS, as in Hogwarts: exact, natural, humanitarian and psychological. Freedom as a choice of the topic of the course, and the method of presentation of the material is not limited to almost nothing, except for the size of the audience and the availability of certain technical means. In each of the five days passes three courses of courses, one hour each. This time can be arbitrarily distributed to lectures and workshops, master classes. In the evening there are studios. This format is ideal for art classes, trainings, where a creative atmosphere is created. At SPS, for example, you can conduct a course in nuclear physics, as well as a studio for making origami colors or playing the ukulele. Each course should have some result that its participants present as a poster at a conference at the end of the week. That is why it is important to think out the lesson plan so that there is a logical conclusion and a visible result.


Register with the school as an employee, i.e. teacher, you need about a month, indicating, if possible, two options for the name of the course - standard and poetic. It is possible to choose which classes the students are designed for and the course and limit the size of the audience of students. Each course must also have an abstract, which is printed in the booklet, and a poster. Being an artist is not necessary, most teachers send images and sketches that the organizers bring to mind. After the registration is completed, new employees arrive at Intellectual High School, where they make a small report for about ten minutes on their course. The Organizing Committee decides on the course (usually positive) and can give recommendations.


Since the School is a bit like a camp where children come a little more than a week, there is an opportunity to be a counselor. All employees are compensated for travel and accommodation in Pushchino. I had to go there by bus for 6 days in a row, because nobody canceled the work. It was pretty hard: at 9 o'clock in the morning I moved out, had time for the third tape by noon, held a lesson, had lunch and returned to Moscow, turned out to be in the office at about half past four. He worked as much as he could, after midnight he returned home and before bedtime he was preparing for the next lesson. Next year, I will probably take a vacation, besides, I myself would attend some courses, especially in psychology, and getting up at 5:30 in the morning to the first tape is just a murder.


My first impressions. Location, atmosphere, people


Yes, it lasts only a week, but the charge that you get there is enough for a long time. Sunday morning, through the snow-covered fields and the forest bus brings me to Pushchino. The gymnasium is located at the entrance to the city, in the microdistrict with the mysterious name “AB”. In the building of red brick, stretched the letter "P", all were invited to the Red Hall at the grand opening of the School. When I got there, I looked at the faces of the people around me, nostalgia rolled at me: I remembered how I went to the Olympiads in mathematics and physics. Only this time I was in a different capacity, when you do not need to solve problems and compete.


Now the main task is to interest the children, because on Sunday there is a presentation of courses. Posters were hung everywhere in the corridors, classes were held everywhere: not only in classrooms, but also in recreation. In general, the atmosphere that prevails in the CPS is saturated with kindness, care and trust. Formalism was kept to a minimum, everything was aimed at giving free rein to the imagination and in no way restrict the work of either students or employees. For example, this squirrel has become not only a universal darling, but also ... a full participant in the School.



On the ground floor, next to the headquarters, an African café was equipped, where you could always drink tea, coffee and enjoy fruit.


Three times exactly 10 minutes you talk about the course. At the end of the day, students make their choice. Only about 200-300 schoolchildren choose for themselves 3 courses from several dozen. The courses on data analysis and neural networks, as I looked, turned out to be more popular, and 6 people enrolled in my information search. I must say, this is good: not too little, and not too much, it is convenient to have a dialogue. I was afraid that there will be less.


The lecture of the so-called mammoths - major scientists who attend the School and inspire interest in science are the pearl of the Secondary National Teacher's School. By the way, the course building plan itself repeats the work of a scientist who combines research and communication at conferences, the publication of the results obtained.


One of the problems with FPS is the lack of computer classes. I got an office of Russian literature. The projector was found, WiFi in that place caught so-and-so, it was necessary to use mobile. Not everyone had laptops, and I took two with me: I gave a presentation from one, I gave it to another for workshops.


Days of study. How the information search course lined up: topics, educational process


So, I set myself the task to create a course of information retrieval, which will be both interesting, accessible and interactive. I am convinced that schools lack mentorship, personal participation and deep immersion in the educational process. I wanted the course to be like a fascinating journey where I would take the guys, knowing all the trails, like an experienced forester. And in no case should the course be like the tedious performance of a party official at a congress of the CPSU. It means that you cannot get away with lectures here.


At the first lesson, I immediately understood the mood of the guys. For the most part, they knew Python well and wanted, if not eagerly, to write code. Moreover, everyone wanted to write a whole search engine, even a small one. Such a desire is contagious, and I decided that not creating a search engine on the basis of the course would mean giving up. And surrender is not about us, we are not one of them! There was a big problem: only 5 hours plus some more time on Friday, which is devoted to the preparation for the conference. Therefore, I decided to talk first about those parts without which the search engine is inconceivable, and at the same time write code, and then how it goes. And since it is almost impossible to write a project from scratch in such a time, I decided to bring the blanks and the individual parts that should be sorted out, finalized and put together. This is a feasible task.


So it turned out that there were two lectures in the course, three mixed lessons and a hackathon. I borrowed materials from which I could prepare quickly and easily from my colleagues, for which I thank them very much. Another convenient thing is that curious and interested ones can be sent to our Tekhnostrim channel, where all the same topics are covered in more detail.


In the first lesson, I talked about the general architecture of search engines, about the problems that we face, and how to solve them. By the second lesson, at night, I sketched some kind of no, and the crawler, launched it and downloaded 2,000 pages from the portal lenta.ru . At first he told the children about how the spider works, and at the end of the lesson he gave them the prepared code. Together we identified shortcomings that I physically did not have time to eliminate (I once needed to sleep), and on the third day they brought me an augmented version, already in Python 3. I mentioned the crawler prioritization only in the spirit that it exists and is based on the clique information. Of course, one cannot do without a query and index analyzer, and, however difficult it may be, to understand the methods of the binary representation of the index and the dictionary. This was included in the course and was the subject of the third and fourth classes.


How much time, disassembled code indexer. By this time, the guys had already written down a beautiful web interface for a search engine, which was just waiting for it to be screwed to the backend. On the fourth day after class, Daney and I (one of my students) drew a course emblem. At the last lesson, at the request of radio listeners, I talked about methods for finding duplicates. The guys learned what shingles are, how the Minshingle algorithm works. And with the Broder algorithm, there were no problems, because during the course on processing large volumes of data, they explained how MapReduce works. And in the end - a small hackathon, where we configured the indexer to the data that the crawler was hitting us, and implemented the snippets generation in the most clumsy way.


By 6 pm, all critical bugs were fixed, the web-interface was screwed up, and they came to the conference with some kind of search engine. It was a success! Terribly tired but happy, I went to Moscow. The next day, I slept quietly, being sure that my students would definitely cope.


Heroes of the course. A couple of words from me, an interview with the guys: Danya Gorlov, Artem Brustovetsky, Katya Khvatova


But what about them - and not to drag? "If we can't get it, then nobody will."


I have not yet seen such energy, indomitable enthusiasm, such a radiance of talents in any audience where I had previously taught. And I learned, no less and no less, five groups of fuel geologists at Moscow State University. As I saw it, the classes were similar to a dialogue, and if the attention was dissipated, it meant that everyone was looking at the code. Here I want to say a lot, and I face the difficulty that you can accurately convey the thought in words, but feelings do not always give in. My father asked me on Sunday: “And you were not too lazy to wander every day for a hundred kilometers? Do you have something else? When will you finally write a dissertation? ”


For the sake of you, my students, it’s not too lazy, even if it were not a hundred kilometers, but all two hundred. Looking at you, you double your desire to develop, move forward and up. I will always be glad to talk. And I was very pleased at the end to find out that I liked the course and turned out successful.


Who are they - my students?


Daniel Gorlov



She studies in the 7th grade of the boarding school “Dubravushka” in Obninsk, is interested in web-programming, IT in general, and information search technologies in particular. And especially - by programming microcontrollers, on the basis of which he developed a greenhouse control system via the Internet. The system is able to read data on the water level in the tank, humidity and air temperature, and provides automatic climate control. Another project Dani - the implementation of a network protocol for Arduino, created in opposition to HTTP, including at a low level. The main distinctive feature of the protocol is modularity and extensibility. Its main programming language is Python, the Flask framework, distributed computing in the MapReduce paradigm using Hadoop, fall into the area of ​​interest. The plans - to master the technology of processing large amounts of data, machine learning, deep learning. And then create an operating system for the neurointerface. I am sure he will succeed.


Artem Brustovetsky



A pupil of the 11th grade from Pushchino, has been attending ZPS for the sixth consecutive year and is extremely passionate about programming. His choice fell on the course on information search, because this discipline is related to web- and backend-development, in which he is interested, and also because there was a great opportunity to write code. Artyom's fears about the understanding of the material did not materialize, and in the end he learned how to program the search engines, for which he was very pleased. The best and most favorite programming language for Artyom is Python, he treats writing code with special care and seeks to bring its quality to perfection. Most of all, Artem likes network programming, multithreading, sockets, and game development. Now he creates a game on Unity, the main emphasis is on the implementation of the server side and the performance of data transfer. He plans to develop in such industries as machine learning, neural networks, and further improve his level in web development. One of the ideas that he wants to bring to life is the transformation of the Web into a decentralized system in which each participant would store his piece of data. Impressions of the School at Artyom are the brightest, and he advises all schoolchildren to get here and immerse themselves in this unique atmosphere.


And now let us give a few words to Katya Khvatova , a pupil of the 11th grade from Moscow.



“I am an optimist in life, my hobbies are quite diverse, but the main ones are biology and programming. I chose this course, because I liked the topic itself, I am very interested in the principles of the work of Mail.ru Search search engines. For me, these classes were very unusual and interesting due to sufficient complexity. Sometimes it was too difficult, because I came to the course, almost without knowing the language. Despite this, the principles of work themselves became clear to me immediately, since the way of teaching was very convenient. Seen from the outside, I was the weakest programmer, but I liked it. I learned a lot - for example, how to make queries more understandable for a search engine, and this can be very helpful in the future. We had quite a bit of practice, but I liked it, because I wanted to learn more not to programming (although this, too), but to the principles themselves. I am interested in a variety of languages, but most of all Java and C ++, because lately they are mainly used. It is also one of the most convenient languages. But despite this, at the moment I use Pascal, on it I have several projects. The world of programming itself is very diverse, I am interested in branches bordering on biology, for example, neural networks and bioinformatics. In fact, I learned everything I wanted, and I am very glad that I chose this course. ”


Results of the course. Search engine What did the guys


And now the most solemn moment has come: to show the results of what we have done during the course and two more weeks after it. And this, albeit small and slightly curtailed, but still working web search engine. It is now deployed on the server at http://alphase.ru/ , link to the source here . Here is its architecture:



Green color indicates those components that can work periodically to keep the index up to date. Orange - those who carry a constant load and serve to process search queries in real time. The base of web documents is shown in blue, in our implementation they are simply stored in the file system.


On the one hand , the crawler bypasses Internet pages, considering the contents of the robots.txt files and processing possible HTTP errors, downloads them and saves them to disk. Currently, a small collection of web documents from habrahabr.ru has been received. The original HTML markup is processed by the Boilerpipe utility, which is used as an external system, the “black box”. As a result, only the text content of the pages without the binding (header, footer), the menu of the site and advertising remains. The indexer, before starting, launches the duplicate removal utility, and then processes the text of the documents and generates an index and a dictionary. To detect duplicate documents, the Minshingle algorithm is used. The size of the index is currently small, on the order of several tens of thousands of documents, but is gradually increasing due to new receipts from the crawler.


On the other hand , the front end receives a request from the user, it is divided into separate words - terms, and the query tree is built on them. It is transmitted to the server that refers to the dictionary, finds in it the numbers of the blocks corresponding to all the terms in the tree. Next, the server extracts the necessary blocks from the index and calculates the intersection of the sets of documents from each block. For the documents found, snippets are generated (in our case, the first fragments of documents that contain one of the query keywords), the URLs are determined — this is what the output consists of. It is transmitted to the frontend and is displayed to the user, who (we hope) remains satisfied with the result. The frontend itself was written using the Flask framework.


All components are written in Python 3.6, with the exception of Boilerpipe, which is used as a jar (Java application). Index and dictionary are presented in binary form for quick search in large collections. Currently, a subsystem for ranking results is being created based on simple text features.



As a small addition, we added a weather forecast page that accesses the public API to locate the user and download current weather data for him.



And the cool thing here is this: it works!



')

Source: https://habr.com/ru/post/353932/


All Articles