Reindexer site search is easy. Or how to make an "instant search" throughout the Habrahabr

Hello,

In the previous article, I wrote that we made a new in-memory database — fast and with rich functionality — Reindexer .

In this article I want to tell how using Reindexer you can implement full-text site search by writing a minimum of application code.

In general, full-text site search is an important feature, in our time, mandatory for any Internet site. The quality and speed of the search depends
how quickly users will find the information they are interested in or the products they plan to purchase.

About 15-20 years ago, the search was completely non-interactive and rustic - the sites had a search line and the "Search" button. Required from the user correctly
without typos, and in the exact form, enter what he wants to find and click the "Search" button. Next - seconds of waiting, reloading the page - and here they are the results.

Often, not the ones that the user expected to see. And everything was repeated on a new one: enter a new query, the "Search" button and seconds of waiting. By modern standards - blatant mockery of the basic principles of UX and users.

Over the past decades, the level of search engines has grown noticeably on average - they are ready to forgive tyrants for the user, words in different word forms, and the most advanced can translate search queries entered by translit or on the wrong keyboard layout, for example, "zyltrc" - "Yandex", by mistake, entered on the English layout.

The interactivity of the search engines has also grown up - they have learned to issue "saggests" - suggestions to the user what next they should enter in the search line, for example, the user starts typing "presit" and he is automatically prompted to substitute the word "president" as they enter.

An even more advanced version of the interactive search is "search as you type": the search output is automatically displayed as the user enters a query.

There are many possibilities, however, they are not free - the more errors a search can correct, the slower it works. And if the search is slow, then you will have to forget about the sages and instant search.

Therefore, developers often have to compromise - either to disable part of the functionality, or to disable interactivity, or to flood it with hardware and spend a lot of money on the server infrastructure.

So, it was a bit of lyrics. Let's move on to the practice - with the help of Reindexer, we will search the site, without compromise.

And we will start immediately with the results - what happened: distribute the entire Habr, including comments and metadata, uploaded it with its re-indexer, and made a backend and a front-end search across the entire Habra.

Feel alive, what happened can be here: http://habr-demo.reindexer.org/

If we talk about the amount of data, then this is about 5GB of text, 170 thousand articles, 6 million comments.

The search works with all the features - translit, incorrect keyboard layout, typos and word forms.

However, the disclaimer is still a project assembled "on one's knee", in a week, in the evenings free from other matters. Therefore, please do not judge strictly.

Powered by 1st VPS 4x CORE, 12 GB RAM. Minimally, it would be possible to shrink up to 1x CORE, and 10GB of RAM, but left a bit of a reserve - suddenly a habro-effect, you know.

The implementation of the entire project is <1000 lines, of which a noticeable part is the parser of habra pages, which decomposes html into data structures.

Further in the article I will tell how it is implemented.

Backend

Structure and components used

Backend is a golang app. As http server and router are used fasthttp and fasthttprouter. In this particular case, one would
use any other set of server and router, but decided to dwell on them.
Reindexer is used as a database, and the goquery library is used for parsing html pages .

The structure of the application is very simple and consists of only 4 modules:

Repository - is responsible for working with the data warehouse, as well as the description of data models
HTTP - is responsible for processing requests
Parser - is responsible for parsing Habr's pages
main - processing the command line interface and starting / initializing components

API methods

/ api / search - full-text search for posts and comments
/ api / posts /: id - getting post by ID
/ api / posts - get post listing with filtering

Data models

Data models are golang structures. When working with Reindexer, in the tags of the structure fields, indices are described that will be constructed by fields.

I’ll dwell on the choice of indexes in more detail - both the speed of query execution and the memory consumed depend on the choice of indexes.

Therefore, it is very important to assign the correct indexes to the fields that are to be searched or filtered.

Structure with a post:

type HabrPost struct { //  ID .      `id`   'pk' - Primary Key //  ,       `id`,    Reindexer        id ID int `reindex:"id,,pk" json:"id"` //  .   API,      ,      `tree` Time int64 `reindex:"time,tree,dense" json:"time"` //  .       text ,    `-` -    Text string `reindex:"text,-" json:"text"` //  .       title ,    `-` -        Title string `reindex:"title,-" json:"title"` //  .  API     ,      `HASH`     User string `reindex:"user" json:"user"` //  .  HASH ,        Hubs []string `reindex:"hubs" json:"hubs"` //  .  HASH ,        Tags []string `reindex:"tags" json:"tags"` //  .      .  API      . //    `likes`     ,      //        Likes int `reindex:"likes,-,dense" json:"likes,omitempty"` //    .       `likes`     Favorites int `reindex:"favorites,-,dense" json:"favorites,omitempty"` //  .       `likes`     Views int `reindex:"views,-,dense" json:"views"` // ,  .     HasImage bool `json:"has_image,omitempty"` //    -  Comments []*HabrComment `reindex:"comments,,joined" json:"comments,omitempty"` //   .    title, text, user //    - `search` //  `dense` -      ,      _ struct{} `reindex:"title+text+user=search,text,composite;dense"` }

The structure with a comment is noticeably simpler, so we will not dwell on it.

Implementing the search method

Hendler

At the REST API level, the handler is the usual fasthttp handler. Its main task is to get the request parameters, call the search method in the repository and give the answer to the client.

 func SearchPosts(ctx *fasthttp.RequestCtx) { //    text := string(ctx.QueryArgs().Peek("query"))    limit, _ := ctx.QueryArgs().GetUint("limit")    offset, _ := ctx.QueryArgs().GetUint("offset")    sortBy := string(ctx.QueryArgs().Peek("sort_by"))    sortDesc, _ := ctx.QueryArgs().GetUint("sort_desc") //       items, total, err := repo.SearchPosts(text, offset, limit, sortBy, sortDesc > 0) //       resp := PostsResponce{        Items: convertPosts(items),        TotalCount: total,    }    respJSON(ctx, resp) }

The main task of accessing the search is performed by the SearchPosts repository SearchPosts - it creates a query (Query) in Reindexer, gets the answer and converts the response from
[]interface{} to an array of pointers to HabrPost models.

 func (r *Repo) SearchPosts(text string, offset, limit int, sortBy string, sortDesc bool) ([]*HabrPost, int, error) { //     Reindexer query := repo.db.Query("posts"). //     `search`,      DSL Match("search", textToReindexFullTextDSL(r.cfg.PostsFt.Fields, text)).        ReqTotal() //      : //   ,      30    30     //       <b>  </b>         //  "...",    "...<br/>"    query.Functions("text = snippet(<b>,</b>,30,30, ...,... <br/>)") //            //    -  //      ,        `query.Sort`    if len(sortBy) != 0 {        query.Sort(sortBy, sortDesc)    } //      applyOffsetAndLimit(query, offset, limit) //  .          query.Exec ()    it := query.Exec() //  ,    if err := it.Error(); err != nil {        return nil, 0, err    } //      .     defer it.Close () //     items := make([]*HabrPost, 0, it.Count())    for it.Next() {        item := it.Object()        items = append(items, item.(*HabrPost))    }    return items, it.TotalCount(), nil }

Formation of DSL and search rules

Usually, the search line of the site involves entering the request in a normal human language, for example, "Big data in science" or "Rust vs C ++", however, search engines require the request to be sent in a special DSL format, which specifies additional search parameters.

In DSL, it is indicated by what fields the search will occur, relevances are adjusted - for example, in DSL you can specify that the results found in the "header" field are more relevant than the results in the "text of the post" field. Also, in DSL search options are configured, for example, whether to search for only exact occurrences of a word or at the same time, to search for a word with typos.

Reindexer is no exception, it also provides an Application DSL interface. DSL documentation available on github

The textToReindexFullTextDSL function is responsible for converting text into DSL. The function converts the text like this:

Text entered	DSL	Comments
Big data	`@^0.4,user^1.0,title^1.6 ~ +*~`	Relevance of being in the `tilte` field - 1.6, in the `user` field - 1.0
		in the rest - 0.4. Search for the word in all word forms
		as a prefix or suffix, as well as look for typos and look for
		all word forms as suffix or prefix

Receiving and loading data

For ease of debugging, we divided the process of receiving / parsing data from Habr and loading them into a re-indexer into two separate stages:

Parsim Habr

The DownloadPost function is responsible for downloading and parsing Habr's pages — its task is to download an article with the specified ID from Habr, parse the received html page, and also load the first picture from the article and make thumbinail out of it.

The result of the DownloadPost function is a filled HabrPost structure with all fields, including comments on the article and an array of []byte with a picture.

How does the parser, you can see on github

In the data import mode, the application calls DownloadPost in a loop with an ID from 1 to 360000 in several streams, and saves the results to a set of json and jpg files.

When downloading in 5 streams - all Habr is downloaded in about ~ 8 hours. Of the possible 360,000 articles - correct articles are only for 170,000 ID-Schnick, for the rest of the ID
This or that error is returned.

The total amount of parsed data is about 5gb.

Load data into Reindexer

After completing the import of Habr, we have 170k json files. RestoreAllFromFiles is responsible for uploading the set of files to Reindexer .

This function converts each saved JSON into a HabrPost structure and loads its posts and comments tags. Please note, comments are highlighted in a separate label to be able to search for individual comments.

It would be possible to proceed differently and store everything in one table (this, by the way, would reduce the size of the index in memory), but then there would be no possibility of searching for individual comments.

This operation is not very long - it takes about 5-10 minutes to load all the data into Reindexer in one stream.

Setting full-text index

Full-text index has a whole set of options. These settings, along with the settings from DSL, directly determine the quality of the search.

The settings include:

list of "stop words": these are words that are often used in documents and do not carry any semantic load.
index building options: support for transliteration / typos / incorrect keyboard layout
coefficients of the formula for calculating relevance. It includes: bm25 function, distance between the words found, word length from the query, signs of exact / not exact match.

In our application, the Init repository function is responsible for setting the search parameters.

About frontend and Chrome bug with "endless" scrolling

The frontend is implemented on vue.js - https://github.com/igtulm/reindex-search-ui

When doing an "endless" scrolling with uploading the results, we encountered a very unpleasant bug of Google Chrome - according to the latter, downloading the response from the server when scrolling sometimes takes 3-4 seconds.

How so! We have a fast backend with a re-index, which is responsible for milliseconds, and here, as many as 4 seconds. Began to understand:

According to the server logs everything is good - the answers are given in milliseconds.

 2018/04/22 16:27:27 GET /api/search?limit=10&query=php&search_type=posts 200 8374 2.410571ms 2018/04/22 16:27:28 GET /api/search?limit=10&offset=10&query=php&search_type=posts 200 9799 2.903561ms 2018/04/22 16:27:34 GET /api/search?limit=10&offset=20&query=php&search_type=posts 200 21390 1.889076ms 2018/04/22 16:27:42 GET /api/search?limit=10&offset=30&query=php&search_type=posts 200 8964 3.640659ms 2018/04/22 16:27:44 GET /api/search?limit=10&offset=40&query=php&search_type=posts 200 9781 2.051581ms

Server logs, of course, is not the ultimate truth. Therefore, I looked at the tcpdump traffic. And tcpdump also confirmed that the server is responsible for milliseconds.

Tried in Safari and Firefox - they have no such problem. Therefore, the problem is clearly not in the response time of the backend, but somewhere else.

It seems the problem is still in Chrome.

A few hours of googling bore fruit - there is an article on stackoverflow with workaround

And the addition of the magical "workaround" from the article partly fixed the problem in Chrome:

  mousewheelHandler(event) { if (event.deltaY === 1) { event.preventDefault(); } }

However, it doesn’t matter if you very much actively scroll with the touchpad, occasionally, there is a delay.

What else - a small bonus track, instead of a conclusion

Since the publication of the previous article in Reindexer, many new features have appeared. The most important of them is a full-fledged server (standalone) mode of operation.

golang API in server mode, fully compatible with API in embeded mode. Any existing applications can be translated from embeded to standalone by replacing one line.

This is how the application will work in embeded mode, saving data on the local file system in the /tmp/reindex/testdb

     db := reindexer.NewReindex("builtin:///tmp/reindex/testdb")

This is how the application will work with a standalone server over the network:

  db := reindexer.NewReindex("cproto://127.0.0.1:6534/testdb")

Standalone server can either be installed from dockerhub , or compiled from sources

And yet, we opened a telegram the official support channel Reindexer . If you have questions or suggestions - welcome!

Source: https://habr.com/ru/post/354034/

All Articles