"Dadata" from 2014 sawing " Tips ". They help you quickly and without error to enter contact information: addresses, details of banks and companies, emails - that's all.
Stuck arranged intricately, and we decided to talk about it. Take clues to addresses, because they are the most complex.
"Tips" know what to suggest, because they have giant reference books. Although this article is about tips by addresses, for the good of the case I will also list other Dadata directories.
What tips | What reference books | Where to get reference books |
---|---|---|
Addresses | FIAS | Download from the official site |
Legal entity | USR and EGRIP | Buy from the FTS annual access - 150 000 â‚˝ for the directory |
Banks | Reference book on credit organizations of the Central Bank of the Russian Federation | Download from the official site |
Full name | Surnames, first names, patronymic names | Gather yourself or search for ready |
Emails |
|
|
Searching for something in an unprepared reference book is a long and ungrateful business. Therefore, we take the wonderful Lucene library and turn the source data into a search index.
Search index - a format in which to find information can be sooooo fast.
Physically, an index is a collection of two types of files:
The index and data on addresses in the total occupy 20 gigabytes. The companies are about the same, and the rest weigh less.
From the official reference books for saving we remove the data for which we are not looking for and which we do not return. We also clean duplicates and obvious mistakes. For example, the index for addresses is not:
"Tips" work quite intricately. For simplicity, I will divide the process into stages and tell you more about each. If questions remain, ask in the comments.
1. Let's go: a person enters characters in the "Clues" field.
2. Tips plugin collects the request. A dispatcher is working between a human and a server - jQuery-plugin “Tips” ( source code on GitHub ).
The plugin accepts data for search, packages it into a request and transmits to the server.
From itself, the plugin adds how many addresses to return. The number is set as a parameter in the integration of "tips". If the number is not specified, "Tips" returns 10 results. More than 20 is useless to ask - only 20 options will return.
The plugin also passes filtering parameters, they are also set when integrating the "Tips". Here are the filters that exist:
And there is such a thing as geoboost. It is similar to the restriction on the parent, but affects only the ranking of addresses. Do you want the Omsk streets to stand above Moscow - please.
By default, geolocation is enabled in the plugin: it sends the user's location to the server. This is also a search parameter.
With integration, you can adjust the delay in requests to the server. For example, set the delay to 100 milliseconds. If a certain virtuoso in four milliseconds hammered in four characters, one request with four new characters will go to the server. And not four requests one by one.
The plugin works in IE since version 10 and all normal browsers. He also needs jQuery 1.10+.
3. Check the cache. When a request arrives at the server, the “Clues” first look at the cache. They are looking for a match in all parameters of the query to a single.
Caching saves from short queries like “M”, “Mo”, “C”. Such a combination of the same type comes a tremendous amount. Since each letter is a separate request, caching protects the server from millions of hits to the search index.
The cache is located entirely in RAM, it contains 100,000 results.
4. We are looking for suitable hints in the index. If there is nothing suitable in the cache, the “Hints” are sent to the search index.
Clues look for addresses by:
The algorithm assumes that only the last word in the request is incomplete or erroneous. If a person wrote “Moscow Turch”, “Clues” are looking for “Moscow Turch *”.
If geolocation is disabled in the plugin, on requests for 1-2 characters, “Clues” only regions, municipal districts and cities are searched. Home service is looking for the second word in the query.
Each result "Tips" is assigned a weight. Weight is needed, because the algorithm sometimes finds thousands of options, especially for short queries. And you can return a maximum of 20 pieces. Therefore, the "Tips" sort the results by weight and return top.
Algorithm for ranking results - know-how "Dadaty". This is such a serious thing that I cannot describe it in detail: the developers will curse.
5. Sort the results. If the search results have the same weight, the “Tips” sorts them. The sorting algorithm is also samopisny, so again remains mysterious.
6. Preparing the answer. Addresses that return “Clues” differ slightly in format from FIAS:
7. Cache. Before returning the result, "Prompts" cache the request with all parameters and with the answer.
The cache is limited to 100,000 entries by the LRU algorithm, so the service throws out rare requests from there. Popular ones like “Mo” are always in the cache.
8. Plugin draws hints. It receives a response from the server, displays addresses on the screen and highlights the matches. If you press Enter during input, the plugin will compare the text with the found prompts and substitute the most appropriate one in the field.
That's how it works. If you take your tips, the article will help a little. And better come to us to work, we will come up with cool things together. Right now we are looking for a javista on “Tips” and 7 more specialists .
Source: https://habr.com/ru/post/349872/
All Articles