Who is not interested in the invention of bicycles, please do not read further and do not spit in the back.
Those who have something to say on the subject of the question are always happy.
Now I am going to consider the main issues that I need to scale the system.
For successful scaling, the system, including data, needs to be broken down into elementary “bricks”. So that this process was as simple as possible. The simpler, the less confusion in the future. Successful partitioning may be a plus in solving other problems. This has already happened in my practice, when a detailed data structure made it possible to solve new problems, which were not thought of at the very beginning of the project. But honestly, at the basis of this thoroughly elaboration lay the elementary laziness, therefore it was done so that it was easy to replace one brick with another imperceptible for the whole structure.
Lyrics is over.
In my opinion, the main unit of information when searching should be a site. Apparently, the major search engines and the way it works, but if not, I feel sorry for them. It's just scary to imagine that when searching in the Yandex directory, what happens is not a search by site group, but a filtering of the global search result. Or when Google sets up output filtering for China not by disabling (not) the necessary sites, but by thinning out the issue. But I would not be surprised if he simply builds a separate “Chinese” index.
So. What gives us the storage of the index and the ability to access it by site?
1. The ability to provide a search service to individual sites. Large search engines have a search restriction for a separate site, but for some reason the sites themselves do not use this, preferring to put local search engines. At least this market (local search engines) exists and this can be used to mutual benefit - platforms for testing and functional running.
2. The ability to search by group of sites, such as Yandex-catalog. This idea is not new, but it is unlikely to ever become irrelevant.
3. The possibility of excluding unwanted sites from the search. For example, "family search" that children can use. It is unlikely that any of the parents would want, even by chance, to see porn sites in the issue.
Those. Site-by-site organization of the index provides ample opportunities for inclusive and exclusive filtering (including-excluding individual sites or entire groups).
4. This idea is perhaps the most seditious - you do not need a backup! Instead of a backup, you start building an index from scratch. It takes more time than restoring their backup, but it reduces your hardware costs. After all, it is not necessary to keep actually the second copy of the index. While you are working with a separate site, it is not very annoying. But with the growth of volumes, the problem of storage and backup support will grow at a similar pace.
I do not intend to completely abandon the backup. But to do this only for critical areas - key directories and indices. Firstly, the volume of this data is much smaller, and secondly, their loss is a real catastrophe.
5. Mobility. Transferring part of the index to another server is quick and painless, which greatly simplifies the process of updating the machine park. This is if we are going to develop the project for a long time.
How many such building blocks of indices are located on a separate server is decided depending on the availability of resources and this is the next topic.
Ps. I still do not consider the question of what to do if the site index is too large for one server.
First of all, there are not so many such sites and it will be possible to think about this as you approach them.
Secondly, this problem can be solved in parallel, without interfering with the work of the main system and without upsetting it with alterations.
')
UPDT:The option when the site wants to organize not only a end-to-end search, but also the opportunity to restrict one or several sections normally falls into the proposed structure. The site-group_site-group_groups -...- scheme is all replaced by the group-group_groups -....- site scheme.
Both that and that bears the general name - the hierarchical structure. The main thing - what are the basic limitations? The number of levels of nesting, how many children can be a separate section? Lack of restrictions will give flexibility, but will affect the speed of work. Hard frames will allow you to operate with lists of fixed length, which will speed up the work. The main task is to propose such restrictions so that they are satisfied in most cases. Thanks for the development of the idea
Infanty