
Introduction
Today I will talk about the design of highly-loaded fault-tolerant systems. Emphasis will be placed on practical development and fried facts, rather than on a dry theory. After reading you will not be afraid of developing a service with a billion users if you have enough servers. The topic is very extensive, but I will try to be brief and concise.
What do we want to get?
Let's start with a thesis statement of the problem and system requirements, there are not many of them.
- The system should automatically share tasks between the servers available to it.
- Simultaneous physical failure (shutdown) of any two servers will not affect the integrity of the system.
- The physical failure of any equipment does not entail the failure of the whole system.
- Turning on a new server does not require any manual configuration.
- A client can interact with any frontend server equally.
Two ways.
There are two ways. The first is the way of Microsoft - each service is developed independently, with each group of developers making their own decisions.
The second way - the way Google - there is a single platform and a single system in which applications live (search engine, mailer, groups, etc.).
The second way significantly benefits, because when optimizing any part of the system, the work of all services improves at once, and during development you do not need to think about how all this will actually work. The first is very reminiscent of Krylov's fable about the swan, cancer, and pike.
The main components of the platform.
On each server any set of components can be started, the number of servers for each component is automatically determined.
- Data store. The state of the system is stored only in it.
- Cache subsystem. Used to temporarily store hot data in memory.
- Service locks. Used to prevent simultaneous calling of sequential procedures.
- Application service. All end services live in it.
- Service interaction with the client. Includes a web server and a DNS server.
Data store.
The state of the system is stored only in it, no other repositories for applications are available. I choose MongoDB. In order to understand how to approach it, I recommend reading my previous article -
MongoDB - we make good coffee , pay special attention to clustering. MapReduce is also important.
Cache subsystem.
It is necessary to reduce the load on the database on a simple sample of objects, and on complex samples. I choose memcached. About how to make friends MongoDB and memcached said in an article about MongoDB, I advise you to pay attention.
')
Service locks.
Used to prevent simultaneous calling of sequential procedures. Before performing a procedure that requires a sequence of actions, the application sends a request to the lock service. In response, it receives either a consent for execution or a refusal. I choose phpDaemon :: LockServer. In the event of a fall of one node, all clients driven by it switch to another node.
Application service.
In it live end services.
OpenVZ can act as an application service. Applications can also be asynchronous phpDaemon modules. When developing applications, you need to queue up any heavy operations.
Service interaction with the client.
Includes a Web server (nginx), a DNS server (bind9) and a firewall through which requests are sent directly to applications. Fault tolerance and load balancing are provided either via DNS round-robin or via BGP to obtain AS status.
File storage
Files are stored in a database using the
GridFS methodology; there is an implementation as a FUSE module.
An example of use. Description of the Web search service.
This service is divided into several components:
1. Storage of information about documents.
To do this, use the documents collection with documents containing the url, host, type, size, mtime, atime, title, text, description, indexed, and other properties. Sharding a field by url.
2. Crawler'y, save the contents of the pages in the database.
Crawler requests documents.find ({host: ..., indexed: 0}), acquiring a lock on this host. Requests each document over the network and adds it to the database, exposing indexed: 1.
3. Full-text search.
To effectively search the text, you need a special mechanism that allows you to quickly find documents by specified words.
3.1. Reduction of the word to the "normal" form. Words can be written in different word forms, they all need to be reduced to one normal form, for this you need to use dictionaries (see spelldump from Sphinx), and heuristic stemmers for words missing in dictionaries.
3.2. Index storage.
There are two approaches.
The first approach is more often used in search engines with a relatively small number of search queries. It implies the division of a single search index into several, and the parallelism of the work of all nodes in the query. That is, when a search query is submitted, all nodes are polled and their responses are glued together. In this approach, to increase performance, mirrors of nodes are added, and the load is balanced between these mirror nodes. But it is very inefficient, because the larger the index, the more servers are required, while the popularity of words is not taken into account. Many words will never be requested at all.
The second approach involves word distribution, so when we query “hot girls”, we first know which servers are responsible for each of these words, then we request them, and we glue the results together. Servers that are not responsible for these words are not loaded at all. This gives a huge performance gain.
Therefore, we choose the second option. When the document is indexed, it is broken down into words, they are reduced to normal form, and words.upsert ({word: ..., {"$ push": {"docs": ...}}}) are called for each word.
3.3. Search.
With a simple search for “hot girls,” query words.find ({word: “hot”}) and words.find ({word: “girl”), notice that “girl” and not “girls”. Then intersections between them are revealed and ranking is performed.
For effective ranking it is necessary to keep the distances between all pairs of words in the document. A wordpairs collection with objects of the form {docid: ..., word1: "hot", word2: "girl", distance: 1}. Thus, you can quickly find out in which documents hot and girl are located closest.
References to used software.
Conclusion
In fact, making such a system is not at all difficult. Moreover, I am working on an appropriate framework for the simple creation of such systems.
It is very difficult to fit such a huge amount of information into an article, but I hope I managed to convey the essence. I am wondering what aspects will seem particularly interesting to you, and I will write about them next time.
The main thing is to understand and realize that if you encounter a problem while working with a simpler tool, rather than using a ready-made solution instead to disregard such problems, this does not mean that the problem is solved, most likely this problem is solved so much in one place. that you can't even imagine. In general, I am suspicious of products that do not profess the principle of Keep It Simple, Stupid. When I cannot control what will happen at the elementary level, I understand that it will not be possible to do everything correctly, since all sorts of optimizers automatically (SQL queries, code, etc.) are often mistaken.
For example, in the comments to one of the articles, a person asked whether it is possible in MongoDB to perform an analogue of the SQL query "SELECT * FROM table ORDER BY RAND () * hosts LIMIT 5". This is of course great and convenient, but with the same success we can select the whole table, and then in the script we will choose what we need, it will not slow down much, because MySQL in this case performs RAND () * hosts for each row, sorts everything these series are in memory, and cuts off the first 5 results. Of course, the larger the label, the greater the brakes. So you should always think what is really happening as a result of your actions, otherwise you will never learn how to make highly loaded systems.
Thanks for attention!