They operate everywhere - even MySQL, even though Oracle is even a samopisnaya database. The smarter the database is, the more it tries to optimize itself, but it is better to help it.
1. Divide and conquer, and simply clustering the database - all data of the same type can be further divided into clusters - separate tables, records that fit some simplest rule fall into each table, for example, the table with index I gets data with ID% N == I, where N is the number of clusters. Thus, it is very simple and effective to divide the data that should not be read sequentially - for example, we break all the words into 100-200 million blocks, in each block only the words for which ID% N == I. As an example, in a large system, such as a social network, you can divide all the data according to the sign of belonging to one user - for example, place all photos in N tables, photo information is placed in table K = USER_ID% N
2. Conventionally - work with the disk. Always write (insert) sequentially, cache and buffer the record, try to read in a row from beginning to end. Accelerating the recording can be just fantastic - many orders of magnitude, simply because you use the recording correctly, knowing how your (or the manufacturer) recording algorithm works on the disc. You can almost always sort the data before writing - in memory, whether different files are with chunks of text - you can always build an index or a simple array that is sorted by data ID and read-write them in the same order as in the index. As one of the options - you can always think of a more optimal data storage structure. For example, when it is necessary to insert a piece of a table into another table, it is better to do this sequentially from a smaller ID to a larger one, at the same time disabling the indexing mechanism. And turning it on after insertion.
3. Do not store on the disk that you need often - load into memory. Now you can easily put in memory gigabytes, or even two. If everything does not fit - break the data into pieces and do the work in one piece. No memcached and analogs will help in such a task, only you know in what sequence the data will be processed, so you can write a solution ten times faster than standard utilities. Many “old” structures are well suited for using data.
- hashing (all data is broken up according to a rule, for example CRC% N)
- AVL and B trees (I strongly advise for your loved one to take paper, pen, C / C ++ and implement them yourself following only the definitions without reading the algorithm - develop it yourself. Develop your wits and raise your overall level)
- Heap with ordered sampling
')
4. Use pointers and related structures. An example from social networks: often out of 100 million photos it is difficult to find those on which a particular person is marked, but it is easy to store a list of such photos in the information about the person himself. Similarly: it is impossible to keep in memory all links to pages that are in the search, but the url of the site root is easily placed. By calculating the CRC or root ID by any link (in any way), you can quickly find a place in the file where all known links on this site are written and check if the given one exists.
I will try to add to this post all the new nuances, with the exception of very simple or stupid rules. I will not describe how to put indexes in MySQL - there are hundreds of manuals for this and many ways to check if your solution really works. If you still do not know such simple things, then I am not sure that the information in this post was useful ...
And yet: I am of the opinion that in any task there is something to optimize, and if it is critical, you should be engaged in optimization, and, at the same time, your own education on this topic.
The full content and list of my articles on the search engine will be updated here:
http://habrahabr.ru/blogs/search_engines/123671/