📜 ⬆️ ⬇️

16 practical tips on working with CouchDB

About a year ago when developing our project, we reached a certain development point, when either the painstaking tuning and optimization of the MySQL server begins, or the painstaking study of queries that go to the database begins again. It so happened that it was then that there was a boom of articles about MongoDB, CouchDB and other NoSQL databases and the temptation to try them on a live project was extremely great.

When choosing, the main role was played by the phrase “CouchDB is designed specifically for the web”, as well as the fact that no layers were required for access - access is provided via my favorite REST, and the API looks very simple and elegant. In addition to this, CouchDB has a very convenient web interface for administering Futon , which MongoDB did not have at that time, as well as iron fall resistance.

Looking ahead, I will say that the choice was completely justified - we got rid of a huge number of problems in developing and designing the database, the project code was much simpler and became much better structured, but the most important thing was the very turn in consciousness that CouchDB gave us. During this time, I personally crammed a lot of cones when developing and would like to share my experience with the Habrasoobschestvom. These tips are not for beginners - these are tips for using CouchDB in live production.

')

Use more databases


In many manuals for beginners (including CouchDB: The Definitive Guide ), the examples look very nice, but they are completely incompatible with life. The bottom line is that as soon as the number of your documents grows any real scale, say 100,000 documents in the database, the development of temporary views becomes almost unreal, because the server now needs to go through all your documents for the map function. Plus, each map will contain something like this:
function (doc) {
if (doc.type == 'photo' ) {
...
}
}
which resembles a small bicycle.

The CouchDB logic is such that when updating a single document in the database, this update "affects" all samples. That is, absolutely all samples will update their ETag when updating only one document. This is another drawback of using multiple documents with different type fields in one database. At the same time, updating one document does not affect the ETag, which will be given by other documents of this database, since ETag for documents are their latest revisions.

Replication should occur in the same local network


Replication is considered one of the competitive advantages of CouchDB. It starts with a POST request and can work in the background. On a live server, it became clear that the replication process is successful only in the local network. As soon as your servers begin to be far away from each other, then completely untreatable glitches begin, such as: connection breakage, inability to receive changes, and so on. With all this replica can give a message to the log and calmly pretend that all is well. Therefore advice: replicate data only in one local network .

Use native reduce-functions on Erlang


Do not reinvent the wheel. The documentation often uses such things as examples of reduce-functions:
function (key, values, rereduce) {
return sum(values);
}
Try to avoid them and use native reduce written in Erlang: "_count" and "_sum", which also work much faster than their Javascript counterparts.

Think three times before using complex reduce-functions


This moment is not covered in the documentation, but it says that if you do not use reduce-functions, then it is quite possible that you lose a lot. Reduce can cause itself if the sample is too large, causing rereduce. But in life, all this loses its meaning as soon as your sample becomes a little more complex.

In our project we have a comments base in which we store comments. Each comment is in a separate document, this document also stores the commentary city (we have a Russian portal, several cities), as well as its so-called. affiliation - belongs field. The challenge is to bring up the last N discussions. In MySQL, the task comes down to something like this:
SELECT * FROM comments GROUP BY (belongs, city) ORDER BY timestamp
The main problem of sampling in CouchDB is that it is sorted by key, and we need to forward the newest discussion threads ahead. This means that grouping through group / group_level will no longer be possible. This is where we turned to (re) reduce. The truncation function of the sample at the end looked like this:
function (key, values, rereduce) {
if (rereduce) {
var data = [], meta = [], record, tmp, index, total = [];

for ( var i in values) {
for (j=0; j<values[i].length; j++) {
record = values[i][j];

tmp = record[2] + '_' + record[3];
index = meta.indexOf(tmp);

if (index === -1) {
meta.push(tmp);
data.push(record);
} else {
data[index][1] = Math.max(data[index][1], record[1]);
}
}
}

data.sort( function (a, b) {
if (a[1] === b[1]) {
return 0;
}

return (a[1] > b[1])
? -1
: 1;
});

return data.slice(0, 7);
} else {
var output = [];
for ( var i in values) {
output.push([values[i]._id, values[i].ts, values[i].belongs, values[i].city]);
}

return output;
}
}
And everything worked well, but there was a problem with the speed of updating the sample. After entering one comment, the update of this sample took 2 seconds on a server with 4GB of memory and an Athlon 64 X2 5600+ processor ( link ). And with a constant stream of comments, a constant database sag was unacceptable. Now the number of documents in the database - 22,000, in the sample - 258,000. From here an output: use powerful reduce-functions only at a powerful server. Otherwise, the whole idea becomes meaningless.

Cache data through ETag


Receiving data through the “If-Modified-Since / ETag” bundle is really faster than simply receiving data by about 3 times (synthetic tests). Do not forget that when you use If-None-Match headers, with a response status of 304, the response body is always empty , since the server means that you keep data on your side. In our project for this purpose we use Memcached with a small simple shell for working with CouchDB ( link )

Think CouchDB Style


Thinking in CouchDB is a separate thing that requires understanding. It is not enough to write a parser from% DBMS% to CouchDB, it is important to really get used to thinking in the CouchDB style, and then all tasks will be perceived by you from a completely different point of view.

I will give a simple example. There are events that take place at some point in time. If we need to find out in MySQL which events are taking place exactly today, then we write such a query:
SELECT * FROM table_name WHERE UNIX_TIMESTAMP() BETWEEN start_timestamp AND finish_timestamp
Now back to CouchDB and recall that in the samples there is no such thing as the current time. All samples are generated 1 time, and later on, when documents are modified / created / deleted, they are only updated. Accordingly, we have only documents and nothing more. It is important to understand that you must make a sample so that you can pass a parameter to it as a key. That is, you can only limit the selection by key. The solution to this problem is to make a sample, in which for each document all days in which the event takes place are selected as a key for each document. And in the future to get all the events taking place on this day, you will only need to refer to the form with the key "? Key = current_day"

Almost everything you do in SQL is implemented on CouchDB much easier and more beautiful.


I will only dilute this pseudo-yellow title with my observation 3 years ago. At one time I had the opportunity to work as a junior programmer at a company that stamps satellites. At these sites, the attendance was no more than 30 people per day, but under each of them was a powerful engine with an XSL template converter on the server side. I don't even want to explain why this is stupid. The general idea is that you should always choose the one that best suits problem solving. In the case of sattelites, these are simple html pages that your CMS can generate; in the case of powerful portals with high traffic, this is by no means a free Joomla.

Let's go back to the title. Not all programmers understand how their code works, and especially how it interacts with the database. Often there are requests in which there are a huge number of JOINs for sampling simple data, and even EXPLAIN will not help this person determine which part of his request is slowing down, since the entire request is made without using a head. Moreover, on a live project it all comes down to simple PRIMARY KEY samples, all other queries become a burden, and knowledge of composing complex SQL queries becomes useless.

At the moment, I am deeply convinced that CouchDB allows novice programmers to turn their heads on and not to fence the most powerful requests just to make them work. The convenience of reduce-functions allows you not to write stupid data truncation, discarding memory overflow. Almost all things that are used when working with simple sites with up to 5,000 visitors per day are much more beautiful and easier to implement on CouchDB: getting pages by URL, getting a list of news, working with a guest book, photo galleries, and so on. At the same time, the only possible UTF-8 encoding used will save you from many things that you don’t need to think about during development.

Use utilities to view current actions.


All current actions in CouchDB can be viewed. In MacOS, the CouchDB utility is called CouchDBX . A similar utility is for Windows. They run a CouchDB server on port 5984 and allow you to view current requests to the server in real time. On Linux, it is enough to start the server in non-daemon mode (the -d option in / usr / bin / couchdb is responsible for this) and all requests will be output to the console.

Also, all current actions can be viewed in the “Status” tab of “Futon”.

Do not use CouchDB for frequently updated data.


Every thing has its own best practices. At CouchDB, working with frequently updated data is not exactly related to these parties. Sampling the same data in CouchDB is the ideal. Why it happens? When one document in the database is updated, ETag is reset for all samples in the database. This means that they all become invalid, obsolete. For samples, this means updating and updating their ETag on the next call (i.e. min +1 query for all samples in the database). At the server level, this means database expansion in size, which will have to be fought with the Compaction operation.

Do not forget about Compaction


Each update of a document leads to the creation of a newer revision. It also leads to the regeneration of samples in which this document participates (regeneration is also affected by the operations of adding and deleting documents) the next time they are called. All old revisions are preserved, and not always you need to have access to 600 revisions of the document, while its current revision is thousandth. The size of the database is growing, and the server space is not always rubber, so do not forget to perform the compaction operation for views and documents. This will save a lot of free disk space.

Regeneration of species. Stale = update_after


Prior to the release of CouchDB version 1.10, a small problem was the sampling of data from non-generated species. For this, it was proposed to use the “stale = ok” parameter when selecting a view, and hang the regeneration of the species itself, for example, in the crontab. Starting from version 1.10, the “stale = update_after” parameter appeared, which acts in the same way as “stale = ok”, but causes an update of the view after receiving it. Together with the simple receipt of data of the form, we have all the possibilities for quick work with even complex design documents.

Adding or changing the view on the production server affects the adjacent views of the design document


When a view is added to a design document, it is assembled. This means that all the time while the view will be collected (say, _design / list / _view / by_name), its neighboring views (say, _design / list / _view / by_age) will not be available. Don't forget about it when you add a view on a production server.

Install from source codes. Update more often.


As many have become accustomed to, Ubuntu / Debian maintainers are not in a hurry to update packages in the repositories. This means that in Ubuntu Maverick CouchDB is version 1.0.1, and in Lucid it is generally 0.10, while CouchDB has long been included in the list of priority Apache projects and is constantly evolving. The latest version at the moment (1.10) contains the following things:

Full text search


I said that CouchDB is suitable for many tasks, but not all. Full-text search just falls under this exception. Since we cannot pass any parameter straight to the view, we cannot search for something exactly in the database. Therefore, you can not organize a search on the site using CouchDB. There are various solutions to these problems, but all this is bicycles. Frankly, this is not always so bad: it often allows you to understand that your visitors will want to search. There is one more important point: search is almost not needed on a low-visited site. And on a large portal, the search should be sufficiently relevant, which will not allow you to do with simple LIKE / LOCATE queries.

A simple solution to this problem is to use Search on a site from Yandex or Custom Search Engine from Google.

A more complex and solid solution to this problem may be the use of a separate search engine . This could be Sphinx, Apache Solr, Lucene (there is a bunch of couchdb-lucene mentioned in the documentation). In fact, this is a topic for a separate article, so now I will not focus on this.

In addition, you should clearly separate the full-text search and tag search in your head, although they are similar in appearance to URL addresses.

Geopoisk


Another problem in CouchDB is a geo-search, for example, finding all objects within a radius of N meters. In SQL-like databases, this task is implemented using a small function that allows determining the distance between two points by latitude and longitude. In CouchDB, we have only one sorting scale in the keys, so finding all the points that fit in a square is almost impossible. However, the author of CouchDB on Twitter mentioned that you can implement a geo-search in the same way as it is implemented in MongoDB, namely using the idea of Geohash . it lies in the fact that any coordinates can be represented as a numerical-alphabetic hash . In this case, the more accurate the coordinates, the greater will be the length of the hash. Thus, you can pass geohash as a key and vary its length in the startkey / endkey parameters to refine the search radius (of course, this is not quite the radius). There are a great many implementations of geohash, you can always read them or write your own .

Data backups


Data backups are one of the things in CouchDB that you should love for. Backups are done by simply copying the database files from the / var / lib / couchdb directory. Remember that you can copy files only when the CouchDB server is turned off , otherwise all your database files will be beaten. Thus, the general procedure is as follows:
  1. turn off the CouchDB server
  2. copy the databases we need
  3. enable CouchDB server
Replication is performed when the server is running. Files with the extension * .couch contain all the documents of the respective databases. The.% Database_name% _design directories contain the generated views of the respective databases. If you do not copy the view directories, there will be nothing terrible: the first time you request views, they will be generated on your computer.

Do not forget that all database files and views must belong to the appropriate CouchDB user, so check the file permissions when copying and install them, if necessary, through the chown utility.



The post is written in 1999 and published at his request.

Source: https://habr.com/ru/post/123338/


All Articles