
I want to share a story about how our project has lasted for an hour and a half, and the experience of finding out the reasons.
At one point, we understand that part of the site is loaded with a 15-minute delay, and the other part simply does not work, yielding a 504 error.
Attention! Since people like to be clever and do not like to read, I write here. The purpose of the post is to tell you how to get out of an emergency, everything else is just lyrics, for which for some reason everyone is focusing attention.I am engaged in a project that uses CouchDB as a database. There is a “Poster” section on it to which you can add events, in particular, you can add a periodic event by setting the start date of the period and the end date.
After adding an event, an event document is created in the database, and a time period is added to its separate field for each day of the period. At these intervals, a sample is made for output on the site. The sample, in fact, simply selects temporary intervals from all documents.
')
Thus, adding an event for 7 days, we get a document in which there are 7 records in the period field, and we have 7 records in our view.
Fail
On the server there was no check for the maximum period of the event. For some reason, this was not foreseen, probably hoping that only users with a paid account would add events, but they should be conscious.
Dirty user
A user appears with a paid account, and, for pampering, adds an event, indicating the final date of the event is 2100.
On the server, php-fpm begins to work powerfully, starting to add 365 * 100 events. Add something, he added, but the user did not wait for the message about the successful addition, having decided, probably, something glitched or the Internet fell off, and clicked on adding an event again, changing the time of the event a little. The process went a second time. Not that php-fpm gave any serious load, but in the list by the command top on the server there were more php-fpm processes than usual, which was confusing and made us think for a while in the wrong direction.
As a result, we have 2 documents in the database with time intervals of 365 * 100 each. CouchDB begins to update the view that it does not give him.
In the server logs something like:
[<0.738.0>] Exit from linked
pid: {<0.742.0>,
{timeout,
{gen_server,call,
[couch_query_servers,
{get_proc,<<"javascript">>}]}}}
When you try to log into the database in Futon, we see an os_process_error error. In the Status section in Futon, we see a non-disappearing inscription with a note that it is in the event database (see line 1):

There was a thought that something was buggy or beat the base, but service couchdb restart did not help, as did replacing the base on the server with the last copy from replication on another server.
After googling, a solution was found in the archive of the CouchDB mailing list - the base came up when the view was updated to os_process_timeout = 5000 (5 seconds). The view simply did not have time to process the document in the time allotted to it. Having increased the value in the config to 15 seconds, they finally managed to achieve the application of changes and the site started working normally.
Having dealt with the reason that the site simply did not load, giving out 504 error and sorting out the base, the script was finally restored and measures were taken to prevent this from happening again.
By the way, the created 2 documents in the database had to be deleted with a quickly written script, since the browser simply refused to open the document in Futon, hanging on tightly, obviously, trying to process the array with time intervals.
The sequence of events was restored in approximately the reverse order of my narration, which made me pretty nervous, because I personally had to face this for the first time (I must pay tribute to the authorities, who didn’t run around, demanding an urgent search and raising the site, and quietly allowed to deal with the problem).
Based on the above, some conclusions suggest themselves.
- Do not store a lot of data in one document, especially if it is an array that participates in the sample. Here, however, another controversial issue is that it is better - many small documents, or few large, but too large documents in our case did not work out for themselves;
- if, having entered the database, the error os_process_error appears, and the word timeout in the logs - try to increase os_process_timeout in the config, this will allow the database to start working again and return the results, having processed the latest changes;
- the golden rule, which we, alas, did not use in a particular case - check the data entered by the user, think like a cunning user-bad guy.
I hope this post will save those who, like us, will face this for the first time, from agonizing googling. Our project was cut off, alas, during peak attendance, I hope this will not happen to you again.
PS Useful article on Habré:
16 practical tips on working with CouchDB .