Translation of this note is not intended to discourage the reader from using Node.js, and the old woman is prorukha, but only calls to be careful, and, perhaps, will prompt the decision to those who suddenly face the same behavior of their application. The author's vocabulary is left without any changes and censorship.The tale is short, it’s a long time being done,
MelonCard was presented today at
TechCrunch along with other companies, when
all of a sudden it broke. Each small. small change. We just updated the site to make it look and feel more responsive, using long-polling NodeJS, with the coolest dynamic frontend on jQuery Templates and KnockoutJS. We did our best and conducted manual and unit testing with
Vows . Are all systems ready, full speed and all? It was not there.
Our system on NodeJS uses the state of the user, for example, “I expect these two records to be updated”, and the server (starting from checking the time fix) returns either “Your records are up to date” or “The xxx record has changed to yyy” (in fact, everything is a bit more complicated) with shared Redis variables, sessions, and other security checks for the interconnection of Rails, MySQL, Redis, and Node). Everything is so crystal simple, but even a simple NodeJS code can turn into hell when something goes wrong. It happened today.
')
After the articles about us were published, a stream of delighted users rushed to us (say, 50-100 new users per hour). And suddenly everything fell apart. Pages no longer worked; Our mailboxes started to get messed up from disgruntled users. I poured coffee and got ready for the battle.
My first thought was that NodeJS holds the load well, and that’s what it’s glorious for. Fifty or a hundred users could not ruin the system. And as it turned out it was not the fault of NodeJS, per se, but more on that later. The server started returning completely unexpected answers, as if the user was saying “I have a, b and c entries,” and the server responded, “You are an idiot, erase x, y and z entries, but you need a, b and c entries.” Focusing and reproducing the problem was impossible, given the terrible error handling and debag capabilities in Node. The following team had to constantly use (yes, right on the production):
NODE_ENV = 'production' node / privacy.js | grep “Returned Results”
You can imagine how terribly hard it was to disassemble this pile. It is worth noting that everything continued to work remarkably on the test server, and all the unit tests passed remarkably, and I had nothing more to push off from. On top of that, our system carefully checked sessions (for security), and frustrated users logged in and logged out in different browser bookmarks, receiving a huge number of warnings that they were not logged in (which also made it impossible to cut off real errors). I got errors like this:
Trace: at EventEmitter. (/—/Node/privacy.js:118:11) at EventEmitter.emit (events.js: 81: 20)
The line mentioned here (the only one reported by Node):
process.on ('uncaughtException', function (err) {console.log (['Caught exception', err]); console.trace ();});
Well, at least the application did not collapse, but all the same - there was nothing to start from. It should be noted that nothing of what we did (manual testing of the interface, unit testing, error handling, etc.) did not reveal errors associated with this line. Yes, it would be necessary to use load testing, but there is no certainty that it would reveal this misfortune.
After four hours of debugging (and translating the title page to 503 - Temporarily Unavailable), and while my co-founder personally responded with apologies to each frustrated and inquisitive user, I noticed that the user confused the parameters of my request with the parameters of other users. Frankly speaking, the server was designed in such a way that it worked so that only YOUR information was returned on YOUR request, but it confused what you requested. That is, you asked "I love apples and melons," and he replied, "Nonsense, you love mango." That is, everything was safe, but still damn wrong. Why would my
ExpressJS server confuse what I asked him. I started digging and found this:
app.all('/apps/:user_id/status', function(req, res, next) {
// …
initial = extractVariables(req.body);
});
Looks bad? Yes, it's just a failure. I am not an expert in JavaScript, but I will try to explain how I can. In JavaScript, variables are declared either in the context of a function or in a global context (with some problems passing through the nesting of contexts from current to global). When I created “initial” without “var”, a passage was made from the current context and quickly got into the global one, and created the global variable “initial” there. When the next request came, the same pass was made again and wrote the data to the same variable (the same one that the previous request was still going to use). And so it happened with every next request. When the server responded to requests after some processing, it read from this constantly updated variable and returned crazy results. Complete shit It was necessary to write something like this:
var initial = extractVariables(req.body);
Such code would create a variable in the context of my anonymous function, and there would be no other possibility to overwrite it with another query. It was an amateur mistake, but all debags and tests that I could apply were passed by it, without noticing.
So the moment came when you should say "you had to use
CoffeeScript ." And you will be right. In other circumstances, things could have been even worse (what would have happened if I had mistaken the context for the session variable?). On top of that, the lack of normal error handling (in Rails, we catch errors on the spectra and send unique spectra to the team via email), and some normal debugging tools (except grep and less) took me to the years when programming was not pleasant thing. Or maybe I just needed to be careful.
After four hours of service interruption and several hundreds of frustrated users, I found the problem and quickly corrected it on a productive server. The clouds scattered, the birds chirped and the sun looked out. We began to respond to users with an apology, considered the losses and moved on. But it's still not easy for me on how much damage one missing keyword has caused. Can it really be that I missed one var makes me flawed?