Difficult lessons: five years with Node.js

After five years of working with Node.js, I understood a lot. I have already shared some stories , but this time I want to talk about what knowledge was given the hardest. Bugs, problems, surprises and lessons that you can use in your own projects!

Basic concepts

Each new platform has its own tricks , but at the moment these concepts are secondary to me. Understanding your bug is a good way to guarantee learning. Even if it is a little painful!

Classes

When I first started working with Node.js, I wrote a scraper . Very quickly, I realized that if nothing is done, it will implement many requests in parallel. That alone was an important discovery. But since I had not yet fully mastered the power of the ecosystem , I sat down and wrote my own concurrency limiter. He worked and checked that no more than N requests were active at any one time.

Later, I needed a second level of concurrency restrictions to make sure that we serve only N users at a time. But when I introduced the second copy of the class, very strange problems began to appear. Logs have lost their meaning. In the end, I realized that the syntax of class properties in CoffeeScript does not provide a new array for each instance, but one common for all!
')
Long using object-oriented programming languages, I'm used to classes. But I did not fully understand the result of the hidden work of the constructs of the CoffeeScript language. Study your tools well. Check all assumptions .

NaN

Once working on a contract, I implemented a sorting based on user parameters, which was to be used in a multi-stage production flow. But we saw really strange behavior - the order did not remain the same. Each time we sent the same set of user parameters, but the order of the elements changed!

I was confused. It was assumed that this is a deterministic procedure. The same input data - the same output data. A shallow investigation did not work, so I finally implemented detailed logging. And here the NaN values have surfaced. They appeared as a result of previous calculations and wreaked havoc on the sorting algorithm. These values are not equal to themselves or anything else , so they broke down the transitivity, which is necessary for sorting .

Be careful with math operations in javascript. It is better to have strong guarantees of the quality of the incoming data , but you can also check the results of your calculations .

Logic after callback

While working on one of my Postgres-connected applications, I noticed strange behavior after a drop in tests. If the initial test fell, then all other tests expired on timeout! This did not happen very often, because my tests do not fall often, but it did happen. And it began to bother. I decided to figure it out.

In my code, the Node pg module was used, so I began to rummage through the node_modules directory and add logging. I found that the pg module needed to do some internal cleanup after completing the request, which it did after calling the user-provided callback. So if an exception was thrown, this code was skipped. For this reason, pg was in poor condition and not ready for the next request. I sent a pull request , which was completely redone and added in version 1.0.2 .

Get in the habit of calling callbacks last. It is also a good idea to precede them with a return . Sometimes they cannot be put on the last line, but they should always be the last expression.

Architecture

Bugs may be in separate lines of code, but much more painful if the bug in the application architecture itself ...

Locking event loop

Under the contract, I was asked to take a one-page application written in React.js and transfer it to the server. After fixing a few parts, because of which it was supposed to render in the browser, everything worked. But I was very afraid that synchronous work would put the Node.js server on its knees , so I added a new position of data collection to our statistics collected from the server: how long does each page render?

When the data began to arrive, it became clear that the situation was not good. Everything was fine, but some of the pages were rendered over 400 ms. Too, too long for a working server. My experience with Gartsby prepared me well for the next step: render static files.

Think carefully about what you want from your Node.js server. Synchronous work is really bad news. HTML rendering requires a lot of synchronous work, and this can slow down the process - not only with React.js, but also with lightweight tools like Jade / Pug ! Even the phase of type checking on a large load GraphQL can take a lot of synchronous time!

Specifically for React.js, the Rapscallion renderer from Dale Boustad demonstrates a promising approach. It splits all synchronous work to render the React component tree into a string. Redfin react-server is another, more heavy, attempt to solve this problem.

Implicit dependencies

I was already actively working on a contract and implemented functions at a good speed. But I was told that for the next function I can check with another Node.js application for reference and implementation assistance.

I looked at the Express linking function at the output point that was discussed. And I found a whole bunch of links to req.randomThing and even some calls to req.randomFunction() . Then I went to look at all the binding functions that I had already passed earlier in order to understand what was going on.

Make dependencies explicit, unless it is absolutely necessary to do otherwise. For example, instead of adding local req.messages strings to req.messages , pass req.locale to var getMessagesForLocale = require('./get_messages') with direct access. Now you will clearly see what your code depends on. It works the other way - if you are a developer of random_thing.js , then definitely want to know which parts of the project use your code!

Data, APIs and Versions

The client wanted me to add features to the Node.js API, which worked as a backend for a large number of installed native applications on tablets and smartphones. I quickly discovered that I couldn’t just add a field, because application developers used defensive programming — the first action of the application when retrieving data was to use a strict scheme.

Considering such a check and the applications themselves, it became clear that two new types of versioning would be needed. One for API clients so they can upgrade and access new features. The second is for the data itself , so that we are sure that all these new functions are implemented reliably on top of MongoDB . Being engaged in adding it to the application, I reflected on the topic of how to develop that first version.

There is something in changeable JavaScript objects that delights people in connection with document-oriented DBMS . “I can create any object in my code, just let it save somewhere!” Unfortunately, these people seem to go away after writing the first version. They do not think about the second, third or fourth versions. I’m taught, so I use Postgres and from the first version I’m thinking about versioning.

Equipment

As the main Node.js expert in a large contract project, the DevOps expert approached me to talk about working servers. It took a long time to equip new machines in the data center, and he wanted to make sure that he had the right plan. I appreciated it.

I nodded when he said that one Node.js process would work on each server. But I stopped nodding at the mention that each server has four physical cores. I explained that only one core could be used on the server, and he shrugged his shoulders - only such servers were obtained. They used to work as a .NET store, and they have standard solutions. Soon after, we introduced the cluster .

It is difficult to understand what is happening, if you move from other platforms. Node.js always runs in the same thread, while almost all other web server platforms scale to the server: just add kernels. Use cluster , but watch carefully - this is not a magical solution.

Testing

I have already written a lot of JavaScript code and learned that testing is absolutely and completely necessary . It really is.

"Easy job"

The client explained that their Node.js experts were gone and I’m replacing them. The company had big plans for this project, which was the main topic of discussion before I started work. Now the client has become more resolute: he realized that things did not matter. But this time I wanted to do everything right.

I was disappointed by the level of test coverage, but there were at least some tests. And I was pleased that JSHint is already in place. Just to be sure, I checked the current set of rules. And I was surprised to find that the unused option is not activated. I turned it on and was shocked by the continuous stream of new errors. I sat for several hours and simply deleted the code.

JavaScript programming is hard. But we have the tools to make it more meaningful. Learn to use ESLint effectively . With small annotations, Flow can help catch wrong function calls. Lots of good with a minimum of effort.

Cleaning after the test

Once I was asked to help the developer figure out why a test error occurs during the test run. When we looked at the output from the mocha , we did not see any error during this failure. When we carefully studied the call stack, it became clear that the error was caused by code that was completely unrelated to this test.

After a deeper study, it turned out that the previous test announced a successful completion, at the same time initiating a series of asynchronous operations. The exceptions that came from that code mocha perceived by the mocha process level handler as coming from the current test. Further study showed that mock-objects that also did not clean out were leaking into other tests.

If you use callbacks, then all tests should end with the done() method. This is easy to check while viewing the code: if there is some nested function in the test, you should probably have done() . Although, there is a small complexity here, because you cannot call done , which was not initially passed to the function. One of those classic coding errors . Also use the sandbox function in Sinon - it will help to make sure that everything has returned to its place at the end of your test.

Mutability

In this project under this contract, the standard way of conducting tests was to conduct unit tests or integration tests separately. At least in local development. But Jenkins does a full test, running both test suites together. In one of the pull requests, I added a couple of new tests, and they caused a crash in Jenkins. I was very surprised. Tests normally completed locally!

After some fruitless reflections, I turned on the detailed study mode. I ran exactly the command that I knew that Jenkins used to run the tests. It took some time, but the problem was able to reproduce. My head was spinning while I was trying to figure out what is the difference between the runs. There were no ideas. Detailed logging comes to the rescue! After two runs, I managed to discover some differences. After several false starts, the correct logging was established and it became clear: unit tests changed some of the key application data that was used in the integration tests!

Bugs of this type are extremely difficult to track. Although I am proud that I found this bug, but more and more I think about immutability. The Immutable.js library is not bad, but you have to give up lodash . A seamless-immutable silently falls when you try to change something (which then normally works in production).

You can understand now why I am interested in Elixir : all data in Elixir is always immutable.

Ecosystem

To a large extent, the benefits of Node.js consist in the efficient use of a large ecosystem. Choosing good dependencies and managing them correctly.

Dependencies and Versions

I use the webpack-static-site-generator tool to generate my blog . As part of preparing my Git repository for a public release, I deleted the node_modules directory and installed everything from scratch. This usually works because I use the exact version numbers in package.json . But not at this time. And everything stopped working in the strangest way: without any meaningful error message.

Since I sent a certain number of pull requests to Gatsby, I know the code base pretty well. First, add a few key logging expressions. And an error message appeared! True, it was difficult to interpret. Then I plunged into the webpack-static-site-generator and found that it uses the Webpack to create a large bundle.js with the code of the entire application, which is then passed to run under Node.js. Madhouse! And it was from there that the error came out - from the depth of this file, during the launch under Node.js.

Now I quickly followed the trail. A few minutes later I had a specific code snippet, which was issued along with the same error message. The problem turned out to be in the dependencies of the new features of the ES6 language in the case of launch under Node.js version 4! It turned out that this dependency has unlimited subdependency , which pulls out too new version of punycode .

Capture all your dependency tree to specific versions with Yarn . If you cannot, fix direct dependencies on specific versions. But know that the remaining loose versions of dependencies can lead to this situation.

Documentation and Versions

On one of the projects I used Async.js , especially the filterLimit function. I had a list of paths, and I wanted to go to the file system to get the characteristics of the file, which would then determine whether the paths should remain on the list. I wrote a method for the filter in a normal asynchronous manner, with the standard signature async callback(err, result) . But nothing worked.

I turned to the documentation, which at that time was on the main page of the GitHub project . I looked at the description of filterLimit , and there was an expected signature: callback(err, result) . I went back to the project and started npm outdated . I had v1.5.2 , and the last one was v2.0.0-rc.1 . I was not going to update it to this version, so I started npm info async to check if I had the latest version 1.x. So it was.

Still at a loss, I went back to the code and added extremely detailed logging. Uselessly. In the end, I went to the source Async.js on GitHub. What makes this stupid feature? And then I realized what had happened - in the master branch on GitHub was the code 2.x. To see the documentation for my installed version 1.5.2 , it was necessary to look in the history. When I did this, I found the correct callback(result) signature without the possibility of propagating errors.

Look very carefully at the version number of the documentation you are accessing. It is easy to become lazy here, because the first available documentation often fits: the methods rarely change or you are already on the latest version. But it is better to check twice.

What does not understand New Relic

I was busy with Node.js server performance for the client, so I was given access to the monitoring tool: New Relic . Several years ago, New Relic had to use Rails client applications for monitoring, and I also prepared for another client an analysis of the appropriateness of the New Relic / Node.js bundle, so I generally knew how the system worked and knew how it integrates with Express and asynchronous calls .

So I proceeded. The system had surprisingly comprehensive traces of how incoming requests are handled: the work of the middleware Express and calls to other servers. But I searched everywhere and could not find a key indicator for the processes: the state of the event cycle. So I had to resort to a workaround: I manually sent the indicators from toobusy-js to New Relic and created a new graph based on them.

With this additional data, there was more confidence in the correctness of the analysis. Of course, the jumps in terms of latency coincide with what New Rellic calls 'time spent in requests'. I looked and anxiously found that the sum of the terms there did not coincide with the result. The total time spent on the request, and its components - do not match. Sometimes there is the category “Other”, which tries to correct the situation, in other cases it does not exist.

Do not use New Relic to monitor Node.js applications. This tool has no idea about the event loop. Its heaped-up 'average time spent' graphics is completely misleading - given a slow cycle of events, the selected endpoints will be those that postpone the cycle of events the most. New Relic can determine the existence of a problem, but does not help to calculate its source.

If you still need a reason not to use New Relic, here you are:

The default windows display averages, a bad way of representing data from the real world . It is necessary to go further beyond the default windows in order to get intelligible graphics, like 95%. But do not feel too comfortable, because you can’t add these intelligible graphics to your dashboards!
He does not understand the use of cluster on the same machine. If you manually send data like the lag time for an event cycle from toobusy-js , only one per second will win for the entire server. Even if there are four workers.

All clear!

Emotional events are remembered with the greatest clarity and accuracy . Therefore, each of these situations is reliably preserved in my memory. You do not remember it as well as me. Maybe if you imagine my struggle, it will help?

Surprisingly, initially in this article I wanted to describe twice as many situations worth mentioning, many of which do not relate directly to Node.js. So wait for more posts like this!

Source: https://habr.com/ru/post/327058/

All Articles