Hi, Habr.
Last time I told you how we wrote Raw JavaScript converter, and you told me that it works slowly. Today I want to talk about how we have accelerated our
raw.pics.io almost
3 times . I will not post a sheet of code describing each step, I will try to tell you in general terms about the approaches to optimization that we used. I also decided not to write about accessing the DOM, reducing the number of HTTP requests, gluing and minifying files, compression options on the server, etc. All this technical work.
about which they wrote
well and
much .
How it all began
We began by defining critical sections of code that take up most of the execution time. Almost always in the programs there are nasty chunks that slow down the most - that's what you should start with. Typically, the developer is roughly where the main time goes in his code, and finding what slows down is not difficult. You only need to want. In addition to sacred knowledge, we used a profiler and measured the execution time using console.time (). In addition, now in browsers for these purposes appeared a convenient object performance.
')
Technology Optimization
Technological optimization is probably the most global of all approaches to increasing productivity. You can change the language, the compiler, or the stack of technologies used to achieve the desired results. In our case, the most significant performance gain gave the transition to parallel computing.
SIMD (Single Instructions Multiple Data)
Using MMX and SSE can achieve significant acceleration due to the rapid execution of operations on vectors. Such a thing is already in many languages, and recently Intel
introduced an extension of the language that adds such features to JavaScript. This news delighted us. We searched and even found the implementation in the nightly build of Firefox. But the joy was premature. After a couple of tests, we realized that now it is impossible to use. The SIMD implementation we tried is very slow at the moment. We continue to follow the development, but this technology is still quite raw for use.
Webcl
Another potentially interesting technology that we really want to use is WebCL. A couple of months ago, the Khronos Group
published the first specification describing the use of GPU power for parallel computing in the browser. It sounds great, but so far it is available only in the form of synthetic tests for specially assembled versions of browsers, or in the form of plug-ins. I hope we can test it towards the end of the year. While not applicable for use at all. = (
Webworks
As a result, the task of debiting perfectly fell on WebWorkers, which turned out to be very convenient and easy to work with. We divide the whole image into pieces and process each in its own stream. First, we wrote our wrapper to manage WebWorkers, and now we use the
Parallel.js library. I had to take care of cutting and then gluing these pieces together. It must be remembered that it takes some time to initialize each worker and there are difficulties with passing some types of data to the worker and back. But the main problem: webworkers are not very convenient to debug. Now this feature is only in Chrome. Another interesting question: how to determine the optimal number of workers? We found a correlation with the number of cores in the system, but in browsers there is still no way to determine their number.
In general, of all that found, the most working were WebWorkers, who now work with us at raw.pics.io.
Algorithm level optimization
One of the important approaches is the selection of the correct architecture and fast algorithms. We process arrays of 20,000,000+ elements, which means that accelerating the processing of each of the elements reduces the overall execution time. Naturally, speeding up or getting rid of unnecessary operations can help a lot. That is why we analyzed and selected a lot of interpolation algorithms, rewrote them several times, replaced mathematical operations with bit shifts and removed unnecessary checks and conditions. These milliseconds on a large number of iterations gave significant savings. For example, we were able to increase the speed of the algorithm due to the fact that the insignificant parts of the photo were not processed (inactive parts of the camera sensor) and the extra checks that did not affect the result were removed.
Optimization of structures and data types
Typed Arrays
Modern browsers provide the ability to use fast typed arrays (
typed arrays ), which give a good performance
boost compared to conventional arrays. This is not always applicable, but for operations that manipulate binary data, this is a real breath of fresh air. Now all our calculations are built on typed arrays - without them it would be impossible to achieve the speed that we have now.
Simple structures
The very first version of our decoder contained beautiful data structures, with a hierarchy of classes and a bunch of modules. Although this was correct from the point of view of the PLO, it slowed down during initialization and access to objects. After we looked at it from the side and thought over the structure once more, I simplified everything to several modules. This denormalization reduced the number of modules and connections between them, which made understanding a bit more difficult, but this was done consciously (for performance purposes).
Language level optimization
Nicholas Zakas has a great JavaScript performance
series . I will not consider everything that we did, but describe the main thing. The decelerating code is added (obtained by the product =)) from the cost of the operation and the number of such operations. And if we cannot reduce the number of iterations, then it is worth reducing the cost of each operation. At each step, we had a function call, or even two. Calling a function is quite an expensive operation and it makes sense to avoid them inside large cycles, since it is always very expensive. In JavaScript, there is no mechanism (as in C ++) that allows the compiler to say that this function needs to be inserted inline, so we denormalize the code and get rid of these calls. It became less readable, but more quickly. Thanks to this approach, we have significantly increased the performance of large cycles.
Also it should be understood that in your cycles are invariants and does not change at each iteration. These invariants must be moved out of the cycle. A simple
example :
In general, there are a lot of similar optimizations and all of them are very similar, but the essence is the same - to reduce the cost of the operation.
Another good example is the so-called LookUpTables (LUT), in which we cache some values instead of recalculating them at each step of the loop. Some variables (pixel brightness in our case) can be repeated and there is no point in calculating them at each step. But the use of LUT is not always justified. Sometimes (when elements vary greatly from iteration to iteration), the total cost of building this table is greater than the cost of calculating these elements.
Platform Level Optimization
JavaScript implementations of engines vary from browser to browser and the same code works differently in Chrome and FF. I will not touch on this, since we have not done this kind of optimization yet, but I am sure that you can write code that is more adapted to a specific browser. If anyone has any thoughts on this, please comment.
Perception optimization
The most interesting thing you can do is show the user the result as soon as possible. This is also a kind of optimization that can give you a few seconds while the user is sad and looks at the rotating progress bar. We can show some intermediate result, small pieces of a map or a pre-loaded image of low quality - we need to give the user the opportunity to begin to assimilate the information in at least some form. For example, Pinterest before uploading images, shows rectangles filled with the average color of a particular image. Due to this, the effect is created that it works faster. We can do the same thing by showing embedded JPEG, and replacing it with the result of raw data processing.
When to finish
Each new optimization step gives a progressively smaller performance gain. And as soon as the applied efforts are large enough (for example, you need to rewrite a large enough part of the code), and the performance gain is already minimal (less than 5%), then, most likely, you should stop and look. Perhaps already enough, or you are doing something wrong. Although if your operations are really long, an additional 5% can save users precious seconds.
Afterword
Useful tools
Often it is worth choosing between execution speed and memory costs. And although now the memory is already quite cheap, there are still cases when it is not enough. Tracking who spends how much memory helps profiler. But sometimes he is not enough to understand what is happening. Google Chrome provides excellent developer tools. Chrome can be run with interesting flags, which is sometimes very useful. For example
, you can access the
memory object and the
gc () method, which are usually not available. And
so - redirect all errors directly to the terminal and find a lot of interesting things. Chrome: // about is a complete list of Chrome-built utilities that can be a great help for debugging and development.
How to test options
When you have found one inhibitory piece and there are several options for optimization, you need to understand which one will work faster. Instead of writing a few options, it is useful to run small synthetic tests. We use
jsperf.com . Such benchmarks make it clear which of the chosen approaches will be the most optimal. And if you search, you can find many ready-made tests and rewrite / add them for your case.
In general, whether or not to optimize (and most importantly, how), it is necessary to decide in each specific case and this decision depends on many factors that are difficult to structure. Combining the above methods, we managed to speed up the conversion of RAW files several times, which at first seemed almost impossible. And this means that you will succeed.
