📜 ⬆️ ⬇️

You do not have enough speed R? We are looking for hidden reserves

Sometimes one has to face the conviction that R, being an interpreter, is too slow to analyze the tasks of a “fast” business. In most cases, such abstracts come from analysts who do not have experience developing serious software, including high-performance or embedded systems that are extremely demanding of limited hardware resources. This is perfectly normal, no one can know everything. however, in 95% of cases, it turns out that R has nothing to do with it, the problem lies in inefficient memory management and the calculation process.


Exactly therefore, in this note, 5 important points will be touched upon (6 and 10 can be called, but let it be 5), work on which very often helps to transform a slow code. The record is rather abstract and in landing on R, since the principles themselves have long been known. "Cheat Sheet" to start research on the possible optimization of the existing code.


  1. All calculated data must be in RAM.


  2. All invariants with respect to loop variables must be calculated outside the loop.


  3. Use object copying effectively.


  4. Keep track of the structure of objects, clean the garbage.


  5. Optimize wisely.

Now step by step.


All in RAM


Do not do "just" data processing. Analyze the task, calculate the amount of data you are working with, see how you can divide the data into autonomous data blocks, commensurate with the available RAM. You can cheat these blocks sequentially, or you can, if there is such an opportunity, parallelize the work on each independent block (“Map - Reduce”). Swap to disk - downsizing in performance by several orders of magnitude.


"Take out the brackets"


The simple algebraic property of distributivity is relevant for cyclic calculations more than ever. Any calculations, explicit or hidden in the functional approach that are not tied to the cycle index, should be moved out of the cycle. Any long-calculated values ​​that are used several times are highly desirable to move beyond the main body of the program and then to appear in the form of table values. A long cycle of a few extra microseconds can make minutes-hours-days.


Watch copying


In high-level languages, behind each object / variable, quite complex structures can be hidden, containing a lot of additional service and meta information. And then a simple procedure for performing an operation on an object, like this:


t <- f(t, n) 

can lead to dynamic memory allocation and copying is far from a dozen bytes. Being placed in a cycle, such a procedure can eat any frequency of any processor and ask for more.


Throw away trash


Take care of RAM. Used the object and more you do not need it? Remove manually, do not wait for the work of the scavenger. You can not wait.


Calculated regression and need only a table of values? Do you know how much an lm object weighs and what is there too much?
As an example, a blog post: “Reducing your R memory footprint by 7000x”


Are you sure that you use the functions effectively? Do you know how they work? Here is another interesting example:


 t <- raw.df$timestamp object.size(t) #    130  m <- lapply(t, function(x){round(x, units="hours")}) #   130   35  m <- lapply(t, function(x){round_date(x, unit = "hour")}) #   8 !!! m <- round_date(t, unit = "hour") # 130 ,      

If you continue to work with this object m not only for reading, what will be faster to process 130kb or 35Mb? In my opinion, the answer is quite obvious.


Code optimization


If the code runs slowly, see the execution profile. Find bottlenecks and optimize them. Do not waste time optimizing non-critical nodes.


And we don’t forget the great scientist’s statement: "It’s a real problem." programming. " - Donald Knuth, lecture "Computer Programming as an Art", published in the collection "Communications of the ACM" (Vol. 17, Issue 12, December 1974, p. 671)


Useful links with debugging and profiling recommendations.



Conclusion


Practice inevitably shows that these simple recommendations often help to speed up the code many times simply by making small modifications, without expensive iron upgrades, switching to parallel computing with complex orchestration, or moving some of the computational functions into C \ C ++ modules.


Previous post: “Using R to work with the statement“ Who is to blame? Of course IT! ””
Next post: "We harness R to serve the business on" 1-2-3 ""


')

Source: https://habr.com/ru/post/310472/


All Articles