Sometimes one has to face the conviction that R, being an interpreter, is too slow to analyze the tasks of a “fast” business. In most cases, such abstracts come from analysts who do not have experience developing serious software, including high-performance or embedded systems that are extremely demanding of limited hardware resources. This is perfectly normal, no one can know everything. however, in 95% of cases, it turns out that R has nothing to do with it, the problem lies in inefficient memory management and the calculation process.
Exactly therefore, in this note, 5 important points will be touched upon (6 and 10 can be called, but let it be 5), work on which very often helps to transform a slow code. The record is rather abstract and in landing on R, since the principles themselves have long been known. "Cheat Sheet" to start research on the possible optimization of the existing code.
All calculated data must be in RAM.
All invariants with respect to loop variables must be calculated outside the loop.
Use object copying effectively.
Keep track of the structure of objects, clean the garbage.
Now step by step.
Do not do "just" data processing. Analyze the task, calculate the amount of data you are working with, see how you can divide the data into autonomous data blocks, commensurate with the available RAM. You can cheat these blocks sequentially, or you can, if there is such an opportunity, parallelize the work on each independent block (“Map - Reduce”). Swap to disk - downsizing in performance by several orders of magnitude.
The simple algebraic property of distributivity is relevant for cyclic calculations more than ever. Any calculations, explicit or hidden in the functional approach that are not tied to the cycle index, should be moved out of the cycle. Any long-calculated values that are used several times are highly desirable to move beyond the main body of the program and then to appear in the form of table values. A long cycle of a few extra microseconds can make minutes-hours-days.
In high-level languages, behind each object / variable, quite complex structures can be hidden, containing a lot of additional service and meta information. And then a simple procedure for performing an operation on an object, like this:
t <- f(t, n)
can lead to dynamic memory allocation and copying is far from a dozen bytes. Being placed in a cycle, such a procedure can eat any frequency of any processor and ask for more.
Take care of RAM. Used the object and more you do not need it? Remove manually, do not wait for the work of the scavenger. You can not wait.
Calculated regression and need only a table of values? Do you know how much an lm
object weighs and what is there too much?
As an example, a blog post: “Reducing your R memory footprint by 7000x”
Are you sure that you use the functions effectively? Do you know how they work? Here is another interesting example:
t <- raw.df$timestamp object.size(t) # 130 m <- lapply(t, function(x){round(x, units="hours")}) # 130 35 m <- lapply(t, function(x){round_date(x, unit = "hour")}) # 8 !!! m <- round_date(t, unit = "hour") # 130 ,
If you continue to work with this object m
not only for reading, what will be faster to process 130kb or 35Mb? In my opinion, the answer is quite obvious.
If the code runs slowly, see the execution profile. Find bottlenecks and optimize them. Do not waste time optimizing non-critical nodes.
And we don’t forget the great scientist’s statement: "It’s a real problem." programming. " - Donald Knuth, lecture "Computer Programming as an Art", published in the collection "Communications of the ACM" (Vol. 17, Issue 12, December 1974, p. 671)
Useful links with debugging and profiling recommendations.
Practice inevitably shows that these simple recommendations often help to speed up the code many times simply by making small modifications, without expensive iron upgrades, switching to parallel computing with complex orchestration, or moving some of the computational functions into C \ C ++ modules.
Previous post: “Using R to work with the statement“ Who is to blame? Of course IT! ””
Next post: "We harness R to serve the business on" 1-2-3 ""
Source: https://habr.com/ru/post/310472/
All Articles