The book "The Art of Programming on R. Immersion in Big Data"

Hi, Habrozhiteli! Many users use R for specific tasks - here they construct a histogram, perform a regression analysis there, or perform other separate operations related to statistical data processing. But this book is written for those who want to develop software on R. The programming skills of prospective readers of this book can lie in a wide spectrum - from professional qualifications to “I took a college programming course”, but the key goal is to write R code for specific purposes. . (A deep knowledge of statistics is generally not necessary.)

A few examples of readers who could benefit from this book:

An analyst (say, working in a hospital or in a government office) who has to regularly issue statistical reports and develop programs for this purpose.
Researcher engaged in the development of statistical methodology - new or combining existing methods into integrated procedures. The methodology needs to be encoded so that it can be used in the research community.
Specialists in marketing, legal support, journalism, publishing, etc., engaged in the development of code for building complex graphical representations of data.
Professional programmers with software development experience assigned to projects related to statistical analysis.
Students studying statistics and data processing.

Thus, this book is not a reference for the countless statistical methods of the wonderful R package. In fact, it is devoted to programming and it deals with programming issues that are rarely found in other books about R. Even fundamental topics are viewed from a programming angle. Some examples of this approach:
')

In this book, there are sections "Advanced examples". Usually they provide complete general purpose functions instead of isolated code fragments based on specific data. Moreover, some of these functions can be useful in your daily work with R. Studying these examples, you will not only learn how specific R constructions work, but also learn how to integrate them into useful programs. In many cases, I provide descriptions of alternative solutions and answer the question: “Why was it done this way?”
The material is presented taking into account the perception of the programmer. For example, when describing data frames, I not only assert that the data frame in R is a list, but also point out the consequences of this fact from the point of view of programming. Also in the text, R is compared with other languages where it can be useful (for readers who are fluent in these languages).
Debugging plays an important role in programming in any language, but most of the books on R have almost no mention of this topic. In this book, I devoted an entire chapter to debugging tools, used the principle of "extended examples" and presented fully-developed demonstrations of how programs are debugged in reality.
Nowadays, multi-core computers have appeared in all homes, and the programming of graphics processors (GPU) is making an inconspicuous revolution in the field of scientific computing. More and more R applications require very large amounts of computation, and parallel processing has become relevant for R programmers. A whole chapter is devoted to this book, in which besides the description of mechanics there are also extended examples.
A separate chapter describes how to use information about the internal implementation and other aspects of R to speed up the work of the R code.
One of the chapters is devoted to the R interface with other programming languages such as C and Python. Again, special attention is paid to advanced examples and debugging recommendations.

Excerpt 7.8.4. When should global variables be used?

There is no consensus on the use of global variables in the community of programmers. Obviously, there is no right answer to the question in the title of this section, since it is a matter of personal preference and style. Nevertheless, many programmers believe that a complete ban on global variables, advocated by many programming teachers, would be too harsh. In this section, we explore the possible benefits of global variables in the context of R structures. The term “global variable” will denote any variable that is in the hierarchy of environments above the level of code of interest.

The use of global variables in R is more common than one would expect. Surprisingly, R uses global variables very widely in its internal implementation (both in C code and in R functions). Thus, the super assignment operator << is used in many library functions R (although usually for writing to a variable that is just one level higher in the hierarchy of variables). In the multi-threaded code and the GPU code used to write fast programs (see Chapter 16), global variables are usually widely used, providing the basic mechanism of interaction between parallel executors.

And now, for the sake of concreteness, let's return to the earlier example from section 7.7:

f <- function(lxxyy) { # lxxyy — ,  x  y ... lxxyy$x <- ... lxxyy$y <- ... return(lxxyy) } #  x  y lxy$x <- ... lxy$y <- ... lxy <- f(lxy) #   x  y ... <- lxy$x ... <- lxy$y

As mentioned earlier, this code can become cumbersome, especially if x and y are lists themselves.

On the other hand, take a look at the alternative scheme using global variables:

 f <- function() { ... x <<- ... y <<- ... } #  x  y x <-... y <-... f() #  x  y  #   x  y ... <- x ... <- y

Perhaps the second version is much cleaner, less cumbersome and does not require manipulations with lists. Understandable code usually creates fewer problems in writing, debugging, and maintenance.

For these reasons - to simplify and reduce the bulkiness of the code - we decided to use global variables instead of returning lists in the DES code given earlier. Consider this example in more detail.

Two global variables were used (both are lists containing different information): the variable sim is associated with the library code, and the variable mm1glbls is associated with the code of the specific application M / M / 1. Let's start with sim.

Even programmers with restraint about global variables agree that the use of such variables can be justified if they are truly global - in the sense that they are widely used in the program. All this applies to the sim variable from the DES example: it is used both in the library code (in schedevnt (), getnextevnt () and dosim ()) and in M / M / 1 code (in mm1reactevnt ()). In this particular example, subsequent calls to sim are limited to reading, but in some situations writing is possible. A typical example of this kind is a possible implementation of cancellation of events. For example, such a situation may occur when modeling the principle of the “earlier of the two”: two events are planned, and when one of them occurs, the other should be canceled.

Thus, using sim as a global variable seems justified. However, if we had decisively abandoned the use of global variables, the sim could be placed in a local variable inside dosim (). This function will pass the sim in the argument of all the functions mentioned in the previous paragraph (schedevnt (), getnextevnt (), etc.), and each of these functions will return a modified variable sim.
For example, line 94:

 reactevnt(head)

converted to the following form:

 sim <- reactevnt(head)

After that, the function mm1reactevnt () associated with a particular application, you must add the following line:

 return(sim)

You can do something similar with mm1glbls by including a local variable in dosim () with the name (for example) appvars. But if this is done with two variables, then they must be placed in the list so that both variables can be returned from the function, as in the above example of the function f (). And then there is a cumbersome structure of lists within lists, which was mentioned above, or rather, lists within lists within lists.

On the other hand, opponents of using global variables notice that the simplicity of the code is not a gift. They are concerned that during the debugging process, difficulties arise in finding places where the global variable changes the value, since the change can occur at any point in the program. It would seem that in the world of modern text editors and integrated development tools that will help find all occurrences of a variable, the problem goes into the background (the original article, calling for the rejection of the use of global variables, was published in 1970!). Yet this factor must be considered.

Another problem that critics mention is encountered when calling a function from several unrelated parts of a program with different values. For example, imagine that a f () function is called from different parts of a program, with each call receiving its own x and y values instead of one value for each. The problem can be solved by creating vectors of x and y values in which each instance of f () in your program corresponds to a separate element. However, this will lose the simplicity of using global variables.

These problems occur not only in R, but in a more general context. However, in R, the use of global variables at the top level creates an additional problem, since the user at this level usually has many variables. There is a danger that the code that uses global variables may accidentally replace a completely foreign variable with the same name.

Of course, the problem is easily solved - it is enough to choose long names for global variables that are tied to a specific application. However, the environments also provide a reasonable compromise, as in the following situation for the DES example.

Inside the dosim () function, the string

 sim <<- list()

may be replaced by a string

 assign("simenv",new.env(),envir=.GlobalEnv)

It creates a new environment, which is referenced by the simenv variable at the top level. This environment serves as a container for encapsulating global variables that can be accessed by calls to get () and assign (). For example, the lines

 if (is.null(sim$evnts)) { sim$evnts <<- newevnt

in schedevnt () take the form

 if (is.null(get("evnts",envir=simenv))) { assign("evnts",newevnt,envir=simenv)

Yes, this solution is also cumbersome, but at least it is not as complicated as the lists inside the lists within the lists. And it protects against accidental writing to an extraneous variable at the top level. Using a super-assignment operator still gives a less cumbersome code, but this trade-off should be taken into account.

As usual, there is no single programming style that provides the best results in all situations. The solution to global variables is another option that should be included in your arsenal of programming tools.

7.8.5. Closures

Recall that the closure (closure) R consists of the arguments and the function body in conjunction with the environment at the time of the call. The inclusion of the environment is involved in the programming paradigm, which uses a concept, also called a closure (here there is some overloading of terminology).

A closure is a function that creates a local variable and then creates another function that accesses this variable. The description is too abstract, so I’d better give an example.

 1 > counter 2 function () { 3 ctr <- 0 4 f <- function() { 5 ctr <<- ctr + 1 6 cat("this count currently has value",ctr,"\n") 7 } 8 return(f) 9 }

Check how this code works before diving into the implementation details:

 > c1 <- counter() > c2 <- counter() > c1 function() { ctr <<- ctr + 1 cat("this count currently has value",ctr,"\n") } <environment: 0x8d445c0> > c2 function() { ctr <<- ctr + 1 cat("this count currently has value",ctr,"\n") } <environment: 0x8d447d4> > c1() this count currently has value 1 > c1() this count currently has value 2 > c2() this count currently has value 1 > c2() this count currently has value 2 > c2() this count currently has value 3 > c1() this count currently has value 3

Here, the counter () function is called twice, and the results are assigned to c1 and c2. As expected, these two variables consist of functions, namely copies of f (). However, f () accesses the variable ctr via a super assignment operator, and this variable will be a variable with the specified name local to counter (), since it will be the first on the path through the environment hierarchy. It is part of the f () environment, and as such is packaged in what is returned to the side of the counter () call. The key point is that with different counter () calls, the ctr variable will be in different environments (in the example, the environments were stored in memory at addresses 0x8d445c0 and 0x8d447d4). In other words, different counter () calls will create physically different instances of ctr.

As a result, the functions c1 () and c2 () work as fully independent counters. This is evident from the example where each function is called several times.

»More information about the book can be found on the publisher's website.
» Table of Contents
» Excerpt

For Habrozhiteley a 25% discount on the coupon - R

Upon payment of the paper version of the book, an electronic version of the book is sent to the e-mail.

Source: https://habr.com/ru/post/452746/

All Articles

The book "The Art of Programming on R. Immersion in Big Data"

Excerpt 7.8.4. When should global variables be used?

7.8.5. Closures

More articles: