Using R for "industrial" development

It is a continuation of previous publications . It is no secret that when R is mentioned, among the tools used, the second most popular issue is the possibility of its use in “industrial development”. The palm in Russia is invariably held by the question “What is R?”

Let's try to understand the aspects and the possibility of using R in the "industrial" development.

What is industrial software development?

It is not at all surprising that everyone understands something by this term. Starting with "only C ++" or "only java" and ending with "having a sales plan and a roadmap for 10 years ahead." But without a clear definition of the term, it is impossible to continue moving.

Since there is no clear definition of a well-established definition, to answer the question posed, let’s gather this concept from a combination of external and internal artifacts:

Management artifacts . Industrial software development: professional employment in the field of software development in an organization or division, the main purpose of which is to create a software product. A source
Internal artifacts . The combination of the team, management methodologies, tools used and agreements on software development methods, which allow to issue software with the required characteristics in terms of functionality, reliability, performance and ergonomics within a given period of time. This definition is based on the works of various famous people: D. Knut, S. McConnell, C. Wigers, M. Fowler, E. Hunt, F. Brooks, K. Beck, J. Spolsky and others.

From the manager's point of view, everything is clear from the manager’s point of view, but they have little reference to the programming language. There are a lot of internal artifacts, but for many of them it doesn’t matter what language is being developed. In particular, when applying R, they can be used from among the beloved / familiar

requirements management system (methodology and tools);
project management system (methodology and tools);
version control system;
bugtracker
etc.

Thus, the main claims can be reformulated in the context of reliability (“aha, interactive work in the console is not 24x7 for you, and everyone wrote for academic research!”). The remark is generally true, but for slightly obsolete data. In fact, R currently has a set of packages and approaches that allow you to easily and elegantly create applications in 24x7 format. Therefore, further we will focus on the most interesting points in the task of ensuring the reliability of the software being developed:

support for various programming paradigms;
debugging;
static code analysis;
logging and analysis during execution;
functional and unit testing;
packaging and automated assembly;
documentation generation system.

In part, these moments intersect with the area of "defensive programming" .

Also, we should not forget that R is a high-level language, focused on solving data manipulation problems and having a wide range of proven libraries (packages). The substantial part of solving even very complex tasks can take only a few hundred lines of code. For larger projects, it is desirable to use a packaging mechanism that allows the code to be structured by building your own libraries (packages).

Support for various programming paradigms

The existing set of packages complements and expands the capabilities of base-R not only in terms of mathematical algorithms, but also in terms of development paradigms. Taking out the tidyverse universe, it makes sense to mention two packages that extend the implementation of OOP and functional programming:

System r6 classes You can get acquainted with the implementation of OOP on the basis of R6 in detail using the video from the UseR conference ! 2017. The R6 Class System , Video and of course, on the CRAN website. There you can read about the comparison of performance characteristics as compared with the standard system of classes S3.

Functional approach. The lambda.r package implements the functional programming paradigm.

If we talk about the functional approach, the implementation of a separate set of elements of this approach in the purrr package in conjunction with the pipe operator %>% allows in practice to greatly simplify and secure data processing with a significant reduction in the amount of code required for this. As a start, you can read a good report on this topic: Happy R Users Purrr - Tutorial

Debugging

In addition to the classic debugging tools that are well described in the article “Debugging with RStudio”, I would mention the following useful tools:

The DebugFnW() function from the wrapr package to save the environment in case of falls inside the function. Links to videos on this tool can be found in the brief description of the package.
Package listviewer for interactive analysis of hierarchical objects. Widget is based on jsoneditor code.
Package diffobj for visual comparison of various objects.

Static code analysis

It's all quite simple. The most popular tool is lintr @ CRAN or lintr @ github .
Lintr integrates with RStudio IDE, not to be repeated, details can be viewed in the Lintr integration with RStudio branch.

Logging and analysis at runtime

Logging

Logging the process of executing software at logically significant points, although adding overhead, but being organized in the right way greatly simplifies the task of providing subsequent technical support for the developed software. Given the high level of R and the compactness of the code, even permanently enabled logging is not overhead for resources.

The futile.logger package is the most convenient for the logging task. The semantics of log4j is known to many and does not require the study of a new one. Of the useful, convenient add-ons / extensions, I would single out the following:

The logger configuration in the form of flog.appender(appender.tee(log_name)) allows you to include simultaneous output in the file and in the console.
For the formation of complex lines containing variable values, instead of base::paste much more convenient to use glue::glue . For example, the paste0(" FROM", ch_db$table, "WHERE ", where_string, sep=" ") string paste0(" FROM", ch_db$table, "WHERE ", where_string, sep=" ") turns into one format string glue(" FROM {ch_db$table} WHERE {where_string}") . Taking into account the vectorization of glue, printing a table selection also turns into one line. Details can be found in the announcement of glue 1.2.0.
You can use the capture.output(fun...) function to output messages issued by functions to stdout.
To display the execution time of a block of commands, it is very convenient to use the tic() , toc() functions from the tictoc package. At the same time, the finalizing function should be immediately included in the logger in the form of the following construction: flog.info(glue("Data query response time: {capture.output(toc())}"))

Real-time validation

In general, especially in languages with dynamic typing, the rule of good tone is the verification of data received for processing in a particular function. In a good way, the verification should be divided into two stages: physical verification (data of the required type) and logical verification (the content of the data of the correct type also follows the established requirements). The time for executing a logical check can be an order of magnitude longer than a physical one. An elementary example is that at the stage of physical verification we see that a vector of floating point numbers has arrived at the input, and at the stage of logical verification we see that all elements of the vector are non-negative.

In R for this, everything is there, and even with a very good choice. I will mention only the most interesting and promising packages. checkmate for physical verification and assertr , validate for logical.

It's nice that, unlike assertive , the implementation of checkmate initially focused on speed and minimum overhead, as can be found in the publication “checkmate: Fast Argument Checks for Defensive R Programming” .
Well, the ability to write compact validation rules in the regexp style with qassert means qassert very good, because it allows you to minimize the typical checking function by 2 lines to a line of several characters.

As part of the logical test - here everyone can choose a convenient way. It all depends on what kind of data, whether the processing goes independently or in the pipeline (pipe), what exactly needs to be checked.

Depending on what is required by the logic of the program, you can either make a check for compliance with the conditions with obtaining TRUE/FALSE and then branching logic, or generate an exception (assert).

Exception Handling

The mechanism for generating exceptions is undoubtedly useful when working with data, and a detailed output of related information can be very useful for interactive work in the console. However, during the transition to streaming execution, stopping the program when an error occurs is completely unnecessary, and the diversity in the ways of generating diagnostic messages begins to tire when creating handlers.
Exception handling by standard tryCatch mechanisms tryCatch well described in the Advanced R book. More interesting and useful in relation to data processing in software mode are the following two extensions:

exception handling without stopping the pipeline (pie);
unification of responses from functions.

Both functions are implemented in the purrr package purrr family of purrr functions safely . Receiving a standardized list with any error / result fields from any function allows not to interrupt stream processing, but to handle the exceptions and errors that occurred after the completion of the pipeline. It is not at all necessary to always ring the bell and raise an exception if during the processing of the vector a division by 0 occurred. It is enough to mark an incorrect element and proceed to the next one. Such an encapsulation of exception handling makes it possible, instead of a few dozen lines of code, designed to take account of unforeseen situations in data processing to reduce everything to a single wrapper. Less code, less redundant variability - more stable result.

Well and briefly, the possibilities for using were safely described in the RStudio blog: purrr 0.2.0 by Hadley Wickham + documentation.

Functional and unit testing

Package testthat . The study can begin with the article “testthat: Get Started with Testing” , continue with CRAN and Hadley Wickham books, as well as the book “Testing R Code” . Taking into account the provisions mentioned above, I would shift when reading a function from assertive to functions from the checkmate package. Self-tests can be written both for packages and for individual functions.

Bundling and automated assembly

We simply state that there is, but not everyone knows about it. Assembly, validation, documentation, verification, etc., are all integrated by RStudio IDE. Briefly covered in the article "Building, Testing, and Distributing Packages" . Thoroughly everything is described in an excellent book Hadley Wickham: "R packages" . usethis helper may be usethis

There is also a packrat package packrat allows you to create snapshot of packages required for the operation of a particular application. This ensures the independence of the software environment from the packages installed in the system.

Documentation generation system

We just state what is for packages, but not everyone knows about it. Built on the basis of roxygen, integrated with RStudio IDE.
Thoroughly everything is described in an excellent book Hadley Wickham: "R packages" .

Conclusion

Now R is very actively developing. Every week there are new useful features, packages, approaches, or existing ones are improved. In tasks related to data processing, the set of packages specified in the publication allows writing fast, stable, compact and predictable code to R. A year ago, there were fewer packages, which will be at the end of 2018 - only time will tell.

In such a situation, it is incorrect to draw conclusions about the possibility or impossibility of using a language and platform based on data from two or more years ago. At a minimum, you should familiarize yourself with the current state.

As for speed, then, as always, this question is relative. 2 days of development + 10 seconds for execution at R is much less than 2 weeks of development + 0.1 seconds for execution, for example, in Java. About the speed you need to talk in context For functions requiring execution speed, it is possible to implement it in C ++ without going beyond the R boundaries by using the Rcpp package. A brief overview of the features can also be found in one of the author’s articles: “Extending R with C ++: A Brief Introduction to Rcpp” .

Tasks on a variety of data processing (from collection to visualization) is becoming more and more. Why not look towards R?

Previous publication - "R, Asterisk and Wardrobe" .

Source: https://habr.com/ru/post/342254/

All Articles