It is a continuation of previous publications . It is no secret that when R is mentioned, among the tools used, the second most popular issue is the possibility of its use in “industrial development”. The palm in Russia is invariably held by the question “What is R?”
Let's try to understand the aspects and the possibility of using R in the "industrial" development.
It is not at all surprising that everyone understands something by this term. Starting with "only C ++" or "only java" and ending with "having a sales plan and a roadmap for 10 years ahead." But without a clear definition of the term, it is impossible to continue moving.
Since there is no clear definition of a well-established definition, to answer the question posed, let’s gather this concept from a combination of external and internal artifacts:
From the manager's point of view, everything is clear from the manager’s point of view, but they have little reference to the programming language. There are a lot of internal artifacts, but for many of them it doesn’t matter what language is being developed. In particular, when applying R, they can be used from among the beloved / familiar
Thus, the main claims can be reformulated in the context of reliability (“aha, interactive work in the console is not 24x7 for you, and everyone wrote for academic research!”). The remark is generally true, but for slightly obsolete data. In fact, R currently has a set of packages and approaches that allow you to easily and elegantly create applications in 24x7 format. Therefore, further we will focus on the most interesting points in the task of ensuring the reliability of the software being developed:
In part, these moments intersect with the area of "defensive programming" .
Also, we should not forget that R is a high-level language, focused on solving data manipulation problems and having a wide range of proven libraries (packages). The substantial part of solving even very complex tasks can take only a few hundred lines of code. For larger projects, it is desirable to use a packaging mechanism that allows the code to be structured by building your own libraries (packages).
The existing set of packages complements and expands the capabilities of base-R not only in terms of mathematical algorithms, but also in terms of development paradigms. Taking out the tidyverse universe, it makes sense to mention two packages that extend the implementation of OOP and functional programming:
If we talk about the functional approach, the implementation of a separate set of elements of this approach in the purrr
package in conjunction with the pipe operator %>%
allows in practice to greatly simplify and secure data processing with a significant reduction in the amount of code required for this. As a start, you can read a good report on this topic: Happy R Users Purrr - Tutorial
In addition to the classic debugging tools that are well described in the article “Debugging with RStudio”, I would mention the following useful tools:
DebugFnW()
function from the wrapr
package to save the environment in case of falls inside the function. Links to videos on this tool can be found in the brief description of the package.listviewer
for interactive analysis of hierarchical objects. Widget is based on jsoneditor
code.diffobj
for visual comparison of various objects.It's all quite simple. The most popular tool is lintr @ CRAN or lintr @ github .
Lintr integrates with RStudio IDE, not to be repeated, details can be viewed in the Lintr integration with RStudio branch.
Logging the process of executing software at logically significant points, although adding overhead, but being organized in the right way greatly simplifies the task of providing subsequent technical support for the developed software. Given the high level of R and the compactness of the code, even permanently enabled logging is not overhead for resources.
The futile.logger package is the most convenient for the logging task. The semantics of log4j is known to many and does not require the study of a new one. Of the useful, convenient add-ons / extensions, I would single out the following:
flog.appender(appender.tee(log_name))
allows you to include simultaneous output in the file and in the console.base::paste
much more convenient to use glue::glue
. For example, the paste0(" FROM", ch_db$table, "WHERE ", where_string, sep=" ")
string paste0(" FROM", ch_db$table, "WHERE ", where_string, sep=" ")
turns into one format string glue(" FROM {ch_db$table} WHERE {where_string}")
. Taking into account the vectorization of glue, printing a table selection also turns into one line. Details can be found in the announcement of glue 1.2.0.capture.output(fun...)
function to output messages issued by functions to stdout.tic()
, toc()
functions from the tictoc package. At the same time, the finalizing function should be immediately included in the logger in the form of the following construction: flog.info(glue("Data query response time: {capture.output(toc())}"))
In general, especially in languages ​​with dynamic typing, the rule of good tone is the verification of data received for processing in a particular function. In a good way, the verification should be divided into two stages: physical verification (data of the required type) and logical verification (the content of the data of the correct type also follows the established requirements). The time for executing a logical check can be an order of magnitude longer than a physical one. An elementary example is that at the stage of physical verification we see that a vector of floating point numbers has arrived at the input, and at the stage of logical verification we see that all elements of the vector are non-negative.
In R for this, everything is there, and even with a very good choice. I will mention only the most interesting and promising packages. checkmate
for physical verification and assertr
, validate
for logical.
It's nice that, unlike assertive
, the implementation of checkmate
initially focused on speed and minimum overhead, as can be found in the publication “checkmate: Fast Argument Checks for Defensive R Programming” .
Well, the ability to write compact validation rules in the regexp style with qassert
means qassert
very good, because it allows you to minimize the typical checking function by 2 lines to a line of several characters.
As part of the logical test - here everyone can choose a convenient way. It all depends on what kind of data, whether the processing goes independently or in the pipeline (pipe), what exactly needs to be checked.
Depending on what is required by the logic of the program, you can either make a check for compliance with the conditions with obtaining TRUE/FALSE
and then branching logic, or generate an exception (assert).
The mechanism for generating exceptions is undoubtedly useful when working with data, and a detailed output of related information can be very useful for interactive work in the console. However, during the transition to streaming execution, stopping the program when an error occurs is completely unnecessary, and the diversity in the ways of generating diagnostic messages begins to tire when creating handlers.
Exception handling by standard tryCatch
mechanisms tryCatch
well described in the Advanced R book. More interesting and useful in relation to data processing in software mode are the following two extensions:
Both functions are implemented in the purrr
package purrr
family of purrr
functions safely
. Receiving a standardized list with any error / result fields from any function allows not to interrupt stream processing, but to handle the exceptions and errors that occurred after the completion of the pipeline. It is not at all necessary to always ring the bell and raise an exception if during the processing of the vector a division by 0 occurred. It is enough to mark an incorrect element and proceed to the next one. Such an encapsulation of exception handling makes it possible, instead of a few dozen lines of code, designed to take account of unforeseen situations in data processing to reduce everything to a single wrapper. Less code, less redundant variability - more stable result.
Well and briefly, the possibilities for using were safely
described in the RStudio blog: purrr 0.2.0 by Hadley Wickham + documentation.
Package testthat
. The study can begin with the article “testthat: Get Started with Testing” , continue with CRAN and Hadley Wickham books, as well as the book “Testing R Code” . Taking into account the provisions mentioned above, I would shift when reading a function from assertive
to functions from the checkmate
package. Self-tests can be written both for packages and for individual functions.
We simply state that there is, but not everyone knows about it. Assembly, validation, documentation, verification, etc., are all integrated by RStudio IDE. Briefly covered in the article "Building, Testing, and Distributing Packages" . Thoroughly everything is described in an excellent book Hadley Wickham: "R packages" . usethis
helper may be usethis
There is also a packrat
package packrat
allows you to create snapshot of packages required for the operation of a particular application. This ensures the independence of the software environment from the packages installed in the system.
We just state what is for packages, but not everyone knows about it. Built on the basis of roxygen, integrated with RStudio IDE.
Thoroughly everything is described in an excellent book Hadley Wickham: "R packages" .
Now R is very actively developing. Every week there are new useful features, packages, approaches, or existing ones are improved. In tasks related to data processing, the set of packages specified in the publication allows writing fast, stable, compact and predictable code to R. A year ago, there were fewer packages, which will be at the end of 2018 - only time will tell.
In such a situation, it is incorrect to draw conclusions about the possibility or impossibility of using a language and platform based on data from two or more years ago. At a minimum, you should familiarize yourself with the current state.
As for speed, then, as always, this question is relative. 2 days of development + 10 seconds for execution at R is much less than 2 weeks of development + 0.1 seconds for execution, for example, in Java. About the speed you need to talk in context For functions requiring execution speed, it is possible to implement it in C ++ without going beyond the R boundaries by using the Rcpp package. A brief overview of the features can also be found in one of the author’s articles: “Extending R with C ++: A Brief Introduction to Rcpp” .
Tasks on a variety of data processing (from collection to visualization) is becoming more and more. Why not look towards R?
Previous publication - "R, Asterisk and Wardrobe" .
Source: https://habr.com/ru/post/342254/
All Articles