R in enterprise tasks. Tricks and Tricks

Despite the fact that common business tasks are very often far from the popular topic of big data and machine learning and are often associated with processing relatively small amounts of information [tens of megabytes - tens of gigabytes], spread in arbitrary representations by various types of sources, using R as the main tool allows you to easily and elegantly automate and speed up these tasks.

And, of course, after the analysis it is necessary to present all this, for which you can successfully use Shiny. Next, I will give a number of tricks and approaches that can help in this task. I am sure that any practicing analyst can easily add his own tricks, it all depends on the class of problems being solved.

Trick number 1. Download excel data formatted for human perception

The situation is typical and occurs in any task several times a day:

data received for analysis are provided in excel files of completely different versions;
files, first of all, are intended for filling and viewing by people. Therefore, a mixture of data and a “readable” representation is actively used. Being imported “as is”, the data is a mass of information that cannot be addressed by standard methods or analyzed. Almost all the functions of R will produce various errors or results that do not meet the expectations at all;
there are no guarantees that an identical external representation will be identical in structure. It is quite possible surprises in the form of hidden rows and columns with unpredictable content;
In this kind of data, the originally proposed and even structured data entry template is sometimes traced. However, during the use of such information exchange, the template has acquired a bunch of any additions and exceptions. And if there are several people who fill out (branches of a company or different departments), then everyone tries to be ready. Even within one excel in adjacent tabs, a different filling method can be used.

Objective: to ensure the import of such a gel-like substance into a structured representation, which can be processed by standard means (not just R).

The task is divided into several successive stages. Immediately, I note that to solve the problem, I try to make the most of the approaches implemented by Hadley Wickham in his “tidyverse” philosophy. This was done for several reasons:

in fact, this is the most harmonious and self-consistent direction of development of modern R. Anyway, the authors of many packages adapt them precisely to the tidy data paradigm.
tidyverse universal power tool. It contains 80% of the functions required for typical work.
in the case of excel for the simple task of importing data, I want to be able to work on machines without an office, incl. on * nix. Those. no java connectors to the office.

Stage 1. Import data

Everything would be fine, and you could just use the readxl package, but there are some nuances.

Since the sheet is a mixture of numbers and text, incl. cleverly designed column names in the data, you need to import all the data as text. To do this, in the read_excel function, read_excel must explicitly specify the specification of ALL columns as textual.
To specify a specification for all columns, you need to know the number of these columns, but the number is not in the output data.frame , but the number that is perceived by the input parser.

On separate excel files type construction

 raw <- read_excel(fname) ctypes <- rep("text", ncol(raw)) cnames <- str_c("grp_", seq_along(ctypes)) raw <- read_excel(fname, col_types = ctypes, col_names = cnames)

breaks with the message "Error: Need one name and type for each column". Top-level study of objects does not give any reasonable answer. To understand how to act, you need to learn github \ stackoverflow.

As a summary, the issues and workaround are as follows:

By itself, the excel size of its tables is considered not only by the content of the cells, but also by design. A cell filled in from the side or frames set automatically expands the number of "data" columns. From the point of view, the output of the read_excel function of this data may not be. Still greatly exacerbated by the different behavior of different xls * file formats. But in 100% of cases for such excel files, the number of columns from the read_excel and in the source file is different.

To get the correct number of columns, you must use the internal functions of the readxl package. The method of application follows from the analysis of the contents of the package on github. This is not very good, but allows you to solve the problem. At the same time, it is necessary to pick up both branches (for .xls and for .xlsx files) separately, despite the fact that read_excel all this behind its facade.

The method of solving the problem is demonstrated by the example of the above excel file format:

 ncol(read_excel("col_test.xlsx")) # 4  length(readxl:::xlsx_col_types("col_test.xlsx")) # 5  ncol(read_excel("col_test.xls")) # 2  length(readxl:::xlsx_col_types("col_test.xls")) # 5

Having received the correct number of columns, we import without further problems according to the documentation.

Stage 2. Selection of required columns for analysis in the context of the format of filling in each individual sheet

We consider the solution of the problem on the example of such an excel file (fragment).

We see that the name of the column is in fact spread over the range from 1 to 3 lines.

The strategy is quite simple and consists of the following:

construct unique column names that uniquely identify them, based on the multilevel header of the completed timesheet on an excel sheet;
select the columns required for further processing not on the basis of the absolute position (number), but on the basis of the unique names made in the previous step;
rename columns to a convenient form (short English names) for further processing.

Then just one of the possible examples of code that allows you to solve this problem. In production, you can collapse everything into a processing chain, but to understand the process, the code is forcedly divided into steps. In the course of the decision, a number of functions are used that greatly simplify life.

repair_names - to fix the names of imported columns;
na.locf — to fill the NA lines with the last non-NA value encountered;
complete.cases - remove empty lines
tribble (transposed tibble) - for manual formation of data.frame in data.frame , not in columns;
stri_replace_all_fixed - use the vectorization properties for batch renaming lines.

 raw <- read_excel(...) #  ,   NA   df0 <- raw %>% repair_names(prefix="repaired_", sep="") #      2-3. 2- -- , 3- --  #      ,     3,   #name_c2 <- tidyr::gather(df0[1, ], key = name, value = name_c2) # 1-     #name_c3 <- tidyr::gather(df0[2, ], key = name, value = name_c3) # 1-     #   join  ,      ,     #    ,    # names.df <- dplyr::full_join(name_c2, name_c3, by = "name") names.df <- tibble(name_c2=tidyr::gather(df0[1, ], key=name, value=v)$v, name_c3=tidyr::gather(df0[2, ], key=name, value=v)$v) %>% mutate(name_c2 = na.locf(name_c2)) %>% #  name_c3 = NA,       NA,      mutate(name.fix = ifelse(is.na(name_c3), name_c2, str_c(name_c2, name_c3, sep=": "))) %>% mutate(name.fix = str_replace_all(name.fix, "\r", " ")) %>% #   mutate(name.fix = str_replace_all(name.fix, "\n", " ")) %>% #   mutate(name.fix = str_replace_all(name.fix, " ", " ")) df1 <- df0 repl.df <- tribble( ~pattern, ~replacement, " 1:  2", "angle_in", " 1:  3", "speed_diff_in", " 5:  1:  ", "slot_in", "", "mark_out" ) names(df1) <- stri_replace_all_fixed(names.df$name.fix, pattern = repl.df$pattern, replacement = repl.df$replacement, vectorize_all = FALSE) #      ""  df1 %<>% repair_names(prefix = "repaired_", sep = "") #     df2 <- df1 %>% select(angle_in, speed_diff_in, slot_in, pressure_in, concentration_in, performance_out, weight_out, mark_out) %>% filter(row_number() > 6) %>% #     filter(complete.cases(.)) %>% #  ,    distinct() %>% #    mutate_each(funs(as.numeric), -mark_out)

Trick number 2. Shiny code optimization

The concept of reactive programming is orthogonal to the classical linear execution of code, and thus is difficult for analysts to fully understand. Half of Shiny code is not a code for execution, but a declaration of reaction to something. Given the fact that Shiny is very actively developing. It is extremely useful to periodically update your understanding of the current state. As a matter of fact, the materials of the “2016 Shiny Developer Conference” , in particular, the Effective Shiny Programming by report after a regular review, gave grounds for reworking the Shiny code of the applications while reducing the code ~ by 25% and increasing the overall transparency.
What's interestnig:

rejection of reactiveValues in favor of separate functions of reactive and observe.
exclusion of various checks on the initialization of values by using the req () function.

Of the more useful tricks, there is a way to partially fix the utf-8 encoding under Windows. The sourse function in app.R causes problems; the letter I cannot be ruled out in principle (comments can be rewritten, but if it occurs on the axes of graphs or the cap of tables ...).
This problem is easily solved by the following replacement:

 # source("common.R") eval(parse("common.R", encoding="UTF-8"))

Trick number 3. Fight unpredictable inputs

Sometimes there is a situation when inside a function that performs quite complex processing through functions from other packages, a failure occurs. More precisely, it may be exception, so much so that the function does not complete its work, but is simply interrupted. As a result, the structure of the packet data stream is broken, and then everything collapses. A close look at the code gives nothing, because the problem is beyond its limits. To cover everything with tryCatch , especially on programs executed once or twice, as a rule, no one likes.

And in this and in many other cases (including with parallel computing) logging helps. Even an elementary conclusion about entering / exiting a function can allow you to quickly localize the data line that is crashing, run the algorithms only on it, and figure out how to fix the bug.

The package futile.logger built on the principle of classic loggers and does not require in-depth study. However, its use can significantly improve its efficiency or get free time.

Naturally, these are not all useful tricks, but the most popular ones in the context of the considered class of problems.

Previous Post: “What ProgRess Reached”

Source: https://habr.com/ru/post/322066/

All Articles