⬆️ ⬇️

R in enterprise tasks. Tricks and Tricks

Despite the fact that common business tasks are very often far from the popular topic of big data and machine learning and are often associated with processing relatively small amounts of information [tens of megabytes - tens of gigabytes], spread in arbitrary representations by various types of sources, using R as the main tool allows you to easily and elegantly automate and speed up these tasks.



And, of course, after the analysis it is necessary to present all this, for which you can successfully use Shiny. Next, I will give a number of tricks and approaches that can help in this task. I am sure that any practicing analyst can easily add his own tricks, it all depends on the class of problems being solved.





Trick number 1. Download excel data formatted for human perception



The situation is typical and occurs in any task several times a day:





Objective: to ensure the import of such a gel-like substance into a structured representation, which can be processed by standard means (not just R).



The task is divided into several successive stages. Immediately, I note that to solve the problem, I try to make the most of the approaches implemented by Hadley Wickham in his “tidyverse” philosophy. This was done for several reasons:





Stage 1. Import data



Everything would be fine, and you could just use the readxl package, but there are some nuances.



  1. Since the sheet is a mixture of numbers and text, incl. cleverly designed column names in the data, you need to import all the data as text. To do this, in the read_excel function, read_excel must explicitly specify the specification of ALL columns as textual.
  2. To specify a specification for all columns, you need to know the number of these columns, but the number is not in the output data.frame , but the number that is perceived by the input parser.


On separate excel files type construction



 raw <- read_excel(fname) ctypes <- rep("text", ncol(raw)) cnames <- str_c("grp_", seq_along(ctypes)) raw <- read_excel(fname, col_types = ctypes, col_names = cnames) 


breaks with the message "Error: Need one name and type for each column". Top-level study of objects does not give any reasonable answer. To understand how to act, you need to learn github \ stackoverflow.



As a summary, the issues and workaround are as follows:



  1. By itself, the excel size of its tables is considered not only by the content of the cells, but also by design. A cell filled in from the side or frames set automatically expands the number of "data" columns. From the point of view, the output of the read_excel function of this data may not be. Still greatly exacerbated by the different behavior of different xls * file formats. But in 100% of cases for such excel files, the number of columns from the read_excel and in the source file is different.


image



  1. To get the correct number of columns, you must use the internal functions of the readxl package. The method of application follows from the analysis of the contents of the package on github. This is not very good, but allows you to solve the problem. At the same time, it is necessary to pick up both branches (for .xls and for .xlsx files) separately, despite the fact that read_excel all this behind its facade.


The method of solving the problem is demonstrated by the example of the above excel file format:



 ncol(read_excel("col_test.xlsx")) # 4  length(readxl:::xlsx_col_types("col_test.xlsx")) # 5  ncol(read_excel("col_test.xls")) # 2  length(readxl:::xlsx_col_types("col_test.xls")) # 5  


Having received the correct number of columns, we import without further problems according to the documentation.



Stage 2. Selection of required columns for analysis in the context of the format of filling in each individual sheet



We consider the solution of the problem on the example of such an excel file (fragment).



image



We see that the name of the column is in fact spread over the range from 1 to 3 lines.



The strategy is quite simple and consists of the following:





Then just one of the possible examples of code that allows you to solve this problem. In production, you can collapse everything into a processing chain, but to understand the process, the code is forcedly divided into steps. In the course of the decision, a number of functions are used that greatly simplify life.





 raw <- read_excel(...) #  ,   NA   df0 <- raw %>% repair_names(prefix="repaired_", sep="") #      2-3. 2- -- , 3- --  #      ,     3,   #name_c2 <- tidyr::gather(df0[1, ], key = name, value = name_c2) # 1-     #name_c3 <- tidyr::gather(df0[2, ], key = name, value = name_c3) # 1-     #   join  ,      ,     #    ,    # names.df <- dplyr::full_join(name_c2, name_c3, by = "name") names.df <- tibble(name_c2=tidyr::gather(df0[1, ], key=name, value=v)$v, name_c3=tidyr::gather(df0[2, ], key=name, value=v)$v) %>% mutate(name_c2 = na.locf(name_c2)) %>% #  name_c3 = NA,       NA,      mutate(name.fix = ifelse(is.na(name_c3), name_c2, str_c(name_c2, name_c3, sep=": "))) %>% mutate(name.fix = str_replace_all(name.fix, "\r", " ")) %>% #   mutate(name.fix = str_replace_all(name.fix, "\n", " ")) %>% #   mutate(name.fix = str_replace_all(name.fix, " ", " ")) df1 <- df0 repl.df <- tribble( ~pattern, ~replacement, " 1:  2", "angle_in", " 1:  3", "speed_diff_in", " 5:  1:  ", "slot_in", "", "mark_out" ) names(df1) <- stri_replace_all_fixed(names.df$name.fix, pattern = repl.df$pattern, replacement = repl.df$replacement, vectorize_all = FALSE) #      ""  df1 %<>% repair_names(prefix = "repaired_", sep = "") #     df2 <- df1 %>% select(angle_in, speed_diff_in, slot_in, pressure_in, concentration_in, performance_out, weight_out, mark_out) %>% filter(row_number() > 6) %>% #     filter(complete.cases(.)) %>% #  ,    distinct() %>% #    mutate_each(funs(as.numeric), -mark_out) 


Trick number 2. Shiny code optimization



The concept of reactive programming is orthogonal to the classical linear execution of code, and thus is difficult for analysts to fully understand. Half of Shiny code is not a code for execution, but a declaration of reaction to something. Given the fact that Shiny is very actively developing. It is extremely useful to periodically update your understanding of the current state. As a matter of fact, the materials of the “2016 Shiny Developer Conference” , in particular, the Effective Shiny Programming by report after a regular review, gave grounds for reworking the Shiny code of the applications while reducing the code ~ by 25% and increasing the overall transparency.

What's interestnig:





Of the more useful tricks, there is a way to partially fix the utf-8 encoding under Windows. The sourse function in app.R causes problems; the letter I cannot be ruled out in principle (comments can be rewritten, but if it occurs on the axes of graphs or the cap of tables ...).

This problem is easily solved by the following replacement:



 # source("common.R") eval(parse("common.R", encoding="UTF-8")) 


Trick number 3. Fight unpredictable inputs



Sometimes there is a situation when inside a function that performs quite complex processing through functions from other packages, a failure occurs. More precisely, it may be exception, so much so that the function does not complete its work, but is simply interrupted. As a result, the structure of the packet data stream is broken, and then everything collapses. A close look at the code gives nothing, because the problem is beyond its limits. To cover everything with tryCatch , especially on programs executed once or twice, as a rule, no one likes.



And in this and in many other cases (including with parallel computing) logging helps. Even an elementary conclusion about entering / exiting a function can allow you to quickly localize the data line that is crashing, run the algorithms only on it, and figure out how to fix the bug.



The package futile.logger built on the principle of classic loggers and does not require in-depth study. However, its use can significantly improve its efficiency or get free time.



Naturally, these are not all useful tricks, but the most popular ones in the context of the considered class of problems.



Previous Post: “What ProgRess Reached”



')

Source: https://habr.com/ru/post/322066/



All Articles