Despite the fact that common business tasks are very often far from the popular topic of big data and machine learning and are often associated with processing relatively small amounts of information [tens of megabytes - tens of gigabytes], spread in arbitrary representations by various types of sources, using R as the main tool allows you to easily and elegantly automate and speed up these tasks.
And, of course, after the analysis it is necessary to present all this, for which you can successfully use Shiny. Next, I will give a number of tricks and approaches that can help in this task. I am sure that any practicing analyst can easily add his own tricks, it all depends on the class of problems being solved.
The situation is typical and occurs in any task several times a day:
Objective: to ensure the import of such a gel-like substance into a structured representation, which can be processed by standard means (not just R).
The task is divided into several successive stages. Immediately, I note that to solve the problem, I try to make the most of the approaches implemented by Hadley Wickham in his “tidyverse” philosophy. This was done for several reasons:
Everything would be fine, and you could just use the readxl
package, but there are some nuances.
read_excel
function, read_excel
must explicitly specify the specification of ALL columns as textual.data.frame
, but the number that is perceived by the input parser.On separate excel files type construction
raw <- read_excel(fname) ctypes <- rep("text", ncol(raw)) cnames <- str_c("grp_", seq_along(ctypes)) raw <- read_excel(fname, col_types = ctypes, col_names = cnames)
breaks with the message "Error: Need one name and type for each column". Top-level study of objects does not give any reasonable answer. To understand how to act, you need to learn github \ stackoverflow.
As a summary, the issues and workaround are as follows:
read_excel
function of this data may not be. Still greatly exacerbated by the different behavior of different xls * file formats. But in 100% of cases for such excel files, the number of columns from the read_excel
and in the source file is different.readxl
package. The method of application follows from the analysis of the contents of the package on github. This is not very good, but allows you to solve the problem. At the same time, it is necessary to pick up both branches (for .xls and for .xlsx files) separately, despite the fact that read_excel
all this behind its facade.The method of solving the problem is demonstrated by the example of the above excel file format:
ncol(read_excel("col_test.xlsx")) # 4 length(readxl:::xlsx_col_types("col_test.xlsx")) # 5 ncol(read_excel("col_test.xls")) # 2 length(readxl:::xlsx_col_types("col_test.xls")) # 5
Having received the correct number of columns, we import without further problems according to the documentation.
We consider the solution of the problem on the example of such an excel file (fragment).
We see that the name of the column is in fact spread over the range from 1 to 3 lines.
The strategy is quite simple and consists of the following:
Then just one of the possible examples of code that allows you to solve this problem. In production, you can collapse everything into a processing chain, but to understand the process, the code is forcedly divided into steps. In the course of the decision, a number of functions are used that greatly simplify life.
repair_names
- to fix the names of imported columns;na.locf
— to fill the NA lines with the last non-NA value encountered;complete.cases
- remove empty linestribble
(transposed tibble) - for manual formation of data.frame
in data.frame
, not in columns;stri_replace_all_fixed
- use the vectorization properties for batch renaming lines. raw <- read_excel(...) # , NA df0 <- raw %>% repair_names(prefix="repaired_", sep="") # 2-3. 2- -- , 3- -- # , 3, #name_c2 <- tidyr::gather(df0[1, ], key = name, value = name_c2) # 1- #name_c3 <- tidyr::gather(df0[2, ], key = name, value = name_c3) # 1- # join , , # , # names.df <- dplyr::full_join(name_c2, name_c3, by = "name") names.df <- tibble(name_c2=tidyr::gather(df0[1, ], key=name, value=v)$v, name_c3=tidyr::gather(df0[2, ], key=name, value=v)$v) %>% mutate(name_c2 = na.locf(name_c2)) %>% # name_c3 = NA, NA, mutate(name.fix = ifelse(is.na(name_c3), name_c2, str_c(name_c2, name_c3, sep=": "))) %>% mutate(name.fix = str_replace_all(name.fix, "\r", " ")) %>% # mutate(name.fix = str_replace_all(name.fix, "\n", " ")) %>% # mutate(name.fix = str_replace_all(name.fix, " ", " ")) df1 <- df0 repl.df <- tribble( ~pattern, ~replacement, " 1: 2", "angle_in", " 1: 3", "speed_diff_in", " 5: 1: ", "slot_in", "", "mark_out" ) names(df1) <- stri_replace_all_fixed(names.df$name.fix, pattern = repl.df$pattern, replacement = repl.df$replacement, vectorize_all = FALSE) # "" df1 %<>% repair_names(prefix = "repaired_", sep = "") # df2 <- df1 %>% select(angle_in, speed_diff_in, slot_in, pressure_in, concentration_in, performance_out, weight_out, mark_out) %>% filter(row_number() > 6) %>% # filter(complete.cases(.)) %>% # , distinct() %>% # mutate_each(funs(as.numeric), -mark_out)
The concept of reactive programming
is orthogonal to the classical linear execution of code, and thus is difficult for analysts to fully understand. Half of Shiny code is not a code for execution, but a declaration of reaction to something. Given the fact that Shiny is very actively developing. It is extremely useful to periodically update your understanding of the current state. As a matter of fact, the materials of the “2016 Shiny Developer Conference” , in particular, the Effective Shiny Programming by report after a regular review, gave grounds for reworking the Shiny code of the applications while reducing the code ~ by 25% and increasing the overall transparency.
What's interestnig:
reactiveValues
in favor of separate functions of reactive and observe.Of the more useful tricks, there is a way to partially fix the utf-8 encoding under Windows. The sourse
function in app.R
causes problems; the letter I cannot be ruled out in principle (comments can be rewritten, but if it occurs on the axes of graphs or the cap of tables ...).
This problem is easily solved by the following replacement:
# source("common.R") eval(parse("common.R", encoding="UTF-8"))
Sometimes there is a situation when inside a function that performs quite complex processing through functions from other packages, a failure occurs. More precisely, it may be exception, so much so that the function does not complete its work, but is simply interrupted. As a result, the structure of the packet data stream is broken, and then everything collapses. A close look at the code gives nothing, because the problem is beyond its limits. To cover everything with tryCatch
, especially on programs executed once or twice, as a rule, no one likes.
And in this and in many other cases (including with parallel computing) logging helps. Even an elementary conclusion about entering / exiting a function can allow you to quickly localize the data line that is crashing, run the algorithms only on it, and figure out how to fix the bug.
The package futile.logger
built on the principle of classic loggers and does not require in-depth study. However, its use can significantly improve its efficiency or get free time.
Naturally, these are not all useful tricks, but the most popular ones in the context of the considered class of problems.
Previous Post: “What ProgRess Reached”
Source: https://habr.com/ru/post/322066/