Data tidying: Preparing Data Sets for Case-Based Analysis

This article arose from the processing and translation of information into Russian, taken from two sources:

from the article "Tidy Data"
from the relevant swirl tidyr package lesson

For data analysis professionals, this may look like a long-learned multiplication table — they are unlikely to find anything new here. And those who, like me, are only acquainted with this field and the possibilities of the R language, I invite you to continue reading.

Introduction

There is a widespread opinion that up to 80% of the data analysis process is the time spent on their preparation. In this article we will focus only on one aspect of preparing data for analysis - structuring and putting in order data sets, the so-called. data tidying

The term “tidy data” can be translated into Russian as “accurate data” or “ordered data”. Paraphrasing Leo Tolstoy, all the ordered data sets look alike, each random data set is disordered in its own way.
')
Let's start with the question of what an ordered data set is.

Defining an ordered data set

Most of the data sets are tables containing rows and columns . Data sets contain values . This is usually either numbers (quantitative data) or strings (qualitative data). Each value refers on the one hand to the variable , on the other hand to the corresponding observation . At the same time, observations can be grouped into types of observation units (separation units) in order to ensure their separate storage and to prevent possible inconsistencies.

Whether the dataset is ordered or random is dependent on how the rows, columns, and tables correspond to the observations, variables, and types of observation units. There are three signs of an ordered data set:

Each variable forms a column;
Each observation forms a string;
Each type of observation unit forms a table.

Violation of any of the listed symptoms means that the data set is erratic.

Sort datasets

In the real world, you can get an ordered set of data right away, except by chance. Consider the five main problems in data sets and their solutions using the R tidyr and dplyr language packages.

1. The column headings contain values, not variable names.

Consider a data set of students:

> students grade male female 1 A 1 5 2 B 5 0 3 C 5 2 4 D 5 5 5 E 7 4

In the first column of the Grade, the marks given by the students are indicated, and in the second and third - what number of boys and girls, respectively, were among them.

In fact, there are three variables in this data set — score, gender, and quantity. The value of the variable "sex" is contained in the headers of the second and third columns. The variable number describes how many students there are for each combination of grade and gender.

To organize this data set, we need to ensure that each column describes a separate variable. This can be easily done using the Gather function:

 > gather(students, sex, count, -grade) grade sex count 1 A male 1 2 B male 5 3 C male 5 4 D male 5 5 E male 7 6 A female 5 7 B female 0 8 C female 2 9 D female 5 10 E female 4

Using the Gather function, we collect several columns in key-value pairs. In this case, sex is key, and count is value. The "-grade" parameter means that this variable is not involved in the process and remains unchanged.

2. Several variables are stored in one column.

Consider the dataset Students2:

 > students2 grade male_1 female_1 male_2 female_2 1 A 3 4 3 4 2 B 6 4 3 5 3 C 7 4 3 8 4 D 4 0 8 1 5 E 1 1 2 7

It is similar to the previous data set. The difference is that there is a division into two classes and the number of students is indicated with the division by gender and class. In addition to the previous problem, a new one is added here - the variables “gender” and “class” are stored in one column.

In this case, the problem of ordering the data set is solved in two moves. First, we print the count variable, preserving the union of the gender and class variables. Next, we separate the variables "gender" and "class" in different columns. For convenience, in the mini-script we combine all actions with the chain operator:

 students2 %>% gather(sex_class, count, -grade) %>% separate(sex_class, into = c("sex", "class")) %>% print

Result of work:

  grade sex class count 1 A male 1 3 2 B male 1 6 3 C male 1 7 4 D male 1 4 5 E male 1 1 6 A female 1 4 7 B female 1 4 8 C female 1 4 9 D female 1 0 10 E female 1 1 11 A male 2 3 12 B male 2 3 13 C male 2 3 14 D male 2 8 15 E male 2 2 16 A female 2 4 17 B female 2 5 18 C female 2 8 19 D female 2 1 20 E female 2 7

3. Variables are stored both in columns and in rows.

Consider the dataset students3:

 > students3 name test class1 class2 class3 class4 class5 1 Sally midterm A <NA> B <NA> <NA> 2 Sally final C <NA> C <NA> <NA> 3 Jeff midterm <NA> D <NA> A <NA> 4 Jeff final <NA> E <NA> C <NA> 5 Roger midterm <NA> C <NA> <NA> B 6 Roger final <NA> A <NA> <NA> A 7 Karen midterm <NA> <NA> CA <NA> 8 Karen final <NA> <NA> CA <NA> 9 Brian midterm B <NA> <NA> <NA> A 10 Brian final B <NA> <NA> <NA> C

For each of the five students, we have an interim and final assessment. In addition, each was trained in two classes out of five possible.

The problems of this dataset begin with the fact that the class1: class5 column names contain the values of one class variable. The values of the test (midterm, final) column must be variable and contain a grade value for each student.
To solve this problem, organize the class variable and hide the column names class1: class5 in its values. Next, we expand the values of the test column into the final and midterm variables. And finally, we remove the redundancy of the values of the variable class, leaving only the numbers there. Below is a mini script:

 students3 %>% gather(class, grade, class1:class5, na.rm = TRUE) %>% spread(test, grade) %>% mutate(class, class = extract_numeric(class)) %>% print

And below are the results of his work - an ordered set of data:

  name class final midterm 1 Brian 1 BB 2 Brian 5 CA 3 Jeff 2 ED 4 Jeff 4 CA 5 Karen 3 CC 6 Karen 4 AA 7 Roger 2 AC 8 Roger 5 AB 9 Sally 1 CA 10 Sally 3 CB

4. Several types of observation units are stored in one table.

The following data set, students4, looks almost the same as the ordered students3 from the previous example. The main difference is the addition of columns id and gender.

 > students4 id name sex class midterm final 1 168 Brian F 1 BB 2 168 Brian F 5 AC 3 588 Sally M 1 AC 4 588 Sally M 3 BC 5 710 Jeff M 2 DE 6 710 Jeff M 4 AC 7 731 Roger F 2 CA 8 731 Roger F 5 BA 9 908 Karen M 3 CC 10 908 Karen M 4 AA

The problem with typing is data redundancy - combinations (id, name, sex) occur twice. The solution to the problem consists in splitting the data set into two tables:

The first will store information on students (id, name, sex)
The second will store the estimated information (id, class, midterm, final)

We collect a table of information on students (choosing the necessary columns and removing duplicates):

 student_info <- students4 %>% select(id, name, sex) %>% unique() %>% print id name sex 1 168 Brian F 3 588 Sally M 5 710 Jeff M 7 731 Roger F 9 908 Karen M

The second table is obtained by simply choosing the fields we need:

 gradebook <- students4 %>% select(id, class, midterm, final) %>% print id class midterm final 1 168 1 BB 2 168 5 AC 3 588 1 AC 4 588 3 BC 5 710 2 DE 6 710 4 AC 7 731 2 CA 8 731 5 BA 9 908 3 CC 10 908 4 AA

The tables are linked by the id field (student ID).

5. One observation unit is stored in several tables.

Let's give an example the reverse of the previous - the observation unit is stored in different tables (passed - those who passed the exam and failed - those who did not pass):

 > passed name class final 1 Brian 1 B 2 Roger 2 A 3 Roger 5 A 4 Karen 4 A > failed name class final 1 Brian 5 C 2 Sally 1 C 3 Sally 3 C 4 Jeff 2 E 5 Jeff 4 C 6 Karen 3 C

At the same time, information on whether the student passed the exam or not is contained in the assessment itself (A, B - passed, in other cases - did not pass).

We merge information into one table, after adding the status column with a sign, whether the student passed the exam or not:

 passed <- mutate(passed, status = "passed") failed <- mutate(failed, status = "failed") > rbind_list(passed, failed) Source: local data frame [10 x 4] name class final status 1 Brian 1 B passed 2 Roger 2 A passed 3 Roger 5 A passed 4 Karen 4 A passed 5 Brian 5 C failed 6 Sally 1 C failed 7 Sally 3 C failed 8 Jeff 2 E failed 9 Jeff 4 C failed 10 Karen 3 C failed

As a result, we received an ordered data set with the passed / failed sign.

List of sources:

"Tidy data" by Hadley Wickham ( http://vita.had.co.nz/papers/tidy-data.pdf )
http://swirlstats.com/students.html

Source: https://habr.com/ru/post/248741/

All Articles

Data tidying: Preparing Data Sets for Case-Based Analysis

Introduction

Defining an ordered data set

Sort datasets

List of sources:

More articles: