Imagine your data before you collect it.

Translation prepared for students of the course "Applied Analytics on R" .

As data scientists, we are often given a set of data and asked to use it for information. We use R for processing, visualization, modeling, preparing tables and graphs to share the results or publish them. If we look at the data in this way, then it doesn’t matter where the data came from. The sample size, the set of features and their scales are fixed. However, the procedures used to collect or generate data are extremely important for future analysis, as well as for the quality of the information that we can ultimately receive. The data collection process affects how data should be analyzed. For studies that measure cause and effect relationships, it is important which data to take into account and which not.

Because these processes are very important, we wanted to create a tool that would help scientists and other researchers submit (emulate) their data before collecting it, so that any changes in the data collection step can be made before it is too late.

If the data is already collected, the tool allows you to visualize your data before analyzing it. When we make data processing and modeling decisions based on the results we get in each operation, or use statistical models, we are unknowingly vulnerable to biases like “garden of diverging paths” or p-hacking , which can lead us to choose an analysis procedure which will give the best result. We use factual data because we have no good alternative: there is no data with the same structure and characteristics that we collected.

This article introduces the fabricatr package (from the DeclareDesign suite of packages), whose role is to model the data structure and attributes. See the review of DeclareDesign on RViews , describing its philosophy. The fabricatr package helps you think about your data before you start analyzing or even collecting. What data? How are they structured? What measurements will you take? What are their ranges and how do they correlate? fabricatr can help you simulate bogus data before collecting real data and test various evaluation strategies without worrying about changing your assumptions.

Imagine your data structure

In the simplest version, the fabricatr will create a single-level data structure with the specified number.

library(fabricatr) fabricate(N = 100, temp_fahrenheit = rnorm(N, mean = 80, sd = 20)) ## Warning: `is_lang()` is deprecated as of rlang 0.2.0. ## Please use `is_call()` instead. ## This warning is displayed once per session. ## Warning: `lang_name()` is deprecated as of rlang 0.2.0. ## Please use `call_name()` instead. ## This warning is displayed once per session.

ID	TEMP_FAHRENHEIT
001	56.6
002	46.3
003	90.5
004	75.1
005	85.1
006	102.8

Sociological data is often hierarchical . For example, there are classes in schools, there are students in classes. Using the add_level command , fabricatr will solve this problem as well. By default, new levels are nested in higher levels.

 library(fabricatr) fabricate( # five schools school = add_level(N = 5, n_classrooms = sample(10:15, N, replace = TRUE)), # 10 to 15 classrooms per school classroom = add_level(N = n_classrooms), # 15 students per classroom student = add_level(N = 15) ) ## Warning: `lang_modify()` is deprecated as of rlang 0.2.0. ## Please use `call_modify()` instead. ## This warning is displayed once per session.

SCHOOL	N_CLASSROOMS	CLASSROOM	STUDENT
one	12	01	001
one	12	01	002
one	12	01	003
one	12	01	004
one	12	01	005
one	12	01	006

In the real world, erratic, overlapping hierarchies often arise. For example, student data can be obtained from high school as well as from high school. In this case, the students will be in two different schools and these schools will not be connected with each other. Below is an example of how to make such "cross" data. The rho parameter determines how much primary_rank and second_rank should correlate.

 dat <- fabricate( primary_schools = add_level(N = 5, primary_rank = 1:N), secondary_schools = add_level(N = 6, secondary_rank = 1:N, nest = FALSE), students = link_levels(N = 15, by = join(primary_rank, secondary_rank, rho = 0.9)) ) ## `link_levels()` calls are faster if the `mvnfast` package is installed. ggplot(dat, aes(primary_rank, secondary_rank)) + geom_point(position = position_jitter(width = 0.1, height = 0.1), alpha = 0.5) + theme_bw()

Similarly, you can generate longitudinal data via cross_levels :

 fabricate( students = add_level(N = 2), years = add_level(N = 20, year = 1981:2000, nest = FALSE), student_year = cross_levels(by = join(students, years)) )

STUDENTS	YEARS	YEAR	STUDENT_YEAR
one	01	1981	01
2	01	1981	02
one	02	1982	03
2	02	1982	04
one	03	1983	05
2	03	1983	06

Imagine your signs

R has many great tools for modeling features. However, in some cases, the usual types of variables are surprisingly difficult to model. In fabricatr there are a small number of functions with a simple syntax for creating features with frequently used types. Here we describe two examples, see the rest in the article .

Symptoms with intraclass correlation

Using the tools described above, you can construct data that has intrablock and interblock variations, for example, variations in classrooms and variations between classrooms. In many cases, it is necessary to more precisely set the level of intraclass correlation (intra-class correlation, ICC). Here draw_normal_icc and draw_binary_icc will help.

 dat <- fabricate( N = 1000, clusters = sample(LETTERS, N, replace = TRUE), Y1 = draw_normal_icc(clusters = clusters, ICC = .2), Y2 = draw_binary_icc(clusters = clusters, ICC = .2) ) ICC::ICCbare(clusters, Y1, dat) ## [1] 0.09726701 ICC::ICCbare(clusters, Y2, dat) ## [1] 0.176036

Ordered results

We also have tools for discrete random variables (including ordered results). We take a latent variable (for example, test_ability ) and convert it into an ordered attribute ( test_score ).

 dat <- fabricate( N = 100, test_ability = rnorm(N), test_score = draw_ordered(test_ability, breaks = c(-.5, 0, .5)) ) ggplot(dat, aes(test_ability, test_score)) + geom_point() + theme_bw()

fabricatr is compatible with almost any R-variable creation function. In this article, we have described some terrific R packages that help mimic sociological attributes.

Where to go next

This article is a high-level overview of the fabricatr functionality. For more information, read the article Getting started with fabricatr .

You can install fabricatr via CRAN:

 install.packages("fabricatr") library(fabricatr)

Have questions? Write in the comments!

Source: https://habr.com/ru/post/460187/

All Articles

SCHOOL	N_CLASSROOMS	CLASSROOM	STUDENT
one	12	01	001
one	12	01	002
one	12	01	003
one	12	01	004
one	12	01	005
one	12	01	006

SCHOOL	N_CLASSROOMS	CLASSROOM	STUDENT
one	12	01	001
one	12	01	002
one	12	01	003
one	12	01	004
one	12	01	005
one	12	01	006