RMarkdown, R and ggplot

This article is neither documentation, nor tells something fundamentally new, it should be considered as a review or as a cheat sheet .

Preamble

The conference is primarily reports, and far from the last place is how the presentation slides are arranged.

Of course , there are speakers who, in spite of everything, can give a report even without a single slide, but they still generally complement the narrative well. It is enough for one to throw memesics into the report and the matter is ready, for others it is necessary to insert the code, and in assembly language (who do not know more - the JPoint is a java conference), and there are also those who need to show charts. However, there is a combination of them.

Perhaps the famous means for creating slides is:

PowerPoint, and variations in the face of LibreOffice Impress, Apple KeyNote
cloud variations with the same approach - Google Slides
LaTeX
and relatively new (for me) RMarkdown

And if the first two are essentially binary formats, and Google Slides and cloudy still require the Internet, which is an unpleasant restriction (during trips and flights), the last two are offline and text only, which means you can store history of all changes in git / hg / to your taste. In addition, the scope is far from being limited to slides only.

LaTeX format with history - a lot is written and said, but RMarkdown is young, even a little hipster, but without doorways.

Markdown

Markdown is a lightweight markup language, created with the purpose of writing the most readable and easy to edit text. Markdown is both easy to understand and easy to read, even without any transformations .

Compare yourself: __ is italic , ** ** this is strong highlighting , and much more is described in more detail in the Markdown cheatsheet .

Markdown support github, habrahabr, sublime, jira (has a similar syntax), and many others.

R

R is a programming language for statistical data processing and working with graphics.

As a rule, it stops - it’s very difficult, it’s math, and it’sn’t necessary — but no one forces you to use all the available functionality, perhaps the simplest and most intuitive - it's graphics and visualization.

Although often in order to build a graph using Excel, it is more difficult to cope when the amount of data is approaching a million. Whereas for R this is not a difficult task.

Data and graphics

Let's leave the battle behind the scenes, which is better - tables or charts. A matter of taste.

To build graphs, use the ggplot2 extension.

Added : We need R and RStudio itself , for example for MacOS / Homebrew:

$ brew tap homebrew / science
$ brew install r
$ brew ask install rstudio

Install the module in RStudio for R :

 install.packages("ggplot2")

But to build graphs you need data, and it is reasonable to keep them separate from the presentation, for example, in the csv format - again - a simple text format.

My data is the results obtained using jmh for my report Inside the VM through the keyhole hashCode . I like the style used by Alexei Shipilev: write the results of the benchmark as a comment at the beginning of the file — grep-n-sed and we have a csv-file.

csv/allocations.csv data:

 pos,alloc,value,error 10,single-threaded,2.836,0.285 20,java,9.878,2.676 28,epsilon,75.289,23.667 30,sync,186.672,21.195 40,cas,74.721,0.192 50,tlab,8.506,1.849 55,javaHashCode,60.270,12.318 57,readHashCode,7.296,0.316

We form a data table ( data frame ) from a csv-file - by default, it is considered that there is a header in the file - this will help us when referring to individual columns

 ```{r} df = read.csv(file = "csv/allocations.csv") ```

if we want to filter, for example, by specific values in the alloc column

 df <- subset(df, alloc == "cas" | alloc == "java" | alloc == "sync" | alloc == "tlab" )

Bars / bars

First, you need to set the correspondence scheme of variables ( aes ) from the data table - we will display the value of the type, in our case, the type of location, the color of the bar will also be selected based on the type of location.

 ggplot(data=df, aes(x=alloc, y=value, fill=alloc))

We will display in the form of bars ( bar chart ) + geom_bar ()

 ggplot(data=df, aes(x=alloc, y=value, fill=alloc)) + geom_bar(stat="identity")

result:

rotate the coordinate system (from vertical to horizontal bars) with the + coord_flip () option

 ggplot(data=df, aes(x=alloc, y=value, fill=alloc)) + geom_bar(stat="identity") + coord_flip()

let's add a change error + geom_errorbar () (remember, we have an error column in the csv file):

 ggplot(data=df, aes(x=alloc, y=value, fill=alloc)) + geom_bar(stat="identity") + coord_flip() + geom_errorbar(aes(ymin = value - error, ymax = value + error), width=0.5, alpha=0.5)

for clarity, add values next to the bar + geom_text () (it is logical that the text will be value )

 ggplot(data=df, aes(x=alloc, y=value, fill=alloc)) + geom_bar(stat="identity") + coord_flip() + geom_errorbar(aes(ymin = value - error, ymax = value + error), width=0.5, alpha=0.5) + geom_text(aes(label=value))

add the signature gloss + geom_text () :

use the function to change the signature value ± error - label = base :: sprintf ("% 0.2f ±% 0.2f", value, error) (hello good old sprintf and template formatting % f !)
Let's play with horizontal hjust and vertical vjust with the signature layout
change the font size size and fontface signature

ggplot (data = df, aes (x = alloc, y = value, fill = alloc)) +
geom_bar (stat = "identity") +
coord_flip () +
geom_errorbar (aes (ymin = value - error, ymax = value + error), width = 0.5, alpha = 0.5) +
geom_text (aes (label = base :: sprintf ("% 0.2f ±% 0.2f", value, error)), hjust = -0.1, vjust = -0.4, size = 5, fontface = "bold")

scroll the theme + theme_classic ()
remove the legend + theme (legend.position = "none")
add captions + labs (title = ..., x = ..., y = ..) and fonts for axes + theme (axis.text.y = ..)

ggplot (data = df, aes (x = alloc, y = value, fill = alloc)) +
geom_bar (stat = "identity") +
coord_flip () +
geom_errorbar (aes (ymin = value - error, ymax = value + error), width = 0.5, alpha = 0.5) +
geom_text (aes (label = base :: sprintf ("% 0.2f ±% 0.2f", value, error)), hjust = -0.1, vjust = -0.4, size = 5, fontface = "bold") +
labs (title = "@Threads (4)", x = "", y = "ns / op") +
theme_classic () +
theme (axis.text.y = element_text (size = 16, face = "bold")) +
theme (axis.title = element_text (size = 16, face = "bold")) +
theme (legend.position = "none")

and to give the final gloss

specify colors + scale_fill_manual ()
remove the gap between the vertical axis and the bar + scale_y_continuous () and slightly expand the range of values so that the error is placed and the signature
and fix the order of the bars, according to the pos : x = reorder (alloc, -pos) column

 ggplot(data=df, aes(x=reorder(alloc, -pos), y=value, fill=alloc)) + geom_bar(stat="identity") + coord_flip() + geom_errorbar(aes(ymin = value - error, ymax = value + error), width=0.5, alpha=0.5) + geom_text(aes(label=base::sprintf("%0.2f ± %0.2f", value, error)), hjust=-0.1, vjust=-0.4, size=5, fontface = "bold") + scale_fill_manual(values=c(java'='#a9a518','sync'='#fa8074', 'cas'='#00b3f6', 'tlab'='#e67bf3')) + labs(title = "@Threads( 4 )", x = "", y = "ns/op") + theme_classic() + scale_y_continuous(limits=c(0, max(df$value) + 40), expand = c(0, 0)) + theme(axis.text.y = element_text(size = 16, face = "bold")) + theme(axis.title = element_text(size = 16, face = "bold")) + theme(legend.position="none")

tip : you can save the graph to a file + ggsave ("allocations.svg") , but do not abuse the vector format, if there are many points on the graph - save to raster, for example png .

 install.packages("svglite")

Separately, it is worth noting the default color palette: selectable colors are clearly distinguishable even for people with a weak color perception.

Do not use the combination of red / green, even if you use only two colors to highlight what is better / worse - these colors are hardly distinguishable for people with a weak color perception.

tip : colorbrewer2 will help you choose including. safe colors.

Points

There is some distribution of addresses on the threads

 step,thread,address 1,thread-0,807437816 1,thread-1,807437784 1,thread-2,807437800 ..........

display separately by points + geom_point
- in order to distinguish one thread from another, let us indicate not only the color aes (..., color = thread, ...) differentiation, but also on the basis of the shape of the aes (..., shape = thread, ...) marker -
ggplot (data = df, aes (x = address, y = index, group = thread, color = thread, shape = thread)) +
geom_point (size = 2)

The first thing that catches the eye (except that a lot of data) is the meaning of addresses in scientific notation. Somehow it is more usual to deal with addresses in a hexadecimal system: add the formatting of values along the X axis + scale_x_continuous () :

 ggplot(data=df, aes(x = address, y = index, group=thread, colour=thread, shape=thread)) + geom_point(size=2) + scale_x_continuous( labels = function(n){base::sprintf("0x%X", as.integer(n))} )

Better, but still somehow difficult and a little clear.

Since we can set any function, then why not display the offset relative to some basic, n minimum, address:

 min_address = min(df$address) ggplot(data=df, aes(x = address, y = index, group=thread, colour=thread, shape=thread)) + geom_point(size=2) + scale_x_continuous( labels = function(n){base::sprintf("+ %d MiB", as.integer((n - min_address)/1024/1024))} )

Again, by virtue of habit, we put labels ( breaks = ) for integers - namely, 16, 32, 48, 64Mb:

 ggplot(data=df, aes(x = address, y = index, group=thread, colour=thread, shape=thread)) + geom_point(size=2) + scale_x_continuous( labels = function(n){ifelse(n == min_address, base::sprintf("base"), base::sprintf("+ %d MiB", as.integer((n - min_address)/1024/1024)))}, breaks=c(min_address, min_address + 16*1024*1024, min_address + 32*1024*1024, min_address + 48*1024*1024, min_address + 64*1024*1024) )

There is a lot of data - we want to look at a small part of it.

we restrict the data set by the number of nrows when loading read.csv (...)
add labels in the right places on the Y axis: + scale_y_continuous (breaks = c (...))

df = read.csv (file = "csv / hashCodesNoTLAB.csv", nrows = 36, header = TRUE)

min_address = min (df $ address)

ggplot (data = df, aes (x = address, y = index, group = thread, color = thread, shape = thread)) +
geom_point (size = 4) +
scale_x_continuous (
labels = function (n) {ifelse (n == min_address, base :: sprintf ("base"), base :: sprintf ("+% d", as.integer ((n - min_address))))},
breaks = c (min_address, min_address + 16 10, min_address + 2 16 10, min_address + 3 16 10, min_address + 4 16 10, min_address + 5 16 * 10)
) +
scale_y_continuous (breaks = c (1,2,3,4,5,6,7,8,9,10)) +

bar chart

And of course - histogram , frequency distribution. Very roughly, this can be described as the number of elements falling within a range of values. Nr, for a row [1, 2, 3, 1, 1] - the histogram will look like [3, 1, 1] - because element 1 met 3 times, and elements 2 and 3 once.

 addressHashCode = read.csv(file = "csv/addressHashCode.csv") defaultHashCode = read.csv(file = "csv/defaultHashCode.csv") ggplot() + geom_histogram(data=addressHashCode, aes(x=hashCode, fill="address"), alpha=0.7, bins = 500) + geom_histogram(data=defaultHashCode, aes(x=hashCode, fill="MXSRng"), alpha=0.7, bins = 500) +

Add already known options to give the desired type:

 ggplot() + geom_histogram(data=addressHashCode, aes(x=hashCode, fill="address"), alpha=0.7, bins = 500) + geom_histogram(data=defaultHashCode, aes(x=hashCode, fill="MXSRng"), alpha=0.7, bins = 500) + scale_fill_manual(name=" hashCode:", labels=c("address"="", "MXSRng"="MXS-"), values=c("address" ="#003dae", "MXSRng" = "#ae003d")) + labs(title = sprintf("   : %sk,  MXS-: %s k", round( sum(duplicated(addressHashCode)) / 1000, 1), round( sum(duplicated(defaultHashCode)) / 1000, 1)), x = "hashCode") + theme_classic() + theme(axis.title.y=element_blank()) + scale_y_continuous(labels = function(n){format(n, big.mark = "_", scientific = FALSE)}, expand = c(0, 0)) + scale_x_continuous(labels = function(n){format(n, big.mark = "_", scientific = FALSE)}, expand = c(0, 0)) + theme(axis.title = element_text(size = 16, face = "bold")) + theme(axis.text.y = element_text(size = 14, face = "bold")) + theme(axis.text.x = element_text(size = 14, face = "bold")) + theme(axis.title = element_text(size = 16, face = "bold")) + theme(axis.title.x=element_text(margin=margin(t=20))) + theme(legend.text = element_text(size = 14, face = "bold")) + theme(title = element_text(size = 16, face = "bold")) +

And pony ? Easily!

need png package: install.packages('png')
img <- readPNG ("images / pony.png")
render to internal buffer g <- rasterGrob (img)
Add the pony as annotation + annotation_custom () :

 # install.packages('png') img <- readPNG("images/pony.png") g <- rasterGrob(img, interpolate=TRUE, x = 0.1, y = 0.9, width = 0.2, height = 0.2) ggplot() + annotation_custom(g, xmin=-Inf, ymin = -Inf, xmax=Inf, ymax=Inf) + geom_histogram(data=addressHashCode, aes(x=hashCode, fill="address"), alpha=0.7, bins = 500) + geom_histogram(data=defaultHashCode, aes(x=hashCode, fill="MXSRng"), alpha=0.7, bins = 500)

Some useful links: ggplot2 tutorial and another useful blog Data wrangling, exploration, and analysis with R

RMarkdown

You don't have to be a big captain to figure out that RMarkdown is R + Markdown .

Install the module for R :

 install.packages("rmarkdown")

And with the HTML words I know, LaTeX love RMarkdown render :

rmarkdown :: render ("path_to_file.Rmd", encoding = "UTF-8")

possible output in the following formats:
- html
- pdf
- and even in doc

It is enough to indicate in the title, for example:

output: pdf_document

slides / presentation:
- HTML5 and javascript / css / html5 , although not everyone approves of html presentations
- PDF and use incl. LaTeX macros

to do this, in the header it is enough to change the output :

output: ioslides_presentation

Source: https://habr.com/ru/post/327912/

All Articles

RMarkdown, R and ggplot

RMarkdown, R and ggplot

Preamble

Markdown

R

Data and graphics

Bars / bars

Points

bar chart

RMarkdown

More articles: