📜 ⬆️ ⬇️

RMarkdown, R and ggplot

RMarkdown, R and ggplot


This article is neither documentation, nor tells something fundamentally new, it should be considered as a review or as a cheat sheet .


Preamble


The conference is primarily reports, and far from the last place is how the presentation slides are arranged.


Of course , there are speakers who, in spite of everything, can give a report even without a single slide, but they still generally complement the narrative well. It is enough for one to throw memesics into the report and the matter is ready, for others it is necessary to insert the code, and in assembly language (who do not know more - the JPoint is a java conference), and there are also those who need to show charts. However, there is a combination of them.


Perhaps the famous means for creating slides is:




And if the first two are essentially binary formats, and Google Slides and cloudy still require the Internet, which is an unpleasant restriction (during trips and flights), the last two are offline and text only, which means you can store history of all changes in git / hg / to your taste. In addition, the scope is far from being limited to slides only.


LaTeX format with history - a lot is written and said, but RMarkdown is young, even a little hipster, but without doorways.


Markdown


Markdown is a lightweight markup language, created with the purpose of writing the most readable and easy to edit text. Markdown is both easy to understand and easy to read, even without any transformations .


Compare yourself: __ is italic , ** ** this is strong highlighting , and much more is described in more detail in the Markdown cheatsheet .


Markdown support github, habrahabr, sublime, jira (has a similar syntax), and many others.


R


R is a programming language for statistical data processing and working with graphics.


As a rule, it stops - it’s very difficult, it’s math, and it’sn’t necessary — but no one forces you to use all the available functionality, perhaps the simplest and most intuitive - it's graphics and visualization.


Although often in order to build a graph using Excel, it is more difficult to cope when the amount of data is approaching a million. Whereas for R this is not a difficult task.

Data and graphics


Let's leave the battle behind the scenes, which is better - tables or charts. A matter of taste.

To build graphs, use the ggplot2 extension.


Added : We need R and RStudio itself , for example for MacOS / Homebrew:


$ brew tap homebrew / science
$ brew install r
$ brew ask install rstudio

Install the module in RStudio for R :


 install.packages("ggplot2") 

But to build graphs you need data, and it is reasonable to keep them separate from the presentation, for example, in the csv format - again - a simple text format.


My data is the results obtained using jmh for my report Inside the VM through the keyhole hashCode . I like the style used by Alexei Shipilev: write the results of the benchmark as a comment at the beginning of the file — grep-n-sed and we have a csv-file.

csv/allocations.csv data:


 pos,alloc,value,error 10,single-threaded,2.836,0.285 20,java,9.878,2.676 28,epsilon,75.289,23.667 30,sync,186.672,21.195 40,cas,74.721,0.192 50,tlab,8.506,1.849 55,javaHashCode,60.270,12.318 57,readHashCode,7.296,0.316 

We form a data table ( data frame ) from a csv-file - by default, it is considered that there is a header in the file - this will help us when referring to individual columns


 ```{r} df = read.csv(file = "csv/allocations.csv") ``` 

if we want to filter, for example, by specific values ​​in the alloc column


 df <- subset(df, alloc == "cas" | alloc == "java" | alloc == "sync" | alloc == "tlab" ) 

Bars / bars


First, you need to set the correspondence scheme of variables ( aes ) from the data table - we will display the value of the type, in our case, the type of location, the color of the bar will also be selected based on the type of location.


 ggplot(data=df, aes(x=alloc, y=value, fill=alloc)) 

We will display in the form of bars ( bar chart ) + geom_bar ()


 ggplot(data=df, aes(x=alloc, y=value, fill=alloc)) + geom_bar(stat="identity") 

result:


rotate the coordinate system (from vertical to horizontal bars) with the + coord_flip () option


 ggplot(data=df, aes(x=alloc, y=value, fill=alloc)) + geom_bar(stat="identity") + coord_flip() 


let's add a change error + geom_errorbar () (remember, we have an error column in the csv file):


 ggplot(data=df, aes(x=alloc, y=value, fill=alloc)) + geom_bar(stat="identity") + coord_flip() + geom_errorbar(aes(ymin = value - error, ymax = value + error), width=0.5, alpha=0.5) 


for clarity, add values ​​next to the bar + geom_text () (it is logical that the text will be value )


 ggplot(data=df, aes(x=alloc, y=value, fill=alloc)) + geom_bar(stat="identity") + coord_flip() + geom_errorbar(aes(ymin = value - error, ymax = value + error), width=0.5, alpha=0.5) + geom_text(aes(label=value)) 


add the signature gloss + geom_text () :






and to give the final gloss



')
 ggplot(data=df, aes(x=reorder(alloc, -pos), y=value, fill=alloc)) + geom_bar(stat="identity") + coord_flip() + geom_errorbar(aes(ymin = value - error, ymax = value + error), width=0.5, alpha=0.5) + geom_text(aes(label=base::sprintf("%0.2f ± %0.2f", value, error)), hjust=-0.1, vjust=-0.4, size=5, fontface = "bold") + scale_fill_manual(values=c(java'='#a9a518','sync'='#fa8074', 'cas'='#00b3f6', 'tlab'='#e67bf3')) + labs(title = "@Threads( 4 )", x = "", y = "ns/op") + theme_classic() + scale_y_continuous(limits=c(0, max(df$value) + 40), expand = c(0, 0)) + theme(axis.text.y = element_text(size = 16, face = "bold")) + theme(axis.title = element_text(size = 16, face = "bold")) + theme(legend.position="none") 


tip : you can save the graph to a file + ggsave ("allocations.svg") , but do not abuse the vector format, if there are many points on the graph - save to raster, for example png .


 install.packages("svglite") 

Separately, it is worth noting the default color palette: selectable colors are clearly distinguishable even for people with a weak color perception.

Do not use the combination of red / green, even if you use only two colors to highlight what is better / worse - these colors are hardly distinguishable for people with a weak color perception.

tip : colorbrewer2 will help you choose including. safe colors.


Points


There is some distribution of addresses on the threads


 step,thread,address 1,thread-0,807437816 1,thread-1,807437784 1,thread-2,807437800 .......... 



The first thing that catches the eye (except that a lot of data) is the meaning of addresses in scientific notation. Somehow it is more usual to deal with addresses in a hexadecimal system: add the formatting of values ​​along the X axis + scale_x_continuous () :


 ggplot(data=df, aes(x = address, y = index, group=thread, colour=thread, shape=thread)) + geom_point(size=2) + scale_x_continuous( labels = function(n){base::sprintf("0x%X", as.integer(n))} ) 


Better, but still somehow difficult and a little clear.


Since we can set any function, then why not display the offset relative to some basic, n minimum, address:


 min_address = min(df$address) ggplot(data=df, aes(x = address, y = index, group=thread, colour=thread, shape=thread)) + geom_point(size=2) + scale_x_continuous( labels = function(n){base::sprintf("+ %d MiB", as.integer((n - min_address)/1024/1024))} ) 


Again, by virtue of habit, we put labels ( breaks = ) for integers - namely, 16, 32, 48, 64Mb:


 ggplot(data=df, aes(x = address, y = index, group=thread, colour=thread, shape=thread)) + geom_point(size=2) + scale_x_continuous( labels = function(n){ifelse(n == min_address, base::sprintf("base"), base::sprintf("+ %d MiB", as.integer((n - min_address)/1024/1024)))}, breaks=c(min_address, min_address + 16*1024*1024, min_address + 32*1024*1024, min_address + 48*1024*1024, min_address + 64*1024*1024) ) 


There is a lot of data - we want to look at a small part of it.




bar chart


And of course - histogram , frequency distribution. Very roughly, this can be described as the number of elements falling within a range of values. Nr, for a row [1, 2, 3, 1, 1] - the histogram will look like [3, 1, 1] - because element 1 met 3 times, and elements 2 and 3 once.


 addressHashCode = read.csv(file = "csv/addressHashCode.csv") defaultHashCode = read.csv(file = "csv/defaultHashCode.csv") ggplot() + geom_histogram(data=addressHashCode, aes(x=hashCode, fill="address"), alpha=0.7, bins = 500) + geom_histogram(data=defaultHashCode, aes(x=hashCode, fill="MXSRng"), alpha=0.7, bins = 500) + 


Add already known options to give the desired type:


 ggplot() + geom_histogram(data=addressHashCode, aes(x=hashCode, fill="address"), alpha=0.7, bins = 500) + geom_histogram(data=defaultHashCode, aes(x=hashCode, fill="MXSRng"), alpha=0.7, bins = 500) + scale_fill_manual(name=" hashCode:", labels=c("address"="", "MXSRng"="MXS-"), values=c("address" ="#003dae", "MXSRng" = "#ae003d")) + labs(title = sprintf("   : %sk,  MXS-: %s k", round( sum(duplicated(addressHashCode)) / 1000, 1), round( sum(duplicated(defaultHashCode)) / 1000, 1)), x = "hashCode") + theme_classic() + theme(axis.title.y=element_blank()) + scale_y_continuous(labels = function(n){format(n, big.mark = "_", scientific = FALSE)}, expand = c(0, 0)) + scale_x_continuous(labels = function(n){format(n, big.mark = "_", scientific = FALSE)}, expand = c(0, 0)) + theme(axis.title = element_text(size = 16, face = "bold")) + theme(axis.text.y = element_text(size = 14, face = "bold")) + theme(axis.text.x = element_text(size = 14, face = "bold")) + theme(axis.title = element_text(size = 16, face = "bold")) + theme(axis.title.x=element_text(margin=margin(t=20))) + theme(legend.text = element_text(size = 14, face = "bold")) + theme(title = element_text(size = 16, face = "bold")) + 


And pony ? Easily!



 # install.packages('png') img <- readPNG("images/pony.png") g <- rasterGrob(img, interpolate=TRUE, x = 0.1, y = 0.9, width = 0.2, height = 0.2) ggplot() + annotation_custom(g, xmin=-Inf, ymin = -Inf, xmax=Inf, ymax=Inf) + geom_histogram(data=addressHashCode, aes(x=hashCode, fill="address"), alpha=0.7, bins = 500) + geom_histogram(data=defaultHashCode, aes(x=hashCode, fill="MXSRng"), alpha=0.7, bins = 500) 


Some useful links: ggplot2 tutorial and another useful blog Data wrangling, exploration, and analysis with R


RMarkdown


You don't have to be a big captain to figure out that RMarkdown is R + Markdown .


Install the module for R :


 install.packages("rmarkdown") 

And with the HTML words I know, LaTeX love RMarkdown render :


rmarkdown :: render ("path_to_file.Rmd", encoding = "UTF-8")


It is enough to indicate in the title, for example:


output: pdf_document


to do this, in the header it is enough to change the output :


output: ioslides_presentation

Source: https://habr.com/ru/post/327912/


All Articles