📜 ⬆️ ⬇️

The combination of R and Python: why, when and how?

dva stula

Probably many of those who are engaged in data analysis have ever thought about whether it is possible to use R and Python simultaneously. And if so, why should it be necessary? When will it be useful and effective for projects? Yes, and how to choose the best way to combine languages, if Google gives out about 100,500 options?

Let's try to understand these issues.

What for



Let's try to talk more about the first paragraph. Summary, which will follow below, is certainly subjective, and it can be supplemented. It is based on the systematization of key articles about the benefits of languages ​​and personal experience. But the world, as we know, is changing very quickly.
')
Python was created by clever programmers and is a general-purpose language, later on - with the development of data science - adapted for specific data analysis tasks. Hence the main advantages of this language. When analyzing data, its use is optimal for:


The main thing in R is an extensive collection of libraries. This language, especially at the initial stage, developed mainly due to the efforts of statisticians, not programmers. The statistics tried very hard and their achievements are difficult to challenge.

If suddenly you are thinking about trying a new tasty stat model that you recently heard about at a conference, before you start writing it from scratch, google R package <INSERT NAME: new great stats model> . Your chances of success are very high! Thus, the undoubted advantage of R is the possibility of advanced statistical analysis. In particular, for a number of specific areas of science and practice (econometrics, bioinformatics, etc.). In my opinion, in R, the analysis of the time series is still much more developed.

Another key and so far undeniable advantage of R over Python is interactive graphics. The possibilities for creating and configuring dashboards and simple applications for people without JS knowledge are truly enormous. Do not believe it - take a little time to explore the possibilities of a couple of libraries from the list: htmlwidgets , flexdashboard , shiny , slidify . For example, initially, the materials for this article were collected as an interactive presentation on slidify .

But no matter how hard they try, they are not strong in everything. They did not manage to achieve such high efficiency of memory management as in Python. Rather, in R, a code that is good and fast on large volumes of data is quite possible. But with much greater effort and self-control than in Python.

Gradually, all differences are erased, and both languages ​​become more and more interchangeable. In Python, visualization capabilities are being developed ( seaborn become a big step forward) and not always working econometric libraries ( puFlux , pymaclab , etc.) are added, in R, memory management efficiency and data processing capabilities are improved ( data.table ). Here, for example, you can see examples of basic operations with data in R and Python . So there is an advantage in combining languages ​​for your project, it is up to you to decide.

As for the second point on improving the speed and convenience of project implementation, here we are talking mainly about the organization of the project. For example, you have two people for a project, one of whom is bigger and stronger at R, the other at Python. Provided that you can provide code review and other controls for both languages, you can try to distribute the tasks so that each participant uses his best skills. Of course, your experience in solving specific problems in various languages ​​also matters.

Although here it should be clarified that we are talking about research projects working with data. For production decisions, other criteria are important. The combination will most likely not be useful for the stability and scalability of the calculations. So we smoothly and turn to the question of when it is more convenient to combine languages.

When


Taking into account the features of both languages, it is possible to gain from combining R and Python with:


I will give examples for the possible combination of languages:


All examples given are real projects.

Again, even though 2 years have passed since my first speech about the possibilities of combining R and Python, I still have not decided to recommend combining languages ​​in production. Is that if it is almost 2 separate entities / models, not critical tied to each other.

If there are lucky people who have made R + Python production something, please share it in the comments!

how


Now directly on the chairs. Among the approaches to combining R and Python, there are three main categories:


Let us consider in more detail each of the approaches.

Command line tools


The essence is in the division of the project into separate relatively independent parts, performed in R or Python and data transfer via disk in any format convenient for both languages.
Syntax is extremely simple. Execution of scripts using the command line is carried out according to the scheme:

 <cmd_to_run> <path_to_script> <any_additional_args> 

<cmd_to_run> - command to execute an R or Python script using the command line
<path_to_script> - script location directory
<any_additional_args> - list of arguments to the script

The table below shows in more detail the execution schemes of scripts from the command line and readings of the passed arguments. The comments indicate the type of the object in which the list of arguments is written.

CommandPythonR
Cmdpython path/to/myscript.py arg1 arg2 arg3Rscript path/to/myscript.R arg1 arg2 arg3
Fetch arguments# list, 1st el. - file executed
import sys
my_args = sys.argv

# character vector of args
myArgs <- commandArgs(trailingOnly = TRUE)


For those who want to have very detailed examples below.

R script from python


To begin with, let's max.R a simple R script to determine the maximum number from the list and call it max.R
 # max.R randomvals <- rnorm(75, 5, 0.5) par(mfrow = c(1, 2)) hist(randomvals, xlab = 'Some random numbers') plot(randomvals, xlab = 'Some random numbers', ylab = 'value', pch = 3) 


Now let's do it in Python, using cmd and passing a list of numbers to find the maximum value.

 # calling R from Python import subprocess # Define command and arguments command = 'Rscript' path2script = 'path/to your script/max.R' # Variable number of args in a list args = ['11', '3', '9', '42'] # Build subprocess command cmd = [command, path2script] + args # check_output will run the command and store to result x = subprocess.check_output(cmd, universal_newlines=True) print('The maximum of the numbers is:', x) 


Python script from R

First, create a simple Python script to split the text line into parts and call it `splitstr.py`.

 # splitstr.py import sys # Get the arguments passed in string = sys.argv[1] pattern = sys.argv[2] # Perform the splitting ans = string.split(pattern) # Join the resulting list of elements into a single newline # delimited string and print print('\n'.join(ans)) 

And now let's execute it on R, using cmd and passing a text string to remove the desired pattern.

 # calling Python from R command = "python" # Note the single + double quotes in the string (needed if paths have spaces) path2script ='"path/to your script/splitstr.py"' # Build up args in a vector string = "3523462---12413415---4577678---7967956---5456439" pattern = "---" args = c(string, pattern) # Add path to script as first arg allArgs = c(path2script, args) output = system2(command, args=allArgs, stdout=TRUE) print(paste("The Substrings are:\n", output)) 


For intermediate storage of files when transferring information from one script to another, you can use a variety of formats - depending on the goals and preferences. To work with each of the formats in both languages ​​there are libraries (and not one).

File Formats for R and Python
Medium StoragePythonR
Flat files
csvcsv, pandasreadr, data.table
jsonjsonjsonlite
yamlPyYAMLyaml
Databases
SQLsqlalchemy, pandasql, pyodbcsqlite, RODBS, RMySQL, sqldf, dplyr
NoSQLPymongoRmongo
Feather
for data framesfeatherfeather
Numpy
for numpy objectsnumpyRcppcnpy


The classic format is, of course, flat files. Often, csv is the most simple, convenient and reliable. If you want to structure or store information for a relatively long period of time, it is likely that storage in databases (SQL / NoSQL) will be the best choice.

For fast transfer of numpy objects to R and back, there is a fast and stable RCppCNPy library. An example of its use can be found here .

At the same time, there is also a feather format, designed specifically for transferring date frames between R and Python. The original format chip is sharpened on R and Python, ease of processing in both languages, and very fast writing and reading. The idea is excellent, but with the implementation, as it sometimes happens, there are nuances. The format developers themselves have repeatedly pointed out that it is not yet suitable for long-term solutions. When updating the libraries for working with the format, the whole process may break down and require significant code changes.

But while writing and reading feather in R and Python is really fast. The test comparison for the speed of reading and writing a file for 10 million lines is shown in the figure below. In both cases, the feather significantly ahead of the speed of the key libraries for working with the classic csv format.

Comparison of speed with feather and CSV for R and Python


File : data frame, 10 million lines. 10 attempts for each library.

R : CSV read / write through data.table and dplyr . Working with feather was the fastest, but the speed data.table not high. At the same time, there are certain difficulties with setting up and working with feather for R. And support is in doubt. feather last updated a year ago.

Python : CSV read / write with pandas . The gain to feather in speed turned out to be significant, there were no problems with working with the format in Python 3.5.

It is important to note that the speed can only be compared separately for R and separately for Python. It will be incorrect between languages, because the whole test was carried out from R - for the convenience of forming the final figure.

Let's sum up


Benefits


disadvantages


Interfacing R and Python


This approach consists in directly launching one language from another and provides for in-memory transmission of information.

During the time when people thought about combining R and Python instead of opposing them, quite a lot of libraries were created. Of these, successful and resistant to various parameters, including the use of the Windows operating system, only two. But already their presence opens up new horizons, greatly facilitating the process of combining languages.

To call each language through another, three libraries are presented below - in decreasing quality.

R from python


The most popular, stable and stable library is rpy2 . It works quickly, has a good description as part of the official documentation pandas , and on a separate site . The main feature is integration with pandas . The key object for transferring information between languages ​​is the data frame . Direct support for the most popular visualization package R ggplot2 also declared. Those. write code in python, see the graph directly in the Python IDE. The ggplot2 is really messy for Windows.

The disadvantage of rpy2 , perhaps, is one - the need to spend some time studying tutorials. For correct work with two languages, this is necessary, since there are unobvious nuances in the syntax and matching of object types during transmission. For example, when transferring a number from Python to the entrance to R you get not a number, but a vector from one element.

The key advantage of the pipe library, which is in second place in the table below, is speed. Implementation through the pipe does indeed speed up the work on average (there is even an article on this topic in JSS ), and the availability of support for working with pandas objects in R-Python inspires hope. But the existing disadvantages reliably shift the library to second place. The key drawback is poor support for installing libraries in R via Python. If you want to use any nonstandard library in R (and R is often needed just for this), then in order for it to be installed and work, you need to consistently (!!!) download all its dependencies. And for some libraries there may be about 100,500 and a small cart. The second major drawback is the inconvenient work with graphics. The plot can be viewed only by writing it to a file on disk. The third drawback is poor documentation. If a jamb happens a bit beyond the standard set, you often will not find a solution in the documentation or on stackoverflow.

The pyrserve library pyrserve easy to use, but also significantly limited in functionality. For example, does not support the transfer of tables. The update and support of the library by developers also leaves much to be desired. Version 0.9.1 is the latest available for more than 2 years.

LibrariesComments
rpy2- C-level interface
- direct support pandas
- graphics support (+ ggplot2)
- weak Windows support
pyper- Python code
- use of pipes (on average faster)
- indirect support for pandas
- limited graphics support
- bad documentation
pyrserve- Python code
- use of pipes (on average faster)
- indirect support for pandas
- limited graphics support
- bad documentation
- low level of project support


Python from R


The best library currently available is RStudio reticulate official development. Cons in it is not visible. But there are enough advantages: active support, clear and convenient documentation, output of the results of the execution of scripts and errors to the console immediately, easy transfer of objects (as in rpy2 , the main object is the date frame). Regarding active support, I will give an example from personal experience: the average speed of answering a question in stackoverflow and github / issues is about 1 minute. To work, you can learn how to syntax by tutorial and then connect separate python modules and write functions. Or execute individual pieces of code in python through the functions py_run . They allow you to easily execute the python script from R, passing the necessary arguments and get the object with the entire output of the script.

The second place in quality in the library rPython . The key advantage is the simplicity of the syntax. Only 4 teams and you master the use of R and Python. Documentation is also not far behind: everything is clear, simple and accessible. The main drawback is the implementation of the data frame transmission curve. Both R and Python require an extra step for the object to be passed as a table. The second important drawback is the lack of output of the result of the function execution to the console. When you run some python commands from R directly from R, you will not be able to quickly understand whether it was successfully executed or not. Even in the documentation there was a good passage from the authors that in order to see the result of executing the python code you need to go into python itself and see. Working with a package from Windows is possible only through pain, but it is possible. Useful links are in the table.

Third place Rcpp library. In fact, this is a very good option. As a rule, what is implemented in R with C ++ works stably, efficiently and quickly. But it takes time to figure it out. Therefore, it is here indicated at the end of the list.

Outside the table, you can mention RSPython . The idea was good - a single platform in both directions with a single logic and syntax - but the implementation failed. The package is not supported since 2005. Although, in general, the old version can be started and poked with a wand.

LibrariesComments
reticulate- good documentation
- active development (since August 2016)
- magic function
 py_run_file("script.py") 
rPython- data transfer via json
- indirect transfer of tables
- good documentation
- weak Windows support
- often falls from Anaconda
Rcpp- through C ++ ( Boost.Python and Rcpp )
- specific skills are needed
- a good example


For the two most popular libraries, below are detailed usage examples.

Interfacing R from Python: rpy2
For simplicity, we will use the script on R, specified in the first examples - max.R

 from rpy2.robjects import pandas2ri # loading rpy2 from rpy2.robjects import r pandas2ri.activate() # activating pandas module df_iris_py = pandas2ri.ri2py(r['iris']) # from r data frame to pandas df_iris_r = pandas2ri.py2ri(df_iris_py) # from pandas to r data frame plotFunc = r(""" library(ggplot2) function(df){ p <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(aes(color = Species)) print(p) ggsave('iris_plot.pdf', plot = p, width = 6.5, height = 5.5) } """) # ggplot2 example gr = importr('grDevices') # necessary to shut the graph off plotFunc(df_iris_r) gr.dev_off() 


Interfacing Python from R: reticulate
For simplicity, we will use the Python script, mentioned in the first examples - splitstr.py .

 library(reticulate) # aliasing the main module py <- import_main() # set parameters for Python directly from R session py$pattern <- "---" py$string = "3523462---12413415---4577678---7967956---5456439" # run splitstr.py from the slide 11 result <- py_run_file('splitstr.py') # read Python script result in R result$ans # [1] "3523462" "12413415" "4577678" "7967956" "5456439" 


Let's sum up


Benefits


disadvantages


Other approaches


As a separate category, I would like to cite a few more ways of combining languages: with the help of special functionality of laptops and separate platforms. Please note that most of the links below are practical examples.

When choosing a laptop, you can use:


There was also the wonderful Beaker Notebooks project, which made it very convenient to navigate between languages ​​in an interface similar to both Jupyter and Zeppelin. To transfer in-memory objects, the authors have written a separate beaker library. Plus, it was immediately convenient to publish the result (shared). But for some reason all this is no more, even the old publications have been deleted - the developers have concentrated on the BeakerX project.

Among the special software that gives the opportunity to combine R and Python should be highlighted:


Advanced example


In conclusion, I would also like to analyze a detailed example of solving a small research problem in which R and Python are convenient to use.

Suppose the task is set : to compare inflation rates and economic growth rates in countries implementing inflation targeting (or a close analogue) as a monetary policy regime from 2013 — an approximate start of the introduction of this regime in Russia.

We need to solve the problem quickly, but as a fast download, we only remember with Python, processing and visualization - with R.

Therefore, we shake R Notebook with shaking hands, write `` `python``` in the top cell and download data from the World Bank website. The data will be transmitted via CSV.

Python: loading data from the World Bank website
 import pandas as pd import wbdata as wd # define a period of time start_year = 2013 end_year = 2017 # list of countries under inflation targeting monetary policy regime countries = ['AM', 'AU', 'AT', 'BE', 'BG', 'BR', 'CA', 'CH', 'CL', 'CO', 'CY', 'CZ', 'DE', 'DK', 'XC', 'ES', 'EE', 'FI', 'FR', 'GB', 'GR', 'HU', 'IN', 'IE', 'IS', 'IL', 'IT', 'JM', 'JP', 'KR', 'LK', 'LT', 'LU', 'LV', 'MA', 'MD', 'MX', 'MT', 'MY', 'NL', 'NO', 'NZ', 'PK', 'PE', 'PH', 'PL', 'PT', 'RO', 'RU', 'SG', 'SK', 'SI', 'SE', 'TH', 'TR', 'US', 'ZA'] # set dictionary for wbdata inflation = {'FP.CPI.TOTL.ZG': 'CPI_annual', 'NY.GDP.MKTP.KD.ZG': 'GDP_annual'} # download wb data df = wd.get_dataframe(inflation, country = countries, data_date = (pd.datetime(start_year, 1, 1), pd.datetime(end_year, 1, 1))) print(df.head()) df.to_csv('WB_data.csv', index = True) 


Next, data is pre-processed on R: the spread of inflation values ​​in different countries (min / max / mean) and the average rate of economic growth (in terms of real GDP growth rates). We also change the names of some countries to shorter ones - so that later it would be more convenient to do visualization.

R: data preprocessing
 library(tidyverse) library(data.table) library(DT) # get df with python results cpi <- fread('WB_data.csv') cpi <- cpi %>% group_by(country) %>% summarize(cpi_av = mean(CPI_annual), cpi_max = max(CPI_annual), cpi_min = min(CPI_annual), gdp_av = mean(GDP_annual)) %>% ungroup cpi <- cpi %>% mutate(country = replace(country, country %in% c('Czech Republic', 'Korea, Rep.', 'Philippines', 'Russian Federation', 'Singapore', 'Switzerland', 'Thailand', 'United Kingdom', 'United States'), c('Czech', 'Korea', 'Phil', 'Russia', 'Singap', 'Switz', 'Thai', 'UK', 'US')), gdp_sign = ifelse(gdp_av > 0, 'Positive', 'Negative'), gdp_sign = factor(gdp_sign, levels = c('Positive', 'Negative')), country = fct_reorder(country, gdp_av), gdp_av = abs(gdp_av), coord = rep(ceiling(max(cpi_max)) + 2, dim(cpi)[1]) ) print(head(data.frame(cpi))) 


Then, using a small code and R, you can create a readable and pleasant graph that allows you to answer the original question about comparing inflation rates and GDP in countries that use inflation targeting.

R: visualization
 library(viridis) library(scales) ggplot(cpi, aes(country, y = cpi_av)) + geom_linerange(aes(x = country, y = cpi_av, ymin = cpi_min, ymax = cpi_max, colour = cpi_av), size = 1.8, alpha = 0.9) + geom_point(aes(x = country, y = coord, size = gdp_av, shape = gdp_sign), alpha = 0.5) + scale_size_area(max_size = 8) + scale_colour_viridis() + guides(size = guide_legend(title = 'Average annual\nGDP growth, %', title.theme = element_text(size = 7, angle = 0)), shape = guide_legend(title = 'Sign of\nGDP growth, %', title.theme = element_text(size = 7, angle = 0)), colour = guide_legend(title = 'Average\nannual CPI, %', title.theme = element_text(size = 7, angle = 0))) + ylim(floor(min(cpi$cpi_min)) - 2, ceiling(max(cpi$cpi_max)) + 2) + labs(title = 'Average Inflation and GDP Rates in Inflation Targeting Countries', subtitle = paste0('For the period 2013-2017'), x = NULL, y = NULL) + coord_polar() + theme_bw() + theme(legend.position = 'right', panel.border = element_blank(), axis.text.x = element_text(colour = '#442D25', size = 6, angle = 21, vjust = 1)) ggsave('IT_countries_2013_2017.png', width = 11, height = 5.7) 




The resulting graph in the polar coordinate system shows the variation of inflation values ​​in the period from 2013 to 2017, as well as the average values ​​of GDP (over inflation values). Circle means positive growth rate of GDP, triangle - negative.

The figure as a whole allows us to draw some preliminary conclusions about the success of the inflation targeting regime in Russia relative to other countries. But this is beyond the scope of this article. And if interested - I can give links to various materials on the topic.

findings


Source: https://habr.com/ru/post/348260/


All Articles