R package <INSERT NAME: new great stats model>
. Your chances of success are very high! Thus, the undoubted advantage of R is the possibility of advanced statistical analysis. In particular, for a number of specific areas of science and practice (econometrics, bioinformatics, etc.). In my opinion, in R, the analysis of the time series is still much more developed.htmlwidgets
, flexdashboard
, shiny
, slidify
. For example, initially, the materials for this article were collected as an interactive presentation on slidify .seaborn
become a big step forward) and puFlux
, pymaclab
, etc.) are added, in R, memory management efficiency and data processing capabilities are improved ( data.table
). Here, for example, you can see examples of basic operations with data in R and Python . So there is an advantage in combining languages for your project, it is up to you to decide.xgboost
, xgboostExplainer
) + visualization (Markdown reports) using Rnumpy
, pandas
) + outputting the result to a dashboard or shiny R application ( flexdashboard
, htmlwidgets
)rvest
) + NLP in Python + parameterized report on R ( RMarkdown Parameterized Reports
) <cmd_to_run> <path_to_script> <any_additional_args>
<cmd_to_run>
- command to execute an R or Python script using the command line<path_to_script>
- script location directory<any_additional_args>
- list of arguments to the scriptCommand | Python | R |
---|---|---|
Cmd | python path/to/myscript.py arg1 arg2 arg3 | Rscript path/to/myscript.R arg1 arg2 arg3 |
Fetch arguments | # list, 1st el. - file executed | # character vector of args |
max.R
a simple R script to determine the maximum number from the list and call it max.R
# max.R randomvals <- rnorm(75, 5, 0.5) par(mfrow = c(1, 2)) hist(randomvals, xlab = 'Some random numbers') plot(randomvals, xlab = 'Some random numbers', ylab = 'value', pch = 3)
# calling R from Python import subprocess # Define command and arguments command = 'Rscript' path2script = 'path/to your script/max.R' # Variable number of args in a list args = ['11', '3', '9', '42'] # Build subprocess command cmd = [command, path2script] + args # check_output will run the command and store to result x = subprocess.check_output(cmd, universal_newlines=True) print('The maximum of the numbers is:', x)
# splitstr.py import sys # Get the arguments passed in string = sys.argv[1] pattern = sys.argv[2] # Perform the splitting ans = string.split(pattern) # Join the resulting list of elements into a single newline # delimited string and print print('\n'.join(ans))
# calling Python from R command = "python" # Note the single + double quotes in the string (needed if paths have spaces) path2script ='"path/to your script/splitstr.py"' # Build up args in a vector string = "3523462---12413415---4577678---7967956---5456439" pattern = "---" args = c(string, pattern) # Add path to script as first arg allArgs = c(path2script, args) output = system2(command, args=allArgs, stdout=TRUE) print(paste("The Substrings are:\n", output))
Medium Storage | Python | R |
---|---|---|
Flat files | ||
csv | csv, pandas | readr, data.table |
json | json | jsonlite |
yaml | PyYAML | yaml |
Databases | ||
SQL | sqlalchemy, pandasql, pyodbc | sqlite, RODBS, RMySQL, sqldf, dplyr |
NoSQL | Pymongo | Rmongo |
Feather | ||
for data frames | feather | feather |
Numpy | ||
for numpy objects | numpy | Rcppcnpy |
numpy
objects to R and back, there is a fast and stable RCppCNPy
library. An example of its use can be found here .feather
format, designed specifically for transferring date frames between R and Python. The original format chip is sharpened on R and Python, ease of processing in both languages, and very fast writing and reading. The idea is excellent, but with the implementation, as it sometimes happens, there are nuances. The format developers themselves have repeatedly pointed out that it is not yet suitable for long-term solutions. When updating the libraries for working with the format, the whole process may break down and require significant code changes.feather
significantly ahead of the speed of the key libraries for working with the classic csv format.data.table
and dplyr
. Working with feather was the fastest, but the speed data.table
not high. At the same time, there are certain difficulties with setting up and working with feather for R. And support is in doubt. feather
last updated a year ago.pandas
. The gain to feather in speed turned out to be significant, there were no problems with working with the format in Python 3.5.rpy2
. It works quickly, has a good description as part of the official documentation pandas
, and on a separate site . The main feature is integration with pandas
. The key object for transferring information between languages is the data frame
. Direct support for the most popular visualization package R ggplot2
also declared. Those. write code in python, see the graph directly in the Python IDE. The ggplot2
is really messy for Windows.rpy2
, perhaps, is one - the need to spend some time studying tutorials. For correct work with two languages, this is necessary, since there are unobvious nuances in the syntax and matching of object types during transmission. For example, when transferring a number from Python to the entrance to R you get not a number, but a vector from one element.pipe
library, which is in second place in the table below, is speed. Implementation through the pipe does indeed speed up the work on average (there is even an article on this topic in JSS ), and the availability of support for working with pandas objects in R-Python inspires hope. But the existing disadvantages reliably shift the library to second place. The key drawback is poor support for installing libraries in R via Python. If you want to use any nonstandard library in R (and R is often needed just for this), then in order for it to be installed and work, you need to consistently (!!!) download all its dependencies. And for some libraries there may be about 100,500 and a small cart. The second major drawback is the inconvenient work with graphics. The plot can be viewed only by writing it to a file on disk. The third drawback is poor documentation. If a jamb happens a bit beyond the standard set, you often will not find a solution in the documentation or on stackoverflow.pyrserve
library pyrserve
easy to use, but also significantly limited in functionality. For example, does not support the transfer of tables. The update and support of the library by developers also leaves much to be desired. Version 0.9.1 is the latest available for more than 2 years.Libraries | Comments |
---|---|
rpy2 | - C-level interface - direct support pandas - graphics support (+ ggplot2) - weak Windows support |
pyper | - Python code - use of pipes (on average faster) - indirect support for pandas - limited graphics support - bad documentation |
pyrserve | - Python code - use of pipes (on average faster) - indirect support for pandas - limited graphics support - bad documentation - low level of project support |
reticulate
official development. Cons in it is not visible. But there are enough advantages: active support, clear and convenient documentation, output of the results of the execution of scripts and errors to the console immediately, easy transfer of objects (as in rpy2
, the main object is the date frame). Regarding active support, I will give an example from personal experience: the average speed of answering a question in stackoverflow and github / issues is about 1 minute. To work, you can learn how to syntax by tutorial and then connect separate python modules and write functions. Or execute individual pieces of code in python through the functions py_run
. They allow you to easily execute the python script from R, passing the necessary arguments and get the object with the entire output of the script.rPython
. The key advantage is the simplicity of the syntax. Only 4 teams and you master the use of R and Python. Documentation is also not far behind: everything is clear, simple and accessible. The main drawback is the implementation of the data frame transmission curve. Both R and Python require an extra step for the object to be passed as a table. The second important drawback is the lack of output of the result of the function execution to the console. When you run some python commands from R directly from R, you will not be able to quickly understand whether it was successfully executed or not. Even in the documentation there was a good passage from the authors that in order to see the result of executing the python code you need to go into python itself and see. Working with a package from Windows is possible only through pain, but it is possible. Useful links are in the table.Rcpp
library. In fact, this is a very good option. As a rule, what is implemented in R with C ++ works stably, efficiently and quickly. But it takes time to figure it out. Therefore, it is here indicated at the end of the list.Libraries | Comments |
---|---|
reticulate | - good documentation - active development (since August 2016) - magic function |
rPython | - data transfer via json - indirect transfer of tables - good documentation - weak Windows support - often falls from Anaconda |
Rcpp | - through C ++ ( Boost.Python and Rcpp ) - specific skills are needed - a good example |
max.R
from rpy2.robjects import pandas2ri # loading rpy2 from rpy2.robjects import r pandas2ri.activate() # activating pandas module df_iris_py = pandas2ri.ri2py(r['iris']) # from r data frame to pandas df_iris_r = pandas2ri.py2ri(df_iris_py) # from pandas to r data frame plotFunc = r(""" library(ggplot2) function(df){ p <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(aes(color = Species)) print(p) ggsave('iris_plot.pdf', plot = p, width = 6.5, height = 5.5) } """) # ggplot2 example gr = importr('grDevices') # necessary to shut the graph off plotFunc(df_iris_r) gr.dev_off()
splitstr.py
. library(reticulate) # aliasing the main module py <- import_main() # set parameters for Python directly from R session py$pattern <- "---" py$string = "3523462---12413415---4577678---7967956---5456439" # run splitstr.py from the slide 11 result <- py_run_file('splitstr.py') # read Python script result in R result$ans # [1] "3523462" "12413415" "4577678" "7967956" "5456439"
R majic
syntax together with rpy2
discussed above. It is not difficult, as a whole , but again you need to spend some time studying syntax. Another option: install IRKernel
. But in this case, it will only be possible to run the kernels separately, transferring files via a disk write.reticulate
library and then indicate for each cell of the laptop the language in which you will create. Convenient thing!beaker
library. Plus, it was immediately convenient to publish the result (shared). But for some reason all this is no more, even the old publications have been deleted - the developers have concentrated on the BeakerX project. import pandas as pd import wbdata as wd # define a period of time start_year = 2013 end_year = 2017 # list of countries under inflation targeting monetary policy regime countries = ['AM', 'AU', 'AT', 'BE', 'BG', 'BR', 'CA', 'CH', 'CL', 'CO', 'CY', 'CZ', 'DE', 'DK', 'XC', 'ES', 'EE', 'FI', 'FR', 'GB', 'GR', 'HU', 'IN', 'IE', 'IS', 'IL', 'IT', 'JM', 'JP', 'KR', 'LK', 'LT', 'LU', 'LV', 'MA', 'MD', 'MX', 'MT', 'MY', 'NL', 'NO', 'NZ', 'PK', 'PE', 'PH', 'PL', 'PT', 'RO', 'RU', 'SG', 'SK', 'SI', 'SE', 'TH', 'TR', 'US', 'ZA'] # set dictionary for wbdata inflation = {'FP.CPI.TOTL.ZG': 'CPI_annual', 'NY.GDP.MKTP.KD.ZG': 'GDP_annual'} # download wb data df = wd.get_dataframe(inflation, country = countries, data_date = (pd.datetime(start_year, 1, 1), pd.datetime(end_year, 1, 1))) print(df.head()) df.to_csv('WB_data.csv', index = True)
library(tidyverse) library(data.table) library(DT) # get df with python results cpi <- fread('WB_data.csv') cpi <- cpi %>% group_by(country) %>% summarize(cpi_av = mean(CPI_annual), cpi_max = max(CPI_annual), cpi_min = min(CPI_annual), gdp_av = mean(GDP_annual)) %>% ungroup cpi <- cpi %>% mutate(country = replace(country, country %in% c('Czech Republic', 'Korea, Rep.', 'Philippines', 'Russian Federation', 'Singapore', 'Switzerland', 'Thailand', 'United Kingdom', 'United States'), c('Czech', 'Korea', 'Phil', 'Russia', 'Singap', 'Switz', 'Thai', 'UK', 'US')), gdp_sign = ifelse(gdp_av > 0, 'Positive', 'Negative'), gdp_sign = factor(gdp_sign, levels = c('Positive', 'Negative')), country = fct_reorder(country, gdp_av), gdp_av = abs(gdp_av), coord = rep(ceiling(max(cpi_max)) + 2, dim(cpi)[1]) ) print(head(data.frame(cpi)))
library(viridis) library(scales) ggplot(cpi, aes(country, y = cpi_av)) + geom_linerange(aes(x = country, y = cpi_av, ymin = cpi_min, ymax = cpi_max, colour = cpi_av), size = 1.8, alpha = 0.9) + geom_point(aes(x = country, y = coord, size = gdp_av, shape = gdp_sign), alpha = 0.5) + scale_size_area(max_size = 8) + scale_colour_viridis() + guides(size = guide_legend(title = 'Average annual\nGDP growth, %', title.theme = element_text(size = 7, angle = 0)), shape = guide_legend(title = 'Sign of\nGDP growth, %', title.theme = element_text(size = 7, angle = 0)), colour = guide_legend(title = 'Average\nannual CPI, %', title.theme = element_text(size = 7, angle = 0))) + ylim(floor(min(cpi$cpi_min)) - 2, ceiling(max(cpi$cpi_max)) + 2) + labs(title = 'Average Inflation and GDP Rates in Inflation Targeting Countries', subtitle = paste0('For the period 2013-2017'), x = NULL, y = NULL) + coord_polar() + theme_bw() + theme(legend.position = 'right', panel.border = element_blank(), axis.text.x = element_text(colour = '#442D25', size = 6, angle = 21, vjust = 1)) ggsave('IT_countries_2013_2017.png', width = 11, height = 5.7)
rpy2
reticulate
.Source: https://habr.com/ru/post/348260/
All Articles