Copying data from a website using R and the rvest library

To analyze data, you must first collect this data. There are many different methods for this purpose. In this article we will talk about copying data directly from the website, or about scraping. On Habré there are several articles on how to make copying using Python. We will use the language R (ver.3.4.2) and its library rvest. As an example, consider copying data from Google Scholar (GS).

GS is a search engine that searches for information not in the entire Internet, but only in published articles or patents. This can be very helpful. For example, when searching for scientific articles for some keywords. To do this, you must enter these words in the search bar. Or, let's say you need to find articles published by a particular author. To do this, you can simply type his last name, but it is better to use the keyword 'author', and enter something like 'author: "D Smith".

Let's do this search and look at the result.

GS shows that about 400 thousand articles were found. Each article is given its title (1), the names of the authors (2), the title of the journal (3), the year of issue (4), and a brief summary (5). If a PDF file of the article is available, the corresponding link (6) is given on the right. Also, an important parameter (7) is indicated for each article, namely, how many times this article has been mentioned in other works ("Cited by"). This parameter shows how popular this work is. Using this parameter in several articles, one can estimate the productivity and "demand" of a scientist. This is the so-called Hirsch index (H-index) of the scientist. According to Wikipedia , the Hirsch index is defined as follows:

"A scientist has an index h, if h from his Np articles is cited at least h times each, while the remaining (Np - h) articles are cited no more than h times each."

In other words, if the list of all articles of a certain author is sorted by decrease in the number of citations $ inline $ N_c $ inline $ then serial number $ inline $ n $ inline $ last article for which the condition $ inline $ n \ le N_c $ inline $ and is the Hirsch index. Due to the fact that various scientific foundations and organizations are now paying special attention to the Hirsch index and other scientometric indicators, people have learned to “wind up” these indicators. But that's another story.

In this article we will search through the articles of a particular author. For example, take the Russian scientist Alexei Yakovlevich Chervonenkisa . A. Ya. Chervonenkis - a famous Russian scientist in the field of computer science, made a significant contribution to the theory of data analysis and machine learning. He died tragically in 2014.

We will collect the following 7 parameters of the articles: Title of the article (1), List of authors (2), Name of the journal (3), Year of release (4), Publisher (8), Number of citations (7), Link to citing articles (7) . The last parameter is needed, for example, if we want to see the relationship of authors, to identify the network (network) of people conducting research in a certain direction.

Training

First you need to install the rvest package. To do this, use the following R command:

install.packages(rvest)

Now we need a tool that shows which component of the webpage code corresponds to one or another parameter. You can use the tools built into the browser. For example, in Mozille, you can select in the Menu: "Tools -> Web Development -> Inspector". The NTML code of the web page will be displayed. Then, by hovering over some element of the page, you can see which CSS code corresponds to it. But we will do easier. We will use the SelectorGadget tool. To do this, go to the specified site and add a link to the code (located at the end of the program description) in the Bookmarks of your browser.

Now, on any page, you can click on this tab, and a handy tool will appear that allows you to determine the code components of this page (for details, see below).

Before copying the data, it is also useful to study the properties of the web page, namely, how the page's appearance changes from the type of request, to which addresses the various links correspond, etc. So, we will search for the articles of A. Ya. Chervonenkis by the following request: author: "A Chervonenkis". This corresponds to the following address:

 https://scholar.google.com/scholar?hl=en&as_sdt=1%2C5&as_vis=1&q=author%3A%22A+Chervonenkis%22&btnG=

With this query syntax, patents are not taken into account, as well as articles that refer to Chervonenkis, but in which he is not the author.

Copying data

Now let's do a data copying program. First we include the rvest library:

 library(rvest)

Next, set the required address and read the L code of the web page:

 url <- 'https://scholar.google.com/scholar?hl=en&as_sdt=1%2C5&as_vis=1&q=author%3A%22A+Chervonenkis%22&btnG=' wpage <- read_html(url)

Now copy the titles of the articles. To do this, run the SelectorGadget, and click on the title of some article. This title is highlighted in green, but other components of the page are also highlighted (in yellow). The SelectorGadget line shows that 152 components were initially selected.

Now we simply click on the components we do not need. As a result, there are only 10 of them left (according to the number of articles on the page), and the name of the component corresponding to the name of the article, namely ".gs_rt a", is also given in the SelectorGadget line.

Using this name we can copy all the names, and convert them to text format using the following commands (the last command gives the structure of the variable titles ):

 titles <- html_nodes(wpage, '.gs_rt a') titles <- html_text(titles) str(titles) ## chr [1:10] "On the uniform convergence of relative frequencies of ..."

Hereinafter, after the characters '##' shows the output of the program (in truncated form).

Similarly, we determine that the names of the authors, the title of the journal, the year, and the publisher correspond to the component ".gs_a". At the same time, the text of this component has the following format "<authors> - <journal, year> - <publisher>". Extract the text from ".gs_a", select the parameters we need ( authors , journals , year , publ ) in accordance with the format, remove unnecessary characters and spaces

 # Scrap combined data, convert to text format: comb <- html_nodes(wpage,'.gs_a') comb <- html_text(comb) str(comb) ## chr [1:10] "VN Vapnik, AY Chervonenkis - Measures of complexity, ..." lst <- strsplit(comb, '-') # Find authors, journal, year, publisher, extracting components of list authors <- sapply(lst, '[[', 1) # Take 1st component of list publ <- sapply(lst, '[[', 3) lst1 <- strsplit( sapply(lst, '[[', 2), ',') journals <- sapply(lst1, '[[', 1) year <- sapply(lst1, '[[', 2) # Replace 3 dots with ~, trim spaces, convert 'year' to numeric : authors <- trimws(gsub(authors, pattern= '…', replacement= '~')) journals <- trimws(gsub(journals, pattern= '…', replacement= '~')) year <- as.numeric(gsub(year, pattern= '…', replacement= '~')) publ <- trimws(gsub(publ, pattern= '…', replacement= '~')) str(authors) ## chr [1:10] "VN Vapnik, AY Chervonenkis " "VN Vapnik, AY Chervonenkis " ...

Note that sometimes the name of the magazine contains a hyphen "-". In this case, the procedure for extracting parameters from the source line will change somewhat.

Next, using the SelectorGadget, we determine the name of the component for the number of citations per article, remove the extra words, and convert the data into a numeric format.

 cit0 <- html_nodes(wpage,'#gs_res_ccl_mid a:nth-child(3)') cit <- html_text(cit0) lst <- strsplit(cit, ' ') cit <- as.numeric(sapply(lst, '[[', 3)) str(cit) ## num [1:10] 3586 256 136 102 30 ...

Finally, we extract the appropriate link:

 cit_link <- html_attr(cit0, 'href') str(cit_link) ## chr [1:10] "/scholar?cites=3657561935311739131&as_sdt=2005&..."

Now we have 7 vectors ( titles , authors , journals , year , publ , cit , cit_link ) for our 7 parameters. We can combine them into a single structure (dataframe)

 df1 <- data.frame(titles= titles, authors= authors, journals= journals, year= year, publ = publ, cit= cit, cit_link= cit_link, stringsAsFactors = FALSE)

You can programmatically go to the next page by adding 'start = n &' to the address, where n / 10 + 1 corresponds to the page number. Thus, it is possible to collect information on all articles.
the author. Further, using links to citing articles ( cit_link ), you can find data on other authors.

In conclusion, a few comments. The terms for using Google’s services include the following:

"Do not misuse our Services. For example, do not need to use our services."

Information on the Internet indicates that Google tracks access to its web pages, in particular in the GS. If Google suspects that information is retrieved using a bot, it can restrict or block access to information from a specific IP address. For example, if requests go too often, or at regular intervals, then this behavior is considered suspicious.

The considered method can be easily adapted to other websites. The combination of R, rvest SelectorGadget makes copying data quite simple.

In preparing this article, information was used from here .

Source: https://habr.com/ru/post/350820/

All Articles

Copying data from a website using R and the rvest library

Training

Copying data

More articles: