R as a price monitoring tool

In this article I would like to touch on such a topic as monitoring competitors. I understand that this topic has as many supporters, because, in one way or another, monitoring is necessary for the successful development of almost any company, as well as opponents who protect the interests of their business from monitor workers.

Those who are somehow connected with sales in a competitive market probably know that monitoring competitors is an important task. The results are used for completely different purposes - from changing local pricing policies and maintaining the range to drawing up strategic plans for the company. The author decided to practice solving this problem and promoted one of the largest electronics retailers in Russia, whose regular client the author is. What came of it -

Instead of introducing

Immediately it should be said that the article will not describe the methods of social engineering or communication with companies providing monitoring services. I also add that there will be no descriptions of the monitoring analysis, only the collection algorithm and some difficulties encountered during operation. Recently, the author increasingly uses R, and it was decided to do data collection with its use. In addition, open data is gaining more and more popularity (for example, here , here or here ) and the skill of working with them directly from the environment used will be useful. All actions are demonstrative in nature, and the resulting melons were not transferred to anyone.

Site analysis

The first thing to do is to study the structure of the site of a competitor. Let's start with the content. There are several levels of product classifier, and the number of levels for any product is always the same. The list of goods can be reached through level 2 or level 3. For further work, it was decided to use level 2.
The next step was to examine the source code of the page with a list of required products and find out its structure. Each item is in a separate HTML container. Here we are waiting for the first difficulty - the code of the page contains information only about the first 30 products in the list - it is clear that this does not suit us. The author did not understand how to programmatically show the following 30 products, so it was decided to study the mobile version of the site. In the mobile version at the bottom of the page there are links to the next and the last page. Plus - the mobile version of the site is much less "littered" with unnecessary links and tags.
')

Write the code

Initially, it was decided to divide the code into several components:
1. A function that collects information from a specific page. Accepts as input the URL of a specific page on which there are up to 15 products (the difference of the mobile version) The function returns a data.frame containing information on the availability, name, article number, number of reviews and prices of goods.

First function

getOnePageBooklet <- function(strURLsub="", curl=getCurlHandle()){ # loading the required page html <- getURL(strURLsub, .encoding='UTF-8', curl=curl) # parsing html html.raw <- htmlTreeParse( html, useInternalNodes=T ) # searching for SKU nodes html.parse.SKU <- xpathApply(html.raw, path="//section[@class='b-product']", fun=xmlValue) # some regex :) noT <- gsub(' ([0-9]+)\\s([0-9]+) ',' \\1\\2 ',unlist(html.parse.SKU)) noT <- gsub(';',',',noT) noT <- gsub('\r\n',';',noT) noT <- trim(noT) noT <- gsub("(\\s;)+", " ", noT) noT <- gsub("^;\\s ", "", noT) noT <- gsub(";\\s+([0-9]+)\\s+;", "\\1", noT) noT <- gsub(" ; ", "", noT) noT <- gsub(" ", "", noT) noT <- gsub("\\s+;\\s*", "", noT) noT <- gsub("\\s+.;\\s*", "", noT) noT <- gsub(";\\s+", ";", noT) # text to list not.df <- strsplit(noT,';') # list to nice df tryCatch( not.df <- as.data.frame(matrix(unlist(not.df), nrow = length(not.df), byrow = T)), error=function(e) {print(strURLsub)} ) }

2. A function that, using function 1, collects information from all pages of a given level of product classifier. The main meaning of this function is to find the last page number, run from first to last using function 1 and combine the results into one data.frame. The result of the function is data.frame with all goods of a given level of product classifier.

Second function

 getOneBooklet <- function(strURLmain="", curl=getCurlHandle()){ # data frane for the result df <- data.frame(inStock=character(), SKU=character(), Article=numeric(), Comment=numeric(), Price=numeric()) # loading main subpage html <- getURL(strURLmain, .encoding='UTF-8', curl=curl) # parsing main subpage html.raw <- htmlTreeParse( html, useInternalNodes=T ) # finding last subpage html.parse.pages <- xpathApply(html.raw, path="//a[@class='page g-nouline']", fun=xmlValue) if(length(html.parse.pages)==0){ urlMax <- 1 }else{ urlMax <- as.numeric(unlist(html.parse.pages)[length(unlist(html.parse.pages))]) } # loop for all sybpages tryCatch( for(iPage in 1:urlMax){ strToB <- paste0(strURLmain, '?pageNum=',iPage) df.inter <- getOnePageBooklet(strToB, curl) df <- rbind(df, df.inter) }, error=function(e) {print(iPage)}) # write.table(df, paste0('D:\\', as.numeric(Sys.time()) ,'.csv'), sep=";") df }

3. A function that, using function 2, collects information from all available levels of the product classifier. In addition, the names of the classifier levels are added to the obtained data. Just in case when the information on one category is entirely collected - the result is saved on a disk.

Third function

 getOneCity <- function(urlMain = "http://m.tramlu.ru", curl = getCurlHandle()){ df.prices <- data.frame(inStock = character(), SKU = character(), Article = numeric(), Comment = numeric(), Price = numeric(), level1 = character(), level2 = character()) level1 <- getAllLinks(urlMain, curl) numLevel1 <- length(level1[,2]) for (iLevel1 in 1:numLevel1){ strURLsubmain <- paste0(urlMain, level1[iLevel1, 2]) level2 <- getAllLinks(strURLsubmain, curl) numLevel2 <- length(level2[,2]) for (iLevel2 in 1:numLevel2){ strURLsku <- paste0(urlMain, level2[iLevel2,2]) df.temp <- getOneBooklet(strURLsku, curl) df.temp$level1 <- level1[iLevel1,1] df.temp$level2 <- level2[iLevel2,1] df.prices <- rbind(df.prices, df.temp) } write.table(df.prices, paste0('D:\\', iLevel1 ,'.csv'), sep=";", quote = FALSE) } df.prices }

In the future, having been frightened by the hammer of the ban, I had to add another function - using a proxy. Later it turned out that no one even planned to ban, taking into account the testing of scripts, the entire classifier was compiled without problems, so everything was limited to creating a data.frame with information on one hundred proxies that were never used.

Getting proxy list

 getProxyAddress <- function(){ htmlProxies <- getURL('http://www.google-proxy.net/', .encoding='UTF-8') #htmlProxies <- gsub('</td></tr>',' \n ', htmlProxies) htmlProxies <- gsub('\n','', htmlProxies) htmlProxies <- gsub('(</td><td>)|(</td></tr>)',' ; ', htmlProxies) # parsing main subpage htmlProxies.raw <- htmlTreeParse( htmlProxies, useInternalNodes=T ) # finding last subpage html.parse.proxies <- xpathApply(htmlProxies.raw, path="//tbody", fun=xmlValue) html.parse.proxies<- gsub('( )+','', html.parse.proxies) final <- unlist(strsplit(as.character(html.parse.proxies),';')) final <- as.data.frame(matrix(final[1:800], nrow = length(final)/8, ncol = 8, byrow=T)) #final <- gsub('( )+','', final) names(final) <- c('IP','Port','Code','Country','Proxy type','Google','Https','Last checked') sapply(final, as.character) }

You can use a proxy like this

How to use a proxy

 opts <- list( proxy = "1.1.1.1", proxyport = "8080" ) getURL("http://habrahabr.ru", .opts = opts)

The structures of all the code components are similar. The functions used are getURL from the RCurl package, htmlTreeParse , and xpathApply from the XML package and several regular expressions.
The last difficulty was the indication of the city, the prices in which we want to find out. By default, when loading data, information on delivery prices by mail with an uncertain availability of goods was given. When entering the site of the company under investigation, a window appears that suggests choosing a city-location. This information is then stored in a browser cookie and is used to display prices and products in the selected city. In order for R to save cookies, it is necessary to set the properties of the connection used. Then you just need to load into the R page corresponding to the city of interest.

Connection properties

 agent ="Mozilla/5.0" curl = getCurlHandle() curlSetOpt(cookiejar="cookies.txt", useragent = agent, followlocation = TRUE, curl=curl)

Instead of conclusion

All, setting the required parameters, we run function 3, wait about half an hour and get a list with all the products and categories of the site under investigation. It turned out more than 60 000 prices for the city of Moscow. The result is presented in the form of data.frame and several saved files on disk.
The script is entirely on github .

Thank you for your attention.

Source: https://habr.com/ru/post/255173/

All Articles

R as a price monitoring tool

Instead of introducing

Site analysis

Write the code

Instead of conclusion

More articles: