📜 ⬆️ ⬇️

Python Selenium and Krisha.kz. First in Big Data

Foreword


Studying something new is always interesting, it captures you completely, at least so. So this time, fascinated by the study of programming in Python, wondered where it could be used, except when creating a photoseparator (the article about him will be a little later) and the sales accounting program, and came across an article about big data (Big Data) . Having studied the materials on Big Data, I realized that this direction is very promising and it is worth spending time studying it.


After that, he began to study the inconceivable number of articles and, after reviewing a couple of dozen tutorial videos, reached the point at which new information was categorically not perceived, it was necessary to use the acquired skills and knowledge in practice. Of course, there are ready-made datasets on the net, where you will be completely chewed on what to do, but for me there is little pleasure from such an activity, although you will definitely get the knowledge and experience. But, if you start the path from scratch, i.e. from data collection (hi-Web scrapping) to structuring the data for further analysis, I think this is more useful in terms of training and practical application.


The beginning of the way


The search for a suitable candidate (i.e. site) has begun. After a brief search, he stopped at Market, OLX, Kolesa and Krisha. A further criterion for site selection was the practical application of the results obtained during the analysis.


Only one site Krisha.kz was suitable for this criterion. Krisha.kz is a popular site where people from all over Kazakhstan submit ads for selling, buying or renting apartments, houses and offices, and on this site the largest number of ads. Having fluently studied the ads on the site, I decided to concentrate on the ads for the sale of apartments in Astana in order to determine the following:


  1. Problem houses, sign - the number of ads exceeds the average number of ads per house;
  2. Successful real estate companies, a sign - the number of published ads;
  3. Hidden entrepreneurs who resell apartments. I think it would be useful for the tax authorities;
  4. Unscrupulous realtors who use other people's photos of apartments to attract customers. I have to say right away that I had to abandon this idea due to limited resources, since I would have to keep a huge amount of photos, and I only have a laptop and home Internet.

Collection of information


Having determined the main directions of the analysis, he began to collect information using Web scrapping. Since I had not previously encountered the task of collecting data from the site, I turned to the indispensable assistant - Google.com. A bunch of different libraries are offered for the Python programming language - BeutifulSoup, Urllib, Selenium, etc., but in some article on Habrahabr or github (useful sites for Russian speakers involved in programming) I came across a recommendation that Selenium would be better suited for Web Scrapping . The benefit of installing libraries in Python is surprisingly simple and consists of typing commands on the command line “pip install selenium” and that’s it, the library is installed.


After two days of dancing with tambourines and searches in Google, we managed to create a script that collected ads that reflected: the date of the announcement, the price of the apartment, the number of rooms, the address of the house, the floor, who submitted the ad (the owner or the agency), the narrative (type of repair name of the residential complex) and the number of views.


This information was sufficient for tasks 1 and 2, but task 3 required a phone number, which can be obtained only if you open the advertisement and click on the “Show phone” button, which complicated the task. This problem was solved by viewing additional material in the internet and selenium documentation.


Time required to collect information


If you collect only basic information, without a phone number (about 23,000 - 27,000 ads only in Astana), then it takes from 1 to 2 hours, depending on the speed of the Internet provider, but getting a phone number takes much longer - from 24 hours.


The launch of the script consisted in collecting information from the main page and opening a separate tab for each announcement of the apartment, while until the phone numbers from all the ads on the current page are opened and received, switching to the next page did not occur. After dozens of script launches, the program was often closed urgently and you had to start the search from the beginning. In this regard, there is a need to divide the information gathering stage into two sub-steps:


  1. Collection of information without a phone number, but where indicated a separate link to the ad.
  2. Getting the phone number on the link to the announcement received in the first stage.
    After all these works we get an excel file of the following content:

image


As you can see, we have not structured data. Further, for analysis it is required to structure the data in such a way that each column contains only one information:


  1. “RSS” column should be divided into 4 columns: number of rooms, apartment area, apartment floor, number of floors;
  2. “Advertiser” column is divided into: who submitted the advertisement, and if this is an agency, then indicate the name of the agency;
  3. Street column: remove extra information about the intersection of streets and process in such a way as to get 2 columns: street name and house number
  4. “Description” column select separate columns for: year of construction, district and name of the residential complex.

Analysis


Browse ads


Having more structured information, we turn to data analysis. First, we determine the main persons submitting ads on the site. Below is a chart for ads from April 14. till April 19 2018


image


Prior to this analysis, I assumed that the ads are served only by real estate agencies and apartment owners, but did not expect the presence on the site of representatives of developers and construction companies, although their share is less than 4% of all ads.


Interestingly, even banks began to serve ads on the site. Apparently, realize mortgage apartments.


More than half of the ads - from real estate agencies. Since the bulk of them duplicate the ads of apartment owners who have already been presented on the site, for further analysis, these data will be excluded. However, this information will be used to determine a successful real estate agency based on the number of ads.


Distressed houses


To determine the problem house, it is necessary to determine the ratio of the number of ads in the house to the number of apartments in the house or in the residential complex, and compare this ratio with the average for the market. But, if the data on the number of ads for the sale of apartments in a particular house are available in our database, then the total number of apartments in the house should be searched on other sites. This work requires additional time, so we had to postpone this work for later, due to the lack of time.


Real Estate Agencies


image


This histogram reflects the number of submitted ads for the sale of apartments by real estate agencies in the period from April 14 to April 19, 2018


Most of the announcements were published by the “Realty” real estate agency, for a specified period the number of announcements was 1 929. Further, the real estate agencies such as “Real Price”, “Vitrina”, “Novaya Ploshchadka” go on decreasing the number of announcements, they have more 500 listings


If we assume that the criterion of success in the real estate market in the field of real estate services is the number of advertisers (the more announcements, the greater the chances that the client will see them and contact the agency), then the undisputed leader is the Realtor Agency "Floor".


Realtors or hidden entrepreneurs?


image


This histogram shows the number of ads published from one phone number. So the first column with the number “2” shows that more than 1,460 people submitted more than 2 ads on the site.


Many realtors, knowing about the negative attitude towards them, often serve ads as owners, as evidenced by the number and frequency of such ads.


If I were a tax authority, I would first determine the TIN of the owners of the rooms, check the base for how much property was registered on it and the average tenure of the property, if the tenure is less than 1-2 years or he often buys and sells apartments for a short time at the same price (that is, supposedly without economic benefits), it is necessary to take a closer look at this owner.


Conclusion


It should be noted that in this article the analysis of the data was superficial, and I was more eager to understand what difficulties and pitfalls could be encountered in obtaining and processing data, for example:


  1. The lack of information to achieve the original goals, as for example in the case of the identification of problem homes.
  2. Limited time.
  3. It is necessary to precisely define the task and follow it. During this analysis, we had to contend with an unbridled desire to look at everything from all sides, and ultimately a kind of struggle of ideas took place in my head.

Each considered question can be considered in more detail, for example, to determine the same problem houses not by the number of announcements, but by the frequency of their submission, but we will leave this for more inquisitive and people who have free time.


')

Source: https://habr.com/ru/post/359459/


All Articles