Studying something new is always interesting, it captures you completely, at least so. So this time, fascinated by the study of programming in Python, wondered where it could be used, except when creating a photoseparator (the article about him will be a little later) and the sales accounting program, and came across an article about big data (Big Data) . Having studied the materials on Big Data, I realized that this direction is very promising and it is worth spending time studying it.
After that, he began to study the inconceivable number of articles and, after reviewing a couple of dozen tutorial videos, reached the point at which new information was categorically not perceived, it was necessary to use the acquired skills and knowledge in practice. Of course, there are ready-made datasets on the net, where you will be completely chewed on what to do, but for me there is little pleasure from such an activity, although you will definitely get the knowledge and experience. But, if you start the path from scratch, i.e. from data collection (hi-Web scrapping) to structuring the data for further analysis, I think this is more useful in terms of training and practical application.
The search for a suitable candidate (i.e. site) has begun. After a brief search, he stopped at Market, OLX, Kolesa and Krisha. A further criterion for site selection was the practical application of the results obtained during the analysis.
Only one site Krisha.kz was suitable for this criterion. Krisha.kz is a popular site where people from all over Kazakhstan submit ads for selling, buying or renting apartments, houses and offices, and on this site the largest number of ads. Having fluently studied the ads on the site, I decided to concentrate on the ads for the sale of apartments in Astana in order to determine the following:
Having determined the main directions of the analysis, he began to collect information using Web scrapping. Since I had not previously encountered the task of collecting data from the site, I turned to the indispensable assistant - Google.com. A bunch of different libraries are offered for the Python programming language - BeutifulSoup, Urllib, Selenium, etc., but in some article on Habrahabr or github (useful sites for Russian speakers involved in programming) I came across a recommendation that Selenium would be better suited for Web Scrapping . The benefit of installing libraries in Python is surprisingly simple and consists of typing commands on the command line “pip install selenium” and that’s it, the library is installed.
After two days of dancing with tambourines and searches in Google, we managed to create a script that collected ads that reflected: the date of the announcement, the price of the apartment, the number of rooms, the address of the house, the floor, who submitted the ad (the owner or the agency), the narrative (type of repair name of the residential complex) and the number of views.
This information was sufficient for tasks 1 and 2, but task 3 required a phone number, which can be obtained only if you open the advertisement and click on the “Show phone” button, which complicated the task. This problem was solved by viewing additional material in the internet and selenium documentation.
If you collect only basic information, without a phone number (about 23,000 - 27,000 ads only in Astana), then it takes from 1 to 2 hours, depending on the speed of the Internet provider, but getting a phone number takes much longer - from 24 hours.
The launch of the script consisted in collecting information from the main page and opening a separate tab for each announcement of the apartment, while until the phone numbers from all the ads on the current page are opened and received, switching to the next page did not occur. After dozens of script launches, the program was often closed urgently and you had to start the search from the beginning. In this regard, there is a need to divide the information gathering stage into two sub-steps:
As you can see, we have not structured data. Further, for analysis it is required to structure the data in such a way that each column contains only one information:
Having more structured information, we turn to data analysis. First, we determine the main persons submitting ads on the site. Below is a chart for ads from April 14. till April 19 2018
Prior to this analysis, I assumed that the ads are served only by real estate agencies and apartment owners, but did not expect the presence on the site of representatives of developers and construction companies, although their share is less than 4% of all ads.
Interestingly, even banks began to serve ads on the site. Apparently, realize mortgage apartments.
More than half of the ads - from real estate agencies. Since the bulk of them duplicate the ads of apartment owners who have already been presented on the site, for further analysis, these data will be excluded. However, this information will be used to determine a successful real estate agency based on the number of ads.
To determine the problem house, it is necessary to determine the ratio of the number of ads in the house to the number of apartments in the house or in the residential complex, and compare this ratio with the average for the market. But, if the data on the number of ads for the sale of apartments in a particular house are available in our database, then the total number of apartments in the house should be searched on other sites. This work requires additional time, so we had to postpone this work for later, due to the lack of time.
This histogram reflects the number of submitted ads for the sale of apartments by real estate agencies in the period from April 14 to April 19, 2018
Most of the announcements were published by the “Realty” real estate agency, for a specified period the number of announcements was 1 929. Further, the real estate agencies such as “Real Price”, “Vitrina”, “Novaya Ploshchadka” go on decreasing the number of announcements, they have more 500 listings
If we assume that the criterion of success in the real estate market in the field of real estate services is the number of advertisers (the more announcements, the greater the chances that the client will see them and contact the agency), then the undisputed leader is the Realtor Agency "Floor".
This histogram shows the number of ads published from one phone number. So the first column with the number “2” shows that more than 1,460 people submitted more than 2 ads on the site.
Many realtors, knowing about the negative attitude towards them, often serve ads as owners, as evidenced by the number and frequency of such ads.
If I were a tax authority, I would first determine the TIN of the owners of the rooms, check the base for how much property was registered on it and the average tenure of the property, if the tenure is less than 1-2 years or he often buys and sells apartments for a short time at the same price (that is, supposedly without economic benefits), it is necessary to take a closer look at this owner.
It should be noted that in this article the analysis of the data was superficial, and I was more eager to understand what difficulties and pitfalls could be encountered in obtaining and processing data, for example:
Each considered question can be considered in more detail, for example, to determine the same problem houses not by the number of announcements, but by the frequency of their submission, but we will leave this for more inquisitive and people who have free time.
Source: https://habr.com/ru/post/359459/
All Articles