📜 ⬆️ ⬇️

How programmers look for apartments

image

In fact, everything is going wrong ...

A friend of mine asked me to write this article. The article will talk about his adventures, which may (could) be misunderstood by the administration of certain Internet resources. And those, in turn, can (could) complain about my friend where it should be. Therefore, I am writing an article with his words. And he left. To Honduras. All right.
')

Problem


A couple of years ago (finally!) A moment came in my life when I needed to buy an apartment. It remained to find her. The matter was complicated by the fact that I had my own views on what my ideal apartment should be. Namely - it should have been on the top floor. Well, that no one went to the brain. Well, do not care at all more convenient.

The central local property search site (the absolute majority of agencies and owners place their apartments here), how to put it more simply, “it was made a little inconvenient”. The search for apartments on it contained the settings that were standard for such services: year of construction, number of floors, price, not (!) Last / first floor, etc. And he, the search, when I asked him to give me an apartment with a separate bathroom, sometimes gave out an apartment with a combined one. There was a similar story with a balcony. And since he (the search) sometimes gives apartments that do not correspond to my request, then perhaps he does not show the corresponding ones. And in my sample (apartment on the top floor, a separate node, number of floors> 5, not far from the metro, and blah blah blah), by definition, many apartments could not get into ...

Kowalski, options!


There was only one thing left - to unload all the apartments from the site to yourself locally: you save them to some database, you take into the hands of SQL (well, or something like a thread) “and drove” (c).

Easy to say, but hard to do. The first idea was to look at the site's engine, look for any holes in it, reach the server where all the information about apartments is stored, and copy it from there. But this is bad , apparently, my qualification was not enough at that time.

On the target site there was a section for real estate agencies. Collaboration, everything. There, provided that you are an agency, you could get (buy?) Access to specialized software, which, judging by the instructions to it and the screenshots, made it possible to automatically submit ads on behalf of the agency (from spammers, right?). Theoretically, it was also possible to find information on the server side in this software and pull out information about apartments from there. This is where my qualifications, I think, would be enough. But I did not have access to the software, but I did not want to become an agency.

Therefore, there was nothing left but to write it ...

Parser


We go to the site programmatically, “search” for all apartments and parsim the results, saving them to a local database. I decided to write a parser in Python - it was a relatively new language for me at that time, and it was useful to raise the level in it (and therefore the code is appropriate).

For the page jump, the standard urllib was used:

from urllib import FancyURLopener, quote_plus ... flatsPageContent = urlOpener.open(flatsPageURL).read() 

For parsing HTML I decided to use (after active googling) the lxml library:

 from lxml.html import parse ... flatsPageDocument = parse(flatsPageFilePath).getroot() if flatsPageDocument is not None: flatsTables = flatsPageDocument.xpath('//*[@id="list"]') 

All this is trivial and uninteresting. And another was interesting.

How far is the subway?


Being a man without horses and moving strictly on public transport, for me the proximity of the metro to my future apartment was critical. Edak, meters no more than 2 thousand. Therefore, the idea to determine the nearest metro station to the apartment, and its distance to it. And then the implementation:

Some code
 def getFlatLocation(flatPageName, flatAddress, mode, geoDBCursor): logging.info('Retrieving geo code info for flat \'%s\' (mode \'%s\')...' % (flatPageName, mode)) flatFullAddress = (flatBaseAddress + flatAddress).encode('utf8') geoCodeResult = '' isGeoCodeResultCached = 1 geoDBCursor.execute("SELECT geoCode FROM %s WHERE address = ?" % ("GeoG" if mode == 'G' else "GeoY"), (flatFullAddress,)) geoCodeResultRow = geoDBCursor.fetchone() if geoCodeResultRow is not None: geoCodeResult = geoCodeResultRow[0] if geoCodeResult is None or len(geoCodeResult) == 0: isGeoCodeResultCached = 0 geoCodeURL = ('http://maps.google.com/maps/api/geocode/json?sensor=false&address=' if mode == "G" else 'http://geocode-maps.yandex.ru/1.x/?format=json&geocode=') + quote_plus(flatFullAddress) urlOpener = UrlOpener() geoCodeResult = urlOpener.open(geoCodeURL).read() if geoCodeResult is None: geoCodeResult = '' logging.info('Geo code result for flat \'%s\' was fetched (mode \'%s\', from cache - %d)' % (flatPageName, mode, isGeoCodeResultCached)) flatLocation = 0 geoCodeJson = json.loads(geoCodeResult) if geoCodeJson is not None and (len(geoCodeJson['results']) if mode == 'G' else len(geoCodeJson['response'])): if isGeoCodeResultCached == 0: geoDBCursor.execute("INSERT INTO %s VALUES (?, ?)" % ("GeoG" if mode == 'G' else "GeoY"), (flatFullAddress, geoCodeResult)) if mode == "G": geoCodeLocation = geoCodeJson['results'][0]['geometry']['location'] flatLocation = {'lat': float(geoCodeLocation['lat']), 'lng': float(geoCodeLocation['lng'])} else: geoCodeLocation = geoCodeJson['response']['GeoObjectCollection']['featureMember'][0]['GeoObject']['Point']['pos'] (flatLocationLng, flatLocationLat) = re.search('(.*) (.*)', geoCodeLocation).group(1, 2) flatLocation = {'lat': float(flatLocationLat), 'lng': float(flatLocationLng)} logging.info('Geo code info for flat \'%s\' was retrieved (mode \'%s\')' % (flatPageName, mode)) else: logging.warning('Geo code info for flat \'%s\' was NOT retrieved (mode \'%s\')' % (flatPageName, mode)) return (flatLocation, isGeoCodeResultCached) 


As can be seen from the code, Google and Yandex are used as a source of geocoding data. Why not someone alone? Just for new streets (and for old ones, or incorrectly entered), someone from the sources could give out incorrect or averaged data (for example, city center coordinates). Therefore, two engines are used simultaneously so that visually incorrect results can be visually screened out. It is clear that Google and Yandex also had a quota for the number of requests per day from IP. Therefore, the results of "punching" addresses were carefully stored in the database for use in subsequent launches of the parser.

With the help of google maps, a table was filled with the coordinates of metro stations, including those still under construction. And the distance was determined simply using the Pythagorean theorem :

 def calculateDistance(location1, location2): # haversine formula, see http://www.movable-type.co.uk/scripts/latlong.html for details R = 6371 * 1000 # Radius of the Earth in m dLat = (location2['lat'] - location1['lat']) * (math.pi / 180) dLng = (location2['lng'] - location1['lng']) * (math.pi / 180) a = math.sin(dLat / 2) * math.sin(dLat / 2) + math.cos(location1['lat'] * (math.pi / 180)) * math.cos(location2['lat'] * (math.pi / 180)) * math.sin(dLng / 2) * math.sin(dLng / 2) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) d = R * c return d 

And here is the nearest metro station:

 def getFlatDistanceInfo(flatLocation): flatSubwayStationDistances = map(lambda subwayStationInfo: calculateDistance(flatLocation, subwayStationInfo['location']), subwayStationInfos) flatNearestSubwayStationDistance = min(flatSubwayStationDistances) flatNearestSubwayStationName = subwayStationInfos[flatSubwayStationDistances.index(flatNearestSubwayStationDistance)]['name'] flatTownCenterDistance = flatSubwayStationDistances[0] return (flatNearestSubwayStationName, flatNearestSubwayStationDistance, flatTownCenterDistance) 


Apartment price tracking


All of us, probably, more than once read articles like "Prices for apartments in the city N began to decline by X% per month." So I had my own opinion on this topic.

Once all the extracted apartments were saved locally to the base, it was possible to track changes in their prices. Looking into the old database and finding information about the apartment being extracted there, it was possible to calculate the delta of its price:

 isFlatInfoUpdated = 0 flatPriceDelta = 0 if len(oldFlatsDBFilePath): oldFlatsDBCursor.execute('''SELECT flatPriceInfo FROM Flats WHERE flatPageURL = ? AND flatAddress = ? AND flatWholeSquare = ? AND flatLivingSquare = ? AND flatKitchenSquare = ?''', (flatPageURL, flatAddress, flatWholeSquare, flatLivingSquare, flatKitchenSquare,)) oldFlatInfoRow = oldFlatsDBCursor.fetchone() if oldFlatInfoRow is not None and oldFlatInfoRow[0] is not None: isFlatInfoUpdated = 1 oldFlatPriceInfo = oldFlatInfoRow[0] try: flatPriceDelta = float(flatPriceInfo) - float(oldFlatPriceInfo) except ValueError: pass 

Therefore, every time I read articles with an analysis of the real estate market, I smiled, knowing that “my” apartments do not grow in price at all. Maybe they didn't need anyone but me?

Are you separate or combined?


I am a programmer, and programmers think a lot. Is it possible in the combined bathroom?

The problem was that the property search site hid this information inside the apartment description page and did not show it in the list of search results. Therefore, a special parser mode was added, called “flatsDeepParseMode”. As they say, "We need to go deeper" (c). He allowed the parser to download not only the apartment search results pages, but also the apartment description pages directly. And already from them the additional information on a bathroom and other was extracted.

fault tolerance


In the deep parsing mode, the script could heavily load the server, hammering it with requests to return thousands of pages. This, in turn, sometimes led to server thoughtfulness, and, at times, to its refusals to fulfill requests. Therefore, after such incidents, the script began to support the mechanisms of “recalling” pages with gradually increasing timeout between attempts.

Disguise


Once the script stopped working. There were messages that the server could not answer something, timeouts and blah blah blah. It turned out that the owners of the real estate website hired a group of specially trained people to filter the filtering of user agents connecting to the client's server (what are they all about?). And my script got under distribution. And it was decided simply - the script pretended to be a browser:

 class UrlOpener(FancyURLopener, object): version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11' pass 

But one day a terrible thing happened ...

You are banned!


Yes, I was banned. And not only me, as it turned out. When I came to work in the morning (I had to earn money for an apartment), I saw an already familiar error message that the server was there, timeouts and blah blah blah. Replacing the user agent with another browser did not help. After all, even browsers could not open the real estate website ... Yes, our entire static IP was banned on the server side. I do not know why this happened. Perhaps, “some kind of virus program sent many requests to the server” or several dozen employees of the company decided to look for housing. But be that as it may, we were banned.

It so happened, but our company's lawyers needed to search something on that site (perhaps an apartment for our overseas colleagues). But they did not give the descent of the administration brazen resource. We haven't done anything like this. Nobody ddosili. True true. In general, we were banned. Honestly.

Features, features, features ...


The parser still knows a lot: parsit apartments to a certain price, marks remote apartments and newly added ones, counts the number of photos of the apartment, etc.

...

We take it, wrap


I still found my ideal apartment. Top floor, near the subway, all things. Would I have found it without writing a parser? I do not know, maybe. But it would be unsporting, not programmerly somehow ...

PS


And yes, I remember about the old article in Habre, where the same pervert enthusiast, like me, parsil and analyzed the apartment in the language of R. And he also bought his right apartment. And that means it works.

By the way, my parser may no longer work or work incorrectly (due to possible changes on the site) because used for a long time. And be careful with him, but they can be banned (there have been cases).

And the compote code ?!


At the request of a friend, I post the source of the parser on bitbucket.org . In turnips, you can also find a file with a rather large SQL query, which renders all the extracted data. The code, of course, is placed only for your reference.

Thank you all for your attention.

Source: https://habr.com/ru/post/242085/


All Articles