It was worth voting for the entry or exit from the renovation program to end - and for some reason, the turnout data for each particular house disappeared from the site of the mayor of Moscow, only votes for and against were left. Certain numbers are written in the news, of course, but you want to see them yourself, play around with statistics, build graphs, don't you?
And after statements like:
As for the popularity of these services, the public service centers “My Documents” attracted a little more than half of all those who voted, slightly yielding to the portal “Active Citizen”
somehow light doubts arise. So - let's start collecting information! And then we will analyze it. To do this, we need some language (say, python), some kind of database (say, sqlite), and some kind of web scrapper, good for the python there are a lot of them. Immediately I say, at the end I will give a link to the resulting database, you can do anything with it.
We go to the
website of the City Hall , spin the wheel. It is necessary to collect data from 4543 houses. Clicks on a
random house , on
regional lists , and we look in what format the data is in general.
Apparently, each house has a certain id, which we see in the url:
')
https://www.mos.ru/otvet-stroitelstvo/itogi-golosovaniya-zhitelej-po-proektu-programmy-renovacii/?u=121
Therefore, in order to process all the houses, we need somewhere to get a list of all ID students and sort them, taking the necessary information from each page, namely - the number of votes for, the number of votes against, the number of votes of all (there are also “undecided ”Votes, when one apartment has half voted for and half against, and whether there was a general house meeting (why? What if it comes in handy).
Where to get a list of IT students? As we have already seen above, in the regional lists there are no references to the indicated houses, this is just a textual list of addresses and that's it. A pity it would be convenient. Looks like it’s going to have to do a brute force. But for starters, we’ll still make a function that collects data from each specific IT specialist.
To begin with, we will make a getter request for some house, and look at the resulting answer in order to understand what we need to scrap from there.
import requests r = requests.get('https://www.mos.ru/otvet-stroitelstvo/itogi-golosovaniya-zhitelej-po-proektu-programmy-renovacii/?u=121') print(r.text)
We are surprised to find that there is no voting data in the answer, we are given some kind of page template. We'll see with our eyes how the results load. And really - at first we have a page in the browser with empty fields, and after a while the data appears in them. Take a browser with the appropriate means "for the developer", for example, Mozilla, and see what actually happens there.

Aha We found a browser request to a certain api, and the ID is the same as the ID at home we are looking at.
In other words, a
getter request to
www.mos.ru/altmosprx/api/1/renovation/house_result/121 without any parameters gives us in response JSON of the following type:
{ "execTime": 0.044450044631958, "errorMessage": "", "result": { "table": "<table class="table table-big"><thead><tr class="table-header"><td> </td><td> </td><td></td><td></td><td> </td></tr></thead><tbody><tr><td class="apartment-id">0G6O4</td><td class="voting-info"><p>bf659227e8e3</p><p>5f9403659209</p></td><td class="voting-choice"><p></p><p></p></td><td class="voting-date"><p>18.05</p><p>18.05</p></td><td class="apartment-position apartment-agree"><p></p></td></tr><tr><td class="apartment-id">0G6O5</td><td class="voting-info"><p>3f12be5cea77</p></td><td class="voting-choice"><p></p></td><td class="voting-date"><p>15.05</p></td><td class="apartment-position apartment-agree"><p></p></td></tr><tr> ... <td class="apartment-id">0G6V1</td><td class="voting-info"><p>5acd126a410ea1a842e67066ea68fa8f</p></td><td class="voting-choice"><p></p></td><td class="voting-date"><p>24.05</p></td><td class="apartment-position apartment-agree"><p></p></td></tr></tbody></table>", "total": { "und": 0, "za": 100, "protocol_res": 0, "protiv": 0, "gorod_mark": 0, "protocol_date": null, "house_status": 1, "gorod": 0 }, "und_table": "<table class=\"table table-big\"><thead><tr class=\"table-header\"><td> </td><td> </td><td></td><td></td><td> </td></tr></thead><tbody></tbody></table>", "address": " , 63, 2" }, "request_id": "empty_requestid", "errorCode": 0 }
Well, here we don’t even need to scrap anything, all the data is immediately from api. All you need is to count the number of apartment-id (total votes), to take ready data about the percentage of votes for, about the general house meeting.
But still, in what range of aydishnikov will we look? Total houses should be 4543. We give api 0 - there is such a house. We give -1 - an error, well, thanks for that. The lower limit determined. We give 10,000 - there is such a house. Okay, there are obviously more than 4543 of them. Let's try to determine the upper limit approximately by looking at a house from recently joined territories ... Back to the map, we climb somewhere in the “New Moscow” away from the “old”, about: Kokoshkino settlement, the summer resort of Kokoshkino , Truda Street, 2, id: 440931. Well, at least half a million of them we have.
Going through half a million links in a normal loop is not the best idea, so we'll use the
concurrent.futures module. Of course, it would be possible to use something like asyncio, but we do not have such a large-scale task and can do it with “little blood”. Everything is very simple. We look at what api gives when the house number is known to be right, when it is obviously wrong, and we make a function to check this all. Then we twist it all in a loop with parallel queries, since IT needs to process quite a lot. Then we make out the result and write it down. In general, you get something like this:
import requests from concurrent.futures import ProcessPoolExecutor import concurrent.futures def check(url):
We start and go about our business - this is a long time even with 6 workers. Looking ahead, I will say that having processed a million, I got about 70 homes less than the results, so I had to turn this bagpipe to 10 million. This is a long time, left work and went to work.
Of course, it would be possible to raise the number of parallel requests, but we use someone else's ip, you need to behave politely (and then suddenly banned).
In general, we have a list of all aydishnik houses, now we need to process them, take all the available information, etc. It is inconvenient to process such an array into a text file without putting it in your hand. We use sqlite3.
Create a database. All fields are obvious, except that the house and additional elements such as buildings, buildings, etc. we will distribute - just in case, if we want, for example, to look at the trends of nearby houses.
import sqlite3 schema = "CREATE TABLE `houses` (\ `id` INTEGER PRIMARY KEY,\ `street` TEXT NOT NULL ,\ `house_nbr` TEXT NOT NULL,\ `house_additional` TEXT,\ `total_votes` INTEGER,\ `total_za` INTEGER,\ `meeting` INTEGER DEFAULT '0',\ `flats` INTEGER\ );" conn = sqlite3.connect('renovation.db') cur = conn.cursor() db = cur.execute(schema) conn.commit() conn.close()
Well, to the point! Let's make a function that adds and adds information on houses to the database, simultaneously we will record all JSON errors (most likely, there was no JSON in the answer) and unknown errors (most likely, there was no answer at all), and then we could see them and drive them away separately, if any, of course.
import requests import re import sqlite3 def gethouseinfo(idd): print(idd) urly = 'https://www.mos.ru/altmosprx/api/1/renovation/house_result/' + str(idd) + '/' try: r = requests.get(urly) r.encoding = 'utf-8' results = r.json() adress = results['result']['address'] print(adress) if re.match('(.*), (.*), (.*)', adress): adress_street = re.match('(.*), (.*), (.*)', adress).group(1) adress_house = re.match('(.*), (.*), (.*)', adress).group(2) adress_building = re.match('(.*), (.*), (.*)', adress).group(3) else: adress_street = re.match('(.*), (.*)', adress).group(1) adress_house = re.match('(.*), (.*)', adress).group(2) adress_building = '' totalvotes = len(re.findall('apartment-id', results['result']['table'])) + len(re.findall('apartment-id', results['result']['und_table'])) aye = results['result']['total']['za'] meetinghappened = bool(results['result']['total']['protocol_res']) iddlist = [] iddlist.append(idd) check = cur.execute('SELECT * FROM houses WHERE id=?', iddlist) res = check.fetchone() if res: print('already exists') else: insert = cur.execute('INSERT INTO houses (id, street, house_nbr, house_additional, total_votes, total_za, meeting) values (?, ?, ?, ?, ?, ?, ?)', [idd, adress_street, adress_house, adress_building, totalvotes, aye, meetinghappened]) print('added ' + str(idd)) except ValueError: print('no data for id '+ str(idd)) jsonerror.append(idd) except: print('unknown eggog') unknownerror.append(idd) jsonerror = [] unknownerror = [] with open('/home/deb/mosres.txt') as fc: mosres = fc.read().splitlines() conn = sqlite3.connect('/home/deb/renovation.db') cur = conn.cursor() for house in mosres: gethouseinfo(house) conn.commit() conn.close() if jsonerror: with open('/home/deb/jsonerror.txt', 'w') as f: for item in jsonerror: f.write('{}\n'.format(item)) if unknownerror: with open('/home/deb/unknownerror.txt', 'w') as f: for item in unknownerror: f.write('{}\n'.format(item))
So this is something. Now we have a database for voting for renovation, created on the basis of public information on the website of the City Hall. You can already delve into and (finally!) Draw graphics!
All results (sorted from small to great, to clearly see the percentage of pros and cons, X - at home, games - interest for, green line - for convenience, cuts 66 percent, which had to be overcome):

Distribution of votes "for":

All this is good, but we are interested in the turnout. And here it becomes hard. The fact is that there is no normal, simple, centralized resource with the number of apartments. Perhaps, 2GIS has it, at least in the examples of paid unloading of information, the number of apartments is there, but - is it charged? This is not our method! We will go another way.
The merciless googling and Yandex pointed to the site
tvoyadres.ru/doma , where it would seem there are houses, often with information about an apartment. But how to collect them? Ideally, you must first collect the entire list of houses (well, at least streets) for which there is data, and then try to link the addresses in our database, taken in the format of the city hall, with the addresses in the format of this site, and then pull them out that managed to tie up the apartment. Should I start with the streets?
tvoyadres.ru/ulitsy - but sorting 200 pages, scraping and processing each issue, is very dreary. Perhaps, it will be possible to find here any api?
Success waited on the page of the city, where there is not only a list of streets, but also the “More streets” button:
tvoyadres.ru/moskovskaya-oblast/goroda/551
Aha Request View
http://tvoyadres.ru/js/street.php?region=81&city=&count=2073&_=1499809159225
gives us a list of streets with links to them (and in the links is an ID). Great, it remains to understand what the characters in the query mean. The region, obviously, we will not touch, the city, too. Let's try to remove the last incomprehensible, leaving only
tvoyadres.ru/js/street.php?region=81&city=Moscow&count=2073 - the result is the same. OK, press the button again - we will see that the same request was sent, but with a count of 100 less. Let's play manually with this parameter.
0 - an error is returned. 1 - returns one street. 2 - two streets return, and we have already seen one. 100 - returns 100 streets. 200 - returns another 100 streets. We started with 2073 - try 2173? Yes, these are the very first hundred streets that were displayed on the page of the city. 2174?
Critical error
You have an error in your SQL syntax; check the syntax for the use of mySQL server.
Op-pa. Well, it seems, count goes to the
LIMIT SELECT query, and the limit is always given 100 lines, but the line to start with is calculated as count minus 2173. By the way, it’s rather secular, I didn’t invent the sql injection here, it is not transmitted, but calculated, and if it’s not a number to put in there, it will be trivial. Anyway. The result is. Great, now to the point.
Everything is even easier than usual:
def getstreets(num): r = requests.get(url + str(num)) results = r.json() result = results['string'] return(result) for i in range(1, 2272, 100): totalres += getstreets(i)
and write the result. The result will look like a heap of html-code. In general, already here it would be possible to move on to something better than regular expressions, but since The task is very simple - after all, we’ll pull out the streets with IDs by them, and then we make a dictation with the key - the name of the street.
sids = re.findall('ulitsy\/(.*?)">(.*?)<\/a>', totalres)
But then we have to get home from the pages of the streets, in other words - to work on a bunch of html heaps. Such a bunch of regular expressions to try otparsit - more expensive. Therefore, let's get acquainted with
BeautifulSoup .
We make a function, we start it for a separate dict, where the keys will already be an Aidish, and the values ​​will be apartments. The logic is as follows: by the name of the apartment it will be possible to get an idi, by idi - apartments.
import re from bs4 import BeautifulSoup def gethouses(num): r = requests.get('http://tvoyadres.ru/moskovskaya-oblast/moskva/ulitsy/' + str(num) + '/') results = r.text soup = BeautifulSoup(results, 'html.parser') ul = soup.find("ul", {"class": "next"}) houses = [] try: for li in ul.find_all("li"): urly = li.a['href'] urly = re.search('doma\/(.*)\/', urly).group(1) houses.append([li.get_text(), urly]) return(houses) except: print('None') return('None') totalyres = {} for key in sids: num = sids[key] totalyres[num] = gethouses(num)
Here we have two things that come to mind. The first one is to make not a dictation, in which each street has a list, in which each value is a list with the name of the house and its ID, but a dictation with nested dictations. This can be turned into a simple loop, we will not focus, but let's call this dictation totre - yes, in the end the fantasy completely left me. Excuse me.
The second - and after all, the Yurl of each house includes not only an IDU, but also a street transliteration!
An example .
for key in totre: urlo = 'http://tvoyadres.ru/moskovskaya-oblast/moskva/ulitsy/' + key + '/' ra = requests.get(urlo) try: streetname = re.search('<ul class="next"><li><a href="\/moskovskaya-oblast\/moskva\/(.+?)\/doma\/', ra.text).group(1) totre[key]['streetname'] = streetname except: print(key)
And here we come to almost the most difficult. We need those street names that we have in the database from the mayor's office, to associate with aydishnikami, using those street names that are from your address box.
Difflib will help us in this, since it is built into the python. But it’s not enough to hope for difflib, it will be necessary to make a check by the user, because frequency and similarity are, of course, good, but we need to avoid stupid mistakes. In general, we look at the formats, we notice that somewhere there are no letters as a class, somewhere the word “street” is removed from the street name, and we do this:
conn = sqlite3.connect('renovation.db') cur = conn.cursor() streets = cur.execute('SELECT DISTINCT street FROM houses order by street asc') streeets = streets.fetchall() conn.close() exactmatches = {} keyslist = [] for key in sids.keys(): keyslist.append(key) def glue(maxres=3, freq=0.6): for each in streeets: eachnoyo = each[0].replace('', '') diffres = difflib.get_close_matches(eachnoyo, keyslist, maxres, freq) if each[0] not in exactmatches.keys(): if len(diffres) == 1: print(each[0] + ': ' + diffres[0]) notcompleted = False while notcompleted == False: inp = input('Correct? y/n ') if inp == 'y': notcompleted = True exactmatches[each[0]] = sids[diffres[0]] elif inp == 'n': notcompleted = True else: print('Incorrect input, try again') elif len(diffres) == 0: print('No matches for ' + each[0]) elif len(diffres) > 1: print(each[0] + ': ' + str(diffres)) notcompleted = False while notcompleted == False: inp = input('List number? Or n ') try: listnum = int(inp) except: listnum = None if inp == 'n': notcompleted = True elif listnum in range(0, len(diffres)): notcompleted = True exactmatches[each[0]] = sids[diffres[0]] else: print('Incorrect input, try again') with open('exactmatches.json', 'w') as f: json.dump(exactmatches, f, ensure_ascii=False)
We sit in the console, look at the issue, press the buttons. When the whole cycle has passed, we start the function with more gentle parameters, for example, glue (10, freq = 0.4).

I had enough patience for 506 streets from 700 kopecks - in my opinion, an excellent result, and most importantly, statistically significant (most likely).
And now you need to do the same for houses, and also take, in fact, their number of apartments. And put in the database.
conn = sqlite3.connect('renovation.db') cur = conn.cursor() allhouses = cur.execute('SELECT * FROM houses WHERE flats IS NULL ORDER BY id') allhousesres = allhouses.fetchall() url2 = 'http://tvoyadres.ru/moskovskaya-oblast/moskva/' def getnumberofflats(streetname, houseid): urlo = url2 + str(streetname) + '/doma/' + str(houseid) + '/' r = requests.get(urlo) results = r.text numbe = re.search('<span class="left"> <\/span> <span class="right">(\d*)<', results).group(1) return numbe def gluehousesnumbers(freq=3, ratio=0.6): for house in allhousesres: if house[1] in exactmatches.keys(): housenbr = house[2].replace('', '') if house[3]: housenbr = housenbr + ' ' + house[3] housenbr = housenbr.lower() diffres = difflib.get_close_matches(housenbr, totre[exactmatches[house[1]]].keys(), freq, ratio) if len(diffres) == 1: print(housenbr + ': ' + diffres[0]) notcompleted = False while notcompleted == False: inp = input('Correct? y/n ') if inp == 'y': notcompleted = True try: flatsnumber = getnumberofflats(totre[exactmatches[house[1]]]['streetname'], totre[exactmatches[house[1]]][diffres[0]]) insertion = cur.execute('UPDATE houses SET flats = ? WHERE id = ?', [flatsnumber, house[0]]) except: print('weird, no flat number for ' + str(house)) elif inp == 'n': notcompleted = True else: print('Incorrect input, try again') elif len(diffres) > 1: print(housenbr + ': ' + str(diffres)) notcompleted = False while notcompleted == False: inp = input('List number? Or n ') try: listnum = int(inp) except: listnum = None if inp == 'n': notcompleted = True elif listnum in range(0, len(diffres)): notcompleted = True try: flatsnumber = getnumberofflats(totre[exactmatches[house[1]]]['streetname'], totre[exactmatches[house[1]]][diffres[0]]) insertion = cur.execute('UPDATE houses SET flats = ? WHERE id = ?', [flatsnumber, house[0]]) except: print('weird, no flat number for ' + str(house)) else: print('Incorrect input, try again') conn.commit() conn.close()
We continue to have fun. The main thing - do not forget to commit changes after each run of the function. I did not add the connection and commit to the functions themselves, so as not to constantly pull at the database.
Yes, it is quite tiring. But, again, if you give everything at the mercy of Difflib - there will be foolish errors, as in the screenshot, where he finds “35b” closer to “35” than to “35 'b'”. This, of course, is not a difflib error, but, to be honest, I would spend more time on the perfect query, and then I would stumble somewhere. Best with user verification, more confident.

Total: the number of apartments there are for 3592 of approximately 4500 houses! Excellent result (you can't praise yourself - nobody will praise). But, of course, if you confirm so many matches, there will be errors.

Let's take 43 apartments, in which more apartments have voted, than the apartments, it seems, have everything. It is clear that there either I made a mistake when confirming matches, or there were obviously incorrect data.
Well, with the rest you can already have fun. Turnout is obviously counted as the number of voters, divided by the number of apartments. Graphs are also drawn obviously, the only thing is, for each point of turnout data there is no sense in drawing a graph, since it will be similar to the shaded paper, it is better to round the turnout data and take the average data on the results for this data point.
Round up to the nearest integer:

Round up to the nearest multiple of five:

In general, with the exception of a pair of anti-peaks in the area of ​​80% turnout, or, if you look at rounding thinner, in areas of 30%, 40% and 80%, in general, turnout did not affect the result. Is that a 100% turnout in an amazing way always gave a 100% result. And the average turnout was 58.7%.
Was it worth it? For me, yes, I learned a lot of new things. And for the reader? Well, I post to the reader the
database itself .
Maybe you can do something more interesting with this data.