Reading the official data on the number of municipalities with text formatting options using the xlrd library

For one public project (visualization of budgets of settlements and regions), I needed data on the hierarchy of municipalities and their size.
I found the necessary data, but the form of their presentation for correct use left much to be desired.
All the necessary data was in one file , however, its formatting made it difficult to extract the hierarchy. All the names of municipalities, districts and regions were located in one column and differed only in the format of the cell and font. The area was bold, the area bold, and the settlement was indented. There were also several errors in the file, for example, one region was selected as an area (or, on the contrary, I don’t remember), and another place in the middle of the name of the formation encountered a line break (this moment became clear during the import to Google App Engine, when db. StringProperty () cursed on multiline), the source file in these places had to be fixed.

In search of a solution for how best to do this, I came across an xlrd library, the possibilities of which are sufficient for this task with interest. More information about it can be read here , but I will give the code of the import program and the available parameters when working with formatting.
The cell formatting and font type provide the following information:

Cell

alignment (XFAlignment object):
hor_align: 0
indent_level: 0
rotation: 0
shrink_to_fit: 0
text_direction: 0
text_wrapped: 0
vert_align: 2
background (XFBackground object):
background_colour_index: 65
fill_pattern: 0
pattern_colour_index: 64
border (XFBorder object):
bottom_colour_index: 64
bottom_line_style: 7
diag_colour_index: 0
diag_down: 0
diag_line_style: 0
diag_up: 0
left_colour_index: 64
left_line_style: 1
right_colour_index: 64
right_line_style: 1
top_colour_index: 64
top_line_style: 1
font_index: 7
format_key: 0
is_style: 0
lotus_123_prefix: 0
parent_style_index: 0
protection (XFProtection object):
cell_locked: 1
formula_hidden: 0
xf_index: 556

Font

struck_out: 0
underline_type: 0
underlined: 0
weight: 400
bold: 1
character_set: 204
colour_index: 32767
escapement_type: 0
family: 0
font_index: 10
height: 200
italic: 1
name: u'Arial Cyr '
outline: 0
shadow: 0

I didn’t understand all the parameters; I had enough of the font type and indent (ident) in the cell for this task. But it is clear that the possibilities are large, up to the color of a particular cell border or line type.

Script code

# -*- coding: utf-8 -*- from __future__ import unicode_literals import xlrd import json rb = xlrd.open_workbook('Tabl-35-12.xls', formatting_info=True) font_list = rb.font_list # list of all fonts in excel table sheet = rb.sheet_by_index(1) rows_number = sheet.nrows peoples_dict = {} # main dict for rownum in range(7, rows_number): # data starts with line 7 cell = sheet.cell(rownum, 0) value = cell.value.strip().replace('\n', ' ') # delete spaces at start and end peoples_count = sheet.cell(rownum, 1).value if peoples_count == 0 or peoples_count == '': # empty row - continue continue peoples_count = int(peoples_count) # from 12313.0 to integer cell_format = rb.xf_list[cell.xf_index] current_font = font_list[cell_format.font_index] bold = bool(current_font.bold) italic = bool(current_font.italic) indent = cell_format.alignment.indent_level is_region = bold and not italic is_raion = bold and italic is_municipal = (indent == 2) if is_region: region = value peoples_dict[region] = {'count': peoples_count} elif is_raion: raion = value peoples_dict[region][raion] = {'count': peoples_count} elif is_municipal: municipal = value peoples_dict[region][raion][municipal] = {'count': peoples_count} print peoples_dict[' ']['  ']['  ']['count'] with open('peoples.json', 'w') as outfile: json.dump(peoples_dict, outfile)

')
As a bonus, the resulting output file is in json format . Structure - nested dictionaries, where each element contains the key 'count', where the number of education is recorded, and the keys of its descendants.
That is, the number of the Moscow region can be obtained as follows:

 peoples_dict[' ']['count']

and the number of the city of Istria of the Istrinsky municipal district is:

 peoples_dict[' ']['  ']['  ']['count']

Source: https://habr.com/ru/post/182548/

All Articles

Reading the official data on the number of municipalities with text formatting options using the xlrd library

More articles: