Parsing 0.5Tb xml in a few hours. Search for organizations in the open data of the register of SMEs of the Federal Tax Service

By occupation (process automation and development of information systems architecture), one often encounters the need to write a script and get the result “here and now” for an unexpectedly “arrived” task in a situation when there is no possibility to promptly involve external developers.

The solution to one of these problems will be devoted to the review At some point, it became necessary to analyze the dynamics by months of the number of organizations of a certain type of activity, namely, agricultural enterprises, based on the open data of the “Unified Register of Small and Medium-Sized Enterprises” of the Federal Tax Service The approaches that were used to solve it, I hope, will be useful to those who are looking for options for processing large structured XML data sets, but common processing tools such as SelectFromXML, online XML processors are not suitable for some reason. Either the functionality is limited, or there are problems when working with Cyrillic encoding, or the required performance is not provided, or the resources of hardware are limited. Programmers and professionals I hope I will not be too strict about the style of coding and the choice of methods of implementation, and criticism and advice in the comments are welcome.

So the task:

As of February 2018, the SME registry contains 18 zip archives with a size of 3-4Gb. Each archive contains about 5-6 thousand files containing information about approximately 6 million organizations, totaling about 40Gb. From this array, it is required to select only those belonging to agricultural enterprises and to analyze the dynamics of the number of these enterprises by months.
')
The source files of the FTS are posted on the link

Description files of organizations contain the following structure:

< ="VO_RRMSPSV_0000_9965_20170110_01b07970-41d2-4d1e-bb80-0abee395d333" ="4.01" ="" ="900"> <> < ="-" ="-"/> </> < ="4e28d9a9-c004-0f72-a27d-7d677620df81" ="10.01.2017" ="01.08.2016" ="2" ="1" ="2"> < ="636204531704"> < ="" ="" =""/> </> < ="63"> < ="" =""/> < ="" =""/> < ="" =""/> </> <> < ="42.21" ="      , " ="2014"/> < ="52.21.2" =" ,    " ="2014"/> < ="74.30" ="     " ="2014"/> < ="63.91" ="  " ="2014"/> < ="95.23" ="      " ="2014"/> < ="42.21" ="      , " ="2014"/> < ="62.09" =",        , " ="2014"/> < ="25.72" ="   " ="2014"/> < ="47.54" ="      " ="2014"/> < ="42.22.1" ="     " ="2014"/> < ="47.99" ="    , , " ="2014"/> < ="82.19" ="              " ="2014"/> < ="49.32" =" " ="2014"/> < ="42.22.2" ="     " ="2014"/> </> </> < ="7a14e521-68a3-9514-7540-04cb03799ac4" ="10.01.2017" ="10.09.2016" ="2" ="1" ="1"> < ="636204538611"> < ="" ="" =""/> </> < ="63"> < ="" =""/> < ="" =""/> < ="" =""/> </> <> < ="47.11" ="    ,  ,      " ="2014"/> < ="47.25.12" ="     " ="2014"/> </> </> < ="ad8636bb-78c3-763c-52d2-4fe5a93e9a8f" ="10.01.2017" ="10.09.2016" ="2" ="1" ="1"> < ="636204540794"> < ="" ="" =""/> </> < ="63"> < ="" =""/> < ="" =""/> < ="" =""/> </> <> < ="47.41" ="  ,          " ="2014"/> < ="49.20.9" ="  " ="2014"/> < ="47.78" ="     " ="2014"/> </> </>

Processing will be performed in the bash shell on a Linux virtual machine with 2 cores, 8 Gb of RAM and 100Gb of disk space:

 %Cpu0 : 6.1 us, 2.0 sy, 0.0 ni, 91.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 54.1 us, 11.2 sy, 0.0 ni, 6.1 id, 28.6 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 8258760 total, 64684 free, 5645284 used, 2548792 buff/cache KiB Swap: 2129916 total, 1157076 free, 972840 used. 2271428 avail Mem

The script should provide for downloading zip archives from the FTS website, renaming files for further processing, unpacking, processing with a parser (using xmlstarlet) to search for organizations that meet the criteria specified in the script, clearing the disk from temporary files (during processing, the source files take dozens of Gb ), saving in a format convenient for later use in data analysis systems and import into spreadsheet programs (in our case, the csv format will be used).

Download and rename will be done using wget. In order for the script to understand which archives with RMSP to process it, we will create a file under the conditional name “flight mission”, where we will indicate which files to process and how to name the result obtained.

The configuration file has the following structure:
A link to the file, the name of the resulting file, a mark about the need to process '*' (in cases where it is necessary to download not the entire set of files).

rmspfiles.txt

 http://data.nalog.ru/opendata/7707329152-rsmp/data-08262016-structure-08012016.zip;20160826;* http://data.nalog.ru/opendata/7707329152-rsmp/data-09102016-structure-08012016.zip;20160910;* http://data.nalog.ru/opendata/7707329152-rsmp/data-10102016-structure-08012016.zip;20161010;* http://data.nalog.ru/opendata/7707329152-rsmp/data-11252016-structure-08012016.zip;20161125;* http://data.nalog.ru/opendata/7707329152-rsmp/data-12122016-structure-08012016.zip;20161212;* http://data.nalog.ru/opendata/7707329152-rsmp/data-01112017-structure-08012016.zip;20170111;* http://data.nalog.ru/opendata/7707329152-rsmp/data-02102017-structure-08012016.zip;20170212;* http://data.nalog.ru/opendata/7707329152-rsmp/data-03102017-structure-08012016.zip;20170310;* http://data.nalog.ru/opendata/7707329152-rsmp/data-04102017-structure-08012016.zip;20170410;* http://data.nalog.ru/opendata/7707329152-rsmp/data-05102017-structure-08012016.zip;20170510 http://data.nalog.ru/opendata/7707329152-rsmp/data-11062017-structure-08012016.zip;20170611 http://data.nalog.ru/opendata/7707329152-rsmp/data-07102017-structure-08012016.zip;20170710 http://data.nalog.ru/opendata/7707329152-rsmp/data-08102017-structure-08012016.zip;20170810 http://data.nalog.ru/opendata/7707329152-rsmp/data-09112017-structure-08012016.zip;20170911 http://data.nalog.ru/opendata/7707329152-rsmp/data-10102017-structure-08012016.zip;20171010 http://data.nalog.ru/opendata/7707329152-rsmp/data-11102017-structure-08012016.zip;20171110 http://data.nalog.ru/opendata/7707329152-rsmp/data-12112017-structure-08012016.zip;20171211 http://data.nalog.ru/opendata/7707329152-rsmp/data-01112018-structure-08012016.zip;20180111

Upon completion of the download with renaming, a cycle is started on the archives received: unzip, search XML for the necessary records, write the result to csv files. Before processing the next archive, the disk space of the source files is cleared.

Despite the simplicity of the task, the script went through stages of debugging and improvement turned out to be quite intricate.

So, what happened in the end:

File Downloader:

 #!/bin/bash # **************** batch downloader from rmsp v 1.0. 2018-02-15 *********************** start=`date +%s` dt=`date` logFn='output_wget.log' printf "********************************************************************************************\n" | tee tmp_output.log echo "* ${dt} wget *" | tee -a tmp_output.log printf "*********************************************************************************************\n\n" | tee -a tmp_output.log # download loop      “ ”,      zip2 IFS=';' while read line; do read -r -a array <<< "$line" echo "${array[0]} | ${array[1]} " # wget ${array[0]} -O ./zip2/${array[1]}.zip | tee -a tmp_output.log 2>&1 # get filesize of external -          FILESIZE=$(wget --spider ${array[0]} 2>&1 | awk '/Length/ {print $2}') # - c - continue, 3>&1 -   wget -c ${array[0]} -O ./zip2/${array[1]}.zip 3>&1 | tee -a tmp_output.log end=`date +%s`; runtime=$((end-start)); dt=`date '+%Y-%m-%d %H:%M:%S'` printf "%s %4d sec %10d %s [ %s" ] ${dt} $runtime $FILESIZE ${array[0]} ${array[1]} | tee -a tmp_output.log done < rmspfiles.txt echo "" | tee -a tmp_output.log //        cat tmp_output.log $logFn > tmp_output2.log; mv tmp_output2.log $logFn

2. Parser

 #!/bin/bash # 2018-02-16  1.1      # 2018-02-19           excell # 2018-02-19  sed   /n -> ; @@; -> \n #     #       (   ) sp=' ' #   ,      ,     . path_src="./src" path_zip="./zip2" path_res="./res" t1="p1.log" t2="p2.log" t3="parsz.log" fnExt=""$1 start=`date +%s` dt=`date '+%Y-%m-%d %H:%M:%S'` #     echo "**** | parsz | ${dt} unzip from: $path_zip/$fnExt.zip to $path_src/$fnExt" # | tee $t1 # -q quiet mode (-qq => quieter) # -o overwrite files WITHOUT prompting # -j junk paths. The archive's directory structure is not recreated; all files are deposited in the extraction directory (by default, the current one). unzip -j -q -o $path_zip/$fnExt.zip -d $path_src/$fnExt/ end=`date +%s` runtime=$((end-start)) MOREF1=`ls "$path_src/$fnExt/" | wc -l` echo " ${dt}, $runtime sec [${MOREF1}] | files from: $path_src/$fnExt/ to $path_res/$fnExt.csv" | tee -a $t1 echo "$sp \ $sp \ $sp \ $sp  ( )\ $sp\ $sp\ $sp\ $sp\ $sp\ $sp\ $sp\ $sp\ $sp\ $sp\ $sp@@\ " > $path_res/res-$fnExt.csv /usr/bin/find $path_src/$fnExt/ -name "*.xml" | xargs -n1 xmlstarlet sel -T -f -t -m "///[contains(@,'')]" \ -v "@" -o "$sp" \ -v "@" -o "$sp" \ --if "../@=1" -o "" --else --if "../@=2" -o "" --else -o "" --break --break -o "$sp" \ --if "../@ = 1" -o "" --else -o "" --break -o "$sp[" \ -v "..///@" -o "]" \ -v "..///@" -o "$sp" \ -v "..///@" -o "$sp" \ -v "..///@" -o "$sp" \ -v "..///@" -o "$sp" \ -v "..///@" -o "$sp" \ -v "..///@" -o "$sp" \ -v "..///@" -o "$sp" \ -v "..///@" -o "$sp" \ -v "../@" -o "$sp" \ -v "../@" -o "$sp" \ -v "..//@" -o "$sp" \ -o "$fnExt@@" \ -n >> $path_res/res-$fnExt.csv end=`date +%s` runtime=$((end-start)) dt=`date '+%Y-%m-%d %H:%M:%S'` echo " ${dt}, $runtime sec :parsing" | tee -a $t1 #           sed -e ':a;N;$!ba;s/\n/;/g' $path_res/res-$fnExt.csv > $path_res/sed_tmp.csv sed -e 's/@@;/\n/g' $path_res/sed_tmp.csv > $path_res/res-$fnExt.csv end=`date +%s` runtime=$((end-start)) dt=`date '+%Y-%m-%d %H:%M:%S'` echo " ${dt}, $runtime sec :sed " | tee -a $t1 cat $t1 $t3 > $t2; mv $t2 $t3 #   XML  rm -rf $path_src/$fnExt/* echo "  XML  rm -rf $path_src/$fnExt/*" rm $t1

The entire data array of 18 files totaling hundreds of Gb is processed in about 6 hours.
The processing process is recorded in files for further debugging and script optimization.

After importing to MS Excel, we get the following result:

Source: https://habr.com/ru/post/358654/

All Articles

Parsing 0.5Tb xml in a few hours. Search for organizations in the open data of the register of SMEs of the Federal Tax Service

More articles: