📜 ⬆️ ⬇️

The fastest SAX parser for python

Suddenly I wanted to recount all xml-tags in 240 thousand xml-files with a total weight of 180 GB. Python - and quickly.

Task


In fact, I wanted to estimate how much it would be possible to overtake the Library, whose name cannot be said out loud, from fb2 to the docbook. In connection with the “specificity” of FB2, it is necessary to estimate which tags you can simply skip because of rarity. Those. just count the number of occurrences of each tag in all files.
On the way, it was planned to compare different sax-parsers. Unfortunately - testing failed, because and xml.sax and lxml on the first fb2 broke. As a result, xml.parsers.expat remained.
Yes, and more - the * .fb2 files are packed into zip archives.

Initial data


The initial data is the snapshot of the Library as of 2013.02.01, drawn from the Internet: 242525 * .fb2 files with a total weight of 183909288096 bytes, packed into 56 zip archives with a total weight of 82540008 bytes.
Platform: Asus X5DIJ (Pentium DualCore T4500 (2x2.30), 2GB RAM); Fedora 18, python 2.7.

Code


Written in haste, with a claim to versatility:
#!/bin/env python # -*- coding: utf-8 -*- ''' ''' import sys, os, zipfile, hashlib, pprint import xml.parsers.expat, magic mime = magic.open(magic.MIME_TYPE) mime.load() tags = dict() files = 0 reload(sys) sys.setdefaultencoding('utf-8') def start_element(name, attrs): tags[name] = tags[name] + 1 if name in tags else 1 def parse_dir(fn): dirlist = os.listdir(fn) dirlist.sort() for i in dirlist: parse_file(os.path.join(fn, i)) def parse_file(fn): m = mime.file(fn) if (m == 'application/zip'): parse_zip(fn) elif (m == 'application/xml'): parse_fb2(fn) else: print >> sys.stderr, 'Unknown mime type (%s) of file %s' % (m, fn) def parse_zip(fn): print >> sys.stderr, 'Zip:', os.path.basename(fn) z = zipfile.ZipFile(fn, 'r') filelist = z.namelist() filelist.sort() for n in filelist: try: parse_fb2(z.open(n)) print >> sys.stderr, n except: print >> sys.stderr, 'X:', n def parse_fb2(fn): global files if isinstance(fn, str): fn = open(fn) parser = xml.parsers.expat.ParserCreate() parser.StartElementHandler = start_element parser.Parse(fn.read(), True) files += 1 def print_result(): out = open('result.txt', 'w') for k, v in tags.iteritems(): out.write(u'%s\t%d\n' % (k, v)) print 'Files:', files if (__name__ == '__main__'): if len(sys.argv) != 2: print >> sys.stderr, 'Usage: %s <xmlfile|zipfile|folder>' % sys.argv[0] sys.exit(1) src = sys.argv[1] if (os.path.isdir(src)): parse_dir(src) else: parse_file(src) print_result() 

results


We charge:
 time nice ./thisfile.py ~/Torrent/....ec > out.txt 2>err.txt 

We get:
* Runtime - 74'15..45 "(a little work was done in parallel and music was listened to, essno);
* It turned out that the processing speed is ~ 40 MB / s (or 58 cycles / byte)
* 2584 * .fb2 files are dropped (though expat is a non validate parser, but not to the same degree ...) - ~ 10%;
* in the file results.txt - which is not ...
* well, actually, for the sake of which everything was started: out of 65 tags FB2 _ not only one is used (output-document-class); You can skip a couple more (output, part, stylesheet); the rest are applied from 10 thousand times.
* according to rough estimates, reading files (with unpacking) takes 52%, parsing - 40%, processing start_element - 8%.

And faster - can I? On python.

')

Source: https://habr.com/ru/post/171447/


All Articles