More about parsing on Prolog

Here I stumbled upon, in general, a simple task consisting in parsing a text file containing 5 million floats (and counting their sum). The file is generated with the following C # code:

static void Main( string [] args) { using ( Stream stm = new FileStream ( @"d:\numbers_large.txt" , FileMode .Create)) { TextWriter wr = new StreamWriter(stm); System. Random r = new System. Random (); for ( int i = 0; i < 5000000; i++) { double d=10000*r.NextDouble() * (r.NextDouble() > 0.7 ? -1.0 : 1.0); wr.Write( "{0} " , d); } wr.Flush(); }

The task was set in the context of discussing the performance of haskell in applying it to the tasks of parsing. I knew that on the prologue such problems are solved beautifully and naturally using the DCG technique (Definite clause grammar: 1 , 2 , 3 , 4 ). In fact, this is a description of grammars in the Prolog language, and parsing on them, based on the redundant-recoil prolog principle.

Well, that is, it usually turns out very briefly and beautifully (for example, here is the solution to the problem of balancing brackets by this method: a program of 7 lines ), but, I suspected that it was not always fast. Actually, I wanted to check it out.

Pretty quickly (about 15 minutes) the obvious (naive) version of the program was written:
')

:- set_prolog_flag (float_format, '%.15g' ). integer ( I ) --> digit ( D0 ), digits ( D ), { number_chars ( I , [ D0 | D ]) }. digits ([ D | T ]) --> digit ( D ), ! , digits ( T ). digits ([]) --> []. digit ( D ) --> [ D ], { code_type ( D , digit) }. float ( F ) --> ( "-" , { Sign = -1} ; "" , { Sign = 1} ), ! , integer ( N ), "," , integer ( D ), { F is Sign * ( N + D / 10^( ceiling ( log10 ( D )))) }. sum ( S ) --> float ( F1 ), ! , " " , ( sum ( S1 ) ; { S1 = 0} ), { S is F1 + S1 }. go :- read_file_to_codes ( 'numbers_large.txt' , Codes ,[]), writeln ( 'Processing...' ), sum ( S , Codes ,[]), writeln ( S ).

I, of course, suspected a bummer on parsing such a large file, but the subring happened before. The read_file_to_codes predicate fell in Out of global stack. That, in general, was pretty obvious. The fact is that in the implementation of the SWI prolog I use there is a fundamental limitation on 32-bit architecture - it cannot use more than 128 MB for the global stack and 128 MB for the local stack (restrictions are eliminated in 64-bit axes). This I later read in the dock, and before that I tried to play with the keys -G500M -L500M, but this, of course, did not save, because to read ~ 82M characters in memory required significantly more space (than 128).

Helped the guys on the mailing list (by the way, I want to emphasize that the author of SWI Prolog, Jan Wielemaker is actively involved in the mailing list, helped this time too!).

They suggested using the phrase_from_file predicate from library (pio), which works with lazy-style streams, and in fact combines reading from a file and parsing in one predicate. Just what you need!

Moreover, anticipating further problems, rewrote the predicate sum, performing parsing with simultaneous summation, in a 'tail recursion' form. For the uninitiated, I’ll tell you in more detail: in functional and logical programming languages, you usually cannot create loops, so almost everything is done using recursion. But if you write it correctly (when it is possible), then you can achieve the effect that the code will compile into a loop, although externally it will be a recursion. On the prologue for writing tail recursion, the predicate should be in the form

head :- <guard goals>, !, <deterministic stuff>, head. head :- ...

Actually, the main thing in tail recursion is that at the very end of the predicate there is a call of his own.

The code was rewritten as follows:

:- set_prolog_flag (float_format, '%.15g' ). integer ( I ) --> digit ( D0 ), digits ( D ), { number_chars ( I , [ D0 | D ]) }. digits ([ D | T ]) --> digit ( D ), ! , digits ( T ). digits ([]) --> []. digit ( D ) --> [ D ], { code_type ( D , digit) }. float ( F ) --> ( "-" , { Sign = 3D -1} ; "" , { Sign = 3D 1} ), integer ( N ), "," , integer ( D ), { F is Sign * ( N + D / 10^( ceiling ( log10 ( D )))) }. sum ( S , Total ) --> float ( F1 ), ! , " " , { S1 is S + F1 }, sum ( S1 , Total ), ! . sum ( Total , Total ) --> []. go1 :- phrase_from_file ( sum (0, S ), 'numbers_large.txt' , [ buffer_size (16384)]), writeln ( S ).

But again, a bummer, this time with a local stack (and a memory consumption of about 130 MB)

?- go1.
ERROR: Out of local stack
Exception: (1,973,549) sum(3943621517.84343, _G2, [45, 50, 55, 54, 57,
44, 49, 48|...], []) ?

This could only mean that tail recursion did not work, and this time it was guilty prudently (indeed, premature optimization is the root of all ills) put at the end of the predicate sum / 2 cut-off icon (cut!). It was he who was called upon to cut off the search for unnecessary alternatives, being put at the end, turned off tail tail recursion optimization.

In general, removing this ill-fated (!) From there, but adding it to the float after an alternative to the sign associated with parsing (to avoid remembering return points), received the final version of the program:

:- set_prolog_flag (float_format, '%.15g' ). integer ( I ) --> digit ( D0 ), digits ( D ), { number_chars ( I , [ D0 | D ]) }. digits ([ D | T ]) --> digit ( D ), ! , digits ( T ). digits ([]) --> []. digit ( D ) --> [ D ], { code_type ( D , digit) }. float ( F ) --> ( "-" , { Sign = -1} ; "" , { Sign = 1} ), ! , integer ( N ), "," , integer ( D ), { F is Sign * ( N + D / 10^( ceiling ( log10 ( D )))) }. sum ( S , Total ) --> float ( F1 ), ! , " " , { S1 is S + F1 }, sum ( S1 , Total ). sum ( Total , Total ) --> []. go1 :- phrase_from_file ( sum (0, S ), 'numbers_large.txt' , [ buffer_size (16384)]), writeln ( S ).

On Core2Duo @ 3Ghz, ~ 1min is running, consuming no more than 5 MB of memory and developing a parsing speed of ~ 1.3 MB / s. Well, not as fast as a variant on Haskell, but, IMHO, very bad for an interpreted SWI-Prolog, and at the same time very elegant, compared to the solution from the root of the RSDN).

By the way, python, surprisingly not pleased with the speed. Naive, true version

from decimal import * getcontext().prec=15 print sum(Decimal(f.replace( ',' , '.' ) or '0' ) for f in open( 'numbers_large.txt' ).read().split( ' ' )) * This source code was highlighted with Source Code Highlighter .

I ate 236 megabytes of memory and worked ~ 4 minutes on the same configuration.

Source: https://habr.com/ru/post/60430/

All Articles

More about parsing on Prolog

More articles: