Poetic discourse with a taste of reverse engineering

"The old man assembler noticed us
And going to the tomb, blessed "

Once I decided to write a poem program. The algorithm came up quickly - at the end of the composed stanzas, put the rhyming words, and fill in the rest of the stanzas with words, taking into account the rhymes, rhythms, and the likelihood of finding them next to other words taken from ready-made connected texts. Such Markov chains with rhymes fastened to them.

Before implementing the algorithm, I decided to see what has already been created by others. The first in the Yandex-search was found (who would doubt!) Yandex.Autopoet , using neural networks trained on the verses of the classics. The second point was the program “Poet's Assistant,” which, upon closer inspection, turned out to be the usual rhyming dictionary. But in third place was the site of a famous writer and experienced fidoshnik Lleo aka Leonid Kaganov.

Why was he there? Because when he was a student at the Mining Institute, Lleo wrote a poetry-based program as a thesis. I do not know how poetic the defense of such a diploma was, but the program seemed to work well - the poems written by her were posted on the author’s website. The program itself was also found there, it worked under MS-DOS and a 32-bit DOS / 4GW extender. The source code for this version was also posted. I learned from the explanatory note to the diploma that there was also a version for OS / 2, apparently, even with a graphical interface, but its source was not found. But the MS-DOS version could be run under DOSBox and see it in action: it really gave out rhymed and rather connected verses, although not very meaningful. For 1996, when Lleo wrote this program, this level of autogenerated verses was very cool. In my opinion, they are not even much worse than the poems of Yandex. Avtopopoet. Or maybe Lleo became a famous writer with the help of a modified version of his program? (Scandals, intrigues, investigations! Just kidding, of course, but who knows ...).

I began to study how this program works. For composing poems, she needed 2 files - a base of words and the markup of rhymes and the size of the composed verses. The sources were in Assembler, about 3500 lines of sources under TASM. The author of the program wrote about this: “I chose to solve most of the tasks on the assembler. It was on it that I performed all the educational work, allowing you to choose a programming language. Mainly because I write and debug a program in this language faster and easier - it allows you to more flexibly interact with the machine. ” And here I fully agree - Assembler is very flexible, and does not impose any programming paradigm. Although of course, it is faster to write programs in modern languages, collecting them from ready-made libraries-cubes. In the source code there were all the characteristic signs of the Assembler programs of that time - short, not always intelligible, understandable only to the author names of variables and functions; fixed sizes of arrays used with the comment “probably enough”; a bunch of global variables, and sometimes witty author comments and error messages, like “Creative Crisis !!!” at the moment when the program runs out of words for the selection of rhymes. Here in this spirit:

@@punkt13: ;call io ;db 13,10,'{{F_LEVEL}}=',0 ;movzx eax,[F_LEVEL] ;call pr_dec cmp [nomer_LEVEL],0 ;13)     ■  16 jne @@punkt16 ;call io ;db 13,10,'  ',0 ;call key cmp [F_LEVEL],0 ;13.0)  ■ ■ =   jne @@punkt14 cmp [FREE_RHYME],0 ;13.1)     ,  je @@error_twor ;■ ■,  stc ret ;   @@punkt14: cmp [F_LEVEL],1 ;14) ; ■ ■ = ■ ■,  je @@565656 ;call io ;db 13,10,'   ,    ',0

It was then that everything began - fascinated by the study of source codes, I forgot that I was originally going to implement the algorithm from scratch, and, recalling my own assembler programs, I decided to port the Lleo algorithm to a modern programming language. Moreover, this algorithm was very similar to the one I had in mind. As a language for porting, I chose Python - it is very convenient to work with text on it.

Analysis of the program began with the fact that I walked through the code and removed all the commented code, there was a lot of it. He left only the commented debugging output - he helped to understand what was happening in this place of the program. Next, I deleted all the service calls, like getting command line keys and file I / O. Now, when there was only code relating to the algorithm, I began to understand it and port it to Python. For those functions, the purpose of which was clear, I immediately replaced the names with understandable ones, or, after rewriting them in Python, deleted the assembler code. Remaining - began to translate line by line on Python. Line by line, of course, it is strongly said that global state variables were widely used in the program, and so that it was not in the Python code, in many places the algorithm had to be completely rewritten without regard for the lines of the original program. At this stage, the code looked like this - not yet Python, but not an Assembler:

 randomValue = init random(777) if curMode=='C': print' ',13,10,' : ',0 BASEname NAME_SHABLON call loadBASE call CREATE else if curMode=='U': print'  ',13,10,' : ',0 call loadBASE call stat call setUdarenie_N call saveBASE # call automat -    # jmp @@udara1 return

Unfortunately, right away I did not guess to look into the text of the explanatory note to the diploma, being sure that there was written a standard crap about economic justification. Because of this, the actual algorithm of the verses and the format of the database of words, I literally reverse-engineering - according to the intelligibility of the names of variables and functions, the source code of the program is not very far from listing disassembler. Although in general, the algorithm of the verses and the format of the verbal database are described in the diploma note . I repent and sprinkle ashes on my head. It is good that their analysis took no more than a couple of evenings. In addition, not all the details of the format and the algorithm turned out to be in the documentation, so that later reverse engineering continued. To work with binary data, Python found a very convenient unpack function. And, having a little tinker with the order of data bytes in the data (of course, it is little endian here, since the program is written for an Intel processor), I was able to load the verbal database. The format of the file with the rhythm of the verse was text, and very simple, it was not necessary to disassemble the download code.

Now it was necessary to understand the proper algorithm for writing poetry. As I wrote above, in general, it was described in the explanatory note, but some details were not there. For example, the verse pattern is set in reverse order — from the end of the stanza to the beginning. Just like the fact that the Latin letter 'p' is everywhere replaced by Russian 'p' - the FIDO legacy, where there was a glitch from Russian “p”, and it was replaced with Latin everywhere, so in Russian texts downloaded from FIDO “Everywhere was Latin, and it had to be transformed back into Russian. Well, and other similar trivia. In general, the algorithm was similar to the Markov chains described in the beginning of the article with rhymes, but it was distinguished by the fact that it used the stack to save the state while writing the stanza, with the ability to roll back states in the event that the algorithm comes to a dead end without finding a word with necessary stress and number of syllables. The code also showed an attempt to compose poems on a given topic, for which the initial word of this topic was chosen, and then the search went on related words. But it seems that this feature did not work, and in the function make_RND_FIELD_TEMA was only 1 hard-coded word index from which the program starts the selection of words.

In the process of parsing the program there were funny moments.
For example, at the beginning of the program was such a fragment:

 jmp @@skip ; ... db 'WATCOM' ;     ,    DOS4GW @@skip:

The fact is that the 32-bit DOS / 4GW extender was written for programs compiled by commercial watcom compilers, and was itself a commercial product. And the fact that the program was compiled by the Watcom compiler was determined by the "Watcom" line at the beginning of the program code. If this line was not, then DOS / 4GW refused to work. In fairness, I note that advanced people at that time used the PMODE / W by Tran extender, where there is no such nonsense, which is noticeably smaller in size, is free, and can be assigned to the program, while DOS / 4GW usually lies in the form separate executable file.

There was also such a piece of code:

 proc bswap_eax ;   , ,   386   bswap! mov [bswap_mes],eax ; ...

Indeed, there were no bswap commands on the 80386 processor, it appeared starting from 80486 and turned out to be very convenient, for example, to convert the little endian byte order -> big endian. So the people wrote such functions and comments.

Another curious thing happened when I tested the algorithm for writing a verse. For the test, I set the end of the stanza so that the word "busy" would definitely fill in the rhyme, which I saw exactly in the word database file. However, this rhyme for some reason was not. It turned out that the word "busy" is written in the database, and the "o" at the end is a part of the service data - a pointer to the word associated with it. The fact that it was the letter “o” is just a coincidence.

When writing poetry was working, I quickly wrote down the spelling of prose - the usual Markov chains, and I wanted more - so that my program could generate the word base from the text itself, and not just use the finished program from the original program. The base generation turned out to be devoted to almost the most part of the program, and due to the reasonableness of the algorithm of work, this part impressed me more than the one that composes verses. In fact, she does all the preparatory work for composing poems: she can parse words from the input text, break them into syllables, and even automatically place accents based on the previously collected statistics of manual placement of accents by syllables. Although, the emphasis is not always correct. And, as far as I know, in the Russian language there is no sustainable rule for the placement of accents. The base of accents was stored in a separate $$$$ SLOG.BSY file, the format of which was not described in the diploma note. Here again I had a little reverse engineering.

When the generation of the database of words earned, it was already possible to start experimenting with various texts. As a result of experiments, it turned out that the algorithm for splitting words into syllables taken from the program does not always work correctly, and I rewrote it from scratch. This also made it possible to refine the algorithm for obtaining a rhyming ending - now it works with syllables and stress, and not just goes to the vowel that is needed in order, relying on this feature to search for the right syllable.

After that, I packed all the functionality into objects, decomposed into modules, and quickly wrote down a Python script using these modules, running from the command line, running with the same keys, and still able to do the same as the original program. And he also knows how to load databases from the original program, although he already stores them in his format — by serializing data through Python pickle. Although the algorithm of the original program is pretty perelopachin, in many places I left the original comments - it is interesting to read them, and they keep the spirit of that era. In addition, being launched with the –oldschool key, the script displays the help of the original program in the console, where there are a lot of greetings to different people and steamers.

Here is an example of poems being written:

***

***

As a result, we have a program that composes graphomaniac verses, with the help of which it is rather interesting to experiment with various texts.

What else can you do good to her:

when generating a database of words, make a dictionary of rhymes - this will speed up the selection of words
teach the program to use the dictionary of accents from the Internet, which will make it possible to place accents more correctly (but still not always correct - there are words in Russian with the same spelling but different accents)
teach the program to write in those foreign languages where the spelling uniquely determines the sound of the word. It will be problematic with English - there, as they say, “We write Liverpool we read Manchester”, but with French it is quite real. In addition, it is easier to place accents there - they (almost) always fall to the end of the word.

That is, in fact, all that I wanted to talk about in this article.
I have laid out the fruits of my labors on Github .

Thank you all for your attention!

Source: https://habr.com/ru/post/323034/

All Articles

Poetic discourse with a taste of reverse engineering

More articles: