I wrote
this post to write this program (and later the article). It just so happened that I have the habit of keeping up the read articles, since it is impossible to remember everything, and it is unknown when that might be useful. So, after reading the above post and remembering the opportunity to print to PDF pages from Wikipedia, so dear to me, a little thought appeared to make the same “printer” for Habr to be able to get articles in my personal archive.
The first attempt was the use of the post inspired by the author, so kindly provided by the author. And almost immediately there was a rake that ignored was beyond my strength. These rakes are code highlighting.
At once I will make a reservation, on Habré I am a beginner and how what works I have a very vague concept. However, looking at the source of the page with the article in which the code fragment is presented, the source of the problem became clear. And it * drum roll * in the fact that the code is engaged in javascript. No, reading through the browser is nice and cool, but Python pisa, which is engaged in rendering the page in PDF, cannot execute the coloring code in principle.
')
There was an idea - you need to think of something.
It is necessary to modify the source code of the article so that the basic constructs of the language, the code on which is presented in the article, are framed with special html-tags. Then it is enough to add a few lines in css to see the code with syntax highlighting, which can be seen both in the browser and in the printed pisa PDF. What is needed for this? First of all, select the very lexemes of the language. And here the fun begins. Well, do not do the same full-featured parser for each language! However, this was not required.
Recall the theory. Programming languages, as a rule, belong to the class of languages ​​described by context-independent grammars, which include as a subset of regular grammars. Regular grammars describe the basic elements of the language - lexemes, from which all other syntactic constructions are built. Favorite by all, the code highlighting is dedicated to highlighting some types of lexemes in a different color, making the code easier to read and more pleasant to read. Therefore, the task is as follows: compose regular expressions for all classes of highlighted lexemes, find all matches for regular expressions and frame them into the corresponding html-tag. However, the problem is this: the compilation of a long regular expression takes time and effort. For each language there are several such expressions. A lot of languages. And the expressions themselves are not as simple as they might seem. For example, let's try to define a regular expression that corresponds to the data type of the C language (we will limit ourselves to several, since there are a lot of them). What's so complicated? Pancake the first:
r'int|short|long|char'
Right? Not. Such a regular expression will find a match, for example, in the chelintano line, and we will get the highlight in the middle of the word. The output is obvious - add whitespace characters at the beginning and at the end.
r'\s+(int|short|long|char)\s+'
Wrong again. Before the type there can be a parenthesis, a square bracket, a curly bracket, and if we recall that the type name can mean a type conversion operation - in general a whole lot. It turns out that it is easier to say that it cannot stand in front of a type name - a letter, a number and an underscore character. So as a result, we get this regular season:
r'(?P<prefix>^|[^a-zA-Z0-9_])(?P<body>int|short|long|char)(?P<postfix>$|[^a-zA-Z0-9_])'
Now imagine such a trouble for each class of tokens. For every language. That would be nice if only the base was specified, and the regular schedule would have been made up by itself. And it can be done.
The most successful solution seemed to me to write classes of language tokens in a file with a structure like an INI file. For each class of lexemes, one can distinguish the main components: the prefix — characters (sequences of characters) that can stand before a lexeme, the body of a lexeme, and postfix — characters that can appear after a lexeme. Each component, in turn, can consist of simple expressions — ordinary strings, such as int or function, or regular expressions, for example, [0-9] + (\. [0-9] +)? (regular expression for floating point number). Thus, the following parameters can be set in each block of an INI file:
lexem
before
after
regexpr
The value of the regexpr parameter is a regular expression. This parameter can be used several times, then the resulting regular expression will coincide with all the parameter values. For the first three parameters, the value, as a rule, is the set (enumeration), which in the best traditions of python is written in square brackets, as well as the list. Sometimes it is convenient to separate values ​​with a special character, to specify which the delimiter parameter is used (by default, the separator is an empty string, that is, each character is considered a possible regular body). This parameter changes the delimiter character within a file block until another definition of the parameter value is encountered. It happens that the beginning or end of lines in the lexem enum is repeated, a vivid example of this is the definition of C ++ preprocessor directives (#include, #define, #pragma). So as not to write too much (and suddenly there are really many of them), you can specify the values ​​of the prefix and postfix parameters. These values ​​will be added to each line in the lexem enumeration at the beginning and at the end, respectively.
I will give an example for the same C ++
[classname]
delimiter =;
postfix = \ s +
before = [class]
regexpr = [a-zA-Z _] [a-zA-Z0-9 _] *
eqstyle = typeword
[number]
delimiter =
regexpr = [0-9] + (\. [0-9] +)?
regexpr = 0x [0-9a-fA-F] +
before = {abcdefghijklmnopqrstuvwxyzABCDEFJHIJKLMNOPQRSTUVWXYZ_}
after = {abcdefghijklmnopqrstuvwxyzABCDEFJHIJKLMNOPQRSTUVWXYZ_}
Here you can see the eqstyle parameter. It is introduced solely for practical reasons so as not to add unnecessary entries when defining the highlight (in this case, wrapper tags and entries in css files). The definition of the eqstyle parameter should be read as “for this class of tokens, use everything the same as for the class <value>”
Things are going to be easy - read this file, compile a regular expression for each class of a token. As a result, we obtain a dictionary with symbolic keys — the names of the classes of the tokens, and values ​​— compiled regular expressions (let's call this dictionary the “style” of the language). It remains to drive the block of code, published in the desired article, through each of the expressions of "style".
def modify(style, eqstyles, stylename, block) : if style == None : return block
Actually, what is happening here. For each regular expression, the first match is searched. So we get the offset of the lexeme relative to the beginning, as well as the length of the prefix and postfix. We add a triple to the list of matches (<position of the beginning of the lexeme>, <position of the end of the lexeme>, <class of the lexeme>). The starting position is calculated as the offset of the match relative to the beginning of the code + the prefix length (for the end position is similar). Along the way, it checks whether the token overlaps others and is not overlapped. What does it mean? Do not forget that tokens are just strings, and one string can be a substring for another. And if this other one is also a lexeme, then it “overlaps” the nested one. For example, if there is a string “int is a type of variable to store a four-byte integer”, the word int does not need to be highlighted, it’s just part of the string — although it will be highlighted as a token. After processing the token, the search string is truncated on the left to the position where the found lexeme ends, and a new match is made in the remaining text, and so on until the match ends.
It remains the simplest - using a list of positions of tokens, frame them with html-tags. Tags are compiled simply: <"name of language" _ "class of token">. As a result, we get a block of code, supplemented by html markup. Add to this the definitions of styles for each of these tags - and get the highlighting of the code, which both the browser and pisa play equally well.
This is one of the printed pages of
this article. The blocks of code are designed in the style of Obsidian, brazenly borrowed from my beloved Notepad ++.

The presented coloring method, definitely, is not ideal either in terms of completeness (sometimes the parameters presented are not enough to determine the exact token), nor in terms of performance. However, on the articles I have printed, he did not give any significant blunders. In addition, we do not launch a shuttle into space, here the price of an error is minimal and I don’t see any sense to optimize. If anyone knows other methods of organizing lighting (I, frankly, have never been interested), I will listen and read with pleasure.
The program can be downloaded
from here.PS: I would be grateful to everyone to help in writing descriptions of languages ​​and styles of illumination.