Backlit editor. Problems and Solutions

Most of the audience Habra regularly writes code. In a text editor or IDE. And no matter how many windows and menus are in it, the heart of any editor is the component (widget) that edits and highlights the code.

More than a year ago on Habré there was a cycle of articles from the namespace about the QScintilla component ( 1 , 2 , 3 ), and my article with its criticism . It turned out some understatement. It is clear that everything is bad, but it is not clear what to do.

Now I wrote my ~~bike~~ component called Qutepart, and the cycle has a sequel.
This article will tell about the syntax highlighting in my project: what problems arose and how they were solved. It is about approaches, and not about the specifics of a specific GUI-toolkit. If it is interesting to look under the hood of a text editor, welcome under the cat.

At once I will make a reservation that this is not a fundamental scientific article. This is my experience. It will be very interesting to read in the comments about alternative approaches and solutions.
')

Motivation

I make a text editor in the spirit of vim and emacs. Universal, cross-platform, extensible, focused on advanced users. But with a slightly more intuitive and modern GUI on PyQt . And for that, I need a code editor.
There were only 2 options: QScintilla and katepart.
I initially used QScintilla, but because of the flaws listed here , I decided to give it up.
katepart is very good, but does not like what depends on the KDE libraries. And having them in dependencies is not very convenient, especially for Windows and MacOS.

Parsing code

My project is not focused on any particular technology, I want to highlight a lot of languages. More than I am able to master myself. Therefore, it was decided to use the existing database of syntax descriptions from katepart.

For each of the supported programming languages in katepart there is a Highlight Definition - an XML file describing the syntax and highlighting.
Highlight Definition describes something like a finite state machine. The machine sequentially parses the text, changing its state - “Context”. Each of the contexts has a set of rules according to which it is necessary to switch to another context. The automaton remembers from which state it has moved to the current one, and can return to previous contexts (the stack of contexts).
Example: C ++. In the context, the code is a rule: if the character "meets" - go to the context "string" . The context "string" highlights the characters in red and supports the rule: if the character "meets" - return to the previous context .
For each of the contexts, the style by which it highlights the text is set.

The system is very convenient and versatile. The file format is well documented. Perhaps that is why katepart highlights about 2 hundred languages and formats.
The downside, in my opinion, is only that the interpreter, which parses the code based on Syntax Definition, will in most cases be worse in performance than a parser for a specific programming language.

Optimization

When I wrote syntax highlighting, there was no limit to my joy. Everything works, everything is beautiful. All 59 files in different languages from the katepart collection look right and open quickly. Hooray! Who said Python is a slow language ?!
And then I tried to open a large file. Really big. And ~~suddenly~~ it turned out that my parser is not so fast. I had to take on the optimization.

A couple of hours with a profiler accelerated the backlight several times. But it still worked too slowly. And the space for optimization has been almost exhausted. It was possible to speed up the parser by a couple of tens of percent due to the terribly confusing code, but such weather optimization does not.

I began to learn how to write modules in C. It turned out to be not at all difficult, Python for extensions is very friendly.
In the process of writing the parser, a problem arose: there are no regular expressions in C. And I really didn’t want to connect dependencies. The problem was solved due to the fact that the interaction of C-Python works in 2 directions. Python calls C for parsing, and from C a Python function is jerked to check the regular expression.

When I started testing the parser with the extension, it turned out that the performance is not significantly different from the version in Python. I took the profiler again and went to look for the problem.
It turned out that 90% of the time, my parser calls Python to check regular expressions. Well, the hack failed. I had to use an external library. So the component has a single dependency - a regular expression processing library on C pcre . With her, the performance was quite acceptable (the numbers will be lower).

As a result, I was not disappointed in the expediency of using Python. The C parser is about 1/3 of the code base of my component. I think that in terms of labor costs such a hybrid version turned out to be easier than a C ++ solution.

Asynchronous backlight

Most text files are rather small in size. But sometimes I had to edit the source, in which more than 300K lines. No matter how cool the parser developer is, and how quick his language is, the final file will be longer to understand than the user agrees to wait.

katepart highlights code in a GUI stream. And it does it lazily - it highlights as much as it needs to be drawn on the screen. This approach works very well if the file is opened at the beginning. However, if you jump to the end of a large file, the GUI just hangs. I did not accept this approach.

vim and emacs, if necessary, draw the end of a large file and parse the text from the middle. The approach is good because it allows not to block the GUI for a long time when highlighting. But not everything is so simple. Programming languages are sequenced. For example, in order to correctly handle the end character of a multi-line comment, you need to know whether the comment began in the previous line. It turns out that in some cases the parsing from the middle will produce the wrong highlighting (as in the screenshot from vim).

Now, when the increase in clock speeds has shifted to an increase in the number of cores, it is often relevant to optimize the calculations due to the separation by cores. But even here problems arise, and not only because the file needs to be parsed sequentially.
The user is constantly editing the code. The GUI thread handles keystrokes and changes the document. If the document is parsed and highlighted in the streams, the changes need to be synchronized. The Qt toolkit is not everywhere focused on multithreading, I have not found a way to reliably synchronize access to the document. It was necessary to refuse threads.

As a result of the experiments, I got the following solution: the file is parsed and highlighted in the GUI stream by timer. The timer works 20 milliseconds, then returns control to the main loop to process user actions, then it is called again ... If the user opened a huge file and immediately jumped to its end, the file is displayed, but without illumination.
The code can be edited, and the backlight will appear a little later.

Incremental backlight

When a user edits a code, the text should be highlighted again from the place being edited, not from the beginning. And, as a rule, one or several lines change.

In the process of parsing, a block of metadata with the state of the parser is attached to each line of the file. If the text has been modified, metadata is used to begin parsing from a specific line.
The incremental parsing continues until a string is found whose metadata has not changed after the new parsing.

Performance comparison

Comparison of the performance of the backlight is not a thankful business. It depends on many factors: hardware, software versions, language, content of a specific file, phase of the moon, ...
However, without it, the article would not be complete. Therefore, it is necessary to add a section.

Denial of responsibility

As mentioned above, performance depends on many factors, and in other circumstances, the picture may be completely different.
This section is not the main purpose of the article. Measurements were made only on one file in one language, very superficially. Figures, observations and conclusions may contain gross errors.
Therefore, the measurement method is not published; I propose to assume that the information below, I invented myself exclusively for the entertainment of the Habr audience.
The information given in no way can be used to conclude that the text editor X is better than the text editor Y.

I opened a large C ++ file (364121 line) in several editors that are interesting to me, and collected my observations into this table.

Component or editor	Time to highlight the entire file	Blocks GUI	Backlight problems
Qutepart	44 seconds	Never	Open file 3 seconds
katepart	Was killed after 6 minutes	While not highlight as much as you want to display
QScintilla	3 seconds	Never	Brakes when editing
Scintilla	3 seconds	Never	Brakes when editing
Sublime text	23 seconds	When editing, until it updates all the changed backlight
gedit	8 seconds	Never	Brakes when editing
Qt creator	20 seconds	When editing, hangs until it updates all the changed backlight
Ninja IDE	14 seconds	When opening	Highlighted only the first 51 lines. Terribly slow when editing.
vim	Instantly	Never	Parsit file from the middle, in some cases shows the wrong result.
emacs	Instantly	Never	Parsit file from the middle, in some cases shows the wrong result. It is hung up for about a minute when rewinding up.

Detailed version of the table
As you can see, the Qutepart is the slowest to highlight the text. This is natural, since it uses an interpreted language, a bunch of Python-Qt, and interpreted syntax definitions in the form of XML.
On the other hand, high-level language and technology allow us to highlight many languages, not to block GUI and not to show artifacts.
When working with real files in the absolute majority of cases, the file is opened already highlighted, and when editing, the user does not see how the backlight is updated. Therefore, the current state of affairs suits me and I have refused further optimization.

And what happened?

I have a component for editing the Qutepart code and a text editor based on it, Enki .
The component depends on PyQt and pcre. Requires building an extension module for C. For small files, you can do without the extension and without pcre.
Syntax Definition files and code alignment algorithms are borrowed from katepart.
Like the katepart, the project is available under the LGPL.

Today I released the first version of Qutepart and Enki based on it, because I decided that the current version is already better than the QScintilla version. Functionality is not much. TODO-list is big. It is periodically updated due to the wishes of users and becomes less due to the features made.

I will be glad to get feedback from the Habra community!

Source: https://habr.com/ru/post/188144/

All Articles