Everyone who tried to program at least the simplest text editor at a low level was faced with the task of organizing memory for storing editable text. The data structure for storing text must meet the following requirements:
- have low memory overhead. Most of the available memory should be used to store text, not service information;
- allow effective insertion and deletion in any place of the text.
To meet these requirements at the same time is not easy. If we consider well-known data structures, such as arrays, lists, trees, stacks, queues, ring buffers, then there is no such structure that would effectively fulfill both requirements. In the case of an array, we have insignificant memory overhead, but the insert operation has complexity O (n), where n is the size of the text being edited. In the case of a list, the complexity of insertion and deletion is O (1), but the memory overhead is several times the size of the actual text. Trees, heaps, ring buffers, associative arrays and other structures are completely inapplicable for storing text in the editor.
There are hybrid solutions when text is stored in a set of arrays, which, in turn, are combined into a list. It would seem that this approach allows combining the advantages of arrays and lists (quick insertion / deletion with low memory overhead). However, this solution is difficult to implement. It also leads to memory fragmentation.
I bring to your attention an efficient data structure for storing editable text that is easy to implement, has constant memory overhead and fast insertion / deletion in an arbitrary place. It also allows you to effectively edit files that do not fully fit into the RAM.
')
Despite the fact that this data structure was discovered long ago and was used in text editors on old computers in the 8-bit era, this secret ancestral knowledge was largely lost and is rarely found in modern editors. Try opening a single-megabyte-by-10 file in Notepad or Far. Inserting and deleting characters will take seconds.
Description
We take a fixed-size memory block (in old 8-bit computers, this block takes all free memory) and places the edited text at the beginning of this block before the cursor, and the text after the cursor at the end of the block. The text is stored in memory in two continuous chunks, and in the gap between them is all the free memory available to the text. In the process of editing the text, we support this invariant: at the beginning of the buffer is the text before the cursor, and at the end - the text after the cursor. Everything!
This data structure is called
Gap Buffer in English terminology. I don’t know the Russian name; only the term “buffer window” is found in Russian
Wikipedia . I do not know, really, to trust such a translation of the term or not.
Analysis
- The operations of insertion and deletion are cheap: O (1) both when inserting a single character and their group (for example, pasting from the clipboard).
- Memory overhead is low: O (1). You need to store only the position of the cursor and the total size of the buffer.
- The element search operation by index is cheap: O (1).
- The operation of copying a section of content (for example, for copying to the clipboard or saving text to a disk) is O (n), where n is the size of the copied section. The complexity is the same as for the array.
In other words, for the above operations, we managed to realize all the advantages that the array provides.
When using Gap Buffer in a text editor, there are costs for moving the cursor. After all, to ensure the invariant, you have to copy the text from the lower filled section of the buffer to the upper one or vice versa. The difficulty of moving the cursor is O (n), where n is the distance that the cursor moves. When moving from the beginning to the end of the text, a noticeable user delay may appear. However, moving from the beginning to the end of the text is a relatively rare operation when editing it. Often there are moves to a character, line or page forward or backward, and then you have to move only a small portion of the text. The delay will be invisible even for 8-bit computers.
Thus, Gap Buffer exploits the property of the text editing process that insertion and deletion always occur in the place where the cursor is located; in turn, cursor movements are usually small.
Variations
1. Shadow write cursor
In its basic implementation, Gap Buffer can still cause unwanted delays. For example, when searching for text or using bookmarks, significant cursor movements can occur. Also, not every movement of the cursor through the text is accompanied by editing. Hence the solution: enter the “record cursor” invisible to the user, which corresponds to the gap in the Gap Buffer. This cursor only moves when the user begins to insert or delete text. With simple movements of the visible cursor in the text or replacement (that is, when the total text size does not change), the recording cursor should not be moved. This eliminates delays in any random movements of the visible cursor when viewing text, using bookmarks, searching, etc.
2. Editing files that do not fit in RAM
In our era, gigabyte-sized RAM is used by very few people who use editors, who do without full text loading into memory. Most of the documents created by people fit into it, so there is simply no point in complicating the editor. However, sometimes it becomes necessary to edit large files that do not even fit into the memory of today's computers. This can be computer-generated data, such as log files or textual representation of a record from any sensors or instruments.
With the help of Gap Buffer, you can effectively implement editing files that are larger than the size of the computer’s RAM. You may notice that Gap Buffer is actually two stacks, one of which stores data up to the cursor, and the second after the cursor. The first of the stacks grows up, and the second - down. Hence the idea: to implement both stacks in the form of files, while storing in memory only those sections that are closer to the top.
A stack in memory can grow up or down, and a stack implemented as a file can only grow up. Only at the end of the file can we add data, or vice versa, delete them from there. If we now go back to memory operations and consider stacks growing in different directions, then it is clear that in the stack growing up, the data are arranged in direct order relative to the sequence of putting them on the stack, and in the stack growing down in the reverse order . So, if we want to implement Gap Buffer in the form of two stacks growing up - we will have to change the order of the elements in the second stack, which stores the text after the cursor. Therefore, the temporary file that contains the data after the cursor should contain them in the reverse order, backwards.
In temporary files, we will not store the entire contents of the stacks, but only their lower part. The upper part will be located in memory, forming a “local” Gap Buffer and allowing the user, as before, to quickly edit the text and navigate through it within certain limits. When moving the cursor through the text outside the area loaded into memory, its part farthest from the cursor will be saved to one of the temporary files, and the other part will be loaded from the other.
When you finish editing the text, we simply move the cursor to its end. All the text falls into the first stack, growing up, that is, the first temporary file. In memory, perhaps, there is still a piece of text - we simply add it to the file. After this, the first temporary file contains the saved text.
You can implement a few more optimizations. When opening a file, for example, the cursor is usually at its beginning. Therefore, all the text should be located in the second stack, that is, we should copy almost all the original contents of the edited file into the second temporary file backwards. This is a lengthy operation. It can be avoided if you postpone the creation of a second temporary file until it becomes necessary. That is, as long as the contents of the tail of the source file completely coincides with the contents of the second stack. During this period, instead of using the second temporary file, we can load information from the source file. To create a second temporary file, thus, you must first move the cursor over the text a distance greater than the size of free RAM, then edit something there, and then go back to the beginning of the text.
The same optimization can be implemented for the first temporary file. Do not create it until it becomes necessary. And it will appear no earlier than the contents of the part of the first stack located on the disk will no longer coincide with the initial part of the source file. You can also use the “shadow write cursor” and thus avoid unnecessary copying operations when moving the cursor through text without editing. Then, if the user moves the cursor far across the text without editing, the speed will be no worse than that of a simple text viewer without fully loading it into memory.
Text editing implemented by the described method without its full loading into memory thus has a moderate overhead in disk space: temporary files require space equal to the size of the text.
Interestingly, even if a large text in principle climbs into RAM, using an editor that does not load it entirely, such a document will open faster. Again, increased user convenience.
Conclusion
This is not the first time that I have come across the fact that some technologies developed for old computers with limited capabilities are forgotten in our time, which leads to the spread of inefficient solutions. I hope that the “secret knowledge of ancestors” described in this article will help the birth of good text editors.
Thanks
I first learned about the described data structure from an
ESL user in a thread on the
zx.pk.ru forum. There was also discussed the organization of memory for a text editor with reference to a computer with limited resources, such as the ZX Spectrum.