Roberto Iurusimshi explains how to effectively combine non-modifiable strings.
Despite the fact that the code is written in Lua, the algorithm is suitable for other languages ​​in which the strings cannot be changed.
Prologue
In Lua, the “accumulation” of the resultant string (i.e., a cycle with a body of the form s = s..x) can be extremely resource-intensive. This note describes an efficient way to create a string piece by piece in Lua.
Problem
Suppose that you make a string in parts, for example, reading from a file line by line. A typical code might look like this:
Despite its innocent appearance, such Lua code can cause significant performance loss for large files: for example, it takes almost a minute to read a 350K file.
')
Often this is not a problem. For small lines, a similar cycle is normal. To read the entire file, you can use the "* all" option, which allows you to read the entire file. But sometimes there is no such simple solution. In this case, the only solution is a more efficient algorithm for this problem. Here we show it (algorithm, not the problem).
Decision
The heart of the algorithm is the stack, which stores large lines at the bottom, and small lines come from above. The main invariant of such a stack is similar to the popular (among programmers) “Hanoi Tower”: a line in the stack cannot be on top of a shorter one. When a new line is placed on top of a shorter one, then (and only then) the algorithm merges them together. Combining lines creates a large line that may be larger than its neighbor from the bottom floor. If this happens, the resulting string and the neighbor from the bottom are also merged. These joins go down the stack until the loop reaches a larger line or bottom of the stack.
function newBuffer () return {n=0}
To get the final result, we just need to combine all the lines from top to bottom:
function toString (stack) for i=stack.n-1, 1, -1 do stack[i] = stack[i]..tremove(stack) end return stack[1] end
Using our new data structure, we can rewrite the program as follows:
local s = newBuffer() while 1 do local line = read() if line == nil then break end addString(s, line.."\n") end s = toString(s)
This program reduced the time to read a 350K file from 40 seconds to 0.5 seconds. The read call "* all" is still faster, doing the same in 0.02 seconds).
Explanation
To understand what happens when we use the naive approach, suppose that we are in the middle of the reading process; buff already contains a 50KB string and each 20 byte row. After concatenation operation
buff = buff..line.."\n"
The buff variable already contains 50020 bytes, and the old line has become garbage. After two iterations of the loop, the buff already contains 50040 bytes, and already two lines form more than 100Kb of garbage. Thus, Lua decides, quite correctly, that it is time to start the garbage collector, and frees these 100Kb. The problem is that this happens every two iterations, and Lua will start the garbage collector two thousand times before the loop ends. And even with all these actions, memory consumption will be three times larger than the file size. And even worse, each concatenation must copy the contents of the entire line (50Kb and more) into a new line.
This problem is not only Lua: other languages ​​with a real garbage collector, and where the lines of non-modifiable objects behave also (the most famous of which is Java).
Our initial cycle used a “linear” approach to the problem, combining small lines one after another into a variable – accumulator. A new algorithm avoids this using a binary approach. It combines many small lines with each other, and sometimes combines the resulting large line with other large lines.
PS
Examples are given on a fairly old version of Lua, and I (Xitsa) did not change them, since the meaning of the algorithm is clear enough.