Automatic calculation of column width

Task

The task sounds simple - print the table. Print it so that it looks beautiful and, if possible, does not crawl.

After some thought, it was decided to use FOP to generate PDF. The hitch is that Apache FOP does not support table-layout:auto , that is, when building a table, you must manually set the width of the columns (it’s good that you can set the relative width in percent). If you make all columns of the same width, the table will look somewhat inelegant. It turns out that you will have to calculate the width manually.

The basic idea is that the width of the columns should be chosen in such a way as to reduce the number of line breaks inside the table cells as much as possible.
')

Decision

I solved the problem in C ++ using the Qt framework, so there will definitely be some references to the language and Qt, but in general, I will try to avoid the code and describe the main idea of the solution.

Data preparation

First, we decide what data will be required. The text in each cell of the table is divided into words, for each word we calculate how much space it takes on paper. Thus, for each cell we will store an array of word lengths of its text.

In addition, you must know the width of the paper, and expressed in the same units as the length of the words.

I used QFontMetrics :: width to determine the length of words. Thus, the word length was obtained in pixels. Then in order to determine the width of the paper in pixels, it is enough to multiply its width in inches on the screen DPI, which can be obtained using the QPaintDevice :: logicalDpiX function .

It may turn out that the width of the paper is not enough to fit the table, even if the columns are compressed so that each line will fit only one word.

The probability of such a scenario is small (if you are not a German and not a tester, of course), so I did not develop any complex algorithm for handling such cases. Just check that the minimum width of the table, which is defined as the sum of the maximum lengths of words in the columns, is less than the paper size.

minimal
width
tables

determined by
by sum
lengths
the most
long
of words
in columns

If the table does not fit, you have to cut the long words. Since, once you have decided to cut the word, you can cut it anywhere, we will consider the effective length of such cropped words to be zero.

So, we will cut the longest word in the table until the table fits on the paper.

At this data preparation can be considered complete, and you can go to the algorithm for calculating the width of the columns.

Payment

At the beginning of the article I identified the main idea of the solution. Now it is time to express it more formally. In fact, it is necessary to display the table on paper in such a way as to reduce the area of the table as much as possible.
The area occupied by the table can be defined as the sum of the areas of its columns:
$S = \ sum_ {k = 1} ^ {n} {h_k w_k}.$
Here

and

- height and width of the column, respectively.

Of course, this is not quite true, but such a model in practice gives quite satisfactory results.

Let's write out what is the height of the column:

where m is the number of rows in the table, lh is the height of a line of text, p is a certain coefficient describing the height of indents, wc is the number of line breaks.

Since it is impossible to influence the number of lines, line height and indents, they can be taken out as a separate coefficient A.

But the number of line breaks depends on the width of the column. Thus, the height of the column can be rewritten in the form:

And the task is to minimize the function

Yes, yes, I was not mistaken, A and B will not depend on k , since B is determined by the height of the line, and A , in addition, by the size of indents and the number of lines.

S can be rewritten as:

Now I will do a lot of rudeness.

- this is the ideal table area, when there are no transfers at all . But since we are not interested in optimizing the width of the table (in the end, we still stretch the table across the entire page), we replace

, that is, on a permanent. Now everything is simple: for the task of minimization, the coefficients G and B do not play any role, so you can safely throw them out and find the minimum of the function

To find the minimum of the function, we take the gradient descent method as a basis, but we will not descend in the direction of the gradient, but along that coordinate, the change in function (partial derivative multiplied by the change in coordinate) along which has the smallest (it is also the largest in module) value .

Did you say the derivative?

In fact, the function of the number of carries, depending on the width of the column, is not particularly smooth. This is a step function that tolerates discontinuities at the points of change in the number of carries. It looks something like this:

Such a function has obvious problems with derivatives, but we will proceed simply: we will consider not the function itself, but some monotonically decreasing function, which at the points of change in the number of carries coincides with the right limit of the original function.

As a starting point, we take a point with the smallest possible column widths.

Algorithm

For each column:

We construct arrays of points at which the number of hyphenation in this column changes - m _k .
The arrays of values of the smooth function of the number of carries at these points (I defined the smooth function in the previous paragraph) - wc _k .

In order to form these arrays, we use the data on the length of words in each cell of the table prepared at the first stage.

Let's add arrays (there are as many of them as there are columns) with starting points.

As a result, for the column

minimal
width
tables

we get the following data structure:

m ₀	wc ₀
eleven	2
13 (“width” and “tables” - on one line)	one
24	0

Similar structures will be obtained for the remaining columns.

Now run the iteration process:
While the sum of the widths of the columns is less than the width of the page:

Calculate the change in the area function ( U ) when moving along each coordinate using the formula

(This is a well-known formula for taking the differential of the product: d (AB) = B d A + A d B)
If there is only one element left in an array, then there are no line breaks in this column ( wc _k [0] == 0 ). We do not take such a column into consideration.
If none of the arrays were considered, in all the columns there are no line breaks, we finish the calculations.
Increase the width of the k- th column, in which the value of d _k is the greatest. Remove the first elements of the arrays wc _k and m _k for this column.

Finally, we convert the resulting column widths from absolute units to percentages. Thereby, let's get rid of the font metric that was used.

Total

As a result, the table is exactly what I would have done with manual formatting.

PS
The article does not pretend to mathematical rigor, but it shows where a little mathematics in applied programming can be useful to you.

Source: https://habr.com/ru/post/167507/

All Articles