Table interface: fast / uncomfortable - slow / convenient

There are questions that, it would seem, can not be resolved. Too often we meet with them in everyday life. But look carefully - and it turns out, no, not resolved. Everybody does differently. And not always good. One such issue is the interaction of the user interface of working with the table and the database management system (DBMS).

The requirements are clear. The data should be displayed quickly, create a minimum load on the DBMS and work with them should be convenient for the user. Solutions seem to be all there too. But all the same, even in very successful projects, technologies have been used that suggest that the developers decided to come up with the “best” solution again.
')
I would like to consider modern approaches to solving this problem and consider whether there is a better option. And, if not, when that is better to use.

Question 1. What is the fastest way to work with a table?

The fastest operation in the database is getting the record by the values of the indexed fields, for example, the following query instantly returns the next record in an ordered set.

select sortField from table where sortField > [ ] limit 1;

where sortField is a field (set of fields) by which the data set is ordered. Typically, such a set of fields includes the primary key, so that the order of entries is always the same.

Hereinafter, I intentionally simplify the look of queries in order to improve readability. It is clear that in real life for a set of fields the request will look more cumbersome.

Thus, the fastest way to display data is to load them in sequence as you scroll. Naturally, in this case, we do not know in advance when the scrolling will reach the end and display the slider, based on the currently loaded records.

In this case, either page-by-page display can be used, such as, for example, in search engines, on Yandex.Market or Google mail, or “endless scrolling”, such as on VKontakte. There are combinations of these approaches, but the principle of working with a database is one.

In the case of paginated viewing, the user is immediately displayed the first page and it is determined whether there is a next one (the next item in the sorted data set is checked) Sometimes there is a check for the following pages. In the case of an infinite scroll, it is assumed that there are only loaded records in the table. As you scroll, additional entries are loaded (if any) and the position of the slider in the scroll bar is adjusted. In any case, there is a sequential reading of records.

For the user, the work is quite logical. It sets some sorting and sorting conditions for records. Further, it usually requires only the upper entries. If the user decided to see what's next, he may well do it by scrolling through several pages. So we, for example, work with search results in search engines.

However, there are drawbacks.

1. When working with tables (especially not very large), using the scroll bar to navigate through all the records is extremely convenient. In this case, this opportunity is lost. Experience shows that most of the tables the user works with do not contain many records.

2. There is no possibility to go to the record. Suppose, when working with my record library, I want to find some song that (I know) someone uploaded right after “Queen. The Show Must Go On. If I enter in the selection conditions “Queen. The Show Must Go On ”, the song I need will not be displayed. I will certainly see “Queen.” The Show Must Go On ”, but I don’t know what was downloaded before or after it. The only option in this case is to scroll through all the entries to the required and see what’s next. A similar problem happens when using the mail server when you need to find a letter in the vicinity of a given one. First you have to find this most specific letter, look at the date, and then set another filter by date.

3. All downloaded data must be kept in memory.

4. If the user selected a record and left it for other operations, then when returning to the table interface, to restore the old view, you will have to reload all records from the very first one.

How serious these inconveniences are is to judge system analysts and, ultimately, users. In different cases, these inconveniences manifest themselves in different ways.

However, overcoming these inconveniences in any case suggests that it is necessary to somehow estimate the total number of records that fall within the sampling range, and to ensure a transition over them. Those. you must immediately show the correct position of the slider on the full scroll bar.

Again, there are two options: a page view and the so-called “live” (live) view. In the latter case, the records are loaded at the time of scrolling. In this case, records that are outside the visible range are forgotten.

To return the required records in most DBMSs, the offset keyword is currently provided, which returns entries with an offset by sorted population. For all those who suffered from the performance of the DBMS, the next question immediately arises.

Question 2. Can the offset keyword be used?

The answer is no.

For many, this answer will seem obvious, for example, for those who have read the PostgreSQL manual ( http://www.postgresql.org/docs/9.4/static/queries-limit.html ): “Strings that are skipped using OFFSET should all equal to the server side; therefore, working with large values of the OFFSET parameter may be ineffective. ”

But in reality, this is not the obvious answer.

Contrary to well-established beliefs, offset works fairly quickly if the table is correctly indexed and the statistics are updated.

Yes, many have tried unsuccessfully to use offset on large tables. It was even worse for those who tried to use the row_number function in MS SQL Server. There are two good articles on this topic. They are dedicated to MS SQL Server ( http://sqlperformance.com/2015/01/t-sql-queries/pagination-with-offset-fetch ) and My SQL ( https://explainextended.com/2009/10/23/ mysql-order-by-limit-performance-late-row-lookups ). The point is that offset should not be applied immediately to the sample:

 select * from table order by attr1, clusteredField offset 1000;

and indirectly through the field on which the clustered index is built

 select * from table where clusteredField in (select clusteredField from table order by attr1, clusteredField offset 1000);

Here it is assumed that a clustered index is built across the clusteredField field and all standard recommendations are applied, for example, not to make the clusteredField field of the “unique identifier” type.

Analysis of this query shows that the processor usage, the number of reads and the total time approximately correspond to the operation of counting the number of records to the required one. This is not surprising, the operation algorithm when using both offset and count is about the same. The only difference is that queries with count are simpler and the query optimizer is less likely to create an inadequate execution plan.

Worse is another: queries that use both count and offset , being fast, are not instantaneous. Counting is counting. Let even statistics be used. Those. on large tables, the user will in any case expect some, perhaps very uncomfortable, time before receiving the necessary entries.

Can I get rid of it? Yes. The algorithm is given in the article https://habrahabr.ru/post/278773 .

Simply put, the search for records is not by number, but by the field values specified for sorting (hereinafter referred to as sorting fields).

For example, if the first last name in a sorted phone directory is Andreev (A is the first letter), and the last is Yakovlev (I am the 33rd letter), then if the user moved the slider to the middle, then most likely you need to show something with the letter P ( 17th letter).

After the user instantly displays the records with the letter P (instantly, since the search is performed on the index), you can start the counting operation and after some time move the slider to where it should be. For the user, this will not be an inconvenience. He is already working with the required entries.

At the same time, you can accumulate a so-called “interpolation” table, which contains pairs: the last name is the real position of the slider. The next time you move the slider, you can take into account the actual position of the letter P.

If several points of the interpolation table are calculated in advance, then a rebound, if any, is minimal and inconspicuous to the user.

You will say that several operations with counting the number of records are a lot. This is not true. Several points can be calculated in one query of the following form:

 select count(*) ,sum(case attr1 < '' then 1 else 0 end) ,sum(case attr1 < '' then 1 else 0 end) from table;

The cost of this request will be slightly higher than just:

 select count(*) from table;

Indeed, all calculations in this case are done within a single index scan.

Of course, if in our directory one Andreev, one Yakovlev, and all the other persons — it just so happens — have a last name starting with “P” (but we do not know this in advance), then a query on all letters of the alphabet will not help, you still need there will be the following iterations in order to “expand” a set of records on “P”. But this is a rare case that will not affect the performance on average.

The described method allows you to instantly display the required records to the user. Thus, the usability becomes significantly higher than when using offset. The load on the DBMS is no different. The only disadvantage is the rebound of the engine, which in reality is minimal.

A more serious disadvantage is the complicated implementation. But such problems of the end user should not worry. High-level developers, frankly, too. That's why development frameworks are needed to solve such problems for developers.

Naturally, using the approach in which the user sees the entire scroll bar at once, in any implementation, involves the use of slow operations (no matter using the keywords count or offset ), which load the DBMS. But you still have to pay for the convenience. In this case, pay not so much. If a table displays many fields, then often the resources expended to count records, even in large sets, are less than a sample.

Moreover, it is necessary to evaluate not the entire data set, but only the part that falls into the user's sample. For example, if there are 10 million records in a table, but more than 1 million records cannot be included in the sampling conditions, then it is necessary to take exactly 1 million records for the evaluation.

We experimented on data from the Federal Address System (FIAS). This is over a million entries.

The experiment was very simple. We implemented two options for working with the database: (1) fast - with sequential data loading) and (2) convenient - in which the full scroll bar and the correct slider position are shown to the user. During the initial display, the 1st method actually works several times faster. However, after we scrolled through several pages of records, it turned out that the load on the DBMS in both cases is quite comparable. I deliberately do not give exact figures in order not to complicate the article. In addition, it is clear that, even with similar operations, the DBMS operation may work differently. Therefore, it is the estimated values that are important: “slightly more resources”, “significantly more resources”, “approximately the same”, etc. In this case, it’s about “slightly more resources”.

In conclusion, I would like to confess that at the beginning of the article, while pointing out the shortcomings of the sequential loading method, an important detail was not indicated. If we supplement this method with pieces of the algorithm for displaying records based on the values of sorting fields, some of the drawbacks can be removed.

It should take into account the following.

The user will still not have a scroll bar, on the basis of which you can estimate both the number of entries and the place of the currently displayed entries. Those. still be uncomfortable.
If we enable the transition to the remote records, then in any case we load the DBMS, i.e. In one way or another, we lose the main advantage of the method of sequential data loading.

Thus, we have two poles (let's call them less radically than in the title of the article): (1) very quickly / satisfactorily convenient, (2) very convenient / fairly quickly. There are options between these poles, but in this case it does not make sense to look for them.

findings

The use of scrolling, in which the user immediately sees the position of the slider, even on large amounts of data (millions of records in the sample) when positioning by the values of the sorting fields, ensures quick and convenient work with the table with an acceptable penalty on the performance of the database management server.
The use of solutions that ensure consistent data download is justified only if the search is carried out in huge amounts of data: archives of letters, music library, etc.
The use of the keyword offset is impractical because the method of displaying records is worse than searching records by the values of sorting fields, gives the same convenience in terms of navigation and does not give any performance advantages.

PS In Microsoft Dynamics NAV, in versions prior to 2013, when working with tables, the full scroll bar was displayed, the records for display were searched for by the values of sorting fields. Starting from version 2013, an algorithm is used in which data is loaded sequentially. One gets the feeling that Microsoft sacrificed the convenience of 99% of users in order to expand the use of NAV by 1%.

It remains to hope that Microsoft will not refuse from 3d in favor of 2d on Xbox. More demanding users.

Source: https://habr.com/ru/post/280157/

All Articles

Table interface: fast / uncomfortable - slow / convenient

Question 1. What is the fastest way to work with a table?

Question 2. Can the offset keyword be used?

findings

More articles: