Statistics in Teradata DBMS

"There are three kinds of lies: lies, damned lies, and statistics" Benjamin Disraeli, the 40th Prime Minister of Great Britain

Statistical information about data in a DBMS plays an important role in system performance. Taking into account a number of comments to the previous article , we decided to describe why we need statistics in the Teradata DBMS, how it is taken into account by the optimizer, what it influences, and to give some practical tips on collecting statistics for those who work with our DBMS.

What is statistics and why is it needed?

Statistical information about the data is needed, first of all, for the operation of the query optimizer. Knowing the statistical characteristics of the data and the criteria by which the user requests them, the optimizer can choose one or another way to extract this data. A simple example: the user requested a sample of all residents of the city of Belozersk . Suppose we have an index on a field with a city code. If the optimizer has information that about 10 thousand people live in Belozersk (out of ~ 143 million residents of the Russian Federation), then it will prefer access by index, since it will be much faster than reading the entire table. On the other hand, if you select data for one or several large cities, which together will give several tens of millions of records, then in this case, on the contrary, the result is faster not when using access by index, but after reading the entire table in its entirety. This is a very simplified example, but sufficient to show how statistics can influence the decisions made by the optimizer.
')
In the previous article, we told you there are no optimizer hints in the Teradata DBMS. This means that the query optimizer makes all decisions based on objective information that is available to it. The calculation takes into account the number of AMPs in the system, the number of nodes, the number and types of processors, the memory currently available, the types of disks and much more, including data demographics. Demographic information, which we call statistics , includes the number of rows in a table, the average row size, the number of rows with the same column value, the number of NULLs, and so on. Knowledge of these parameters also allows the optimizer to correctly calculate the size of temporary memory ( spool ) allocated to the query for data transformations.

Similarly, statistics for individual columns can collect statistics on indices. The principles are absolutely the same.

Statistics of data distribution often need to be monitored not only for any one field of the table, and often for combinations of fields. For example, you often request people with whose last name is Ivanov and they live in the same Belozersk. If you collect statistics separately by name and separately by city, this will not give good information to the optimizer, since There are 1 million people with the last name Ivanov, 10 thousand residents of Belozersk, and in combination there will be a lot less. Therefore, it is necessary to collect statistics on the combination of fields. And for Teradata, the order of the fields is absolutely unimportant, if you collect statistics on "Surname, City" and "City, Surname", the result will be the same (to understand why this is so - just look again at what demographic indicators are used).

How statistics are stored in the DBMS

The collected statistics are stored in a dictionary ( DBC.TVFields , DBC.Indexes and DBC.StatsTbl for version 14) and, from the point of view of the DBMS, are interval histograms. The more the number of intervals in the latter, the more accurately it can reflect the distribution of data. So, in Teradata 13.10, the maximum number of intervals in a histogram is 200, which approximately gives 0.5% of data in each interval. In this case, you can recall the epigraph to this article and understand that the statistics show some average temperature in the hospital. The number of lines in each interval may vary. For example: in a table there are 1 million lines, which means that, on average, there will be 5000 lines in one interval, provided that there are 200 intervals in the histogram. Suppose that in one of the intervals for one value there are 4900 lines, and the next value - 300 lines. It may be that these lines will be placed in one interval and the estimate of the number of lines in it will be 4900 + 300 = 5200. And it may happen that 300 lines will be placed in the next interval and then the previous one will be only 4900.

If a value occurs in more than 0.25% of the rows, then it is stored in special intervals set aside for frequently occurring values. To be fair, I must say that the number of intervals is growing from version to version. Since the Teradata 12 version, the number of intervals has been increased from 100 to 200, and in Teradata 14, the default number of intervals is 250 and can be increased up to 500.

Depending on the distribution of values, one of three types of histograms is used to save the statistics:

The histogram with the same size of intervals. The same number of values is stored in each interval. It is possible if there is a more or less uniform distribution of values, without obvious skews.
Offset histogram. A maximum of two values is stored in each interval. This type is used only in cases where the distribution has a significant bias.
Compressed histogram. A diagram in which the intervals of both identical values and intervals with only two values are combined. If it will be interesting to someone to look down to bytes, what and how is stored in the statistics, then this is clearly shown in the diagrams here .

Statistics collection

The process of collecting statistics in Teradata, as well as in other vendors' databases, is started with the COLLECT STATISTICS command. Each statistic defined for a table requires a separate pass through the table. Starting with the 14th version, you can combine the collection of several statistics in a single pass through the table. At the same time, for each statistics collected, a number of steps are performed:

A request of the following form is executed:
```
SELECT x1, y1, COUNT(*) AS cnt FROM t_coll_stats GROUP BY 1,2; 
```
When executing this query, any alternative access paths to the table data can be used, for example, JOIN / HASH indexes.
Using the resulting unit, Teradata performs the following steps:
- Builds a table of intervals
- Using this table, determines histogram intervals
- Directs histogram structure to AMPs
- Each AMP reads an aggregated Spool and fills its own histograms
- The system consolidates local histograms
- Consolidated histograms are recorded in statistics dictionaries.

For large (millions of rows) and very large (tens of billions of rows) tables, very often the data warehouses found in projects, this process can take hours, despite the parallelization. It is for this reason that the collection of statistics must be approached responsibly and not load the system without real need. To prevent an untrained user from accidentally loading the system, Teradata has a special privilege to collect statistics.

In addition to organizational constraints, to minimize the load on the system, you can use statistics collection using sampling (COLLECT STATISTICS USING SAMPLE with optional% of rows). In this case, Teradata first performs a request for a random set of rows. For this, TOP n or SAMPLE (for partitioned tables) can be used. The rows obtained as a result are used for subsequent aggregation and creation of histograms. However, sampling can only be used in the case when there is confidence that the considered part of the data ( sample ) of the table adequately provides the demographics of the entire table for those columns for which it is planned to use sample statistics collection. If this is not the case, then you need to collect full statistics.

How statistics are used

So, statistics are collected. How will the optimizer use them? Below is a diagram describing the process of optimizing a query based on statistics:
statistics in the Teradata DBMS

When executing a query, the optimizer:

Searches table header in cache
If the table header is found in the cache, then dynamic AMP sampling is performed dynamically, if necessary. If it is not found in the cache, it is read from the disk and dynamic sampled statistics are collected.
Searches statistics for query optimization
Teradata searches for required statistics in the cache. If they are not found there, a request is made for reading statistics from the disk and caching them.
Uses statistics to optimize query
If statistics were not collected or out of date, the optimizer can use statistics obtained by dynamic sampling. This is valid only for index columns, otherwise heuristics are used.

In the process described above, an important place is given to statisticians obtained by dynamic sampling (Random AMP sampling). As bebop quite rightly pointed out in one of the comments on our first article:

“Guys, no statistics is ever complete - the optimizer does not have complete information about how many records each subquery returns in the plan, especially if the selection criteria are quite complex”.

This is especially true for data warehousing applications, in which a significant part of requests is ad hoc and it is impossible to foresee all the necessary statistics in advance.
Teradata collects statistics using dynamic sampling always, it is impossible to disable it. The process is as follows:

By hashing the identifier of the table for which sampling is performed, the AMP whose data will be used is selected.
A master index is read for a list of cylinders containing the data in the table.
Count the number of cylinders with data.
Randomly selects one of the cylinders and reads the list of blocks containing the data in the table.
Count the number of blocks.
Randomly selected data block and counted the number of rows in it.
The number of rows in the table is calculated according to the following formula:
```
 -    =     *     *   *  AMP' 
```

This method allows you to collect a smaller number of statistics compared with the traditional:

Number of lines
Average number of rows per value

Only for indexes can be additionally collected:

Average number of rows per index
Average index size per amp
The number of unique values

To collect statistics, this method is characterized by an extremely short execution time. In addition, these statistics are stored in the cache (by default, not more than 4 hours) and are not rebuilt during the execution of each query.

Another purpose of this statistics method is to detect the obsolescence of statistics collected by users. It is determined by comparing the number of rows of a table obtained by dynamically collecting statistics with a stored value. With a deviation of more than 10% of the statistics is considered outdated. When identifying statistics as outdated, Teradata can compensate for this by extrapolating the statistics collected by the user.

Considerations regarding the application of a particular method of collecting statistics are given in the table below:

Method	Specifications	Application
Full statistics	Collects all statistics May take considerable time The most complete and accurate statistics Statistics are stored in a data dictionary.	The best choice for columns or indexes with a significant "bias" Recommended for tables with less than 1000 rows on AMP Recommended for columns with a small number of unique values. Recommended for most NUSI, PARTITION columns and other columns used in predicates. Recommended for all indices and columns for which the sampled statistics do not provide satisfactory accuracy.
Sampled statistics	Collects all statistics, but without reading all the rows in the table. Significantly faster than full Statistics are stored in a data dictionary.	Recommended for unique or almost unique columns and indexes. Recommended for very large tables in order to save resources Not recommended for tables where the number of rows is less than 20 * The number of AMPs in the system
Dynamic Sample Statistics	Collect fewer stats Very fast execution Statistics are cached Collection is automatic. Automatically updated after batch updates (INSERT ... DELETE) of data when tables are resized by more than 10%	Suitable for determining the size of a table with a very small “skew” or in its absence for tables with a number of rows significantly larger than the number of AMPs in the system Collects reliable statistics for NUSI, if they have a slight “skew” and the table contains significantly more rows than the number of AMPs in the system It is an indispensable mechanism for protecting against missing / outdated statistics.

In conclusion, I would also like to mention the presence in Teradata of the mechanism of "inheriting statistics". Statistics can be inherited as child objects from parent objects (TABLE -> JOIN INDEX), and vice versa (TABLE <- JOIN INDEX). The use of this mechanism also allows minimizing the use of system resources for collecting statistical information without loss of performance.

Practical recommendations for collecting statistics

So where and how to collect statistics? The basic rule is to collect only those statistics that are relevant to your requests.

It may be that you have a cunning request, about which you still have to figure out what statistics are needed. But there are a number of typical recommendations for collecting statistics, namely:

Collect full statistics

non-indexed columns used in predicates (WHERE conditions);
on join columns if the column is non-unique;
across all Non-Unique Secondary Index (remove unused NUSI indexes);
on the Unique Secondary Index / Unique Primary Index, which are used in predicates with a condition other than "equal";
on Non-Unique Primary Index;
all indexes or columns used in the predicates of small tables (less than 100 rows per AMP);
on the PARTITION column of all tables - regardless of whether they are partitioned or not (they are assembled very quickly).

You can rely on Dynamic AMP Sampling and not collect statistics

if the Unique Secondary Index / Unique Primary Index is used in predicates only by the condition of "equal";
if a uniformly distributed NUPI is used for the join.

Use sampled statistics (USING SAMPLE)

for columns of unique indexes;
for columns that are close to unique (uniqueness> 95%).

Collect multicolumn statistics

for a group of columns that are often used together in predicates with the “equal” condition, if the first 16 bytes of the values of the grouped columns are sufficiently unique;
for column groups used for joins or aggregations, subject to correlation between column values.

Additional Information

For tables with a partitioned primary index, it is always recommended to collect statistics for:

PARTITION column. This will give the optimizer information about the number of sections and the number of lines in them.
Sectioning column This will determine the number of rows per value when using the partitioning column in the query predicate.

For tables with partitioned primary index, if the partition column is not part of the primary index:

(PARTITION, PI). This statistic is most important if the value of PI may be present in several sections; allows you to estimate the cost of the sliding-window and rowkey-based merge join connections, as well as the dynamic partition elimination.
(PARTITION, PI, partitioning column). This statistic gives the number of unique values for the combination of PI and partitioning columns after performing partition elimination.

Dynamic AMP sampling allows you to perform sampling with the participation of all AMPs instead of one (by default). For small tables (less than 25 rows per AMP), all AMPs are sampled automatically. When using sampling with the participation of all AMPs, it is necessary to take into account that:

Dynamic all-AMP sampling gives a more accurate result for tables with NUPI. It is important in the case of the absence of statistics on NUPI, when NUPI has an uneven distribution.
Extrapolation of statistics for any column of the table is performed if the optimizer detects an increase in the table. The latter is determined by comparing the number of rows of the table obtained by sampling with the saved statistics. If dynamic sampling with the participation of one AMP gives an inaccurate result, it is recommended to apply dynamic sampling with the participation of all AMPs or rebuild statistics on the PARTITION column.
Request parsing time may increase. Statistics are stored in the data dictionary cache, and they are not re-collected until the cache is reset (periodically or for other reasons). This means that the use of dynamic sampling with the participation of all AMPs will not increase the time of parsing all requests.

For temporal tables, all of the above recommendations are also relevant.

How often should statistics be collected? It all depends on how much the data changes over time. As a rule, if the data in the table has changed by more than 5-10%, then you need to update the statistics on this table so that the optimizer knows about these changes.

Separately, you should pay attention to the fact that Teradata does not automatically update statistics without your knowledge. You control when and how to start collecting statistics. These recommendations provide a basis. The complete process of optimizing physical structures (“physical design,” as we call it) is quite creative and consists not only of working with statistics. However, this is the subject of a separate article :)

Source: https://habr.com/ru/post/167801/

All Articles