String distribution and access to Teradata (Primary Index)

Previous Post: What is Teradata?

How does Teradata distribute strings?

Teradata uses a hashing algorithm for randomly distributing table rows between AMPs (advantages: distribution is the same, regardless of the amount of data, and depends on the content of the row, not the demographics of the data)
Primary Index determines whether the rows of the table are evenly or unevenly distributed between AMPs.
A uniform distribution of the rows of the table leads to a uniform load distribution.
Each AMP is responsible only for its own subset of rows of each table.
Strings are placed in an irregular manner (advantages: support for maintaining order is not required, order is not dependent on any submitted query)

Primary Key (PK) vs. Primary Index (PI)

Primary Key (primary key) - this is the convention of the relational model, which uniquely identifies each line.
The Primary Index is a Teradata convention, which defines the distribution of rows and access.
A well-designed database contains tables in which PI is the same as PK, as well as tables in which PI is defined in columns other than PK and may affect access paths.

Primary Key (PK)	Primary Index (PI)
Logical data modeling concept	Mechanism for row allocation and access
Teradata does not need a PK definition	The table must necessarily have one PI
No limit on the number of columns	Can be from 1 to 64 columns
PK is defined in a logical data model.	PI is defined when the table is created.
Value must be unique	The value does not have to be unique
Uniqueness identifies each line	Used to place a string in AMP
Values should not change	Values are subject to change (updated)
Cannot be null	May be NULL
Not applicable to the access path	Determines the most efficient access path.
Selected for logical correctness	Selected for the physical performance of the compound values

There are two types of PIs: UPI (Unique Primary Index) and NUPI (Non-Unique Primary Index). When using UPI, lines are distributed evenly among AMPs, and when using NUPI, lines with the same index values belong to the same AMP-y.

Creating tables with UPI and NUPI:

CEATE TABLE Table1 ( Col1 INTEGER, Col2 INTEGER ) UNIQUE PRIMARY INDEX (Col1);

 CEATE TABLE Table2 ( Col1 INTEGER, Col2 INTEGER ) PRIMARY INDEX (Col2);

Hash allocation of strings

The Primary Index value is passed to the hashing algorithm, which ensures uniform distribution of unique values among all AMPs. The algorithm produces a 32-bit hash value. The first 16 bits (Hash Bucket Number) are used as a pointer to a hash map (Hash Map). Hash values are calculated using a hashing algorithm. The hash card is uniquely configured for each system; it is an array that binds the DSW (statistics key) to a specific AMP. Two systems with the same number of AMPs will have the same hash card. Changing the number of AMPs in the system requires changes in the hash map.

NoPI Table

NoPI- table is a table without Primary Index (feature of the 13th version of Teradata). In this case, new rows are always added to the end of the table and are never added to the middle of the hash sequence.
Rows will also be distributed between AMPs. A new random code will determine which AMPs will receive strings or groups of strings. Within AMP, rows are simply added to the end of the table. They will have a unique identifier, which increases the uniqueness of the value.

Benefits:
The table will reduce the offset in intermediate tables ETL (Extract, Transform, Load — Extract, transform, load) that do not have a Primary Index.
Downloads (FastLoad and TPump Array Insert) in the NoPI intermediate table are faster.

PPI and MLPPI

PPI (Partitioned Primary Index) is a partitioning mechanism that is used to improve performance for large tables when sending queries that specify a range limit.
Data distribution using PPI is based on PI: Primary Index -> Hash value -> AMP definition, which will get the string. With NPPI (No Partitioned Primary Index), the rows in the AMP are arranged in the order of the hash rows. When using PPI, strings are first arranged by partitions, and then in the order of the hash strings.

MLPPI (Multi-Level Partitioned Primary Index) allows for multipartitioning or the use of an uncompressed Join Index. In multilevel partitioning, each partitioning level is determined independently using the RANGE_N or CASE_N expressions .

The goal of partitioning (partitioning) is to reduce the load on the system if the data in the table is rarely or even never updated (for example, various logs).

PS
Just today, we published an article Physical Design of Storage Structures in the Teradata DBMS on the Teradata blog, so I will not write about the Secondary Index .
I hope the company will continue the regular series of articles, and I will not have to translate the documentation from English.

Source: https://habr.com/ru/post/209166/

All Articles

String distribution and access to Teradata (Primary Index)

Primary Key (PK) vs. Primary Index (PI)

Hash allocation of strings

NoPI Table

PPI and MLPPI

More articles: