Continuing to publish materials about the techniques of physical modeling in the Teradata DBMS, we, as promised in the last article, want to talk about additional techniques that are not directly related to ordinary indices (although in this article we will touch on specific types of indices that should be known).
In addition to the usual indexes, Teradata has a number of specific and sometimes fine techniques that can be used in physical modeling. We give a description of most of the techniques, and if you have any questions or desire to learn more, we will be happy to answer in the comments.
Join index
This is a special kind of index that is a saved and updated result of a SQL query. This index has the name because the query may include joining tables. But there may be no table joins - in this case the index is called the Single Table Join Index. Such a technique should be considered if there are frequent queries for joining several tables and / or performing aggregations. After receiving a SQL query from a user, for which you can use only Join Index data, the optimizer will most likely prefer this method. There are also ways to let the optimizer understand how to join the Join Index and the base table if, for example, the index does not have enough fields that the user requested. This is done by including the “ROWID” field in the Join Index - then the DBMS can, if necessary, perform an effective connection of the index with the base table.
Indexes can be limited not only by width (the number of columns of the base table), but also by depth. When creating a Join Index, you can specify the condition “WHERE”, and then only necessary data will be included in the index. This method is suitable for optimizing queries that have quite specific filtering conditions. Then they can be included in the index and thus reduce its volume and minimize the resources expended. But it must be remembered that this index will be considered by the optimizer only when processing those queries that have the same (or more stringent) filtering conditions as when creating the index.
')
In addition, since the Join Index is almost the same as the table, it is possible to build indexes on it and collect statistics.
Hash nusi
By default, the NUSI index is distributed across AMPs in the same way as the base table on which it is built is distributed. This gives an even distribution of the index across the AMPs and ease of use. Inside each AMP, NUSI is sorted by hash value from all its columns. But it is possible to create an index in such a way that it is sorted by the hash value of a specific field included in the index. Such an index will significantly optimize query performance with equality conditions. The value compared with the index column is then hashed, and this hash is quickly searched for in the index (which is sorted by it).
Value Ordered NUSI
But besides sorting by hash, you can set the index to be sorted by the value of a specific field. Such indices allow to optimize well queries that have ranges or inequality conditions as filter conditions. When creating such an index, it is sorted by the specified field and saved to disk in this form. I think the advantages of searching ranges by sorted list do not need comments.
Intentional "skewed" data
The MPP system with the Shared Nothing architecture shows the best performance with the most even distribution of data. This issue has received much attention in this and other articles. But there are times when it makes sense to create a “skew” of data intentionally, and this will have a positive impact on performance.
For example, you have a system with 144 amps. And there is a small directory in which, say, 200 records distributed more or less evenly. When you fetch data from the directory, all 144 AMPs turn to their disks to extract one or two records each. Does it make sense to strain so many AMPs for such a trivial operation? In such situations, you can create an artificial "bias" of data - to do this, you need to add a blank column to the directory, making it the Primary Index, and fill it with the same value. In this case, when accessing the table, only one AMP will work, which will quickly extract all records and redistribute between all other AMRs. This approach cannot be recommended for use in all such cases, just keep in mind that this can be done. And it will give a benefit in your particular case or not - is a separate question.
Referential integrity
Teradata, like other relational DBMSs, allows support for referential integrity. But Teradata has a number of features that we want to mention. The first feature is that the columns of the parent table to which reference is made do not have to be declared as Primary Key. Teradata only requires that these columns form a unique index, primary or secondary. Although the definition of PK remains an affordable option, because any constraint to the uniqueness of Teradata physically embodies in the form of a unique index.
The remaining features relate to individual types of referential integrity.
Standard RI
Standard referential integrity checking. Executed for each row to be inserted, modified or deleted. Added by the following command:
ALTER TABLE Employee ADD CONSTRAINT fk1 FOREIGN KEY (Dept) REFERENCES Department (Dept);
A feature of the implementation of this type of referential integrity in Teradata is that when it is added to existing tables, even in the event of a violation of conditions, a restriction will be created and activated (you will not be able to perform operations that violate referential integrity). At the same time, Teradata will automatically create an error table with the name of the child table, supplemented by the suffix “_0”, into which it will place the records of the child table, which at the time of the inclusion of referential integrity did not meet its conditions. In the future, the user is responsible for solving problems with these links and removing the error table. Such an implementation makes it possible, without waiting for the resolution of table binding problems to enable and activate referential integrity, to easily detect problems (there is an error table — there are problems), their scale and take corrective measures (all problem lines in the error table).
Batch RI
The implementation of this type of referential integrity in Teradata does not have any special features. It is done after the completion of the DML command, the detection of violations leads to rollback of the entire transaction, the presence of violations at the time of creation leads to an error. Added by the following command:
ALTER TABLE Employee ADD CONSTRAINT fk1 FOREIGN KEY (Dept_Number) REFERENCES WITH CHECK OPTION Department (Dept_Number);
Soft ri
This type of referential integrity is specific to Teradata. Any of the types described above, helping the optimizer, at the same time increases the load on the system, honestly performing checks. Soft RI can also be called “trusted referential integrity”, because its addition to tables does not lead to the inclusion of a mechanism for controlling data relatedness. Creating this restriction will only add to the data dictionary information about how the tables A and B are interconnected. For what purpose, you will say. With good data quality, you will give more information to the optimizer and, at the same time, eliminate costly referential integrity checks. Added by the following command:
ALTER TABLE Employee ADD CONSTRAINT fk1 FOREIGN KEY (Dept_Number) REFERENCES WITH NO CHECK OPTION Department (Dept_Number);
What does RI give to the optimizer?
The presence of a referential integrity constraint defined for tables (of any of the above types) allows the optimizer to perform Join Elimination, that is, not to join the tables when executing a query that does not contain portions of the parent table column in the SELECT or WHERE, for example
We have a View
REPLACE VIEW EmpDept AS SELECT Employee_Number, Last_Name, First_Name, E.Dept_Number, Dept_Name FROM Employee E INNER JOIN Department D ON E.Dept_Number = D.Dept_Number;
where Department is the parent and Employee is the child table, between which referential integrity is defined.
When executing the next request, the connection will not be executed, since it does not refer to the columns of the parent table:
SELECT Employee_Number, Last_Name, Dept_Number FROM EmpDept;
Note: when using Soft RI, be sure of the quality of your data, otherwise the exclusion of a connection may lead to an incorrect result.Data compression
As mentioned above, one of the main tasks of optimizing the physical data model is to reduce the number of I / O operations. If compression is applied to the data, it takes up less disk space and is read in fewer block reads. But after that the data will have to be decompressed, which will require additional resources.
Data compression can be implemented both software and hardware. We will not consider hardware compression in the framework of the topic about physical design; we’ll dwell on software.
It should be noted that the use of compression makes sense for those tables that occupy a truly significant amount. If the table is so small, you are unlikely to notice the significant benefits of using compression.
Compressing null values
We agree immediately - in this section, the empty value means only NULL. Philosophical debates that NULL is “nothing”, and not null, are welcome, but in the comments.
If you have a table with a large percentage of NULLs in fixed-size columns, then, given the large number of entries in the table, it makes sense to think about compressing NULLs. This is done very simply - in the DDL-code for creating a table (or ALTER'e) after the attributes of the column the word "COMPRESS" is indicated. Thus, the system will analyze the table and for all columns where the NULL value is present, put down a special bit, meaning: “here was NULL, but we squeezed it, and it was gone”.
Compression of specific values
So often it happens that in addition to NULLs there are frequently encountered values ​​in the tables. For example, it could be a surname leading from a statistical point of view — say, Ivanov. Any field (especially string) for which statistically frequent values ​​can be distinguished is a good candidate for compression. There may be exceptions - for example, you decide to keep the client status as text (do not ask why). The client table for HD is, as a rule, tens of millions of records. Having just a limited set of statuses: “Active”, “Inactive”, “Blocked”, you can apply compression and reduce costs for tens of millions of entries (in our example - from VARCHAR (16) to two bits).
The list of values ​​for compression can be specified for each column separately. Such lists are stored in the table headings, and when accessing the table, the compressed values ​​(flags) are replaced with real values. The additional load from this operation is so small that it can be neglected.
When designing models, pay attention to the data demographics that have been mentioned many times in this article. If the table contains millions of records and some values ​​can be compressed, then having done this, you will speed up queries to the table.
Block Level Compression (BLC)
Block-level compression, or BLC, is a type of compression that is applied to entire data blocks before they are written directly to discs. This type of compression can be applied either to the entire system or to individual tables. With this type of compression, the disk space gain comes at the expense of increasing the load on the processor, so in this case it is important to understand how this suits you.
After enabling this type of compression for the table, it can no longer be disabled, since it affects the physical appearance of the blocks, you will have to rewrite the table aside, but without compression, and then replace it completely.
Columnar compression
As we described in one of the previous articles (
Column and Hybrid Storage of Records in the Teradata DBMS ), the Teradata DBMS supports both column storage of records and string storage. When designing a physical data model, one must take the best from each of the options. When using a column storage table, it is possible to use automatic compression. Its algorithm is based on the fact that if a data block contains values ​​of a single column, then the probability of their compression is quite large (especially if these are non-unique values). The values ​​of one column are compressed at the container level (see the article above). If the system determines that compression does not give a gain for a given value, then it simply does not compress it, if it sees that it makes sense to compress, it checks which of the built-in compression algorithms will give the greatest gain and use it. For example, if you decide to enable automatic column compression across the “balance” field, you are unlikely to get a big win - each client / subscriber, as a rule, has his own balance, close to the unique one.
For the creator of the model, this means only one thing: if you decide to make a table with a column type of storage, then it makes sense to you. And if so, then analyze the demographics - perhaps for a number of columns you want to enable automatic compression.
What to choose?
The table below will help you decide on the type of compression.
| Compression of specific values | Block compression | Column compression |
Ease of use | Just apply for well-studied data. | Turned on and forgotten | Turned on and forgotten |
Need an analysis? | Need to analyze data demographics. | It is necessary to analyze the space gain in exchange for CPU costs. | It is necessary to understand for which columns of the table to enable compression, and for which it will not make sense |
Flexibility | Works for a large number of situations and data types | Can be combined with other types of compression. | Can only be used for column storage tables. |
Influence | Minimum impact on CPU consumption | Influences CPU | For each line it is determined whether it was compressed or not. |
Application area | Replacing specific values | Compression of all data | Compression of a set of columns from the entire table |
The result of the work on the physical design
The result of the physical modeling work should be a set of DBMS objects, which, at the minimum cost of system resources, will provide the required performance for the main user load:
- Primarily, these are tables (including storage attributes, such as, for example, compression parameters) and primary indexes (including partitioning parameters), for whose support no additional consumption of system resources is required. Ideally, primary indexes should cover the basic need for access to the rows of the tables for which they are created, and promote the maximum use of local table join scripts on AMPs. The creation of primary indexes is included in the compulsory physical design program for the Teradata platform.
- Secondary indexes, designed to provide alternative ways to access the required rows of tables. Unlike the primary indexes, the system has to spend additional resources on their support, therefore, when deciding on their creation, it is necessary to weigh the benefits of their use and the expenditure of resources on their support. Unused secondary indexes should be removed.
- Third, these are other objects that need to be created to achieve the desired performance. Such objects include, for example, various types of join- and hash-indices.
Throughout life, the physical model should be subject to regular revision. This is connected not only with a change in the demographics of the data, but also with changes in the nature of the load, extension of the model, etc.
Summing up
The attentive reader has already understood that the process of physical design of the data model is the process of preparation (before the model is implemented) and the process of fine-tuning (improvement after the implementation). You can not 100% guess with the load and data, but 50% - you can. The next time it will be possible to 60%, and with time, it will probably turn out to be 80%. Thus, the grinding will take less time. The skill of designing physical models increases with experience, therefore, the more experiments and projects you implement, the better you will understand how to work with models in the future.
Comprehend the basics and do not be afraid to experiment. We hope that the Teradata DBMS data processing mechanisms described in this article, the index selection criteria and the physical modeling techniques will be useful for specialists already working with our database and will cause interest to those who are yet to come.
And, as always, we are ready to answer in the comments any questions you may have.