In the previous
article, I described the concept and implementation of a database, built on the basis of functions, rather than tables and fields as in relational databases. It contained many examples showing the advantages of this approach over the classical one. Many found them not convincing enough.
In this article, I will show how this concept allows you to quickly and conveniently balance writing and reading to the database without any change in the operation logic. They tried to implement similar functionality in modern commercial DBMS (in particular, Oracle and Microsoft SQL Server). At the end of the article I will show what happened to them, to put it mildly, not very.
Description
As before, for a better understanding, I will begin the description with examples. Suppose we need to implement a logic that will return a list of departments with the number of employees in them and their total salary.
In a functional database, it will look like this:
The complexity of executing this query in any DBMS will be equivalent to
O (number of employees) , since for this calculation it is necessary to scan the entire table of employees, and then group them by department. There will also be some small (we believe that employees are much more than departments), depending on the chosen plan
O (log number of employees) or
O (number of departments) for grouping and so on.
')
It is clear that the overhead of execution may be different in different DBMS, but the complexity will not change in any way.
In the proposed implementation, the functional DBMS will form one subquery, which will calculate the necessary values ​​by department, and then make a JOIN with a table of departments to get the name. However, for each function during the announcement there is an opportunity to set a special MATERIALIZED marker. The system will automatically create the appropriate field for each such function. When the function value changes, the field value will also change in the same transaction. When accessing this function, the call will already be addressed to the readable field.
In particular, if you put MATERIALIZED for the functions
countEmployees and
salarySum , then the table with the list of departments will add two fields in which the number of employees and their total salary will be stored. When any change of employees, their salaries or belonging to departments, the system will automatically change the values ​​of these fields. The above request will be directly addressed to these fields and will be executed for
O (number of departments) .
What are the limitations? Only one thing: such a function must have a finite number of input values ​​for which its value is defined. Otherwise it will be impossible to build a table storing all its values, since there can be no table with an infinite number of rows.
Example:
This function is defined for an infinite number of values ​​of the number N (for example, any negative value is appropriate). Therefore, it can not put MATERIALIZED. Thus, this is a logical and not a technical limitation (that is, not because we could not implement it). The rest - no restrictions. You can use grouping, sorting, AND and OR, PARTITION, recursion, etc.
For example, in task 2.2 of the previous article you can put MATERIALIZED on both functions:
The system itself will create one table with keys of the
Customer ,
Product and
INTEGER types , add two fields to it and update the field values ​​in them with any changes. Further calls to these functions will not calculate them, but will read the values ​​from the corresponding fields.
With this mechanism, for example, you can get rid of in recursions (CTE) in queries. In particular, consider the groups that form a tree using the child / parent relationship (each group has a link to its parent):
In the functional database, the logic of recursions can be defined as follows:
Since MATERIALIZED is affixed to the
isParent function, a table with two keys (groups) will be created under it, in which the
isParent field will be true only if the first key is a descendant of the second. The number of entries in this table will be equal to the number of groups multiplied by the average depth of the tree. If it is necessary, for example, to count the number of descendants of a certain group, then you can refer to this function:
There will be no CTE in the SQL query. Instead, there will be a simple GROUP BY.
Using this mechanism, you can also easily denormalize the database if necessary:
When accessing the
date function for the order line, there will be a reading from the table with the order lines of the field that has an index. When you change the date of the order, the system will automatically automatically recalculate the denormalized date in the row.
Benefits
What is this mechanism for? In classical DBMS, without rewriting queries, the developer or DBA can only change indices, define statistics and prompt the query planner how to execute them (and HINTs are only in commercial DBMS). No matter how hard they try, they will not be able to execute the first request in the article for
O (number of departments) without changing requests and adding triggers. In the proposed scheme, at the development stage, you can not think about the structure of data storage and which aggregations to use. All this can be easily changed on the fly already directly in operation.
In practice, it looks like this. Some people develop logic directly based on the task. They do not understand the algorithms and their complexity, or the implementation plans, or the types of joines, nor any other technical component. These people are more business intelligence than developers. Then, it all goes into testing or operation. Logging of long queries is enabled. When a long query is discovered, then by other people (more technical - in fact, DBA) a decision is made to enable MATERIALIZED on some intermediate function. Thus, the recording slows down a bit (since an additional field in the transaction needs to be updated). However, not only this request is significantly accelerated, but also all others that use this function. In this case, the decision on which function to materialize is relatively easy to take. Two main parameters: the number of possible input values ​​(exactly how many records will be in the corresponding table), and how often it is used in other functions.
Analogs
Modern commercial databases have similar mechanisms: MATERIALIZED VIEW with FAST REFRESH (Oracle) and INDEXED VIEW (Microsoft SQL Server). In PostgreSQL, MATERIALIZED VIEW is not able to be updated in a transaction, but only upon request (and even with very strict restrictions), so it is not considered. But they have several problems, which significantly limits their use.
First, you can enable materialization only if you have already created a normal VIEW. Otherwise, you will have to rewrite the remaining requests for accessing the newly created view in order to use this materialization. Or leave everything as it is, but it will be at least inefficient if there is some data already defined, but many queries do not always use them, but calculate it anew.
Secondly, they have a huge number of restrictions:
Oracle5.3.8.4 General Restrictions on Fast Refresh
The defining query is restricted as follows:
- The materialized view must not contain references to non-repeating expressions like
SYSDATE
and ROWNUM
. - The materialized view must not be referenced to
RAW
or LONG
RAW
data types. - It cannot contain a
SELECT
list subquery. - It cannot contain analytic functions (for example,
RANK
) in the SELECT
clause. - It can not
XMLIndex
indexed. - It cannot contain a
MODEL
clause. - It cannot contain a
HAVING
clause with a subquery. - It cannot contain nested queries that have
ANY
, ALL
, or NOT
EXISTS
. - It cannot contain a
[START WITH …] CONNECT BY
clause. - It cannot contain multiple detail tables at different sites.
ON
COMMIT
materialized views cannot have remote detail tables.- Nested materialized views must be a join or aggregate.
- GROUPED clause cannot be selected from an index-organized table.
5.3.8.5 Restrictions
The following restrictions on fast refreshing:
- All restrictions from " General Restrictions on Fast Refresh ".
- They cannot have
GROUP
BY
clauses or aggregates. - Rowids of all tables in the
FROM
list must appear in the SELECT
list of the query. - Materialized view logs must be listed in the
FROM
list of the query. - It’s not a problem to make a statement.
Also, you should not be optimally efficient if:
- The defining query uses the outer join. If the defining query contains such a join, consider rewriting the defining query to contain an inner join.
- The
SELECT
list of materialized view contains several tables.
5.3.8.6 Restrictions on Fast Foods with Aggregates
Fast refreshing queries:
Fast refresh is supported for both the following restrictions apply:
- All tables in the materialized logs must be:
- Contain all columns from the table referenced in the materialized view.
- Specify with
ROWID
and INCLUDING
NEW
VALUES
. - Specify your
SEQUENCE
clause if you need a mix of directs, directs, deletes, and updates.
- Only
SUM
, COUNT
, AVG
, VARIANCE
, VARIANCE
, MIN
and MAX
are supported for fast refresh. COUNT(*)
must be specified.- Aggregate functions must occur only in the outermost part of the expression. That is, aggregates such as
AVG(AVG(x))
or AVG(x)
+ AVG(x)
are not allowed. - For each aggregate such as
AVG(expr)
, the corresponding COUNT(expr)
must be present. Oracle recommends that SUM(expr)
be specified. - If
VARIANCE(expr)
or STDDEV(expr
) is specified, COUNT(expr)
and SUM(expr)
must be specified. Oracle recommends that SUM(expr *expr)
be specified. - The
SELECT
column in the defining query cant be a complex expression with multiple base tables. A possible workaround is a materialized view. - The
SELECT
list must contain all GROUP
BY
columns. - Materialized view.
- If you’re using a must must must must must must must must
- If it is a materialized view, it is supported only on conventional DML inserts and direct loads.
- Materialized views with
MIN
or MAX
aggregates - Materialized views which have
SUM(expr)
but no COUNT(expr)
- Materialized views without
COUNT(*)
This is a materialized view. - If you don’t have what it is, it doesn’t have a
WHERE
clause.
DML doesn’t have the same behavior as the insert-only case. It deletes and recomputes the max / min values ​​for the affected groups. You need to be aware of its performance impact. - It can be completely refined and can be completely refreshed. For information on which views will merge, see Oracle Database SQL Language Reference .
- If there are no arbitrary selections and joins in the
WHERE
clause. - There has been a change in the aggregate. The join join table. If there are outer joins, it must be the equality (
=
) operator. - For materialized views with
CUBE
, ROLLUP
, grouping sets, or
- The
SELECT
list should contain a GROUPING_ID
function. For example, if it is a CUBE(a, b)
, then it should be either " GROUPING_ID(a, b)
" or " GROUPING(a)
AND
GROUPING(b)
"for the materialized view to be fast refreshable. GROUP
BY
should not result in any duplicate groupings. For example, " GROUP BY a, ROLLUP(a, b)
" is not fast refreshable because it results in duplicate groupings " (a), (a, b), AND (a)
".
5.3.8.7 Restrictions on UNION ALL
If you are satisfied with the following conditions:
ALL
operator at the top level.
It can be used as a rule for the submission of the segregation. example:
CREATE VIEW view_with_unionall AS
(SELECT c.rowid crid, c.cust_id, 2 umarker
FROM customers with WHERE c.cust_last_name = 'Smith'
UNION ALL
SELECT c.rowid crid, c.cust_id, 3 umarker
FROM customers with WHERE c.cust_last_name = 'Jones');
CREATE MATERIALIZED VIEW unionall_inside_view_mv
REFRESH FAST ON DEMAND AS
SELECT * FROM view_with_unionall;
Note that the view view_with_unionall
satisfies the requirements for a fast refresh.- It is a fast refreshing materialized view.
There is no need for any type of fast refreshing materialized view.
It is also possible to note that the Oracle database will be included. This is shown in the defining query of the view view_with_unionall
. SELECT
list of each query, you must include a UNION
ALL
branch. The list of each query block should be followed. See " UNION ALL Marker and Query Rewrite " for more information regarding UNION
ALL
markers.- Aggregate materialized aggregate material aggregate material aggregate material aggregate material views. Note, however, that it can be fast refreshed when it’s used.
- A quick refreshable materialized view with
UNION
ALL
.
I do not want to offend Oracle fans, but judging by their list of restrictions, it seems that this mechanism was not written in the general case, using some kind of model, but thousands of Hindus, where everyone was given to write their own branch, and each of them could, then and did. Using this mechanism for real logic is like walking a minefield. At any time you can get a mine, hitting one of the not obvious restrictions. How it works is also a separate issue, but it is outside the scope of this article.
Microsoft SQL ServerAdditional Requirements
In addition to the SET options, you must met:
CREATE INDEX
must be the owner of the view.IGNORE_DUP_KEY
option must be set to OFF (the default setting).- Tables must be referenced by two-part names, schema . tablename in the view definition.
- User-defined functions referenced in the view must be created by using the
WITH SCHEMABINDING
option. - Any user defined functions should be referenced by two-part names, <schema> . <function> .
- There is
NO SQL
, and it must be NO
. - It can be used to make a list of the indexed keys. CLR functions can not be seen in the view.
- It is shown in the following table.
- The view must be created by using the
WITH SCHEMABINDING
option. - This is the same database as the view. Reference view views.
- The following Transact-SQL elements include:
1 The indexed view can contain float columns; however, such columns cannot be included in the clustered index key. - If
GROUP BY
is present, the VIEW definition must contain COUNT_BIG(*)
and must not contain HAVING
. GROUP BY
restrictions are applicable only to the indexed view definition. It doesn’t matter if there is a group of restrictions. - If you have a
GROUP BY
clusters, it will be possible to use the group of clauses.
Here it can be seen that the Indians did not attract, as they decided to do according to the scheme “we will do little, but well”. That is, they have more mines on the field, but their location is more transparent. What is most upsetting is this restriction:
This is the same database as the view. Reference view views.
In our terminology, this means that a function cannot refer to another materialized function. It cuts the whole ideology on the vine.
Also, this restriction (and further in the text) greatly reduces the use cases:
The following Transact-SQL elements include:
OUTER JOINS, UNION, ORDER BY and others are prohibited. Perhaps it was easier to point out what can be used than what is impossible. The list would probably be much smaller.
Summing up: a huge set of restrictions in each (I note commercial) DBMS vs no (except for one logical, but not technical) in the LGPL technology. However, it should be noted that to implement this mechanism in relational logic is somewhat more complicated than in the described functional.
Implementation
How it works? PostgreSQL is used as a “virtual machine”. Inside there is a complex algorithm that deals with the construction of queries. Here is the
source code . And there is not just a big set of heuristics with a bunch of ifs. So, if you have a couple of months to study, you can try to understand the architecture.
Does it work efficiently? Efficiently enough. Unfortunately, it is hard to prove. I can only say that if you consider the thousands of requests that exist in large applications, then on average they are more efficient than a good developer. An excellent SQL programmer can write any query more efficiently, but on a thousand queries it simply will have neither motivation nor time to do it. The only thing that I can now give as evidence of efficiency is that based on the platform built on this database, there are several
ERP system projects in which there are thousands of different MATERIALIZED functions, with thousands of users and terrabite databases with hundreds of millions of records running on a regular two-processor server. However, anyone can test / disprove efficiency by downloading the
platform and PostgreSQL,
enabling SQL query logging and trying to change the logic and data there.
In the following articles, I will also talk about how you can hang restrictions on functions, work with change sessions, and more.