📜 ⬆️ ⬇️

Balancing write and read in the database

image

In the previous article, I described the concept and implementation of a database, built on the basis of functions, rather than tables and fields as in relational databases. It contained many examples showing the advantages of this approach over the classical one. Many found them not convincing enough.

In this article, I will show how this concept allows you to quickly and conveniently balance writing and reading to the database without any change in the operation logic. They tried to implement similar functionality in modern commercial DBMS (in particular, Oracle and Microsoft SQL Server). At the end of the article I will show what happened to them, to put it mildly, not very.

Description


As before, for a better understanding, I will begin the description with examples. Suppose we need to implement a logic that will return a list of departments with the number of employees in them and their total salary.

In a functional database, it will look like this:
CLASS Department '';
name '' = DATA STRING [ 100 ] (Department);

CLASS Employee '';
department '' = DATA Department (Employee);
salary '' = DATA NUMERIC [ 10 , 2 ] (Employee);

countEmployees '- ' (Department d) =
GROUP SUM 1 IF department(Employee e) = d;
salarySum ' ' (Department d) =
GROUP SUM salary(Employee e) IF department(e) = d;

SELECT name(Department d), countEmployees(d), salarySum(d);
The complexity of executing this query in any DBMS will be equivalent to O (number of employees) , since for this calculation it is necessary to scan the entire table of employees, and then group them by department. There will also be some small (we believe that employees are much more than departments), depending on the chosen plan O (log number of employees) or O (number of departments) for grouping and so on.
')
It is clear that the overhead of execution may be different in different DBMS, but the complexity will not change in any way.

In the proposed implementation, the functional DBMS will form one subquery, which will calculate the necessary values ​​by department, and then make a JOIN with a table of departments to get the name. However, for each function during the announcement there is an opportunity to set a special MATERIALIZED marker. The system will automatically create the appropriate field for each such function. When the function value changes, the field value will also change in the same transaction. When accessing this function, the call will already be addressed to the readable field.

In particular, if you put MATERIALIZED for the functions countEmployees and salarySum , then the table with the list of departments will add two fields in which the number of employees and their total salary will be stored. When any change of employees, their salaries or belonging to departments, the system will automatically change the values ​​of these fields. The above request will be directly addressed to these fields and will be executed for O (number of departments) .

What are the limitations? Only one thing: such a function must have a finite number of input values ​​for which its value is defined. Otherwise it will be impossible to build a table storing all its values, since there can be no table with an infinite number of rows.

Example:
employeesCount ' > N' (Department d, NUMERIC [ 10 , 2 ] N) =
GROUP SUM salary(Employee e) IF department(e) = d AND salary(e) > N;
This function is defined for an infinite number of values ​​of the number N (for example, any negative value is appropriate). Therefore, it can not put MATERIALIZED. Thus, this is a logical and not a technical limitation (that is, not because we could not implement it). The rest - no restrictions. You can use grouping, sorting, AND and OR, PARTITION, recursion, etc.

For example, in task 2.2 of the previous article you can put MATERIALIZED on both functions:
bought '' (Customer c, Product p, INTEGER y) =
GROUP SUM sum(Detail d) IF
customer(order(d)) = c AND
product(d) = p AND
extractYear(date(order(d))) = y MATERIALIZED ;
rating '' (Customer c, Product p, INTEGER y) =
PARTITION SUM 1 ORDER DESC bought(c, p, y), p BY c, y MATERIALIZED ;
SELECT contactName(Customer c), name(Product p) WHERE rating(c, p, 1997 ) < 3 ;
The system itself will create one table with keys of the Customer , Product and INTEGER types , add two fields to it and update the field values ​​in them with any changes. Further calls to these functions will not calculate them, but will read the values ​​from the corresponding fields.

With this mechanism, for example, you can get rid of in recursions (CTE) in queries. In particular, consider the groups that form a tree using the child / parent relationship (each group has a link to its parent):
parent = DATA Group (Group);
In the functional database, the logic of recursions can be defined as follows:
level (Group child, Group parent) = RECURSION 1l IF child IS Group AND parent == child
STEP 2l IF parent == parent($parent);
isParent (Group child, Group parent) = TRUE IF level(child, parent) MATERIALIZED ;
Since MATERIALIZED is affixed to the isParent function, a table with two keys (groups) will be created under it, in which the isParent field will be true only if the first key is a descendant of the second. The number of entries in this table will be equal to the number of groups multiplied by the average depth of the tree. If it is necessary, for example, to count the number of descendants of a certain group, then you can refer to this function:
childrenCount (Group g) = GROUP SUM 1 IF isParent(Group child, g);
There will be no CTE in the SQL query. Instead, there will be a simple GROUP BY.

Using this mechanism, you can also easily denormalize the database if necessary:
CLASS Order '' ;
date '' = DATA DATE (Order);

CLASS OrderDetail ' ' ;
order '' = DATA Order (OrderDetail);
date '' (OrderDetail d) = date(order(d)) MATERIALIZED INDEXED ;
When accessing the date function for the order line, there will be a reading from the table with the order lines of the field that has an index. When you change the date of the order, the system will automatically automatically recalculate the denormalized date in the row.

Benefits


What is this mechanism for? In classical DBMS, without rewriting queries, the developer or DBA can only change indices, define statistics and prompt the query planner how to execute them (and HINTs are only in commercial DBMS). No matter how hard they try, they will not be able to execute the first request in the article for O (number of departments) without changing requests and adding triggers. In the proposed scheme, at the development stage, you can not think about the structure of data storage and which aggregations to use. All this can be easily changed on the fly already directly in operation.

In practice, it looks like this. Some people develop logic directly based on the task. They do not understand the algorithms and their complexity, or the implementation plans, or the types of joines, nor any other technical component. These people are more business intelligence than developers. Then, it all goes into testing or operation. Logging of long queries is enabled. When a long query is discovered, then by other people (more technical - in fact, DBA) a decision is made to enable MATERIALIZED on some intermediate function. Thus, the recording slows down a bit (since an additional field in the transaction needs to be updated). However, not only this request is significantly accelerated, but also all others that use this function. In this case, the decision on which function to materialize is relatively easy to take. Two main parameters: the number of possible input values ​​(exactly how many records will be in the corresponding table), and how often it is used in other functions.

Analogs


Modern commercial databases have similar mechanisms: MATERIALIZED VIEW with FAST REFRESH (Oracle) and INDEXED VIEW (Microsoft SQL Server). In PostgreSQL, MATERIALIZED VIEW is not able to be updated in a transaction, but only upon request (and even with very strict restrictions), so it is not considered. But they have several problems, which significantly limits their use.

First, you can enable materialization only if you have already created a normal VIEW. Otherwise, you will have to rewrite the remaining requests for accessing the newly created view in order to use this materialization. Or leave everything as it is, but it will be at least inefficient if there is some data already defined, but many queries do not always use them, but calculate it anew.

Secondly, they have a huge number of restrictions:

Oracle
5.3.8.4 General Restrictions on Fast Refresh

The defining query is restricted as follows:
  • The materialized view must not contain references to non-repeating expressions like SYSDATE and ROWNUM .
  • The materialized view must not be referenced to RAW or LONG RAW data types.
  • It cannot contain a SELECT list subquery.
  • It cannot contain analytic functions (for example, RANK ) in the SELECT clause.
  • It can not XMLIndex indexed.
  • It cannot contain a MODEL clause.
  • It cannot contain a HAVING clause with a subquery.
  • It cannot contain nested queries that have ANY , ALL , or NOT EXISTS .
  • It cannot contain a [START WITH …] CONNECT BY clause.
  • It cannot contain multiple detail tables at different sites.
  • ON COMMIT materialized views cannot have remote detail tables.
  • Nested materialized views must be a join or aggregate.
  • GROUPED clause cannot be selected from an index-organized table.

5.3.8.5 Restrictions

The following restrictions on fast refreshing:
  • All restrictions from " General Restrictions on Fast Refresh ".
  • They cannot have GROUP BY clauses or aggregates.
  • Rowids of all tables in the FROM list must appear in the SELECT list of the query.
  • Materialized view logs must be listed in the FROM list of the query.
  • It’s not a problem to make a statement.

Also, you should not be optimally efficient if:
  • The defining query uses the outer join. If the defining query contains such a join, consider rewriting the defining query to contain an inner join.
  • The SELECT list of materialized view contains several tables.

5.3.8.6 Restrictions on Fast Foods with Aggregates

Fast refreshing queries:

Fast refresh is supported for both the following restrictions apply:
  • All tables in the materialized logs must be:
    • Contain all columns from the table referenced in the materialized view.
    • Specify with ROWID and INCLUDING NEW VALUES .
    • Specify your SEQUENCE clause if you need a mix of directs, directs, deletes, and updates.

  • Only SUM , COUNT , AVG , VARIANCE , VARIANCE , MIN and MAX are supported for fast refresh.
  • COUNT(*) must be specified.
  • Aggregate functions must occur only in the outermost part of the expression. That is, aggregates such as AVG(AVG(x)) or AVG(x) + AVG(x) are not allowed.
  • For each aggregate such as AVG(expr) , the corresponding COUNT(expr) must be present. Oracle recommends that SUM(expr) be specified.
  • If VARIANCE(expr) or STDDEV(expr ) is specified, COUNT(expr) and SUM(expr) must be specified. Oracle recommends that SUM(expr *expr) be specified.
  • The SELECT column in the defining query cant be a complex expression with multiple base tables. A possible workaround is a materialized view.
  • The SELECT list must contain all GROUP BY columns.
  • Materialized view.
  • If you’re using a must must must must must must must must
  • If it is a materialized view, it is supported only on conventional DML inserts and direct loads.
    • Materialized views with MIN or MAX aggregates
    • Materialized views which have SUM(expr) but no COUNT(expr)
    • Materialized views without COUNT(*)

    This is a materialized view.
  • If you don’t have what it is, it doesn’t have a WHERE clause.
    DML doesn’t have the same behavior as the insert-only case. It deletes and recomputes the max / min values ​​for the affected groups. You need to be aware of its performance impact.
  • It can be completely refined and can be completely refreshed. For information on which views will merge, see Oracle Database SQL Language Reference .
  • If there are no arbitrary selections and joins in the WHERE clause.
  • There has been a change in the aggregate. The join join table. If there are outer joins, it must be the equality ( = ) operator.
  • For materialized views with CUBE , ROLLUP , grouping sets, or
    • The SELECT list should contain a GROUPING_ID function. For example, if it is a CUBE(a, b) , then it should be either " GROUPING_ID(a, b) " or " GROUPING(a) AND GROUPING(b) "for the materialized view to be fast refreshable.
    • GROUP BY should not result in any duplicate groupings. For example, " GROUP BY a, ROLLUP(a, b) " is not fast refreshable because it results in duplicate groupings " (a), (a, b), AND (a) ".

5.3.8.7 Restrictions on UNION ALL

If you are satisfied with the following conditions:
  • ALL operator at the top level.

    It can be used as a rule for the submission of the segregation. example:
      CREATE VIEW view_with_unionall AS
     (SELECT c.rowid crid, c.cust_id, 2 umarker
      FROM customers with WHERE c.cust_last_name = 'Smith'
      UNION ALL
      SELECT c.rowid crid, c.cust_id, 3 umarker
      FROM customers with WHERE c.cust_last_name = 'Jones');
    
     CREATE MATERIALIZED VIEW unionall_inside_view_mv
     REFRESH FAST ON DEMAND AS
     SELECT * FROM view_with_unionall;
    
    Note that the view view_with_unionall satisfies the requirements for a fast refresh.
  • It is a fast refreshing materialized view.

    There is no need for any type of fast refreshing materialized view.
    It is also possible to note that the Oracle database will be included. This is shown in the defining query of the view view_with_unionall .
  • SELECT list of each query, you must include a UNION ALL branch. The list of each query block should be followed. See " UNION ALL Marker and Query Rewrite " for more information regarding UNION ALL markers.
  • Aggregate materialized aggregate material aggregate material aggregate material aggregate material views. Note, however, that it can be fast refreshed when it’s used.
  • A quick refreshable materialized view with UNION ALL .

I do not want to offend Oracle fans, but judging by their list of restrictions, it seems that this mechanism was not written in the general case, using some kind of model, but thousands of Hindus, where everyone was given to write their own branch, and each of them could, then and did. Using this mechanism for real logic is like walking a minefield. At any time you can get a mine, hitting one of the not obvious restrictions. How it works is also a separate issue, but it is outside the scope of this article.

Microsoft SQL Server

Additional Requirements


In addition to the SET options, you must met:
  • CREATE INDEX must be the owner of the view.
  • IGNORE_DUP_KEY option must be set to OFF (the default setting).
  • Tables must be referenced by two-part names, schema . tablename in the view definition.
  • User-defined functions referenced in the view must be created by using the WITH SCHEMABINDING option.
  • Any user defined functions should be referenced by two-part names, <schema> . <function> .
  • There is NO SQL , and it must be NO .
  • It can be used to make a list of the indexed keys. CLR functions can not be seen in the view.
  • It is shown in the following table.
    PropertyNote
    DETERMINISTIC = TRUEThe Microsoft .NET Framework method.
    PRECISE = TRUEMust be declared attribute of the .NET Framework method.
    DATA ACCESS = NO SQLDetermined by setting DataAccess attribute to DataAccessKind.None and SystemDataAccess attribute to SystemDataAccessKind.None.
    EXTERNAL ACCESS = NOThis property defaults to NO for CLR routines.
  • The view must be created by using the WITH SCHEMABINDING option.
  • This is the same database as the view. Reference view views.
  • The following Transact-SQL elements include:
    COUNTROWSET functions ( OPENDATASOURCE , OPENQUERY , OPENROWSET , AND OPENXML )OUTER joins ( LEFT , RIGHT , or FULL )
    Derived table (defined by clause)Self-joinsSpecifying columns by using SELECT * or SELECT <table_name>.*
    DISTINCTSTDEV , STDEVP , VAR , VARP , or AVGCommon table expression (CTE)
    float 1 , text , ntext , image , XML , or filestream columnsSubqueryOVER clause, which includes ranking or aggregate window functions
    Full-text predicates ( CONTAINS , FREETEXT )SUM function that references a nullable expressionORDER BY
    CLR user-defined aggregate functionTOPCUBE , ROLLUP , or GROUPING SETS operators
    MIN , MAXUNION , EXCEPT , or INTERSECT operatorsTABLESAMPLE
    Table variablesOUTER APPLY or CROSS APPLYPIVOT , UNPIVOT
    Sparse column setsInline (TVF) or multi-statement table-valued functions (MSTVF)OFFSET
    CHECKSUM_AGG

    1 The indexed view can contain float columns; however, such columns cannot be included in the clustered index key.
  • If GROUP BY is present, the VIEW definition must contain COUNT_BIG(*) and must not contain HAVING . GROUP BY restrictions are applicable only to the indexed view definition. It doesn’t matter if there is a group of restrictions.
  • If you have a GROUP BY clusters, it will be possible to use the group of clauses.

Here it can be seen that the Indians did not attract, as they decided to do according to the scheme “we will do little, but well”. That is, they have more mines on the field, but their location is more transparent. What is most upsetting is this restriction:
This is the same database as the view. Reference view views.

In our terminology, this means that a function cannot refer to another materialized function. It cuts the whole ideology on the vine.
Also, this restriction (and further in the text) greatly reduces the use cases:
The following Transact-SQL elements include:
COUNTROWSET functions ( OPENDATASOURCE , OPENQUERY , OPENROWSET , AND OPENXML )OUTER joins ( LEFT , RIGHT , or FULL )
Derived table (defined by clause)Self-joinsSpecifying columns by using SELECT * or SELECT <table_name>.*
DISTINCTSTDEV , STDEVP , VAR , VARP , or AVGCommon table expression (CTE)
float 1 , text , ntext , image , XML , or filestream columnsSubqueryOVER clause, which includes ranking or aggregate window functions
Full-text predicates ( CONTAINS , FREETEXT )SUM function that references a nullable expressionORDER BY
CLR user-defined aggregate functionTOPCUBE , ROLLUP , or GROUPING SETS operators
MIN , MAXUNION , EXCEPT , or INTERSECT operatorsTABLESAMPLE
Table variablesOUTER APPLY or CROSS APPLYPIVOT , UNPIVOT
Sparse column setsInline (TVF) or multi-statement table-valued functions (MSTVF)OFFSET
CHECKSUM_AGG

OUTER JOINS, UNION, ORDER BY and others are prohibited. Perhaps it was easier to point out what can be used than what is impossible. The list would probably be much smaller.

Summing up: a huge set of restrictions in each (I note commercial) DBMS vs no (except for one logical, but not technical) in the LGPL technology. However, it should be noted that to implement this mechanism in relational logic is somewhat more complicated than in the described functional.

Implementation


How it works? PostgreSQL is used as a “virtual machine”. Inside there is a complex algorithm that deals with the construction of queries. Here is the source code . And there is not just a big set of heuristics with a bunch of ifs. So, if you have a couple of months to study, you can try to understand the architecture.

Does it work efficiently? Efficiently enough. Unfortunately, it is hard to prove. I can only say that if you consider the thousands of requests that exist in large applications, then on average they are more efficient than a good developer. An excellent SQL programmer can write any query more efficiently, but on a thousand queries it simply will have neither motivation nor time to do it. The only thing that I can now give as evidence of efficiency is that based on the platform built on this database, there are several ERP system projects in which there are thousands of different MATERIALIZED functions, with thousands of users and terrabite databases with hundreds of millions of records running on a regular two-processor server. However, anyone can test / disprove efficiency by downloading the platform and PostgreSQL, enabling SQL query logging and trying to change the logic and data there.

In the following articles, I will also talk about how you can hang restrictions on functions, work with change sessions, and more.

Source: https://habr.com/ru/post/459066/


All Articles