📜 ⬆️ ⬇️

What is faster: 0 or NULL?

There are three aggregate functions that are most often used in practice: COUNT , SUM and AVG .

And if the first has already been discussed earlier , then there are interesting nuances with performance with the rest. But let's get everything in order ...

When using aggregate functions on the execution plan, depending on the input stream, there can be two operators: Stream Aggregate and Hash Match .
')
The first one may require a pre-sorted input set of values, and the Stream Aggregate does not block the execution of subsequent statements.

In turn, Hash Match is a blocking operator (with rare exceptions ) and does not require sorting the input stream. For Hash Match , a hash table is used that is created in memory and if the expected number of rows is not correctly estimated, the operator can merge the results into tempdb .

In summary, it turns out that Stream Aggregate works well on small sorted data sets, and Hash Match does well with large, non-sorted sets and lends itself well to parallel processing.

Now that we have overcome the theory we will start to see how the aggregate functions work.

Suppose we need to calculate the average price among all products:

SELECT AVG(Price) FROM dbo.Price 

According to the table with a fairly simple structure:

 CREATE TABLE dbo.Price ( ProductID INT PRIMARY KEY, LastUpdate DATE NOT NULL, Price SMALLMONEY NULL, Qty INT ) 

As we have scalar aggregation, we expect to see Stream Aggregate on the execution plan:



Internally, this operator performs two aggregation operations COUNT_BIG and SUM (although this is performed as a single operation on the physical level) on the Price column:



Do not forget that the average is calculated only for NOT NULL , since the COUNT_BIG operation goes through the column, not with an asterisk. Accordingly, such a request:

 SELECT AVG(v) FROM ( VALUES (3), (9), (NULL) ) t(v) 

will return as a result not 4, but 6.

Now let's take a look at Compute Scalar , inside of which there is an interesting expression for checking division by zero:

 Expr1003 = CASE WHEN [Expr1004]=(0) THEN NULL ELSE [Expr1005]/CONVERT_IMPLICIT(money,[Expr1004],0) END 

Let's try to calculate the total amount:

 SELECT SUM(Price) FROM dbo.Price 

The execution plan will remain the same:



But if you look at the operations that Stream Aggregate performs ...



You can be a bit surprised. Why does SQL Server count the amount if I only need the amount? The answer lies in Compute Scalar :

 [Expr1003] = Scalar Operator(CASE WHEN [Expr1004]=(0) THEN NULL ELSE [Expr1005] END) 

If COUNT is not taken into account, then according to the semantics of the T-SQL language , when there are no rows in the input stream, then we should return NULL , not 0 . This behavior works for both scalar and vector aggregation:

 SELECT LastUpdate, SUM(Price) FROM dbo.Price GROUP BY LastUpdate OPTION(MAXDOP 1) 

 Expr1003 = Scalar Operator(CASE WHEN [Expr1008]=(0) THEN NULL ELSE [Expr1009] END) 

Moreover, such a check is done for both NULL and NOT NULL columns. Now we will consider examples in which the features of SUM and AVG described above will be useful.

If we want to calculate the average, then you do not need to use COUNT + SUM :

 SELECT SUM(Price) / COUNT(Price) FROM dbo.Price 

Because such a request would be less efficient than using AVG explicitly.

Further ... Explicitly transferring NULL to an aggregate function is not necessary:

 SELECT SUM(CASE WHEN Price < 100 THEN Qty ELSE NULL END), SUM(CASE WHEN Price > 100 THEN Qty ELSE NULL END) FROM dbo.Price 

Since in this design:

 SELECT SUM(CASE WHEN Price < 100 THEN Qty END), SUM(CASE WHEN Price > 100 THEN Qty END) FROM dbo.Price 

Optimizer does substitution automatically:



But what if I want to get 0 in the results instead of NULL ? Very often use ELSE and do not think:

 SELECT SUM(CASE WHEN Price < 100 THEN Qty ELSE 0 END), SUM(CASE WHEN Price > 100 THEN Qty ELSE 0 END) FROM dbo.Price 

Obviously, in this case, we will achieve the desired ... and even one warning will cease to call the eyes:

 Warning: Null value is eliminated by an aggregate or other SET operation. 

Although it is best to write a query like this:

 SELECT ISNULL(SUM(CASE WHEN Price < 100 THEN Qty END), 0), ISNULL(SUM(CASE WHEN Price > 100 THEN Qty END), 0) FROM dbo.Price 

And this is good not because the CASE operator will work faster. We already know that the optimizer inserts ELSE NULL there automatically ... So what are the advantages of the latter option?

As it turned out, aggregation operations in which NULL values ​​prevail are processed faster.

 SET STATISTICS TIME ON DECLARE @i INT = NULL ;WITH E1(N) AS ( SELECT * FROM ( VALUES (@i),(@i),(@i),(@i),(@i), (@i),(@i),(@i),(@i),(@i) ) t(N) ), E2(N) AS (SELECT @i FROM E1 a, E1 b), E4(N) AS (SELECT @i FROM E2 a, E2 b), E8(N) AS (SELECT @i FROM E4 a, E4 b) SELECT SUM(N) -- 100.000.000 FROM E8 OPTION (MAXDOP 1) 

The execution took me:

 SQL Server Execution Times: CPU time = 5985 ms, elapsed time = 5989 ms. 

Now we change:

 DECLARE @i INT = 0 

And re-perform:

 SQL Server Execution Times: CPU time = 6437 ms, elapsed time = 6451 ms. 

Not so significant, but the reason for optimization nevertheless it gives in certain situations.

The end of the play and the curtain? Not. That's not all…

As one of my friends said: “There is neither black nor white ... The world is multicolored” and therefore, finally, I will give an interesting example when NULL can harm.

Create a slow function and test table:

 USE tempdb GO IF OBJECT_ID('dbo.udf') IS NOT NULL DROP FUNCTION dbo.udf GO CREATE FUNCTION dbo.udf (@a INT) RETURNS VARCHAR(MAX) AS BEGIN DECLARE @i INT = 1000 WHILE @i > 0 SET @i -= 1 RETURN REPLICATE('A', @a) END GO IF OBJECT_ID('tempdb.dbo.#temp') IS NOT NULL DROP TABLE #temp GO ;WITH E1(N) AS ( SELECT * FROM ( VALUES (1),(1),(1),(1),(1), (1),(1),(1),(1),(1) ) t(N) ), E2(N) AS (SELECT 1 FROM E1 a, E1 b), E4(N) AS (SELECT 1 FROM E2 a, E2 b) SELECT * INTO #temp FROM E4 

And execute the query:

 SET STATISTICS TIME ON SELECT SUM(LEN(dbo.udf(N))) FROM #temp 

 SQL Server Execution Times: CPU time = 9109 ms, elapsed time = 11603 ms. 

Now let's try the result of the expression, which is transmitted in SUM , to wrap in ISNULL :

 SELECT SUM(ISNULL(LEN(dbo.udf(N)), 0)) FROM #temp 

 SQL Server Execution Times: CPU time = 4562 ms, elapsed time = 5719 ms. 

The speed of execution has decreased by 2 times. I’ll say right away that this is not magic ... But a bug in the SQL Server engine , which Microsoft has already “sort of” fixed in SQL Server 2012 CTP .

The essence of the problem is as follows: the result of an expression inside the SUM or AVG functions can be executed twice if the optimizer considers that it can return NULL .

Everything was tested on Microsoft SQL Server 2012 (SP3) (KB3072779) - 11.0.6020.0 (X64) .

If you want to share this article with an English-speaking audience:
What is faster inside SUM & AVG: 0 or NULL?

Source: https://habr.com/ru/post/275069/


All Articles