
There are three aggregate functions that are most often used in practice:
COUNT ,
SUM and
AVG .
And if the first has already been discussed
earlier , then there are interesting nuances with performance with the rest. But let's get everything in order ...
When using aggregate functions on the execution plan, depending on the input stream, there can be two operators:
Stream Aggregate and
Hash Match .
')
The first one may require a pre-sorted input set of values, and the
Stream Aggregate does not block the execution of subsequent statements.
In turn,
Hash Match is a blocking operator (with rare
exceptions ) and does not require sorting the input stream. For
Hash Match , a hash table is used that is created in memory and if the expected number of rows is not correctly estimated, the operator can merge the results into
tempdb .
In summary, it turns out that
Stream Aggregate works well on small sorted data sets, and
Hash Match does well with large, non-sorted sets and lends itself well to parallel processing.
Now that we have overcome the theory we will start to see how the aggregate functions work.
Suppose we need to calculate the average price among all products:
SELECT AVG(Price) FROM dbo.Price
According to the table with a fairly simple structure:
CREATE TABLE dbo.Price ( ProductID INT PRIMARY KEY, LastUpdate DATE NOT NULL, Price SMALLMONEY NULL, Qty INT )
As we have scalar aggregation, we expect to see
Stream Aggregate on the execution plan:

Internally, this operator performs two aggregation operations
COUNT_BIG and
SUM (although this is performed as a single operation on the physical level) on the Price column:

Do not forget that the average is calculated only for
NOT NULL , since the
COUNT_BIG operation goes through the column, not with an asterisk. Accordingly, such a request:
SELECT AVG(v) FROM ( VALUES (3), (9), (NULL) ) t(v)
will return as a result not 4, but 6.
Now let's take a look at
Compute Scalar , inside of which there is an interesting expression for checking division by zero:
Expr1003 = CASE WHEN [Expr1004]=(0) THEN NULL ELSE [Expr1005]/CONVERT_IMPLICIT(money,[Expr1004],0) END
Let's try to calculate the total amount:
SELECT SUM(Price) FROM dbo.Price
The execution plan will remain the same:

But if you look at the operations that
Stream Aggregate performs ...

You can be a bit surprised. Why does
SQL Server count the amount if I only need the amount? The answer lies in
Compute Scalar :
[Expr1003] = Scalar Operator(CASE WHEN [Expr1004]=(0) THEN NULL ELSE [Expr1005] END)
If
COUNT is not taken into account, then according to the semantics of the
T-SQL language , when there are no rows in the input stream, then we should return
NULL , not
0 . This behavior works for both scalar and vector aggregation:
SELECT LastUpdate, SUM(Price) FROM dbo.Price GROUP BY LastUpdate OPTION(MAXDOP 1)
Expr1003 = Scalar Operator(CASE WHEN [Expr1008]=(0) THEN NULL ELSE [Expr1009] END)
Moreover, such a check is done for both
NULL and
NOT NULL columns. Now we will consider examples in which the features of
SUM and
AVG described above will be useful.
If we want to calculate the average, then you do not need to use
COUNT + SUM :
SELECT SUM(Price) / COUNT(Price) FROM dbo.Price
Because such a request would be less efficient than using
AVG explicitly.
Further ... Explicitly transferring
NULL to an aggregate function is not necessary:
SELECT SUM(CASE WHEN Price < 100 THEN Qty ELSE NULL END), SUM(CASE WHEN Price > 100 THEN Qty ELSE NULL END) FROM dbo.Price
Since in this design:
SELECT SUM(CASE WHEN Price < 100 THEN Qty END), SUM(CASE WHEN Price > 100 THEN Qty END) FROM dbo.Price
Optimizer does substitution automatically:

But what if I want to get 0 in the results instead of
NULL ? Very often use
ELSE and do not think:
SELECT SUM(CASE WHEN Price < 100 THEN Qty ELSE 0 END), SUM(CASE WHEN Price > 100 THEN Qty ELSE 0 END) FROM dbo.Price
Obviously, in this case, we will achieve the desired ... and even one warning will cease to call the eyes:
Warning: Null value is eliminated by an aggregate or other SET operation.
Although it is best to write a query like this:
SELECT ISNULL(SUM(CASE WHEN Price < 100 THEN Qty END), 0), ISNULL(SUM(CASE WHEN Price > 100 THEN Qty END), 0) FROM dbo.Price
And this is good not because the
CASE operator will work faster. We already know that the optimizer inserts
ELSE NULL there automatically ... So what are the advantages of the latter option?
As it turned out, aggregation operations in which
NULL values ​​prevail are processed faster.
SET STATISTICS TIME ON DECLARE @i INT = NULL ;WITH E1(N) AS ( SELECT * FROM ( VALUES (@i),(@i),(@i),(@i),(@i), (@i),(@i),(@i),(@i),(@i) ) t(N) ), E2(N) AS (SELECT @i FROM E1 a, E1 b), E4(N) AS (SELECT @i FROM E2 a, E2 b), E8(N) AS (SELECT @i FROM E4 a, E4 b) SELECT SUM(N) -- 100.000.000 FROM E8 OPTION (MAXDOP 1)
The execution took me:
SQL Server Execution Times: CPU time = 5985 ms, elapsed time = 5989 ms.
Now we change:
DECLARE @i INT = 0
And re-perform:
SQL Server Execution Times: CPU time = 6437 ms, elapsed time = 6451 ms.
Not so significant, but the reason for optimization nevertheless it gives in certain situations.
The end of the play and the curtain? Not. That's not all…
As one of my friends said:
“There is neither black nor white ... The world is multicolored” and therefore, finally, I will give an interesting example when
NULL can harm.
Create a slow function and test table:
USE tempdb GO IF OBJECT_ID('dbo.udf') IS NOT NULL DROP FUNCTION dbo.udf GO CREATE FUNCTION dbo.udf (@a INT) RETURNS VARCHAR(MAX) AS BEGIN DECLARE @i INT = 1000 WHILE @i > 0 SET @i -= 1 RETURN REPLICATE('A', @a) END GO IF OBJECT_ID('tempdb.dbo.#temp') IS NOT NULL DROP TABLE #temp GO ;WITH E1(N) AS ( SELECT * FROM ( VALUES (1),(1),(1),(1),(1), (1),(1),(1),(1),(1) ) t(N) ), E2(N) AS (SELECT 1 FROM E1 a, E1 b), E4(N) AS (SELECT 1 FROM E2 a, E2 b) SELECT * INTO #temp FROM E4
And execute the query:
SET STATISTICS TIME ON SELECT SUM(LEN(dbo.udf(N))) FROM #temp
SQL Server Execution Times: CPU time = 9109 ms, elapsed time = 11603 ms.
Now let's try the result of the expression, which is transmitted in
SUM , to wrap in
ISNULL :
SELECT SUM(ISNULL(LEN(dbo.udf(N)), 0)) FROM #temp
SQL Server Execution Times: CPU time = 4562 ms, elapsed time = 5719 ms.
The speed of execution has decreased by 2 times. I’ll say right away that this is not magic ... But a
bug in the SQL Server engine , which
Microsoft has already “sort of” fixed in
SQL Server 2012 CTP .
The essence of the problem is as follows: the result of an expression inside the
SUM or
AVG functions can be executed twice if the optimizer considers that it can return
NULL .
Everything was tested on
Microsoft SQL Server 2012 (SP3) (KB3072779) - 11.0.6020.0 (X64) .
If you want to share this article with an English-speaking audience:
What is faster inside SUM & AVG: 0 or NULL?