During my work, I came across a wide range of tasks. Some tasks required monotonous work, others came down to pure creativity.
The most interesting tasks that I can now recall, one way or another, raised questions of query optimization.
Optimization is, first of all, the search for the optimal query plan. However, what to do in a situation when a standard language construct produces a plan that is very far from optimal?
')
I encountered this kind of problem when I used the
UNPIVOT construct to convert columns to rows.
Through a small comparative analysis, a more effective alternative was found for
UNPIVOT .
So that the task does not seem abstract, suppose that we have at our disposal a table containing information about the number of medals among users.
IF OBJECT_ID('dbo.UserBadges', 'U') IS NOT NULL DROP TABLE dbo.UserBadges GO CREATE TABLE dbo.UserBadges ( UserID INT , Gold SMALLINT NOT NULL , Silver SMALLINT NOT NULL , Bronze SMALLINT NOT NULL , CONSTRAINT PK_UserBadges PRIMARY KEY (UserID) ) INSERT INTO dbo.UserBadges (UserID, Gold, Silver, Bronze) VALUES (1, 5, 3, 1), (2, 0, 8, 1), (3, 2, 4, 11)
Taking this table as a basis, we present various methods for converting columns to rows, as well as plans for their implementation.
1. UNION ALL
At one time,
SQL Server 2000 did not provide an efficient way to convert columns to rows. As a result, the practice of multiple sampling from the same table, but with a different set of columns united through the
UNION ALL construction, was widely practiced:
SELECT UserID, BadgeCount = Gold, BadgeType = 'Gold' FROM dbo.UserBadges UNION ALL SELECT UserID, Silver, 'Silver' FROM dbo.UserBadges UNION ALL SELECT UserID, Bronze, 'Bronze' FROM dbo.UserBadges
A huge disadvantage of this approach is the repeated readings of data, which significantly reduced the efficiency in performing such a query.
If you look at the execution plan, you can easily see this:
2. UNPIVOT
With the release of
SQL Server 2005 , it became possible to use the new design of the
T-SQL language -
UNPIVOT .
Using
UNPIVOT, the previous request can be simplified to:
SELECT UserID, BadgeCount, BadgeType FROM dbo.UserBadges UNPIVOT ( BadgeCount FOR BadgeType IN (Gold, Silver, Bronze) ) unpvt
Upon execution, we will get the following plan:
3. VALUES
Beginning with
SQL Server 2008, it became possible to use the
VALUES construct not only for creating multi-line
INSERT queries, but also inside the
FROM block.
Using the
VALUES construct, the query above can be rewritten as:
SELECT p.UserID, t.* FROM dbo.UserBadges p CROSS APPLY ( VALUES (Gold, 'Gold') , (Silver, 'Silver') , (Bronze, 'Bronze') ) t(BadgeCount, BadgeType)
At the same time, in comparison with
UNPIVOT , the implementation plan is slightly simpler:
4. Dynamic SQL
Using dynamic SQL, it is possible to create a "universal" query for any table. The main condition for this is that the columns that are not part of the primary key must have compatible data types.
You can find the list of such columns by the following query:
SELECT c.name FROM sys.columns c WITH(NOLOCK) LEFT JOIN ( SELECT i.[object_id], i.column_id FROM sys.index_columns i WITH(NOLOCK) WHERE i.index_id = 1 ) i ON c.[object_id] = i.[object_id] AND c.column_id = i.column_id WHERE c.[object_id] = OBJECT_ID('dbo.UserBadges', 'U') AND i.[object_id] IS NULL
If you look at the query plan, you will notice that connecting to
sys.index_columns is quite expensive:
To get rid of this connection, you can use the
INDEX_COL function. As a result, the final version of the request will take the following form:
DECLARE @table_name SYSNAME SELECT @table_name = 'dbo.UserBadges' DECLARE @SQL NVARCHAR(MAX) SELECT @SQL = ' SELECT * FROM ' + @table_name + ' UNPIVOT ( value FOR code IN ( ' + STUFF(( SELECT ', [' + c.name + ']' FROM sys.columns c WITH(NOLOCK) WHERE c.[object_id] = OBJECT_ID(@table_name) AND INDEX_COL(@table_name, 1, c.column_id) IS NULL FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 2, '') + ' ) ) unpiv' PRINT @SQL EXEC sys.sp_executesql @SQL
When executed, a request will be generated in accordance with the template:
SELECT * FROM <table_name> UNPIVOT ( value FOR code IN (<unpivot_column>) ) unpiv
Even if we take into account the optimization that we have done, it is worth noting that this method is still slower compared to the previous two:
5. XML
It is more elegant to implement dynamic
UNPIVOT by using the following trick with
XML :
SELECT p.UserID , BadgeCount = tcvalue('.', 'INT') , BadgeType = tcvalue('local-name(.)', 'VARCHAR(10)') FROM ( SELECT UserID , [XML] = ( SELECT Gold, Silver, Bronze FOR XML RAW('t'), TYPE ) FROM dbo.UserBadges ) p CROSS APPLY p.[XML].nodes('t/@*') t(c)
In which for each line an
XML of the form is formed:
<t Column1="Value1" Column2="Value2" Column3="Value3" ... />
Then the name of each attribute and its value is parsed.
In most cases, using
XML results in a slower execution plan - this is a payback for universality.
Now compare the results:
There is no
dramatic difference in execution speed between
UNPIVOT and
VALUES . This statement is true when it comes to simply converting columns to rows.
Let's complicate the task and consider another option, where for each user you need to find out the type of medals, which he has the most.
Let's try to solve the problem using the
UNPIVOT construction:
SELECT UserID , GameType = ( SELECT TOP 1 BadgeType FROM dbo.UserBadges b2 UNPIVOT ( BadgeCount FOR BadgeType IN (Gold, Silver, Bronze) ) unpvt WHERE UserID = b.UserID ORDER BY BadgeCount DESC ) FROM dbo.UserBadges b
The execution plan shows that the problem is observed in re-reading the data and sorting, which is necessary to organize the data:
Getting rid of re-reading is easy enough if we recall that it is allowed to use columns from the outer block in the subquery:
SELECT UserID , GameType = ( SELECT TOP 1 BadgeType FROM (SELECT t = 1) t UNPIVOT ( BadgeCount FOR BadgeType IN (Gold, Silver, Bronze) ) unpvt ORDER BY BadgeCount DESC ) FROM dbo.UserBadges
Repeated readings are gone, but the sorting operation has not gone away:
Let's see how the
VALUES construction behaves in this task:
SELECT UserID , GameType = ( SELECT TOP 1 BadgeType FROM ( VALUES (Gold, 'Gold') , (Silver, 'Silver') , (Bronze, 'Bronze') ) t (BadgeCount, BadgeType) ORDER BY BadgeCount DESC ) FROM dbo.UserBadges
The plan is expectedly simplified, but the sorting is still present in the plan:
Let's try to bypass the sorting using the aggregate function:
SELECT UserID , BadgeType = ( SELECT TOP 1 BadgeType FROM ( VALUES (Gold, 'Gold') , (Silver, 'Silver') , (Bronze, 'Bronze') ) t (BadgeCount, BadgeType) WHERE BadgeCount = ( SELECT MAX(Value) FROM ( VALUES (Gold), (Silver), (Bronze) ) t(Value) ) ) FROM dbo.UserBadges
We got rid of sorting:
Small results:
In a situation where it is necessary to make a simple conversion of columns into rows, it is most preferable to use the constructions
UNPIVOT or
VALUES .
If, after conversion, the obtained data is planned to be used in aggregation or sorting operations, then it is more preferable to use the
VALUES construct, which, in most cases, allows to obtain more efficient execution plans.
If the number of columns in a table is variable, it is recommended to use
XML , which, unlike dynamic SQL, can be used inside table functions.
PS To adapt, part of the examples for the features of
SQL Server 2005 , the construction using
VALUES :
SELECT * FROM ( VALUES (1, 'a'), (2, 'b') ) t(id, value)
It is necessary to replace the combination
SELECT UNION ALL SELECT :
SELECT id = 1, value = 'a' UNION ALL SELECT 2, 'b'
UPDATE 10/16/2013 : How do UNPIVOT and VALUES behave on large amounts of data?
The table is based on the following structure (25 columns in total).
CREATE TABLE [dbo].[WorkOutFactors] ( WorkOutID BIGINT NOT NULL PRIMARY KEY, NightHours INT NOT NULL, EveningHours INT NOT NULL, HolidayHours INT NOT NULL, ... )
This table contains ~ 186000 rows. From a cold start on a local
SQL Server 2012 SP1 , the operation to convert rows to columns gave the following results.
UNPIVOT execution
plan :
VALUES execution plan:
In comparison, you can see that
VALUES runs faster (by 3 seconds), but it requires a lot of CPU resources:
From myself I will add that in each specific situation the difference in performance will differ.