📜 ⬆️ ⬇️

MS SQL: generating pseudo-random data using newID (). Opportunities and pitfalls

It is known that the built-in function newID () is widely used by developers not only for their intended purpose — that is, for generating unique primary keys, but also as a means for generating pseudo-random data arrays.
As part of the built-in functions, newID () is actually the only one that is not only non-deterministic, but one can also say “super-non-deterministic”, since unlike all the others, it is capable of producing a new value for each new line, and not the same for the whole batch - which makes it extremely useful for such mass generation. In addition to newID (), this property also has newSequentialID (), but its use elsewhere, except in defining the default value of uniqueidentifier columns, is prohibited.
There is no need to go far for examples - below is the code

SELECT TOP 100 ABS(CHECKSUM(NEWID())) % 1000 FROM sysobjects A CROSS JOIN sysobjects B 

or this one (if it seems that checksum is a laborious operation):

 SELECT TOP 100 ABS(CONVERT(INT, (CONVERT(BINARY(4), (NEWID()))))) % 1000 FROM sysobjects A CROSS JOIN sysobjects B 

Generates a table of 100 random integers in the range from 0 to 999.
')
For floating numbers, you can use the property of the rand () function to initialize the generator with an integer number:

 SELECT TOP 100 RAND(CHECKSUM(NEWID())) FROM sysobjects A CROSS JOIN sysobjects B 

In this case, rand () is used as a matter of fact simply as a converter of the int32 range to the range [0..1). The statistical quality check of the distribution of this method on the number of records of the order of a million shows that it is not inferior to the standard use of rand (), initialized once, and then used in the loop. Therefore - you can safely use.

Another interesting option is the generation of normally distributed data. Here we will use the Box-Muller method:

 SELECT TOP 1000 COS(2 * PI() * RAND(BINARY_CHECKSUM(NEWID()))) * SQRT(-2 * LOG(RAND(BINARY_CHECKSUM(NEWID())))) FROM sysobjects A CROSS JOIN sysobjects B CROSS JOIN sysobjects C 

Those interested can check that the generated distribution is very close to normal by plotting.

All this works well and allows you to very quickly generate at least ten million records without using head-on solutions such as cycles, cursors, or even inserting records one by one into the base from the application layer. You just need to make sure that the tables that you use as the source of rows have sufficient capacity, and either increase the number of CROSS JOINs, or use the table variables with the required number of rows as the source.

However, the topic is not only about this. In the overwhelming majority of cases, the generated rows materialize, that is, they are inserted into a permanent or temporary table, or into a table variable. If so, then you can not read further - materialized data will work fine. However, there are cases when the above statements are used in subqueries. And here there are difficult at first sight features of the SQL engine behavior. Consider them with examples, and then try to analyze why this is happening, and how to deal with it:

To begin, just write a statement with newID () in the subquery and run it several times in a loop:

 declare @c int = 0 while @c < 5 begin SELECT * FROM ( SELECT ABS(CONVERT(INT, CONVERT(BINARY(4), NEWID()))) % 5 AS RNDIDX FROM ( SELECT 1 AS ID ) ROWSRC ) SUBQ set @c = @c + 1 end 

The code works as expected - it gives out 5 rezaltsets, each one has exactly one entry with a number in the range from 0 to 4. I do not give a screenshot of the results - when everything is alright, there is little sense in them.

Now more interesting. We try to fire the result from SUBQ to some other table. You can create it, or you can set up subquery on subquery - the result will not change. We write:

 declare @c int = 0 while @c < 5 begin SELECT * FROM ( SELECT ABS(CONVERT(INT, CONVERT(BINARY(4), NEWID()))) % 5 AS RNDIDX FROM ( SELECT 1 AS ID ) ROWSRC ) SUBQ INNER JOIN ( SELECT 0 AS VAL UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 ) NUM ON SUBQ.RNDIDX = NUM.VAL set @c = @c + 1 end 

Look at the screenshot of the result of the execution - and slowly crawl under the chair - the number of lines in each rezalset is not strictly 1. .4] (which is unbelievable in itself!)), And somewhere - more than one (!) Record.

image

Now we are doing an innocent change - change INNER to LEFT:

 declare @c int = 0 while @c < 5 begin SELECT * FROM ( SELECT ABS(CONVERT(INT, CONVERT(BINARY(4), NEWID()))) % 5 AS RNDIDX FROM ( SELECT 1 AS ID ) ROWSRC ) SUBQ LEFT JOIN ( SELECT 0 AS VAL UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 ) NUM ON SUBQ.RNDIDX = NUM.VAL set @c = @c + 1 end 

We carry out - everything began to work correctly (!) - check plz yourself, I didn’t take a screenshot for proper work. Note that since for any RNDIDX value from the range [0..4] that is capable of issuing SUBQ subquery, there is always a VAL value from the NUM subquery, from the logical point of view the result LEFT and INNER JOIN must be the same. However, in fact it is not so!

Another test is returning INNER, but adding TOP / ORDER BY to the first subquery. Why - more on that later, let's just try:

 declare @c int = 0 while @c < 5 begin SELECT * FROM ( SELECT TOP 1 ABS(CONVERT(INT, CONVERT(BINARY(4), NEWID()))) % 5 AS RNDIDX FROM ( SELECT 1 AS ID ) ROWSRC ORDER BY RNDIDX ) SUBQ INNER JOIN ( SELECT 0 AS VAL UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 ) NUM ON SUBQ.RNDIDX = NUM.VAL set @c = @c + 1 end 

Everything works correctly again! Mystic!

Googling, we find out that SQL developers from all over the world periodically encounter such behavior - examples here or here

People assume that materializing subquery helps. Indeed, if you rewrite the example by first selecting the records in an explicit form in the temporary table, and then just by clicking it, everything works fine. Why does the replacement of INNER affect LEFT for normal operation, or the addition of TOP / ORDER BY where it is not needed? All for the same reason - in one case there is a materialization of the subquery results, in the other - no. A clearer difference can be shown by an analysis of the plan of a more detailed case, for example, this:

 DECLARE @B TABLE (VAL INT) INSERT INTO @B VALUES (0), (1), (2), (3), (4) SELECT * FROM ( SELECT ABS(CONVERT(INT, CONVERT(BINARY(4), NEWID()))) % 5 AS RNDIDX FROM @B ) SUBQ INNER JOIN @BB ON SUBQ.RNDIDX = B.VAL  ,   (query 2),   @B    : 
image

We see that the query splices two threads of rows before calculating the value of a column that depends on newID (). This may occur because the SQL engine considers that the value returned by newID (), although non-deterministic, does not change during the entire batch. However, this is not the case - and most likely, therefore, the request does not work correctly. Now we change the INNER to LEFT, and see the plan:

image

Yeah, LEFT JOIN made the SQL engine execute Compute Scalar before merging threads, so our query started working correctly.

Finally, check the version with the addition of TOP / ORDER BY:

image

Actually, the diagnosis is clear. MS SQL does not take into account the peculiarities of newID (), and accordingly, it plans incorrectly, relying on the constant value returned by the function in the scoped batch. There is a work around for this feature - to force the SQL engine to materialize the sample by any means before using it in dependent queries. How you will materialize is your business, but it’s best to use tabular variables, especially if the size of the subsample is small. Otherwise, the result, to put it mildly, is not 100% guaranteed; In addition, there is no guarantee that one day you yourself, or someone else, will not unmask the code by throwing out the “unnecessary” TOP / ORDER BY or wisely replacing LEFT with INNER.

Actually, everything. Successful SQL programming!

Source: https://habr.com/ru/post/177221/


All Articles