📜 ⬆️ ⬇️

How to fill an MS SQL database with heterogeneous random data or 17 hours of waiting

Good day,
The developer often has the task of conducting a test of the database on large amounts of data, but where can we get this very data? After all, everyone knows that the structure of the database can reach over 50 tables that you do not really want to fill with your hands. And if you think about foreign keys and composite primary keys whose values ​​are associated with other tables, then the head starts to heat up proportionally to the old AMD with the cooling turned off.
On the Internet, there are many solutions for filling a database with random values ​​using .NET, C ++, Java, etc. This article will highlight the topic of filling a database with random values ​​with T-SQL using MS SQL Server.

Introduction


A few days ago I received the task of filling a database running with MS SQL Server with random data. Moreover, the entire implementation should be performed only by means of T-SQL. After a long search for such solutions on resources, I came to the conclusion that I would have to do it myself and set to work. Not being (until recently) an expert on T-SQL, but having only a set of knowledge from the course “Database” of the university, the implementation turned out to be very “crutch” and slow (the main problem), but it works.

The main purpose of this article is to discuss with the habr community the possibility to optimize the solution, or its Ctrl + A and Shift + Del with reference to the ready-made implementation.

And so, what was at the entrance:

What with all this had to be done:

Implementation


The whole implementation received a view of each other's calling procedures:

I propose to dwell on each procedure in detail. (it is assumed that the reader, unlike the author , is familiar with the basics of SQL)
')
randomString
CREATE PROCEDURE [dbo].[randomString] @inputSize int, @outputRandomString nvarchar(max) output AS BEGIN --        . END; 


I used one of the first implementation options I got from the MS SQL forum. The procedure receives the length of the string as input, and the output produces a string of random characters of the NVARCHAR (MAX) type of the required size. In this case, the implementation is not critical, since it does not have significant time costs for large amounts of data. We go further.

randomInt
 CREATE PROCEDURE [dbo].[randomInt] @inputSize int, @outputInteger int output AS BEGIN DECLARE @TEMP bigint SET @TEMP = SUBSTRING('999999999999999999',1,@inputSize) SET @outputInteger = (ABS(CHECKSUM(NewId())) % @TEMP) END 


The function is small and not very beautiful (especially the place with SUBSTRING), but it completely suited me with its speed, so for now we leave it and move on.

generateDataByType
 CREATE PROCEDURE [dbo].[generateDataByType] @tableName nvarchar(40), --  ,      @inputColumName nvarchar(40), --  ,      @inputType nvarchar(10), @inputSize int, @outputString nvarchar(max) output --  AS BEGIN DECLARE @isFK bit = 0; DECLARE @FKName NVARCHAR(MAX); DECLARE @ParentTable NVARCHAR(MAX); -- @tableName     FK (ccu.table_name)     (references_table)           DECLARE columnsCursor1 CURSOR FOR SELECT kcu.column_name, ccu.table_name AS references_table FROM information_schema.table_constraints tc INNER JOIN information_schema.key_column_usage kcu ON tc.constraint_catalog = kcu.constraint_catalog AND tc.constraint_schema = kcu.constraint_schema AND tc.constraint_name = kcu.constraint_name INNER JOIN information_schema.referential_constraints rc ON tc.constraint_catalog = rc.constraint_catalog AND tc.constraint_schema = rc.constraint_schema AND tc.constraint_name = rc.constraint_name AND tc.constraint_type = 'FOREIGN KEY' INNER JOIN information_schema.constraint_column_usage ccu ON rc.unique_constraint_catalog = ccu.constraint_catalog AND rc.unique_constraint_schema = ccu.constraint_schema AND rc.unique_constraint_name = ccu.constraint_name WHERE tc.table_name = @tableName OPEN columnsCursor1; FETCH NEXT FROM columnsCursor1 INTO @FKName,@ParentTable --     WHILE @@FETCH_STATUS = 0 BEGIN -- ,        @inputColumName       IF (@inputColumName = @FKName) BEGIN SET @isFK = 1; --   true -   FK         . DECLARE @selectedPK NVARCHAR(MAX); DECLARE @params NVARCHAR(MAX); --           ,         SET @selectedPK = N'SELECT TOP 1 @outputString =' + @FKName + ' FROM ' + @ParentTable + ' ORDER BY NEWID(); '; SET @params = N'@FKName NVARCHAR(MAX), @ParentTable NVARCHAR(MAX), @outputString NVARCHAR(MAX) OUTPUT'; EXEC sp_executesql @selectedPK , @params, @FKName = @FKName, @ParentTable = @ParentTable, @outputString = @outputString OUTPUT; END FETCH NEXT FROM columnsCursor1 INTO @FKName,@ParentTable END; CLOSE columnsCursor1; DEALLOCATE columnsCursor1; --          . IF (@isFK <> 1) BEGIN IF (@inputType = 'nvarchar') BEGIN EXECUTE randomString @inputSize, @outputRandomString = @outputString OUTPUT ; END ELSE --          . END 


And so, without reaching the "main" procedure, we get huge time costs when filling in the foreign key of the table with data from the found parent table. If this search is noticed by substituting random numbers in a given range, productivity increases dramatically. Perhaps it's a SELECT from the system table and random sorting. For comparison: writing 1 million rows to a table without FK takes about 20 minutes, writing 1 million rows to a table with FK takes more than 17 hours. For reference, writing one million lines of pure INSERT into one field takes 6-10 seconds.
At the moment, I could not think of anything more optimal, which was the impetus for writing this article, but about this in conclusion.

insertRandomData
 CREATE PROCEDURE [dbo].[insertRandomData] @childTableName nvarchar(MAX), @insertRowCount int AS BEGIN DECLARE @i int = 0 /*  */ DECLARE @columnName NVARCHAR(30); DECLARE @columnType NVARCHAR(10); DECLARE @columnLenght INT; DECLARE @columnUniq INT; /*      */ DECLARE @insertQuery NVARCHAR(MAX); DECLARE @insertColumnsQuery NVARCHAR(MAX); DECLARE @insertValuesQuery VARCHAR(MAX); DECLARE @params NVARCHAR(MAX); SET @insertColumnsQuery = ''; SET @insertValuesQuery = ''; begin transaction WHILE (@i < @insertRowCount) BEGIN DECLARE columnsCursor CURSOR FOR ----------        @childTableName    ----- SELECT all_columns.column_id, all_columns.name, systypes.name, all_columns.max_length FROM SYS.all_objects join SYS.all_columns on all_columns.object_id = all_objects.object_id join SYS.systypes on all_columns.system_type_id = systypes.xtype WHERE all_objects.name like @childTableName and all_objects.type = 'U' AND systypes.name <> 'sysname'/*-     2   ( )*/ ORDER BY all_columns.column_id; OPEN columnsCursor; --  ,   ID- (   ) FETCH NEXT FROM columnsCursor INTO @columnUniq, @columnName, @columnType, @columnLenght; FETCH NEXT FROM columnsCursor INTO @columnUniq, @columnName, @columnType, @columnLenght; DECLARE @tempLenght INT = 0; WHILE @@FETCH_STATUS = 0 BEGIN /*   (   -1   30) , -1   MAX  ,         30 . */ IF(@columnLenght >= 0) BEGIN SET @tempLenght = @columnLenght; END ELSE BEGIN SET @tempLenght = 30; END --    INSERT,   . SET @insertColumnsQuery = @insertColumnsQuery + @columnName + ', '; DECLARE @TEMPValues nvarchar(MAX) = '' ---  generateStringByType,    TEMPValues--- EXECUTE generateDataByType @childTableName, @columnName, @columnType, @tempLenght, @outputString = @TEMPValues OUTPUT --     INSERT,  . SET @insertValuesQuery = @insertValuesQuery +'''' + @TEMPValues + ''',' FETCH NEXT FROM columnsCursor INTO @columnUniq, @columnName, @columnType, @columnLenght; END; --         INSERT SET @insertColumnsQuery = SUBSTRING(@insertColumnsQuery, 1, LEN(@insertColumnsQuery)-1); SET @insertValuesQuery = SUBSTRING(@insertValuesQuery,1, LEN(@insertValuesQuery)-1); --       . SET @insertQuery = N'INSERT INTO ' + @childTableName + N' (' + @insertColumnsQuery + N') VALUES (' + @insertValuesQuery + ') ;'; EXEC(@insertQuery); --   


This procedure is “relatively” not time-consuming, although it climbs into the system tables to get the structure of the incoming table, but contains several obvious weak points. For example, a jump through the first element of the table in the hope that it is PK.

Conclusion


The above can be useful for someone, for which the author sincerely hopes, since he could not find such solutions. But the decision presented to the community is not optimal in terms of time and requires major changes. I hope that everyone who is interested will help me bring it to mind (if that makes sense) or point a different way.

Source: https://habr.com/ru/post/220185/


All Articles