How to fill an MS SQL database with heterogeneous random data or 17 hours of waiting

Good day,
The developer often has the task of conducting a test of the database on large amounts of data, but where can we get this very data? After all, everyone knows that the structure of the database can reach over 50 tables that you do not really want to fill with your hands. And if you think about foreign keys and composite primary keys whose values are associated with other tables, then the head starts to heat up proportionally to the old AMD with the cooling turned off.
On the Internet, there are many solutions for filling a database with random values using .NET, C ++, Java, etc. This article will highlight the topic of filling a database with random values with T-SQL using MS SQL Server.

Introduction

A few days ago I received the task of filling a database running with MS SQL Server with random data. Moreover, the entire implementation should be performed only by means of T-SQL. After a long search for such solutions on resources, I came to the conclusion that I would have to do it myself and set to work. Not being ~~(until recently)~~ an expert on T-SQL, but having only a set of knowledge from the course “Database” of the university, the implementation turned out to be very “crutch” and slow (the main problem), but it works.

The main purpose of this article is to discuss with the habr community the possibility to optimize the solution, or its Ctrl + A and Shift + Del with reference to the ready-made implementation.

And so, what was at the entrance:

A database with a number of related tables;
All primary keys (further PK) - auto increments;
There are tables containing a composite PK consisting of foreign keys (further FK).

What with all this had to be done:

Generate random data depending on the type of attribute (column);
Skip autoincrement filling.
Fill FK child tables with random PK parent tables.

Implementation

The whole implementation received a view of each other's calling procedures:

randomString - generation of a random string of characters of a given length .;
randomInt - generation of a random number from the specified range;
generateDataByType - gets the type of the attribute (column) of the table and calls the desired procedure for generating random values.
insertRandomData - the main procedure, asking for input only the name of the table and the number of records that you want to add

I propose to dwell on each procedure in detail. (it is assumed that the reader, ~~unlike the author~~ , is familiar with the basics of SQL)
')

randomString

CREATE PROCEDURE [dbo].[randomString] @inputSize int, @outputRandomString nvarchar(max) output AS BEGIN --        . END;

I used one of the first implementation options I got from the MS SQL forum. The procedure receives the length of the string as input, and the output produces a string of random characters of the NVARCHAR (MAX) type of the required size. In this case, the implementation is not critical, since it does not have significant time costs for large amounts of data. We go further.

randomInt

 CREATE PROCEDURE [dbo].[randomInt] @inputSize int, @outputInteger int output AS BEGIN DECLARE @TEMP bigint SET @TEMP = SUBSTRING('999999999999999999',1,@inputSize) SET @outputInteger = (ABS(CHECKSUM(NewId())) % @TEMP) END

The function is small and not very beautiful (especially the place with SUBSTRING), but it completely suited me with its speed, so for now we leave it and move on.

generateDataByType

 CREATE PROCEDURE [dbo].[generateDataByType] @tableName nvarchar(40), --  ,      @inputColumName nvarchar(40), --  ,      @inputType nvarchar(10), @inputSize int, @outputString nvarchar(max) output --  AS BEGIN DECLARE @isFK bit = 0; DECLARE @FKName NVARCHAR(MAX); DECLARE @ParentTable NVARCHAR(MAX); -- @tableName     FK (ccu.table_name)     (references_table)           DECLARE columnsCursor1 CURSOR FOR SELECT kcu.column_name, ccu.table_name AS references_table FROM information_schema.table_constraints tc INNER JOIN information_schema.key_column_usage kcu ON tc.constraint_catalog = kcu.constraint_catalog AND tc.constraint_schema = kcu.constraint_schema AND tc.constraint_name = kcu.constraint_name INNER JOIN information_schema.referential_constraints rc ON tc.constraint_catalog = rc.constraint_catalog AND tc.constraint_schema = rc.constraint_schema AND tc.constraint_name = rc.constraint_name AND tc.constraint_type = 'FOREIGN KEY' INNER JOIN information_schema.constraint_column_usage ccu ON rc.unique_constraint_catalog = ccu.constraint_catalog AND rc.unique_constraint_schema = ccu.constraint_schema AND rc.unique_constraint_name = ccu.constraint_name WHERE tc.table_name = @tableName OPEN columnsCursor1; FETCH NEXT FROM columnsCursor1 INTO @FKName,@ParentTable --     WHILE @@FETCH_STATUS = 0 BEGIN -- ,        @inputColumName       IF (@inputColumName = @FKName) BEGIN SET @isFK = 1; --   true -   FK         . DECLARE @selectedPK NVARCHAR(MAX); DECLARE @params NVARCHAR(MAX); --           ,         SET @selectedPK = N'SELECT TOP 1 @outputString =' + @FKName + ' FROM ' + @ParentTable + ' ORDER BY NEWID(); '; SET @params = N'@FKName NVARCHAR(MAX), @ParentTable NVARCHAR(MAX), @outputString NVARCHAR(MAX) OUTPUT'; EXEC sp_executesql @selectedPK , @params, @FKName = @FKName, @ParentTable = @ParentTable, @outputString = @outputString OUTPUT; END FETCH NEXT FROM columnsCursor1 INTO @FKName,@ParentTable END; CLOSE columnsCursor1; DEALLOCATE columnsCursor1; --          . IF (@isFK <> 1) BEGIN IF (@inputType = 'nvarchar') BEGIN EXECUTE randomString @inputSize, @outputRandomString = @outputString OUTPUT ; END ELSE --          . END

And so, without reaching the "main" procedure, we get huge time costs when filling in the foreign key of the table with data from the found parent table. If this search is noticed by substituting random numbers in a given range, productivity increases dramatically. Perhaps it's a SELECT from the system table and random sorting. For comparison: writing 1 million rows to a table without FK takes about 20 minutes, writing 1 million rows to a table with FK takes more than 17 hours. For reference, writing one million lines of pure INSERT into one field takes 6-10 seconds.
At the moment, I could not think of anything more optimal, which was the impetus for writing this article, but about this in conclusion.

insertRandomData

 CREATE PROCEDURE [dbo].[insertRandomData] @childTableName nvarchar(MAX), @insertRowCount int AS BEGIN DECLARE @i int = 0 /*  */ DECLARE @columnName NVARCHAR(30); DECLARE @columnType NVARCHAR(10); DECLARE @columnLenght INT; DECLARE @columnUniq INT; /*      */ DECLARE @insertQuery NVARCHAR(MAX); DECLARE @insertColumnsQuery NVARCHAR(MAX); DECLARE @insertValuesQuery VARCHAR(MAX); DECLARE @params NVARCHAR(MAX); SET @insertColumnsQuery = ''; SET @insertValuesQuery = ''; begin transaction WHILE (@i < @insertRowCount) BEGIN DECLARE columnsCursor CURSOR FOR ----------        @childTableName    ----- SELECT all_columns.column_id, all_columns.name, systypes.name, all_columns.max_length FROM SYS.all_objects join SYS.all_columns on all_columns.object_id = all_objects.object_id join SYS.systypes on all_columns.system_type_id = systypes.xtype WHERE all_objects.name like @childTableName and all_objects.type = 'U' AND systypes.name <> 'sysname'/*-     2   ( )*/ ORDER BY all_columns.column_id; OPEN columnsCursor; --  ,   ID- (   ) FETCH NEXT FROM columnsCursor INTO @columnUniq, @columnName, @columnType, @columnLenght; FETCH NEXT FROM columnsCursor INTO @columnUniq, @columnName, @columnType, @columnLenght; DECLARE @tempLenght INT = 0; WHILE @@FETCH_STATUS = 0 BEGIN /*   (   -1   30) , -1   MAX  ,         30 . */ IF(@columnLenght >= 0) BEGIN SET @tempLenght = @columnLenght; END ELSE BEGIN SET @tempLenght = 30; END --    INSERT,   . SET @insertColumnsQuery = @insertColumnsQuery + @columnName + ', '; DECLARE @TEMPValues nvarchar(MAX) = '' ---  generateStringByType,    TEMPValues--- EXECUTE generateDataByType @childTableName, @columnName, @columnType, @tempLenght, @outputString = @TEMPValues OUTPUT --     INSERT,  . SET @insertValuesQuery = @insertValuesQuery +'''' + @TEMPValues + ''',' FETCH NEXT FROM columnsCursor INTO @columnUniq, @columnName, @columnType, @columnLenght; END; --         INSERT SET @insertColumnsQuery = SUBSTRING(@insertColumnsQuery, 1, LEN(@insertColumnsQuery)-1); SET @insertValuesQuery = SUBSTRING(@insertValuesQuery,1, LEN(@insertValuesQuery)-1); --       . SET @insertQuery = N'INSERT INTO ' + @childTableName + N' (' + @insertColumnsQuery + N') VALUES (' + @insertValuesQuery + ') ;'; EXEC(@insertQuery); --

This procedure is “relatively” not time-consuming, although it climbs into the system tables to get the structure of the incoming table, but contains several obvious weak points. For example, a jump through the first element of the table in the hope that it is PK.

Conclusion

The above can be useful for someone, for which the author sincerely hopes, since he could not find such solutions. But the decision presented to the community is not optimal in terms of time and requires major changes. I hope that everyone who is interested will help me bring it to mind (if that makes sense) or point a different way.

Source: https://habr.com/ru/post/220185/

All Articles

How to fill an MS SQL database with heterogeneous random data or 17 hours of waiting

Introduction

Implementation

Conclusion

More articles: