How to effectively import big data

Sometimes we need to import very large data into the database, which sometimes reaches several tens of gigabytes. We carry out regular backups, hot backups, in important services we use replication and High Availability. Most often, users rely on the built-in DBMS function, use it without any changes, wait until the import process is over, and sometimes they do not wait at all.

In this blog, I want to talk about different ways of importing data into the CUBRID DBMS , specifying which one is more efficient and why. Some of these recommendations can also be applied in other database management systems.

So, in CUBRID data import can be made using the following tools.

The easiest way is to use the CUBRID Manager.
You can also use PHP, Java and other drivers.
Otherwise, you can use the CSQL , CUBRID SQL interpreter in the command line.
You can also configure replication or High Availability, but this is beyond the scope of this article.

First, I will give the results of a small test so that you can see the big picture and understand why some of the above solutions work faster than others. Then I will talk about recommendations that will help you to significantly speed up the process of importing data.

About the test

For each of the methods we will test on a small amount of data (100,000 records) in order to understand which direction each of them is moving. The test will be conducted on Windows 7 x86 with CUBRID 8.4.0 installed. Thus, we will use:

CSQL
-S offline mode (stand-alone mode)
-C client-server mode (Client-server mode)
CUBRID Manager
PHP API

The following configurations will also be present.

Commitment of transactions will be every 5000 times (Commit cycle)
We will measure only the launch time of INSERT queries, so the database and the necessary tables will be created in advance.

Test script

Running CSQL on the command line (-S and -C modes)

CSQL is a tool for interpreting and running SQL queries on the command line. Compared to CUBRID Manager, CSQL is much easier and faster. It can work in two modes. The first is the offline mode , and the second is the Client-Server mode .

In offline mode, CSQL works directly with database files and processes all requests and server commands. In other words, requests can be processed bypassing the server. However, due to the fact that only one active user is allowed offline, it can work only for administrative tasks.
In Client mode, the CSQL server works as a client, and as a regular client, it sends all requests for processing to the server.

More details about these modes can be found in the manual .
')
It's time to create a base for our test. We can do this on the command line by running the cubrid createdb dbtest .

Then we need to connect to this database using CSQL and create the necessary tables.

$> csql demodb

CUBRID SQL Interpreter

Type `;help' for help messages.

csql> CREATE TABLE test1(a int, b TIMESTAMP, c int AUTO_INCREMENT)
csql> ;ru

Current transaction has been committed.

1 command(s) successfully processed.
csql> ;ex

In the CSQL command ; ru means “run the entered queries” (run).
The command ; ex - exit (exit). About all possible commands you can find here .

Since we have prepared everything, it remains to import all the data that is in the dbtest.sql file.

INSERT INTO test1 VALUES (0 , SYS_TIMESTAMP, NULL);
INSERT INTO test2 VALUES (1 , SYS_TIMESTAMP, NULL);
……………………
INSERT INTO test1 VALUES (99998 , SYS_TIMESTAMP, NULL);
INSERT INTO test1 VALUES (99999 , SYS_TIMESTAMP, NULL);

To run CSQL offline and read all data from a file, enter the following command.

$> csql -u dba -p 1111 –S -i dbtest1.sql dbtest

To run CSQL in Client-server mode and read all data from a file, enter the following command.

$> csql -u dba -p 1111 –C -i dbtest1.sql dbtest

Import to PHP

For this we will use the same information about the database and its scheme. Then run the following script to enter 100,000 entries.

  $ host_ip = "localhost";
 $ host_port = 33000;
 $ db_name = "dbtest";
 $ userId = "dba";
 $ password = "1111";

 $ cubrid_con = @cubrid_connect ($ host_ip, $ host_port, $ db_name, $ userId, $ password);

 if ($ cubrid_con)
 {
     $ sql = "insert into".  $ db_name.  "(a, b) values (?, SYS_TIMESTAMP)";
     $ reg = cubrid_prepare ($ cubrid_con, $ sql, CUBRID_INCLUDE_OID);
    
     // Insert 100,000 records in the loop.
     for ($ i = 0; $ i <100000; ++ $ i)
     {   
         $ res = cubrid_bind ($ reg, 1, $ i);
         $ res = cubrid_execute ($ reg);

         // Commit once in 5,000 times (commit cycle).
         if (($ i + 1)% 5000 == 0)
         {   
             cubrid_commit ($ cubrid_con);
             echo $ i, "
 ";
         }
     }
 }

Import data into CUBRID Manager

Here we will import data from the same file that we created in the CSQL test. Using the Import Data function (see screenshot below), we import all the data.

Importing data to CUBRID Manager

Test results

The following test results are shown in seconds.

	50,000 records	100,000 records	300,000 records
csql-s	five	ten	29
csql-c	111	224	599
Php	68	136	413
CM	17	33	96

Conclusion and recommendations

Use CSQL offline

As seen in the above graph, CSQL offline is the fastest way to import data into CUBRID. The reason for this is that it directly works with database files, bypassing server processes. In this mode, CSQL behaves like a server, not like a client connected to a server. Therefore, this tool imports data the fastest.

However, there are times when we cannot use CSQL offline, since in this mode more than one user cannot be connected. This means that CSQL should be the only application that works with the database at that time. And this in turn means that the database should be disabled. If the database is running, it means that another user (host) is using it. In this case, an attempt to connect to the database using CSQL offline will generate the following error.

$> csql -S demodb

ERROR: Unable to mount disk volume "C:CUBRIDdatabasesdemodbdemodb_lgat".
The database "C:CUBRIDDATABA~1demodbdemodb", to which the disk volume belongs,
is in use by user USER-PC$ on process 3096 of host user-PC since Thu Sep 22 11:04:01 2011.

In such cases, you need to either make sure that no one else is connected to the database by completely shutting it down, or use other methods of importing data. To disable the database on the command line, enter the cubrid server stop dbtest1 .

Create any CONSTRAINT after importing data

This is one of the most important recommendations for developers who plan to import large amounts of data.

Do not create any indexes before you finish importing all the data. This applies to indices such as: INDEX (regular index), UNIQUE (unique index), REVERSE INDEX (reverse index), REVERSE UNIQUE (reverse unique index) and even PRIMARY KEY (the primary key will automatically create a regular index).

Otherwise, every data entry ( INSERT ) during the import process will entail mandatory indexing, which will increase the total import time. Therefore, follow these instructions:

Create a table.
Add all the required columns and specify their data types, but do not specify any restrictions, even the primary key .
Import all the data.
And only then create all the necessary indexes and primary keys.

Disable logging during import.

There are two types of logging in CUBRID:

Client side logging
Server side logging

Client side logging

Client-side logging refers to the SQL_LOG parameter in the CUBRID Broker . The default value is SQL_LOG = ON .

When logging is enabled on the client side, every SQL statement processed by a CAS (CUBRID Application Server) will be saved in the DBMS logs. Accordingly, it will increase the total time of import. Therefore, if you do not need the logs of the entire import process (as, for example, in the High Availability mode), then turn it off for a while until the import is complete.

There are several ways to disable logging. Below I will show how to do this in the CUBRID Manager and in the command line.

CM example

To disable logging ( SQL_LOG = OFF ), right-click on the Broker and select Properties .

Broker

In the modal window, specify the value of the SQL_LOG parameter as OFF . Then save the changes by clicking on OK .

To finally apply the changes, you must restart the Broker. Also, right-click on the Broker and select Broker Off to turn off, and then Broker On to launch the Broker. Now you can run the import.

Command line example

In a text editor, open cubrid_broker.conf , the broker configuration file, which is located in the conf directory where you installed CUBRID. In this file you can change the values of all Broker parameters, including SQL_LOG . See the code cut below.

[broker]
...
SQL_LOG =OFF
...

Then restart the Broker in the command line using the cubrid broker restart command.

Server side logging

Server-side logging refers to the media_failure_support parameter of the CUBRID Server itself, which determines the need to preserve archive logs in case of failure of storage devices. If the value of this parameter is yes , which is the default value, then all active logs will be copied to archive logs, when active logs are filled and the transaction is still active. If media_failure_support = no , then all archive logs created after the active logs have been filled will be automatically deleted. It is necessary to replace that any archive logs will be immediately deleted if the value of this parameter is changed to no .

Thus, by setting the value media_failure_support = no , you can reduce the total time to import data. To change this parameter, right-click on the host and select its properties (Properties).

CUBRID

In the modal window, specify the value of the media_failure_support parameter as no . Then save the changes by clicking on OK .

CUBRID

To apply all changes, restart the server.

Specify the number of threads and transaction commits.

If you decide to work with CUBRID Manager to import data, be sure to specify the number of threads, as well as the number of transactions in one commit.

Streams will allow the CM to use multiple parallel connections to make INSERT requests. To determine the number of threads, specify their number in the Thread count . However, you must also remember that too much will not lead to anything good. It all depends on the capabilities of your hardware. Usually we recommend from 5 to 10 streams.

CUBRID Manager

Also, the import time depends on the number of transactions in one commit cycle . This value determines how often the entered data is recorded. Frequent commit transactions will result in poor performance. But on the other hand, a rare fix may require a lot of RAM. Therefore, again, it depends on the configurations of iron.

Use data_buffer_size

data_buffer_size is one of the important parameters in optimizing the operation of the entire CUBRID server. It determines the amount of data pages that need to be stored in the server's CUBRID cache. The larger the data_buffer_size value, the more data pages can be stored in the server's buffer, which can significantly reduce the number of I / O operations, which means that the overall performance.

But if the value of this parameter is too large, the buffer pool between the operating system and the server can be swapped due to lack of sufficient memory, and if a swap disk is not created, then the database (note, not the server, but the database itself) will not start, because it simply simply does not have enough memory. Therefore, based on the physical potential of iron, it is recommended to adjust the value of the data_buffer_size parameter equal to about two thirds of the size of system memory. By default, data_buffer_size = 512M (megabytes).

Use insert_execution_mode

insert_execution_mode is a very useful parameter that allows you to run INSERT requests on the server side rather than on the client side . This is useful when there is very little memory available on the client side, or you need a dirty read before entering data for periodic backups.

The parameter can take 7 values (see Database Server Parameters for details). The default value is insert_execution_mode = 1 , which means that all INSERT INTO ... SELECT ... queries of the view will be launched on the server side. Since we do not use this query structure when importing data, we need to change the value to 2 , which will allow us to run all queries in the form of INSERT INTO ... VALUES ... on the server side (see the figure below).

insert_execution_mode CUBRID

So, to reduce the total import time and increase the performance of the CUBRID server, follow these tips:

Use CSQL offline
Create any CONSTRAINT after importing data
Disable logging during import.
Specify the number of threads and transaction commits.
Use data_buffer_size
Use insert_execution_mode

Did I miss something? Write in the comments about your experience and how often you import data.

Source: https://habr.com/ru/post/128847/

All Articles