⬆️ ⬇️

The second competition is CUBRID. Search for a solution

Many have probably heard that the open source project CUBRID decided to arrange a contest, and since the time for delivery of works has already ended, I will tell you about how I solved the contest task, which method I used and which features CUBRID faced.



Task (approximate)



Dana DB, which consists of tables containing well-defined types of columns:
VARCHAR, CHAR, STRING, INT, SMALLINT, BIGINT, NUMERIC, FLOAT, DOUBLE, DATE, TIME, DATETIME  TIMESTAMP. 


It is required to find the most frequent non-numeric value in the database (the one that consists not only of numbers) and the number of its uses. The answer must be recorded in the results table. And that's it (in brief, read more on the contest page).



Task analysis



The solution can stand out from the rest only if it is faster and less demanding of RAM (+ there can be a lot of values). Therefore, I entrusted all the statistics maintenance using the temporary table. And the counters decided to update with the help of the prefix on dublicate key update key update "count"="count"+1 for all insert requests. I did not use hashing, since the tests showed that there is no use for it (most likely, CUBRID already uses hashing for string keys). Multithreading will also not have an effect - at one point in time only one process can change the data in the table, while others have to wait (you can remove the lock, but the integrity of the data is broken).



Decision



The following decision comes to mind:

 init(); foreach(list_tables() as $table){ process_table($table); } save_result(); 


It remains to implement only 4 functions:

  1. init is a function that creates a temporary table.

    Sample code:

     global $tempTableName,$conn; $tempTableName='"temp'.rand(0,32767).'"'; $q='CREATE TABLE '.$tempTableName.'( "val" character varying(255) NOT NULL, "count" bigint DEFAULT 1 NOT NULL, CONSTRAINT pk_temp_hash_val PRIMARY KEY("val") )'; cubrid_execute($conn,$q); 
  2. list_tables is a function that gets a list of tables in the current database.

    Since CUBRID is similar to MySql, the first thing you can do is look at SHOW TABLES , which, however, is not supported by version 8.3.1 (alas, there is little in the documentation about getting a list of tables in the database), so we all go to the Cubrid WebQuery project page and study the code. For the most impatient I immediately quote the required sql query:

      select class_name as table_name from db_class where class_type='CLASS' and is_system_class='NO' and class_name!='results' order by a asc; 


    And use it in our code.

    ')

    You can use the cubrid_schema function, but there you also need to filter the type of classes and make additional. processing.
  3. process_tables is the most interesting function in the entire solution, which makes statistics accordingly.



    There are many possible implementation options:

    • select * + setFetchSize/cubrid_unbuffered_query

      As my tests showed (to whom it is interesting - the java solution is also contained in the archive at the bottom of the post) - not the fastest solution.
    • select * + condition with a call to a stored function in java type:

      where "int"<0 and counter("int") or length(translate("str",'0123456789',''))>0 and counter("str")

      where counter saves the value in the database and returns false.

      However, such a request simply caused my system to freeze (most likely in CUBRID, the functions in java are not completely refined).
    • insert from select ... on dublicate key update "count"="count"+1

      It is this option that I position as the main one (although, in view of the errors in the php driver, when processing double values, I had to invent my crutches).


    Now we need to somehow get a list of columns and their types, since it is obvious that for each type of columns their own processing can take place:

    • For integer type (or decimal without decimals), it suffices to check for a character.

    • For all string values, you must delete all the digits and then check the length of the string.

    • For all other values, no checks are needed.


    The question also arises - in what format to convert the values ​​into a string. After much discussion on the forum, I came to the conclusion that the most ideal option is to use the same format as in the cubrid manager (when it shows the result of the select query).



    Since the same type of verification can be applied to different types of columns (+ there are still different names for the same data type), I brought all the conditions and “aliases” into separate arrays:

     $aliases=array( "STRING"=>"VARCHAR", 'CHAR'=>'VARCHAR', "INT"=>'INTEGER', "SHORT"=>'INTEGER', 'SMALLINT'=>'INTEGER', 'BIGINT'=>'INTEGER', 'DECIMAL'=>'NUMERIC', 'DEC'=>'NUMERIC', "REAL"=>'FLOAT', "DOUBLE PRECISION"=>'DOUBLE', ); // column process criteria and/or format $handlers=array( 'VARCHAR'=>array( 'select'=>'cast(%s as varchar(255))', 'criteria'=>"length(translate(%s,'0123456789',''))>0", ), 'INTEGER'=>array( 'select'=>'cast(%s as varchar(255))', 'criteria'=>"%s<0" ), 'FLOAT'=>array( 'select'=>"to_char(%s,'9.999999EEee')", ), 'DOUBLE'=>array( 'select'=>"to_char([double],'9.9999999999999999EEee')", ), 'DATE'=>array( 'select'=>"TO_CHAR(%s,'YYYY-MM-DD')", ), 'TIME'=>array( 'select'=>"TO_CHAR(%s,'HH24:MI:SS')", ), 'DATETIME'=>array( 'select'=>"to_char(%s,'YYYY-MM-DD HH24:MI:SS.FF')", ), 'TIMESTAMP'=>array( 'select'=>"to_char(%s,'YYYY-MM-DD HH24:MI:SS')", ), 'DEFAULT'=>array( 'select'=>'cast(%s as varchar(255))', ) ); 


    And finally, the final cycle:

     foreach(get_columns($q) as $column){ //echo "\tProcess column:".$column['column']."\n"; while(isset(self::$aliases[$column['type']])) $column['type']=self::$aliases[$column['type']]; // If column is decimal and has no precision then convert type to integer if($column['type']==='NUMERIC' && $column['scale']===0) $column['type']='INTEGER'; $criteria=(isset(self::$handlers[$column['type']])) ? self::$handlers[$column['type']]: self::$handlers["DEFAULT"]; $toSelect=(isset($criteria['select'])) ? $criteria['select'] : '%s'; // Depending of the column type build appropiate criteria $q='insert into '.$tempTableName.' ("hash","val") '. 'SELECT 1,'.$toSelect.' '. 'FROM `'.$qtable.'` '. 'where %s is not null and '.$criteria.' '. 'ON DUPLICATE KEY UPDATE "count"="count"+1'; $q=str_replace('%s','`'.$this->escape($column['column']).'`',$q); cubrid_execute($q); } 


  4. save_result - saves the result and deletes the temporary table

    In order not to overload the article I quote only an sql query to save the result.

     'replace into results ("userid","most_duplicated_value","total_occurrences") select \''.cubrid_real_escape_string($userid).'\',val,"count" from "'.$tempTableName.'" order by "count" desc limit 1' 


    It is used by replace since during testing the code is run many times, and in the case of the insert, an error will occur for the same set of verification data.


findings



Although I sent a solution that was not completely finished to my mind (there was a problem with char and the name of the temporary table was not random, there was no check for null values ​​either) and, at the very last moment, during that time I understood the cubrid performance and mastered java at the beginner level, practiced English and gained a lot of valuable experience. As an experimental method, I found out that CUBRID has a very fast query interpreter, it works very quickly with tables, but it lacks support for temporary tables in RAM and good documentation (now it looks more like a brief reference). Since CUBRID turned out to be a very productive DBMS, in the future (when some of the errors I found are corrected), I will use it in high-load projects.



Links



  1. Curid it - competition page.
  2. Cubrid it forum - discussion page.
  3. Cubrid Webquery - analog cubrid manager, contains useful sql queries.
  4. Cubrid solution - my solution, corrected (java + php + code to fill the test database).




PS: This is my first article on Habré, tried to take into account all the rules (but if something is wrong - write, correct, I will learn from mistakes). I decided to transfer the topic to the sql blog (it will not work in the cubrid company blog), since there is a place for it.

PS2: Habrayuser from cubrid advised a suitable blog. Moved there.

PS3: The decision of the winner of the competition is also published on Habré - we look

Source: https://habr.com/ru/post/123727/



All Articles