Hi, Habrayuzer! I offer you the solution of the problem, which won the
second competition CUBRID it! The essence of the competition is to find the most optimal solution to the SQL problem using Java or PHP. The solution is purely algorithmic, so even if you are not affiliated with CUBRID and the CUBRID it! Contest, then still look under the cat - it can be just interesting and even useful. Go!
The challenge of the competition
There is a database, of course,
CUBRID . The task of the competition is to find the value that is most often found in the database. The database tables use data types from a limited list. The value needs to be cast to the string type using standard database functions and the value should not consist only of numbers. Detailed condition of the problem can be found
here .
Decision
First, we need to get a list of columns and their types in all the tables in our database. To do this, we can use the
PDO features or
the native CUBRID drivers .
')
Getting a list of tables:
cubrid_schema( $this->conn , CUBRID_SCH_CLASS );
Having chosen the necessary tables, we will make a query to each of them:
$result = cubrid_execute( $this->conn , "SELECT * FROM {$table['NAME']} LIMIT 0;"); $column_names = cubrid_column_names($result); $column_types = cubrid_column_types($result);
$column_names
and
$column_types
contain, respectively, arrays with the names of the table fields and the data type of each of them.
Next, we need to make a sample for each column. In general, the query will look like this:
SELECT TO_CHAR(`fieldname`), count(*) FROM `tablename` GROUP BY `fieldname` ORDER BY 2 DESC;
We must do this query for each column in the database. The query may vary slightly, depending on the data type of the column. As a result, we get a table of the form “
Value ” - “
Number of values ​​in a column ”, sorted in descending order of the number of values. Example query result:
Value | amount |
---|
CUBRID | 159 |
HOLA! | 80 |
14.3 | 50 |
... | ... |
This means that the value “
CUBRID ” is found in the field of the table 159 times, and “
14.3 ” is only 50.
The request that we use is quite simple and very fast, with almost no checks, except in some cases. There are many such requests - one for each column in the database, but don’t be scared, we won’t load all these values ​​into PHP memory, we’ll work with a pointer to the result of the query. For convenience, we will save pointers to the results of queries as an array.
Now we have to calculate which value is most often found in the tables of our database. To do this, in the loop we will take one result of the query from our array of pointers and extract the current line from it.
For example, take the three columns
Field1 ,
Field2 ,
Field3 , from which the following data was obtained as a result of the query:
Query result for Field1 :Value | amount |
---|
CUBRID | 159 |
HOLA! | 80 |
14.3 | 50 |
... | ... |
The result of the query for Field2 :Value | amount |
---|
14.3 | 160 |
CUBRID | 100 |
xyz | 90 |
... | ... |
The result of the query for Field3 :Value | amount |
---|
HOLA! | 100 |
14.3 | ten |
xyz | five |
... | ... |
The tables can be of any size, and accordingly the result that is obtained after the query can be a very large table.
At each step of the cycle, we will read one row of these tables and put them into a special array, so after the first round of the tables the result will look like this:
Array('CUBRID' => 159, '14,3' => 160, 'HOLA!' => 100)
At this stage it is checked that the string does not consist only of numbers, according to the condition of the problem.
The resulting data we record in the array. For clarity, I will show the current result in the form of a pivot table, which after the first tour looks like this:
Value | Field1 | Field2 | Field3 | Amount | Potential amount |
---|
14.3 | - | 160 | - | 160 | 259 |
CUBRID | 159 | - | - | 159 | 260 |
HOLA! | - | - | 100 | 100 | 319 |
The data is entered into the array at each new iteration of the loop. A hyphen means that this value has not yet met in this column, and accordingly it can still occur. But if it is met, its quantity will be no more than the quantity of the previous one found in the same column, since these our result tables are sorted in descending order of the number of values. This means that even if the “
CUBRID ” value is next in the “
Field2 ” column, its amount will be no more than 160. “Potential amount” is how much this value can dial, i.e. essentially "the sum of hyphens." The potential quantity for “
CUBRID ” is determined on the assumption that this value will occur in the next iteration in the “
Field2 ” and “
Field3 ”
fields with the maximum possible number (160 and 100), respectively.
What do these data give us? They tell us about how much a particular value
can accumulate. Amount + Potential amount - this is the amount that could theoretically be the result for the current value. Based on this number, we can make an assumption whether it can be first in this list, or it can already be deleted and not taken into account in the future. Next we will see how it works.
After the next step (“
HOLA! ” - 80, “
CUBRID ” - 100, “
14.3 ” - 10) the table looks like this:
Value | Field1 | Field2 | Field3 | Amount | Potential amount |
---|
CUBRID | 159 | 100 | - | 259 | ten |
HOLA! | 80 | - | 100 | 180 | 100 |
14.3 | - | 160 | ten | 170 | 80 |
Pay attention to the value of "
14.3 ": its sum is 170 and potentially it can only dial 80. That is, with the most successful scenario, its quantity will be 250, which is already less than that of the current leader (
CUBRID - 259). So this value will never take first place in our list. It serves him right. We can safely remove the value "14.3" from the list and no longer take it into account. So we will do with all the values ​​that can not become leaders. Thus, we form a "Top" of those values ​​that have a chance to take first place.
At the next step ("
14.3 " - 50, "
xyz " - 90, "
xyz " - 5) we will see that now, any value of which is not "Top" can not dial more than 145 (50 + 90 + 5), and therefore take first place. On this we can stop our search and not watch the remaining values ​​in the tables, since these values ​​simply do not fall into the "Top".
Now that we have an accurate list of candidates for leadership, it is enough for us to find their number in the columns in which we have a hyphen, i.e. in those in which we can still have values. This is done by simple queries of the form:
SELECT `Field3`, count(*) FROM `tablename` WHERE `Field3` = `CUBRID` GROUP BY `Field3`;
Summing up the number in different columns for each value, we find out the most common value in the database. What we actually need from the conditions of the problem.
Thank you for your attention, some interesting links:
Solution source code (GPL v2 license)
An interesting
decision of the winner of the contest
EkstaziChallenge Contest (Eng)Announcement of the contest on Habré (Rus)Winners of the competition (
Eng ,
Rus )
CUBRID Documentation