
One day, our team discovered that the performance of the system was draining when performing a fairly simple SQL query:
select count(*) n from products where category_id = ?
')
Of course, the question arose how to optimize it.
A savvy reader may immediately think about indexes, hints, and other DBMS features. But today the story will not be about them. And in general, will not touch the topic of optimizing SQL queries.
Today we will talk about a very simple method of refactoring, which in this particular case allowed to significantly increase system performance.
This query was in that old code that no one had climbed for a couple of years, in the SQLConsts class, among other SQL queries:
public class SQLConsts { public static final String PRODUCTS_SQL = "select count(*) n from products where category_id = ?"; ...
And it was used in another class - CategoryRepository:
public class CategoryRepository { ... private boolean isCategoryVisible(int categoryID) { ResultSet resultSet = executeQuery(SQLConsts.PRODUCTS_SQL, categoryID); int n = resultSet.getIntegerFieldByName("n"); return n > 0; } ...
Even a not very experienced programmer will probably notice that it is unnecessary to calculate the number of lines in the request, if later this number is simply compared with zero.
How did this apparent epicfile come about? Analysis of Git-logs showed that initially in the isCategoryVisible method there was a more complex logic that used the number of lines in its calculations. But then they abandoned complex logic, and only
n > 0
remained. Apparently, the programmer who made these changes simply didn’t have a question, what is
n
, and he didn’t watch the SQL query itself, especially since it was in a completely different file.
Now that these two pieces of code are nearby, the optimization becomes apparent. As a result, the isCategoryVisible method was rewritten:
select count(*)
replaced by the construction with
where exists
, which gave a noticeable performance increase on large amounts of data; and from class SQLConsts subsequently got rid of.
public class CategoryRepository { ... private boolean isCategoryVisible(int categoryID) { ResultSet resultSet = executeQuery( "select null from dual where exists (select null from products where category_id = ?)", categoryID ); return !resultSet.isEmpty(); } ...
Hence the rule: “that which changes simultaneously, must be stored in one place. The data and functions that use this data usually change together, ”wrote Martin Fowler in his book“ Refactoring. Improving existing code "more than ten years ago.
In our case, the data (SQL query) was stored in one class — SQLConsts, and the isCategoryVisible function that used this data — in the other: in the CategoryRepository. Fowler calls such functions envious, because they are more interested not in the class they are in, but in some other. And most often, the envy is data, as in our case: isCategoryVisible seems to be jealous that another SQLConsts class stores a SQL query that is not needed by this class, but isCategoryVisible.
Once again:
that which changes at the same time, must be kept in one place - repeat it as a mantra, until this rule becomes a habit with you. When you stop thinking about it and follow it on a subconscious level, you will not notice how your code will become cleaner.
Functional envy
It should be noted that the article provides a non-classical example of an envious function. In the original, the envious functions are called “feature envy”, which literally means “functional envy”. Thus, I believe that functional envy according to Fowler is not limited to functions / methods alone and can also be extended to whole classes, although Fowler himself cites only envy of methods as an example.
Therefore, it might be advisable to talk about the category Category enpository envious class, since in reality all methods of this class used data from SQLConsts, and SQLConsts itself did not use this data.
In more detail my vision of this question was voiced here:
habrahabr.ru/post/220883/#comment_7547819PS
And what if the variable
n
was called, for example,
productCount
, and the constant
PRODUCTS_SQL
-
PRODUCT_COUNT_IN_CATEGORY
? Then
productCount > 0
would push the developer to think about whether he needs to calculate the quantity in the request.
Therefore, the second rule:
let us understand the names of variables, constants, methods and classes . Perhaps this rule is even more important than the rule of the first.
UPDATE
A small educational program for those who do not like
exists
.
The
exists
operator returns
true
if at least one record from the subquery satisfies the condition category_id =?
Thus, the DBMS will not select all rows from the subquery: it is enough to find the first record that satisfies the condition.
Therefore, these two options will be equally effective:
select null from dual where exists (select null from products where category_id = ?)
select null from products where category_id = ? and rownum = 1
*
rownum = 1
in Oracle is the same as
limit 1
in MySQL.
But
where NOT exists
really leads to the enumeration of all suitable records. But it should be noted that in this case it will not be possible to use
rownum = 1
.
UPDATE 2
For those who are interested in whether the index was used for the category_id column or not, I inform you that YES has been used.
From the point of view of SQL optimization, nothing could be done with the original query without changing the logic of its operation.