📜 ⬆️ ⬇️

Greetings to Yandex developers

ClickHouse and PVS-Studio

Approximately once every six months, someone from Yandex employees writes us, is interested in licensing PVS-Studio, shakes the trial and disappears. This is normal, we are accustomed to the slow process of selling our analyzer to large companies. However, once the occasion presented itself, it would not be superfluous to convey hello to the Yandex developers and remind you about the PVS-Studio tool.

Honestly, the article was largely random. We were already offered to check ClickHouse, but somehow this idea was forgotten. The other day, walking across the expanses of the Internet, I again met the mention of ClickHouse and became interested in this project. This time I decided not to delay and check out this project.

Clickhouse


ClickHouse is a column-based DBMS for OLAP (online analytic query processing). ClickHouse was developed in Yandex to solve Yandex.Metrica problems. ClickHouse allows you to perform analytical queries on updated data in real time. The system is linearly scalable and capable of working with trillions of records and petabytes of data. In June 2016, ClickHouse was released on open source under the Apache 2.0 license.


Project analysis using PVS-Studio


I checked the ClickHouse source code, taken from the repository on August 14, 2017. For verification, I used the beta version of the PVS-Studio v6.17 analyzer. By the time of publication of the article, the release of this version has already taken place.
')
The following directories were excluded from the check:
The size of the remaining C ++ source code is 213 KLOC. At the same time, 7.9% of the lines are comments. It turns out that our own code, which was verified, is quite a bit: about 196 KLOC.

As you can see, the ClickHouse project has a small size. At the same time, the quality of the code is definitely high and I will not be able to write a shocking article. In total, the analyzer issued 130 warnings (General analysis, High and Medium messages).

About the number of false positives difficult to answer. There are many warnings that can not be formally called false, but at the same time there is no practical benefit from them. The easiest way is to explain this by giving an example.

int format_version; .... if (format_version < 1 || format_version > 4) throw Exception("Bad checksums format version: " + ....); if (format_version == 1) return false; if (format_version == 2) return read_v2(in); if (format_version == 3) return read_v3(in); if (format_version == 4) return read_v4(in); return false; 

The analyzer draws attention to the fact that if an expression begins to be evaluated (format_version == 4) , then it will always be true. As you can see, there is a check above that if the value of format_version goes beyond [1..4], then an exception is thrown. And the return false statement is never executed at all.

Formally, the analyzer is right and it is not clear how to substantiate that this is a false positive. On the other hand, it is clear that this code is correct and simply written with a “safety margin”.

In such cases, analyzer warnings can be suppressed in various ways or rewritten code. For example, you can write this:

 switch(format_version) { case 1: return false; case 2: return read_v2(in); case 3: return read_v3(in); case 4: return read_v4(in); default: throw Exception("Bad checksums format version: " + ....); } 

There are also warnings about which I simply cannot say: they indicate an error or not. I am not familiar with the project and have no idea how some code fragments should work. Consider this case.

There is a certain scope with 3 functions:

 namespace CurrentMemoryTracker { void alloc(Int64 size); void realloc(Int64 old_size, Int64 new_size); void free(Int64 size); } 

The names of the formal arguments of the functions suggest that the functions must take on some dimensions. Suspicion of the analyzer is caused by cases when, for example, not the size of the structure, but the size of a pointer is passed to the alloc function.

 using Large = HyperLogLogCounter<K, Hash, UInt32, DenominatorType>; Large * large = nullptr; .... CurrentMemoryTracker::alloc(sizeof(large)); 

The analyzer does not know whether it is an error or not. I don't know either, but in my opinion, this code is suspicious.

In general, I will not speak about such cases. If the ClickHouse developers have an interest, they will be able to independently check the project and study the list of warnings in more detail. I in the article will consider only those code fragments that seemed to me the most interesting.

Interesting code snippets


1. CWE-476: NULL Pointer Dereference (3 errors)

 bool executeForNullThenElse(....) { .... const ColumnUInt8 * cond_col = typeid_cast<const ColumnUInt8 *>(arg_cond.column.get()); .... if (cond_col) { .... } else if (cond_const_col) { .... } else throw Exception( "Illegal column " + cond_col->getName() + // <= " of first argument of function " + getName() + ". Must be ColumnUInt8 or ColumnConstUInt8.", ErrorCodes::ILLEGAL_COLUMN); .... } 

PVS-Studio warning : V522 Dereferencing of the null pointer 'cond_col' might take place. FunctionsConditional.h 765

Here the situation when an error occurs is incorrectly handled. Instead of generating an exception, the null pointer is dereferenced.

To create an error message, the function is called: cond_col-> getName () . This can not be done, since the pointer cond_col will be zero.

A similar error is found here: V522 Dereferencing of the null pointer 'cond_col' might take place. FunctionsConditional.h 1061

Consider another variation on the use of the null pointer:

 void processHigherOrderFunction(....) { .... const DataTypeExpression * lambda_type = typeid_cast<const DataTypeExpression *>(types[i].get()); const DataTypes & lambda_argument_types = lambda_type->getArgumentTypes(); if (!lambda_type) throw Exception("Logical error: .....", ErrorCodes::LOGICAL_ERROR); .... } 

PVS-Studio warning : V595 The 'lambda_type' pointer was used against nullptr. Check lines: 359, 361. TypeAndConstantInference.cpp 359

At the beginning, the lambda_type pointer is dereferenced, and only then is it checked. To fix the code, you need to move the pointer check just above:

 if (!lambda_type) throw Exception("Logical error: .....", ErrorCodes::LOGICAL_ERROR); const DataTypes & lambda_argument_types = lambda_type->getArgumentTypes(); 

2. CWE-665: Improper Initialization (1 error)

 struct TryResult { .... explicit TryResult(Entry entry_) : entry(std::move(entry)) // <= , is_usable(true) , is_up_to_date(true) { } .... Entry entry; .... } 

V546 Member of a class is initialized by itself: 'entry (entry)'. PoolWithFailoverBase.h 74

Because of a typo, the entry member is initialized by itself and as a result actually remains uninitialized. To fix the code, add an underscore:

 : entry(std::move(entry_)) 

3. CWE-672: Operation on a Resource after Expiration or Release (1 error)

 using Strings = std::vector<std::string>; .... int mainEntryClickhousePerformanceTest(int argc, char ** argv) { .... Strings input_files; .... for (const String filename : input_files) // <= { FS::path file(filename); if (!FS::exists(file)) throw DB::Exception(....); if (FS::is_directory(file)) { input_files.erase( // <= std::remove(input_files.begin(), // <= input_files.end(), // <= filename) , // <= input_files.end() ); // <= getFilesFromDir(file, input_files, recursive); } else { if (file.extension().string() != ".xml") throw DB::Exception(....); } } .... } 

PVS-Studio warning : V789 Iterators for the 'input_files' container, use it. PerformanceTest.cpp 1471

The input_files container is used in range-based for loop. At the same time, inside the loop, the container may change due to the removal of some elements. If the reader does not understand why it is impossible to do this, I suggest reading the description of the diagnostic V789 .

4. CWE-563: Assignment to Variable without Use ('Unused Variable') (1 error)

 struct StringRange { const char * first; const char * second; .... StringRange(TokenIterator token_begin, TokenIterator token_end) { if (token_begin == token_end) { first = token_begin->begin; // <= second = token_begin->begin; // <= } TokenIterator token_last = token_end; --token_last; first = token_begin->begin; // <= second = token_last->end; // <= } }; 

The analyzer generates two warnings:
Under certain conditions, the value of token_begin-> begin is assigned first to the variable first and the variable second . Further, the value of these variables in any case changes again. Most likely this code contains some kind of logical error or something is missing in it. For example, the return statement is probably forgotten:

 if (token_begin == token_end) { first = token_begin->begin; second = token_begin->begin; return; } 

5. CWE-570: Expression is Always False (2 errors)

 DataTypePtr getReturnTypeImpl(const DataTypes & arguments) const override { .... if (!((.....)) || ((left_is_string || left_is_fixed_string) && (.....)) || (left_is_date && right_is_date) || (left_is_date && right_is_string) || (left_is_string && right_is_date) || (left_is_date_time && right_is_date_time) // 1 || (left_is_date_time && right_is_string) // 1 || (left_is_string && right_is_date_time) // 1 || (left_is_date_time && right_is_date_time) // 2 || (left_is_date_time && right_is_string) // 2 || (left_is_string && right_is_date_time) // 2 || (left_is_uuid && right_is_uuid) || (left_is_uuid && right_is_string) || (left_is_string && right_is_uuid) || (left_is_enum && right_is_enum && .....) || (left_is_enum && right_is_string) || (left_is_string && right_is_enum) || (left_tuple && right_tuple && .....) || (arguments[0]->equals(*arguments[1])))) throw Exception(....); .... } 

In this condition, the three subexpressions are repeated twice. PVS-Studio warnings:
There are two options. The first is that there is no error, the condition is simply redundant and it can be simplified. The second is that there is an error and some conditions are not checked. In any case, authors should check out this code snippet.

Consider another case where the condition is always false.

 static void ipv6_scan(const char * src, unsigned char * dst) { .... uint16_t val{}; unsigned char * colonp = nullptr; while (const auto ch = *src++) { const auto num = unhex(ch); if (num != -1) { val <<= 4; val |= num; if (val > 0xffffu) // <= return clear_dst(); saw_xdigit = 1; continue; } .... } 

PVS-Studio warning : V547 Expression 'val> 0xffffu' is always false. The value range of unsigned short type: [0, 65535]. FunctionsCoding.h 339

When parsing a string containing an IPv6 address, some incorrect IPv6 addresses will be perceived as valid. It is expected that between separators numbers in hexadecimal format with a value of no more than FFFF can be written. If the number is greater, the address should be considered incorrect. To identify this situation in the code there is a check " if (val> 0xffffu) ". But it does not work. The variable val is of type uint16_t , which means it cannot be greater than 0xFFFF. As a result, the function will "swallow" incorrect addresses. As the next part of the address will be the last 4 hex numbers to the separator.

6. CWE-571. Expression is Always True (1 error)

 static void formatIP(UInt32 ip, char *& out) { char * begin = out; for (auto i = 0; i < 3; ++i) *(out++) = 'x'; for (size_t offset = 8; offset <= 24; offset += 8) { if (offset > 0) // <= *(out++) = '.'; /// Get the next byte. UInt32 value = (ip >> offset) & static_cast<UInt32>(255); /// Faster than sprintf. if (value == 0) { *(out++) = '0'; } else { while (value > 0) { *(out++) = '0' + value % 10; value /= 10; } } } /// And reverse. std::reverse(begin, out); *(out++) = '\0'; } 

PVS-Studio warning : V547 Expression 'offset> 0' is always true. FunctionsCoding.h 649

The condition " offset> 0 " is always fulfilled, therefore a point is always added. It seems to me that there is no error and the check is just superfluous. Although of course I'm not sure. If this is not an error, then the check should be removed so that it does not confuse other programmers and static code analyzers.

Conclusion


Perhaps the project developers will be able to find some more errors by looking at the analyzer warnings, which are not reflected in the article. I will finish the story, especially since in order to “say hello,” I had enough material :).

In general, I want to note the high quality code of the ClickHouse project developers. However, how high-class developers would not be, they are not immune from errors, and this article again demonstrates this. The PVS-Studio static code analyzer will help prevent many errors. The greatest effect of static analysis, developers get when writing new code. It makes no sense to spend time debugging errors that can be detected by the analyzer immediately after checking the new code.

I invite everyone to download and try PVS-Studio.


If you want to share this article with an English-speaking audience, then please use the link to the translation: Andrey Karpov. Give my Best Regards to Yandex Developers

Read the article and have a question?
Often our articles are asked the same questions. We collected answers to them here: Answers to questions from readers of articles about PVS-Studio, version 2015 . Please review the list.

Source: https://habr.com/ru/post/337182/


All Articles