📜 ⬆️ ⬇️

Finding errors in the LLVM project code using the PVS-Studio analyzer

PVS-Studio vs LLVM About two months ago I wrote an article about testing the GCC compiler using the PVS-Studio analyzer. The idea of ​​the article was the following: GCC warnings are good, but not enough. One should use specialized code analysis tools, for example, PVS-Studio. As a confirmation, I showed errors that PVS-Studio could find in the GCC code. A number of readers have noticed that the quality of the GCC code and its diagnostics is so-so, while the Clang compiler is modern, high-quality, fresh and young. In general, Clang is hoo! Well, the time has come for me to check the LLVM project with PVS-Studio.

Checking LLVM with the help of Linux version of PVS-Studio


I think there are few who do not know what LLVM is. However, I will keep the tradition of briefly describing a project that has been verified.

LLVM (Low Level Virtual Machine) is a universal system for analyzing, transforming and optimizing programs that implements a virtual machine with RISC-like instructions. It can be used as an optimizing compiler for this bytecode into machine code for various architectures, or for its interpretation and JIT compilation (for some platforms). As part of the LLVM project, Clang frontend was developed for C, C ++ and Objective-C languages, translating source codes into LLVM bytecode, and allowing using LLVM as a full-featured compiler.

Official site: http://llvm.org/
')
Revision 282481 was checked. The analysis was carried out with a new version of PVS-Studio working under Linux. Since PVS-Studio for Linux is a new product, below I will describe in more detail how the check was performed. I’m sure it will show that using our analyzer in Linux is not at all difficult, and you can, without delay, try to check your own project.

Penguin on the PVS-Studio unicorn



The Linux version of the analyzer is available for download on the following page: http://www.viva64.com/ru/pvs-studio-download-linux/

We checked the previous projects with the help of a universal mechanism that keeps track of compiler runs. This time we will use to check the information that PVS-Studio will take from the JSON Compilation Database. Details can be found in the documentation section " How to run PVS-Studio in Linux ".

In LLVM 3.9, they completely abandoned autoconf in favor of CMake, and this was a good reason to test the JSON Compilation Database support in action. What it is? This is the format used by the Clang utilities. It stores the list of compiler calls in the following form:
[ { "directory": "/home/user/llvm/build", "command": "/usr/bin/c++ .... file.cc", "file": "file.cc" }, .... ] 

For CMake projects, getting such a file is very simple - just generate a project with an additional option:
 cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=On ../llvm 

After that, compile_commands.json will appear in the current directory. We need it for verification. Since some projects use code generation, we first do the build.
 make -j8 

Now everything is ready for analysis. It starts with one line:
 pvs-studio-analyzer analyze -l ~/PVS-Studio.lic -o PVS-Studio.log -j 

For projects not using CMake, you can get compile_commands.json using the Bear utility. But for complex build systems that actively use environment variables or cross-compilation, the commands received do not always provide detailed information about the translation unit.

Note N1. How to work with PVS-Studio report in Linux .

Note N2. We provide quality and quick support to our customers and potential users. Therefore, if you do not understand something or something does not work, write to us in support . You will enjoy our service.

Test results


By the way, this is not the first LLVM check. Articles written based on previous checks:
Unfortunately, I can not say anything about the number of false positives and the density of errors found. The project is big, there are a lot of warnings, and I studied them very superficially. In my defense, I can say that the preparation for the release of PVS-Studio for Linux took a lot of time, and I managed to work on the article only with fits and starts.

Everything, the lyrics are over, let's move on to the most interesting. Let us examine the suspicious places in the LLVM code that PVS-Studio pointed to me.

Unbit Fields


In the code there is such an enumeration:
 enum Type { ST_Unknown, // Type not specified ST_Data, ST_Debug, ST_File, ST_Function, ST_Other }; 

This, if I may say so, is the “classical listing”. Each name in the enumeration is assigned an integer value that corresponds to a specific place in the order of values ​​in the enumeration:
Once again, I stress that this is just an enumeration, not a set of masks. If these constants could be combined with each other, then they would be a power of 2.

Now it's time to look at the code where this enumeration is used incorrectly:
 void MachODebugMapParser::loadMainBinarySymbols(....) { .... SymbolRef::Type Type = *TypeOrErr; if ((Type & SymbolRef::ST_Debug) || (Type & SymbolRef::ST_Unknown)) continue; .... } 

PVS-Studio warning : V616 The 'SymbolRef :: ST_Unknown' is used in the bitwise operation. MachODebugMapParser.cpp 448

Recall that the ST_Unknown constant is zero. Therefore, the expression can be abbreviated:
 if (Type & SymbolRef::ST_Debug) 

There is obviously something wrong here. Apparently the programmer who wrote this code decided that it works with an enumeration representing flags. That is, he expected that one constant or another bit corresponds to each constant. But it is not. I think the correct check in the code should look like this:
 if ((Type == SymbolRef::ST_Debug) || (Type == SymbolRef::ST_Unknown)) 

To avoid such errors, I think, one should use the enum class . In this case, the incorrect expression simply would not compile.

One-time cycles


The function is not very complicated, so I decided to bring it entirely. Before reading the article further, I propose to independently guess what is suspicious here.
 Parser::TPResult Parser::TryParseProtocolQualifiers() { assert(Tok.is(tok::less) && "Expected '<' for qualifier list"); ConsumeToken(); do { if (Tok.isNot(tok::identifier)) return TPResult::Error; ConsumeToken(); if (Tok.is(tok::comma)) { ConsumeToken(); continue; } if (Tok.is(tok::greater)) { ConsumeToken(); return TPResult::Ambiguous; } } while (false); return TPResult::Error; } 

PVS-Studio warning : V696 The 'continue' operator will terminate 'do {...} while (FALSE)' loop because the condition is always false. Check lines: 1642, 1649. ParseTentative.cpp 1642

The developers of LLVM, of course, will immediately be able to understand whether there is an error here or not. I have to play detective. Considering this code, I reasoned as follows. The function should read the opening '<' bracket, then it reads identifiers and commas in a loop. If there is no comma, then a closing bracket is expected. If something went wrong, the function returns an error code. I think that the following function algorithm (pseudocode) was conceived:
The trouble is that the cycle is trying to resume using the operator continue . It transfers control at all not to the beginning of the cycle body, but to checking the conditions for continuing the cycle. And the condition is always false . As a result, the cycle immediately ends and the algorithm is as follows:
Thus, only a sequence of one element enclosed in square brackets can be correct. If there are several elements in the sequence separated by a comma, the function will return the error status TPResult :: Error .

We now consider another case where no more than 1 iteration of the loop is performed:
 static bool checkMachOAndArchFlags(....) { .... unsigned i; for (i = 0; i < ArchFlags.size(); ++i) { if (ArchFlags[i] == T.getArchName()) ArchFound = true; break; } .... } 

PVS-Studio warning : V612 An unconditional 'break' within a loop. MachODump.cpp 1206

Pay attention to the break statement. It will break the loop immediately after the first iteration. It seems to me that the break statement should relate to the condition, and then the correct code will look like this:
 for (i = 0; i < ArchFlags.size(); ++i) { if (ArchFlags[i] == T.getArchName()) { ArchFound = true; break; } } 

There are two more similar places, but in order not to get the article too big, I’ll only give analyzer warnings:

The mixed operator || and &&


 static bool containsNoDependence(CharMatrix &DepMatrix, unsigned Row, unsigned Column) { for (unsigned i = 0; i < Column; ++i) { if (DepMatrix[Row][i] != '=' || DepMatrix[Row][i] != 'S' || DepMatrix[Row][i] != 'I') return false; } return true; } 

PVS-Studio warning : V547 Expression is always true. Probably the '&&' operator should be used here. LoopInterchange.cpp 208

The expression does not make sense. I will simplify the code to highlight the essence of the error:
 if (X != '=' || X != 'S' || X != 'I') 

The variable X will always be something not equal. As a result, the condition is always true. Most likely, instead of the operators "||" should have used the operators "&&" , then the expression becomes meaningful.

The function returns a reference to a local object.


 SingleLinkedListIterator<T> &operator++(int) { SingleLinkedListIterator res = *this; ++*this; return res; } 

PVS-Studio warning : V558 Function returns temporary reference: temporary object: res. LiveInterval.h 679

The function represents the traditional implementation of the postfix increment:
The error is that the function returns a link. This link is not valid, since when exiting the function, the temporary res object will be destroyed.

To fix the situation, you need to return the value, not the link:
 SingleLinkedListIterator<T> operator++(int) { .... } 

Reassignment


I'll give the whole function, so that no one would think that before re-assigning the variable ZeroDirective is somehow used.
 HexagonMCAsmInfo::HexagonMCAsmInfo(const Triple &TT) { Data16bitsDirective = "\t.half\t"; Data32bitsDirective = "\t.word\t"; Data64bitsDirective = nullptr; ZeroDirective = "\t.skip\t"; // <= CommentString = "//"; LCOMMDirectiveAlignmentType = LCOMM::ByteAlignment; InlineAsmStart = "# InlineAsm Start"; InlineAsmEnd = "# InlineAsm End"; ZeroDirective = "\t.space\t"; // <= AscizDirective = "\t.string\t"; SupportsDebugInformation = true; MinInstAlignment = 4; UsesELFSectionDirectiveForBSS = true; ExceptionsType = ExceptionHandling::DwarfCFI; } 

PVS-Studio warning : V519 The 'ZeroDirective' variable is assigned values ​​twice successively. Perhaps this is a mistake. Check lines: 25, 31. HexagonMCAsmInfo.cpp 31

The variable ZeroDirective is a simple pointer of type const char * . At the beginning, he points to the string "\ t.skip \ t", and just below he is assigned the address of the string "\ t.space \ t". It is strange and does not make sense. There is a high probability that one of the assignments should change a completely different variable.

Consider another case of reassignment.
 template <class ELFT> void GNUStyle<ELFT>::printFileHeaders(const ELFO *Obj) { .... Str = printEnum(e->e_ident[ELF::EI_OSABI], makeArrayRef(ElfOSABI)); printFields(OS, "OS/ABI:", Str); Str = "0x" + to_hexString(e->e_version); // <= Str = to_hexString(e->e_ident[ELF::EI_ABIVERSION]); // <= printFields(OS, "ABI Version:", Str); Str = printEnum(e->e_type, makeArrayRef(ElfObjectFileType)); printFields(OS, "Type:", Str); .... } 

PVS-Studio warning : V519 The 'Str' variable is assigned values ​​twice successively. Perhaps this is a mistake. Check lines: 2407, 2408. ELFDumper.cpp 2408

Apparently we are dealing with a typo. Instead of reassignment, it was necessary to concatenate two strings using the + = operator. Then the correct code should look like this:
 Str = "0x" + to_hexString(e->e_version); Str += to_hexString(e->e_ident[ELF::EI_ABIVERSION]); 

There are some more code fragments where reassignment occurs. In my opinion, these reassignments do not pose any danger, so I’ll simply list the corresponding warnings:

Suspicious work with smart pointers


 Expected<std::unique_ptr<PDBFile>> PDBFileBuilder::build( std::unique_ptr<msf::WritableStream> PdbFileBuffer) { .... auto File = llvm::make_unique<PDBFile>( std::move(PdbFileBuffer), Allocator); File->ContainerLayout = *ExpectedLayout; if (Info) { auto ExpectedInfo = Info->build(*File, *PdbFileBuffer); .... } 

PVS-Studio warning : V522 Dereferencing of the null pointer 'PdbFileBuffer' might take place. PDBFileBuilder.cpp 106

I don’t understand the code, because, for example, I didn’t study what llvm :: make_unique is and how it all works. Nevertheless, the analyzer and I am alarmed by the fact that at first glance the possession of an object from the smart pointer PdbFileBuffer goes to File . After that, the smart pointer PdbFileBuffer is dereferenced, which, in theory, at this moment already contains nullptr inside itself. That is, the following is alarming:
 .... llvm::make_unique<PDBFile>(::move(PdbFileBuffer), Allocator); .... .... Info->build(*File, *PdbFileBuffer); 

If this is a mistake, then it should be corrected in 3 more places in the same file:

A typo in the condition


 static bool areExclusiveRanges(BinaryOperatorKind OpcodeLHS, const APSInt &ValueLHS, BinaryOperatorKind OpcodeRHS, const APSInt &ValueRHS) { .... // Handle cases where the constants are different. if ((OpcodeLHS == BO_EQ || OpcodeLHS == BO_LE || // <= OpcodeLHS == BO_LE) // <= && (OpcodeRHS == BO_EQ || OpcodeRHS == BO_GT || OpcodeRHS == BO_GE)) return true; .... } 

PVS-Studio warning : V501 There are identical sub-expressions 'OpcodeLHS == BO_LE' operator. RedundantExpressionCheck.cpp 174

This is a classic typo. The variable OpcodeLHS is twice compared with the constant BO_LE . It seems to me that one of the constants BO_LE should be replaced by BO_LT . As you can see the names of the constants are similar to each other and easy to confuse.

The following example demonstrates how static analysis complements other methodologies for writing high-quality code. Consider the error code:
 std::pair<Function *, Function *> llvm::createSanitizerCtorAndInitFunctions( .... ArrayRef<Type *> InitArgTypes, ArrayRef<Value *> InitArgs, ....) { assert(!InitName.empty() && "Expected init function name"); assert(InitArgTypes.size() == InitArgTypes.size() && "Sanitizer's init function expects " "different number of arguments"); .... 

}

PVS-Studio warning : V501 There are identical sub-expressions 'InitArgTypes.size ()' to the left. ModuleUtils.cpp 107

One good way to increase code reliability is to use assert () macros. This and similar macros help to identify many errors during the development and debugging of the program. However, I will not go into the detailed descriptions of the benefits brought by such macros, since this is beyond the scope of the article.

What matters to us is that the createSanitizerCtorAndInitFunctions () function uses assert () macros to check the correctness of the input values. That's just because of a typo second assert () is useless.

Fortunately, we are helped by a static analyzer, which notices that the size of the array compares with itself. As a result, we can fix the check, and the correct condition in assert () over time can help prevent some other error.

Apparently, the condition should compare the sizes of the InitArgTypes and InitArgs arrays :
 assert(InitArgTypes.size() == InitArgs.size() && "Sanitizer's init function expects " "different number of arguments"); 

Confusion between release () and reset ()


In the class std :: unique_ptr there are two consonant functions: release and reset . As my observations show, they are sometimes confused. Apparently it happened here:
 std::unique_ptr<DiagnosticConsumer> takeClient() { return std::move(Owner); } VerifyDiagnosticConsumer::~VerifyDiagnosticConsumer() { .... SrcManager = nullptr; CheckDiagnostics(); Diags.takeClient().release(); } 

PVS-Studio warning : V530. VerifyDiagnosticConsumer.cpp 46

Perhaps there is no mistake here and there is some kind of special cunning logic. But it’s more like a resource drain. In any case, this code does not hurt to once again check the developers.

Excess conditions


 bool ARMDAGToDAGISel::tryT1IndexedLoad(SDNode *N) { LoadSDNode *LD = cast<LoadSDNode>(N); EVT LoadedVT = LD->getMemoryVT(); ISD::MemIndexedMode AM = LD->getAddressingMode(); if (AM == ISD::UNINDEXED || LD->getExtensionType() != ISD::NON_EXTLOAD || AM != ISD::POST_INC || LoadedVT.getSimpleVT().SimpleTy != MVT::i32) return false; .... } 

PVS-Studio warning : V590 Consider inspecting this expression. The expression is misprint. ARMISelDAGToDAG.cpp 1565

The condition is long, so I will highlight the most important thing:
 AM == ISD::UNINDEXED || AM != ISD::POST_INC 

This condition is redundant and can be requested to:
 AM != ISD::POST_INC 

So here we see just redundancy in the condition or some kind of error. Perhaps redundancy indicates to us that they wanted to write some other condition. I do not presume to judge how dangerous this place is, but it’s worth checking out. At the same time I want to draw the attention of developers to 2 more warnings of the analyzer:

My Favorite Warnings V595


Pointers in C and C ++ are an endless programmer headache. Check them to zero, check, and somewhere - once! - and again null pointer dereference. Diagnostics V595 detects situations when checking a pointer to zero equality is too late. Before this check, the pointer is already used. This is one of the most typical errors that we find in the code of various applications ( proof ). However, in defense of C / C ++ I’ll say that the situation in C # is not much better. From the fact that pointers in C # were called links, such errors did not disappear ( proof ).

Let's return to the LLVM code and consider a simple version of the error:
 bool PPCDarwinAsmPrinter::doFinalization(Module &M) { .... MachineModuleInfoMachO &MMIMacho = MMI->getObjFileInfo<MachineModuleInfoMachO>(); if (MAI->doesSupportExceptionHandling() && MMI) { .... } 

PVS-Studio warning : V595 The 'MMI' pointer was used before it was verified against nullptr. Check lines: 1357, 1359. PPCAsmPrinter.cpp 1357

The case is simple and everything is immediately clear. The check (... && MMI) tells us that the MMI pointer can be zero. If so, the program flow will not get to this check. It will be interrupted earlier due to null pointer dereferencing.

Consider another code snippet:
 void Sema::CodeCompleteObjCProtocolReferences( ArrayRef<IdentifierLocPair> Protocols) { ResultBuilder Results(*this, CodeCompleter->getAllocator(), CodeCompleter->getCodeCompletionTUInfo(), CodeCompletionContext::CCC_ObjCProtocolName); if (CodeCompleter && CodeCompleter->includeGlobals()) { Results.EnterNewScope(); .... } 

PVS-Studio warning : V595 The 'CodeCompleter' pointer was used against nullptr. Check lines: 5952, 5955. SemaCodeComplete.cpp 5952

The CodeCompleter pointer is first dereferenced, and below is a check for the equality of this pointer to zero. The same code is found three more times in the same file:
These were simple cases, but there is also more confusing code, where I cannot tell right away how dangerous it is. Therefore, I suggest that developers independently verify the following sections of the LLVM code:

Strange code


I apologize for presenting a hard-to-read snippet of code. Suffer it, until the end of the article there is not much.
 static bool print_class_ro64_t(....) { .... const char *r; uint32_t offset, xoffset, left; .... r = get_pointer_64(p, offset, left, S, info); if (r == nullptr || left < sizeof(struct class_ro64_t)) return false; memset(&cro, '\0', sizeof(struct class_ro64_t)); if (left < sizeof(struct class_ro64_t)) { memcpy(&cro, r, left); outs() << " (class_ro_t entends past the .......)\n"; } else memcpy(&cro, r, sizeof(struct class_ro64_t)); .... } 

PVS-Studio warning : V649 There are two 'if' statements with identical conditional expressions. The first 'if' statement contains function return. This means that the statement is senseless. Check lines: 4410, 4413. MachODump.cpp 4413

Pay attention to the check:
 if (.... || left < sizeof(struct class_ro64_t)) return false; 

If the value contained in the variable left is smaller than the class size, then the function will exit. It turns out that this choice of behavior does not make sense:
 if (left < sizeof(struct class_ro64_t)) { memcpy(&cro, r, left); outs() << " (class_ro_t entends past the .......)\n"; } else memcpy(&cro, r, sizeof(struct class_ro64_t)); 

The condition is always false, and therefore the else-branch is always executed. It is very strange. Perhaps the program contains a logical error, or we are dealing with a typo.

At the same time you should check this place:

Different trifle


Inside the template RPC class, the SequenceNumberManager class is declared. It has this move assignment operator:
 SequenceNumberManager &operator=(SequenceNumberManager &&Other) { NextSequenceNumber = std::move(Other.NextSequenceNumber); FreeSequenceNumbers = std::move(Other.FreeSequenceNumbers); } 

PVS-Studio warning : V591 Non-void function should return a value. RPCUtils.h 719

As you can see at the end forgot to write return:
 return *this; 

In fact, there is nothing wrong with that. Compilers, as a rule, do not work with bodies of functions of template classes, if these functions are not used. Here, apparently, is such a case. Although I did not check, but I’m sure: if you call such a move statement, the compiler will generate a compilation error or a loud warning. So there is nothing wrong here, but I decided to point out this flaw.

We met several strange sections of the code where the value of the pointer that returned the new operator is checked for equality to zero. , , std::bad_alloc . :
 LLVMDisasmContextRef LLVMCreateDisasmCPUFeatures(....) { .... // Set up the MCContext for creating symbols and MCExpr's. MCContext *Ctx = new MCContext(MAI, MRI, nullptr); if (!Ctx) return nullptr; .... } 

PVS-Studio: V668 There is no sense in testing the 'Ctx' pointer against null, as the memory was allocated using the 'new' operator. The exception will be generated in the case of memory allocation error. Disassembler.cpp 76

2 :
This code does not look dangerous, so I decided to describe them in the non-essential warnings section. Most likely, all these three checks can be simply deleted.

Conclusion


As you can see, compiler warnings are good, but not enough. Specialized static analysis tools, such as PVS-Studio, will always be ahead of the compilers in diagnostic capabilities and configuration flexibility when working with warnings. Actually, this is where analyzer developers make their money.

It is also important to note that the main effect of applying the methodology of static analysis is achieved with the regular use of static code analyzers. Many errors will be detected at the earliest stage, and they will not need to be debugged or the user be asked to describe in detail the sequence of actions leading to the crash of the program. Here is a complete analogy with the compiler warnings (in fact, these are the same warnings, but more intelligent). You see the compiler warnings constantly, not once a month?

We invite you to download and try PVS-Studio on the code of your project.


If you want to share this article with an English-speaking audience, then please use the link to the translation: Andrey karpov. Finding bugs in the code of LLVM project with the help of PVS-Studio .

Read the article and have a question?
Often our articles are asked the same questions. We collected answers to them here: Answers to questions from readers of articles about PVS-Studio, version 2015 . Please review the list.

Source: https://habr.com/ru/post/314044/


All Articles