Static analysis of PHP code on the example of PHPStan, Phan and Psalm

Badoo has been around for over 12 years. We have a lot of PHP code (millions of lines) and probably even lines written 12 years ago have been preserved. We have code written back in the days of PHP 4 and PHP 5. We post the code twice a day, and each layout contains about 10-20 tasks. In addition, programmers can post urgent patches - small changes. And on the day of such patches, we have a couple of dozen. In general, our code is changing very actively.

We are constantly looking for opportunities to both speed development and improve the quality of the code. And once we decided to implement static code analysis. What came out of it, read under the cut.

Strict types: why we do not use it yet

Once we had a discussion in our corporate PHP chat. One of the new employees told how at the previous place of work they introduced mandatory strict_types + scalar type hints for the entire code - and this significantly reduced the number of production bugs.
')
Most of the old-timers chat was against this innovation. The main reason was that PHP does not have a compiler that, at the compilation stage, would check the conformity of all types in the code, and if you don’t have 100% coverage of the code with tests, there is always a risk that errors will pop up during production, want to admit.

Of course, strict_types will find a certain percentage of bugs caused by type mismatch and the way PHP "silently" converts types. But many experienced PHP programmers already know how the type system works in PHP, what are the rules for type conversion, and in most cases they write the correct, working code.

But we liked the idea of having a certain system showing where in the code there is a mismatch of types. We are thinking about alternatives to strict_types.

At first we even wanted to patch PHP. We wanted that if the function takes some scalar type (say, int), and another scalar type comes in (for example, float), then TypeError (which is essentially an exception) would not throw, but type conversion would occur, as well as the logging of this event in error.log. This would allow us to find all the places where our assumptions about types are wrong. But such a patch seemed risky to us, and there could be problems with external dependencies, which were not ready for such behavior.

We abandoned the idea of patching PHP, but in time it all coincided with the first releases of the static analyzer Phan, the first commits of which were made by Rasmus Lerdorf himself. So we came up with the idea to try static code analyzers.

What is static code analysis

Static code analyzers simply read the code and try to find errors in it. They can perform both very simple and obvious checks (for example, the existence of classes, methods and functions, and more cunning ones (for example, search for type discrepancies, race conditions or vulnerabilities in code). The key is that analyzers do not perform code - they analyze the program text and check it for typical (and not so) errors.

The most obvious example of a static PHP code analyzer are inspections in PHPStorm: when you write code, it highlights incorrect function calls, methods, mismatch of parameter types, etc. At the same time, PHPStorm does not run your PHP code — it only analyzes it.

I note that in this article we are talking about analyzers that are looking for errors in the code. There is another class of analyzers - they check the style of writing code, cyclomatic complexity, the size of methods, the length of lines, etc. We do not consider such analyzers here.

Although not all that the analyzers we are looking at are exactly the wrong thing. By mistake, I mean the code that creates Fatal in production. Very often, what analyzers find is rather an inaccuracy. For example, in PHPDoc, an invalid parameter type may be specified. This inaccuracy does not affect the operation of the code, but subsequently the code will evolve - another programmer may make a mistake.

Existing PHP code analyzers

There are three popular PHP code analyzers:

And there is Exakat , which we have not tried.

From the user's side, all three analyzers are the same: you install them (most likely through Composer), configure them, after which you can run an analysis of the entire project or a group of files. As a rule, the analyzer is able to beautifully display the results in the console. You can also output the results in JSON format and use them in CI.

All three projects are now actively developing. Their maintainers are very actively responding to issues on GitHub. Often, on the first day after the creation of a ticket, at least they react to it (they comment or put a bug / enhancement tag). Many bugs we found were fixed in a couple of days. But I especially like the fact that the maintainers of projects actively communicate with each other, report bugs to each other, and send pull requests.

We have implemented and use all three analyzers. Everyone has their own nuances, their own bugs. But using three analyzers at the same time makes it easier to understand where the real problem is and where the false positive is.

What analyzers can do

Analyzers have many common features, so first consider what they can do, and then we turn to the features of each of them.

Standard checks

Of course, the analyzers perform all standard code checks to ensure that:

the code does not contain syntax errors;
all classes, methods, functions, constants exist;
variables exist;
in PHPDoc, hints are true.

In addition, the analyzers check the code for unused arguments and variables. Many of these errors lead to real fatals in the code.

At first glance it may seem that good programmers do not make such mistakes, but sometimes we are in a hurry, sometimes copy-paste, sometimes we are just inconsiderate. And in such cases, these checks greatly save.

Data Type Checks

Of course, static analyzers also perform standard checks regarding data types. If the code says that the function accepts, say, int, then the analyzer will check if there are any places where the object would be passed to this function. With most analyzers, you can set the test strictness and imitate strict_types: check that no string or Boolean is passed to this function.

In addition to standard checks, analyzers still know how much

Union types

All analyzers support the concept of Union types. Suppose you have a type function:

/** * @var string|int|bool $yes_or_no */ function isYes($yes_or_no) :bool {    if (\is_bool($yes_or_no)) {         return $yes_or_no;     } elseif (is_numeric($yes_or_no)) {         return $yes_or_no > 0;     } else {         return strtoupper($yes_or_no) == 'YES';     } }

Its contents are not very important - the type of the input parameter string|int|bool is important. That is, the variable $yes_or_no is either a string, or an integer, or Boolean .

Using PHP, this type of function parameter cannot be described. But in PHPDoc this is possible, and many editors (for example, PHPStorm) understand it.

In static analyzers, this type is called the union type , and they are very good at checking such data types. For example, if we would write the above function like this (without checking for Boolean ):

 /** * @var string|int|bool $yes_or_no */ function isYes($yes_or_no) :bool {    if (is_numeric($yes_or_no)) {        return $yes_or_no > 0;    } else {        return strtoupper($yes_or_no) == 'YES';    } }

the analyzers would see that either a string or a Boolean can come to the strtoupper, and return an error — you cannot pass a Boolean to the strtoupper.

This type of check helps programmers correctly handle errors or situations where a function cannot return data. We often write functions that can return some data or null :

 // load()  null   \User $User = UserLoader::load($user_id); $User->getName();

In the case of such a code, the analyzer will tell you that the $User variable here can be equal to null and this code can lead to fatal.

Type false

PHP itself has quite a few functions that can return either a value or false. If we wrote such a function, how would we document its type?

          /** @return resource|bool */ function fopen(...) {       … }

Formally, everything seems to be right here: fopen returns either resource or false (which is of type Boolean ). But when we say that a function returns a data type, it means that it can return any value from the set belonging to that data type. In our example for the analyzer, this means that fopen() can return true . And, for example, in the case of such a code:

 $fp = fopen('some.file','r'); if($fp === false) {     return false; } fwrite($fp, "some string");

analyzers would complain that fwrite accepts the first resource parameter, and we pass bool (because the analyzer sees that the variant with true is possible). For this reason, all analyzers understand this “artificial” data type as false , and in our example we can write @return false|resource . PHPStorm also understands this type description.

Array shapes

Very often, arrays in PHP are used as a record type - a structure with a clear list of fields, where each field has its own type. Of course, many programmers are already using classes for this. But we have a lot of legacy code in Badoo, and arrays are actively used there. It also happens that programmers are too lazy to start a separate class for a one-time structure, and in such places they also often use arrays.

The problem with such arrays is that there is no clear description of this structure (a list of fields and their types) in the code. Programmers can make mistakes while working with such a structure: forget the required fields or add the “left” keys, confusing the code even more.

Analyzers allow description of such structures:

 /** @param array{scheme:string,host:string,path:string} $parsed_url */ function showUrl(array $parsed_url) { … }

In this example, we described an array with three string fields: scheme, host and path . If inside the function we turn to another field, the analyzer will show an error.

If you do not describe the types, the analyzers will try to "guess" the structure of the array, but, as practice shows, with our code, this is not very good for them. :)

This approach has one drawback. Suppose you have a structure that is actively used in the code. It is impossible to declare a pseudotype in one place and then use it everywhere. You have to write PHPDoc with the array description everywhere in the code, which is very inconvenient, especially if there are many fields in the array. It will also be problematic to edit this type afterwards (add and delete fields).

Description of key types of arrays

In PHP, array keys can be integers and strings. Sometimes types can be important for static analysis (and for programmers). Static analyzers allow you to describe array keys in PHPDoc:

 /** @var array<int, \User> $users */ $users = UserLoaders::loadUsers($user_ids);

In this example, we used PHPDoc to add a hint that in the $users array, the keys are int-int integers, and the values are objects of the \User class. We could describe the type as \ User []. This would tell the analyzer that there are objects of the \User class in the array, but nothing would tell us about the type of keys.

PHPStorm supports this array description format starting from version 2018.3.

Your PHPDoc namespace

PHPStorm (and other editors) and static analyzers can understand PHPDoc differently. For example, analyzers support this format:

 /** @param array{scheme:string,host:string,path:string} $parsed_url */ function showUrl($parsed_url) { … }

And PHPStorm doesn't understand it. But we can write this:

 /** * @param array $parsed_url * @phan-param array{scheme:string,host:string,path:string} $parsed_url * @psalm-param array{scheme:string,host:string,path:string} $parsed_url */ function showUrl($parsed_url) { … }

In this case, both the analyzers and PHPStorm will be satisfied. PHPStorm will use @param , and analyzers will use their PHPDoc tags.

Checks related to PHP features

This type of verification is best illustrated with an example.

Do we all know what the explode () function can return? If you skim through the documentation, it seems that it returns an array. But if we look more closely, we will see that it can also return false. In fact, it may return both null and an error if you pass the wrong types to it, but passing the wrong value with the wrong data type is already an error, so this option does not interest us right now.

Formally from the point of view of the analyzer, if the function can return false or an array, then, most likely, then the code should check for false. But the explode () function returns false only if the separator (the first parameter) is an empty string. Often, it is explicitly registered in the code, and analyzers can check that it is not empty, which means that in this place, the explode () function accurately returns an array and does not need to check for false.

PHP doesn’t have many of these features. Analyzers gradually add relevant checks or improve them, and we, programmers, no longer need to memorize all these features.

We proceed to the description of specific analyzers.

PHPStan

Development of a certain Ondřej Mirtes from the Czech Republic. Actively developed since the end of 2016.

To start using PHPStan, you need:

Install it (the easiest way to do this is through Composer).
(optional) Configure.
In the simplest case, just run:

vendor/bin/phpstan analyse ./src

(instead of src may be a list of specific files that you want to check).

PHPStan will read the PHP code from the transferred files. If he meets unknown classes, he will try to load them with autoload and through reflection to understand their interface. You can also transfer the path to the Bootstrap file through which you configure the autoload and include some additional files to simplify the PHPStan analysis.

Key features:

You can not analyze the entire code base, but only a part - PHPStan will try to load unknown classes with autoload.
If for some reason some of your classes are not in the autoload, PHPStan will not be able to find them and will give an error.
If you actively use magic methods through __call / __get / __set , then you can write a plugin for PHPStan. There are already plugins for symfony, Doctrine, Laravel, Mockery and others.
In fact, PHPStan performs autoloads not only for unknown classes, but in general for all. We have a lot of old code written before the appearance of anonymous classes, when we create a class in one file, and then we instantiate it right away, and maybe even call some methods. Autoload ( include ) such files leads to errors, because the code is not executed in a normal environment.
Configs in neon format (never heard of this format anywhere else).
No support for their PHPDoc tags like @phpstan-var, @phpstan-return , etc.

Another feature is that errors have text, but there is no type. That is, an error text is returned to you, for example:

Method \SomeClass::getAge() should return int but returns int|null
Method \SomeOtherClass::getName() should return string but returns string|null

In this example, both errors, in general, are about the same: the method must return one type, but in reality it returns the other. But the texts of errors are different, although similar. Therefore, if you want to filter out any errors in PHPStan, do it only through regular expressions.

For comparison, in other analyzers errors have a type. For example, in Phan, such an error has the type PhanPossiblyNullTypeReturn , and you can specify in the config that no verification is required for such errors. Also, having the type of error, you can, for example, easily collect statistics on errors.

Since we do not use Laravel, Symfony, Doctrine and similar solutions, and we rarely use magical methods in the code, the main feature of PHPStan was unclaimed. ; (Besides, due to the fact that PHPStan includes all checked classes, sometimes its analysis simply does not work on our code base.

Nevertheless, PHPStan remains useful for us:

If you need to check several files, PHPStan is noticeably faster than Phan and a little (20-50%) faster than Psalm.
PHPStan reports make it easier for us to find false-positive in other analyzers. Usually, if there is some obvious fatal in the code, it is shown by all analyzers (or at least two out of three).

Update:
PHPStan author Ondřej Mirtes also read our article and suggested that PhpStan, like Psalm, has a sandbox site: https://phpstan.org/ . This is very convenient for bug reports: you reproduce the error in and give a link in GitHub.

Phan

Developed by Etsy. The first commits are from Rasmus Lerdorf.

Of the three considered, Phan is the only true static analyzer (in the sense that it does not execute any of your files — it parses the entire code base and then analyzes what you say). Even for analyzing several files in our code base, it takes about 6 GB of RAM, and this process takes four to five minutes. But a complete analysis of the entire code base takes about six to seven minutes. For comparison, Psalm analyzes it for several tens of minutes. And from PHPStan, we could not achieve a complete analysis of the entire code base due to the fact that it includes classes.

The impression of Phan is twofold. On the one hand, it is the most qualitative and stable analyzer, it finds a lot and the least problems with it when it is necessary to analyze the entire code base. On the other hand, it has two unpleasant features.

Under the hood, Phan uses the php-ast extension. Apparently, this is one of the reasons that the analysis of the entire code base is relatively fast. But php-ast shows the internal representation of the AST tree as it is displayed in PHP itself. And in PHP itself, the AST tree does not contain information about comments that are located inside the function. That is, if you wrote something like:

 /** * @param int $type */ function doSomething($type) {   /** @var \My\Object $obj **/   $obj = MyFactory::createObjectByType($type);   … }

then inside the AST tree there is information about the external PHPDoc for the doSomething() function, but there is no PHPDoc hint information that is inside the function. And, accordingly, Phan also knows nothing about her. This is the most common cause of false-positive in Phan. There are some recommendations on how to insert hints (via strings or assertions), but, unfortunately, they are very different from what our programmers are used to. We partially solved this problem by writing a plugin for Phan. But plugins are discussed below.

The second unpleasant feature is that Phan poorly analyzes the properties of objects. Here is an example:

 class A { /** * @var string|null */ private $a; public function __construct(string $a = null) {      $this->a = $a; } public function doSomething() {      if ($this->a && strpos($this->a, 'a') === 0) {          var_dump("test1");      } } }

In this example, Phan will tell you that in strpos you can pass null. You can learn more about this problem here: https://github.com/phan/phan/issues/204 .

Summary. Despite some difficulties, Phan is a very cool and useful development. In addition to these two types of false-positive , it almost does not make mistakes, or makes mistakes, but on some really complex code. We also liked the fact that the config is in a PHP file - this gives a certain flexibility. Phan can also work as a language server, but we didn’t use this feature, as PHPStorm is enough for us.

Plugins

Phan has a well-developed API for plug-in development. You can add your own checks, improve type inference for your code. This API has documentation , but it’s especially cool that there are already ready-made working plugins inside that can be used as examples.

We managed to write two plugins. The first was intended for a one-time check. We wanted to evaluate how our code is ready for PHP 7.3 (in particular, to find out if there are any case-insensitive constants in it). We were almost sure that there were no such constants, but in 12 years anything could have happened - it should have been checked. And we wrote a plugin for Phan that would curse if the third parameter was used in define() .

The plugin is very simple.

 <?php declare(strict_types=1); use Phan\AST\ContextNode; use Phan\CodeBase; use Phan\Language\Context; use Phan\Language\Element\Func; use Phan\PluginV2; use Phan\PluginV2\AnalyzeFunctionCallCapability; use ast\Node; class DefineThirdParamTrue extends PluginV2 implements AnalyzeFunctionCallCapability { public function getAnalyzeFunctionCallClosures(CodeBase $code_base) : array {   $define_callback = function (       CodeBase $code_base,                  Context $context,                  Func $function,                  array $args    ) {      if (\count($args) < 3) {         return;      }       $this->emitIssue(       $code_base,      $context,      'PhanDefineCaseInsensitiv',      'Define with 3 arguments',      []      );    };    return [          'define' => $define_callback,    ]; } } return new DefineThirdParamTrue();

In Phan, different plugins can be hung on different events. In particular, AnalyzeFunctionCallCapability plug- AnalyzeFunctionCallCapability work when a function call is analyzed. In this plugin, we made it so that when we call the define() function, our anonymous function is called, which checks that define() no more than two arguments. Then we just started Phan, found all the places where define() was called with three arguments, and made sure that we don't have case-insensitive- .

Using the plugin, we also partially solved the false-positive problem, when Phan does not see PHPDoc hints inside the code.

We often use factory methods that take a constant as input and create an object from it. Often the code looks like this:

 /** @var \Objects\Controllers\My $Object */ $Object = \Objects\Factory::create(\Objects\Config::MY_CONTROLLER);

Phan does not understand such PHPDoc hints, but in this code the class of the object can be obtained from the name of a constant passed to the create() method. Phan allows you to write a plugin that is triggered when analyzing the return value of a function. And using this plugin, you can tell the analyzer what type of function the function returns in this call.

An example of this plugin is more complicated. But in the Phan code there is a good example in vendor/phan/phan/src/Phan/Plugin/Internal/DependentReturnTypeOverridePlugin.php.

In general, we are very pleased with the Phan analyzer. Listed above false-positive we partially (in simple cases, with a simple code) learned to filter. After that, Phan became an almost standard analyzer. However, the need to immediately parse the entire code base (time and a lot of memory) still complicates the process of its implementation.

Psalm

Psalm is a Vimeo development. Honestly, I didn’t even know that Vimeo uses PHP until I saw Psalm.

This analyzer is the youngest of our troika. When I read the news that Vimeo released Psalm, I was at a loss: “Why invest resources in Psalm if you already have Phan and PHPStan?” But it turned out that Psalm has its own useful features.

Psalm followed in the footsteps of PHPStan: it can also be given a list of files for analysis, and it will analyze them, and connect the missing classes with an autoload. At the same time, it only includes non-found classes, and the files that we are asked to analyze will not include (this is the difference from PHPStan). The config is stored in an XML file (for us, this is rather a minus, but not very critical).

Psalm has a sandbox site where you can write PHP code and analyze it. This is very convenient for bug reports: you reproduce the error on the site and give a link in GitHub. And, by the way, the site describes all possible types of errors. For comparison: there are no types of errors in PHPStan, but they are in Phan, but there is no single list that could be found.

We also liked that when displaying errors, Psalm immediately shows the lines of code where they were found. This greatly simplifies the reading of reports.

But perhaps the most interesting feature of Psalm are its custom PHPDoc tags, which allow for improved analysis (especially the definition of types). We list the most interesting of them.

@ psalm-ignore-nullable-return

It sometimes happens that a method can formally return null , but the code is already organized in such a way that it never happens. In this case, it is very convenient that you can add this PHPDoc hint to the method / function - and Psalm will assume that null not returned.

A similar hint also exists for false: @psalm-ignore-falsable-return .

Types for closure

If you have ever been interested in functional programming, you might have noticed that there often a function can return another function or take a function as a parameter. In PHP, this style can be very confusing for your colleagues, and one of the reasons is that PHP does not have standards for documenting such functions. For example:

 function my_filter(array $ar, \Closure $func) { … }

How can a programmer understand which interface the function has in the second parameter? What parameters should it take? What should she return?

Psalm supports syntax for describing functions in PHPDoc:

 /** * @param array $ar * @psalm-param Closure(int):bool $func */ function my_filter(array $ar, \Closure $func) { … }

With such a description it is already clear that in my_filter you need to pass an anonymous function, which takes int as input and returns bool. And, of course, Psalm will check that you have exactly such a function here in the code.

Enums

Suppose you have a function that accepts a string parameter, and only certain strings can be passed there:

 function isYes(string $yes_or_no) : bool {     $yes_or_no = strtolower($yes_or_no)     switch($yes_or_no)  {           case 'yes':                 return true;          case 'no':                 return false;           default:                throw new \InvalidArgumentException(…);     } }

Psalm allows you to describe the parameter of this function like this:

 /** @psalm-param 'Yes'|'No' $yes_or_no **/ function isYes(string $yes_or_no) : bool { … }

In this case, the Psalm will try to understand what specific values are passed to this function, and give errors if there are values other than Yes and No

Read more about enum ah here .

Type aliases

Above, when describing array shapes I mentioned that, although analyzers allow us to describe the structure of arrays, it is not very convenient to use this, since the description of the array has to be copied in different places. The correct solution, of course, is to use classes instead of arrays. But in the case of multi-year legacy this is not always possible.

, , , :

;
closure;
union- (, );
enum.

, , PHPDoc , , . Psalm . alias PHPDoc alias . , : PHP-. . , Psalm.

Generics aka templates

. , :

 function identity($x) { return $x; }

? ? ?

, , , — mixed , .

mixed — . , . , identity() / , : , . -. , :

 $i = 5; // int $y = identity($i);

(int) , , $y ( int ).

? Psalm PHPDoc-:

 /** * @template T * @psalm-param T $x * @psalm-return T */ function identity($x) { $return $x; }

templates Psalm , / .

Psalm templates:

— vendor/vimeo/psalm/src/Psalm/Stubs/CoreGenericFunctions.php ;
— vendor/vimeo/psalm/src/Psalm/Stubs/CoreGenericClasses.php .

Phan, : https://github.com/phan/phan/wiki/Generic-Types .

, Psalm . , «» . , Psalm , , Phan PHPStan. .

PHPStorm

: , . , , .

. Phan, language server. PHPStorm, , .

, , PHPStorm ( ), . — Php Inspections (EA Extended). — , , . , . , scopes - scopes.

, deep-assoc-completion . .

Badoo

?

, .

, . , , git diff / , , () . , .

, : - git diff . . , . . , , , , .

, , :

false-positive . , , Phan , , . , - Phan , , .

QA

— , , , . :

100% ( , );
, code review;
, .

strict types . , strict types , :

, strict types , ;
, (, , );
, PHP (, union types , PHP);
strict types , .

:

, . There are several reasons for this.

-, , , - , .

-, , — , , PHPDoc. — .

-, . , - , PHPDoc. :)

, , . , .

Source: https://habr.com/ru/post/426605/

All Articles