📜 ⬆️ ⬇️

Gitphp in badoo

Badoo is a project with a giant git repository, in which there are thousands of branches and tags. We use a highly modified GitPHP ( http://gitphp.org ) version 0.2.4, which we have done a lot of add-ons (including integration with our workflow in JIRA, organization of the review process, etc.). In general, we were satisfied with this product until we began to notice that our main repository opens for more than 20 seconds. And today we will talk about how we investigated the performance of GitPHP and what results we achieved by solving this problem.

Timer setting


When developing badoo.com in a developer environment, we use a very simple debug panel for setting up timers and debugging SQL queries. Therefore, the first thing we did was remake it in GitPHP and began to measure the execution time of the code sections, not taking into account the nested timers. This is what our debug panel looks like:


')
The first column contains the name of the method (or action) being called, the second contains additional information: arguments for starting, beginning of the output of the command and trace. The last column is the time spent on the call (in seconds).

Here is a short excerpt from the implementation of the timers themselves:

<?php class GitPHP_Log { // ... public function timerStart() { array_push($this->timers, microtime(true)); } public function timerStop($name, $value = null) { $timer = array_pop($this->timers); $duration = microtime(true) - $timer; //      ,     foreach ($this->timers as &$item) $item += $duration; $this->Log($name, $value, $duration); } // ... } 

Using this API is very simple. At the beginning of the measured code, timerStart() is called, at the end - timerStop() with the name of the timer and optional additional data:

 <?php $Log = new GitPHP_Log; $Log->timerStart(); $result = 0; $mult = 4; for ($i = 1; $i < 1000000; $i+=2) { $result += $mult / $i; $mult = -$mult; } $Log->timerStop("PI computation", $result); 

In this case, calls can be nested, and the above class will take this into account when calculating.

For easier debugging of the code inside Smarty, we made “auto-timers”. They make it easy to measure the time spent on the work of methods with multiple exit points (many places where return is performed):

 <?php class GitPHP_DebugAutoLog { private $name; public function __construct($name) { $this->name = $name; GitPHP_Log::GetInstance()->timerStart(); } public function __destruct() { GitPHP_Log::GetInstance()->timerStop($this->name); } } 

Using such a class is very simple: you need to insert $Log = new GitPHP_DebugAutoLog('timer_name'); the time of its execution will be automatically measured at the beginning of any function or method, and when exiting the function:

 <?php function doSomething($a) { $Log = GitPHP_DebugAutoLog('doSomething'); if ($a > 5) { echo "Hello world!\n"; sleep(5); return; } sleep(1); } 

Thousands of calls to git cat-file -t <commit>


Thanks to the set timers, we were quickly able to find where GitPHP version 0.2.4 was spending most of the time. For each tag in the repository, one call to git cat-file -t only to find out the type of commit and whether this commit is a “lightweight tag” ( http://git-scm.com/book/en/Git-Basics -Tagging # Lightweight-Tags ). Lightweight tags in Git is a type of tag that is created by default and contains a link to a specific commit. Since no other tag types were present in our repository, we simply removed this check and saved a couple of thousand git cat-file -t, calls git cat-file -t, took about 20 seconds.

How did it happen that GitPHP needed to find out for each tag in the repository whether it is “lightweight”? It's pretty simple.

On all GitPHP pages, near the commit, there are branches and tags that point to it:



To do this, the GitPHP_TagList class has a method that is responsible for getting a list of tags that reference the specified commit:

 <?php class GitPHP_TagList extends GitPHP_RefList { // ... public function GetCommitTags($commit) { if (!$commit) return array(); $commitHash = $commit->GetHash(); if (!$this->dataLoaded) $this->LoadData(); $tags = array(); foreach ($this->refs as $tag => $hash) { if (isset($this->commits[$tag])) { // ... } else { $tagObj = $this->project->GetObjectManager()->GetTag($tag, $hash); $tagCommitHash = $tagObj->GetCommitHash(); // ... if ($tagCommitHash == $commitHash) { $tags[] = $tagObj; } } } return $tags; } // ... } 

Those. For each commit for which you want to get a list of tags, the following is done:

  1. The first call loads a list of all tags in the repository (call LoadData ()).
  2. Enumerates all tags.
  3. For each tag, the corresponding object is loaded.
  4. GetCommitHash () is called on the tag object and the resulting value is compared with the desired one.

Apart from the fact that you can first create a map of the form array( commit_hash => array(tags) ) , you need to pay attention to the GetCommitHash() method: it calls the Load($tag) method, which, when implemented using an external Git utility, does the following:

 <?php class GitPHP_TagLoad_Git implements GitPHP_TagLoadStrategy_Interface { // ... public function Load($tag) { // ... $args[] = '-t'; $args[] = $tag->GetHash(); $ret = trim($this->exe->Execute($tag->GetProject()->GetPath(), GIT_CAT_FILE, $args)); if ($ret === 'commit') { // ... return array(/* ... */); } // ... $ret = $this->exe->Execute($tag->GetProject()->GetPath(), GIT_CAT_FILE, $args); // ... return array(/* ... */); } } 

Those. To show which branches and tags are included in a commit, GitPHP loads a list of all tags and calls git cat-file -t for each one. Not bad, Christopher, keep it up!

Hundreds of git rev-list calls --max-count = 1 ... <commit>


The situation is similar with the commit information. To load the date, the commit message, the author, etc., each time git rev-list was called - max-count = 1 ... <commit>. This operation is also not free:

 <?php class GitPHP_CommitLoad_Git extends GitPHP_CommitLoad_Base { public function Load($commit) { // ... /* get data from git_rev_list */ $args = array(); $args[] = '--header'; $args[] = '--parents'; $args[] = '--max-count=1'; $args[] = '--abbrev-commit'; $args[] = $commit->GetHash(); $ret = $this->exe->Execute($commit->GetProject()->GetPath(), GIT_REV_LIST, $args); // ... return array( // ... ); } // ... } 

Solution: batch commit loading (git cat-file --batch)


In order not to make many single calls to git cat-file, Git allows you to load many commits at once using the --batch option. At the same time, it takes a list of commits to stdin, and writes the result to stdout. Accordingly, you can first write to the file all the hashes of commits that we need, run git cat-file --batch and load all the results at once.

Here is an example of the code that does this (the code is given for the version of GitPHP 0.2.4 and the operating systems of the * nix family):

 <?php class GitPHP_Project { // ... public function BatchReadData(array $hashes) { if (!count($hashes)) return array(); $outfile = tempnam('/tmp', 'objlist'); $hashlistfile = tempnam('/tmp', 'objlist'); file_put_contents($hashlistfile, implode("\n", $hashes)); $Git = new GitPHP_GitExe($this); $Git->Execute(GIT_CAT_FILE, array('--batch', ' < ' . escapeshellarg($hashlistfile), ' > ' . escapeshellarg($outfile))); unlink($hashlistfile); $fp = fopen($outfile, 'r'); unlink($outfile); $types = $contents = array(); while (!feof($fp)) { $ln = rtrim(fgets($fp)); if (!$ln) continue; list($hash, $type, $n) = explode(" ", rtrim($ln)); $contents[$hash] = fread($fp, $n); $types[$hash] = $type; } return array('contents' => $contents, 'types' => $types); } // ... } 

We began to use this function for most of the pages where information about commits is displayed (i.e., we collect a list of commits and load them all with one call to git cat-file --batch ). This optimization reduced the average page load time from more than 20 seconds to 0.5 seconds. So we solved the problem of slow GitPHP in our project.

Open-source: optimizing GitPHP 0.2.9 (master)


A little thought, we realized that it was possible not to rewrite all the code to use git cat-file --batch . Although not documented, this command allows you to download information one commit at a time, without losing performance! During operation, one line is read from the standard input and the results are sent to the standard output without buffering. This means that we can open git cat-file --batch via proc_open() and get the results immediately, without reworking the architecture!

Here is an excerpt from the implementation (error handling is removed for readability):

 <?php // ... class GitPHP_GitExe implements GitPHP_Observable_Interface { // ... public function GetObjectData($projectPath, $hash) { $process = $this->GetProcess($projectPath); $pipes = $process['pipes']; $data = $hash . "\n"; fwrite($pipes[0], $data); fflush($pipes[0]); $ln = rtrim(fgets($pipes[1])); $parts = explode(" ", rtrim($ln)); list($hash, $type, $n) = $parts; $contents = ''; while (strlen($contents) < $n) { $buf = fread($pipes[1], min(4096, $n - strlen($contents))); $contents .= $buf; } return array( 'contents' => $contents, 'type' => $type, ); } // ... } 

Considering that we can now very quickly load the contents of objects without making a call to the git command each time, it became easy to get a big performance boost: you just need to change all the git cat-file and git rev-list calls to a call to our optimized function.

We collected all the changes into one commit and sent a pull-request to the GitPHP developer. After some time, the patch was accepted! Here is this commit:

source.gitphp.org/projects/gitphp.git/commitdiff/3c87676b3afe4b0c1a1f7198995cec17170000482

The author made some corrections to the code (by separate commits), and now in the master branch there is a significantly accelerated version of GitPHP! To use optimizations, you need to turn off the “compatibility mode”, that is, set $compat = false; in the configuration.

Yuri youROCK, Nasretdinov, PHP developer, Badoo
Eugene eZH Makhrov, QA Engineer, Badoo

Source: https://habr.com/ru/post/200946/


All Articles