Git internals: data storage and merge

In the process of transition from SVN to Git, we were faced with the need to rewrite our internal tools related to the deployment of code, which focused on the existence of a linear history of edits (and development on the trunk). On Habré already published possible solutions to this problem through Git-SVN, but we went the other way. We need support for Git features such as branching and merge , so we decided to understand the basics of how Git works and how it should be integrated with it.

About the article

The material is primarily aimed at readers who know how to work with Git at the level of a regular user and who know the basic concepts of working with him. Perhaps the article will not contain anything new for developers of version control systems that support the easy creation of branches and their reliable merging. All information is taken from open sources, including Git sources (2d242fb3fc19fc9ba046accdd9210be8b9913f64).

Data Storage: Objects

The information is based on the latest chapter of Pro Git .

In Git, the data storage unit is an object (English object), which is uniquely determined by the 40-character sha1 hash. Git stores almost everything in objects: commits, the contents of files, their hierarchy. At first, objects are regular files in the .git / objects folder, and after git gc they are packaged into .pack-files, which will be described below. To save disk space, the contents of all objects are additionally compressed using zlib.
')
You can find out the type of object by typing

 git cat-file -t .   : 
 
 BLOB ( ). 
    BLOB       .  :   ,     . 
 
  ,   BLOB    ,   ,   . ,        100 000       ,       : 
 
 $ git init Initialized empty Git repository in test/.git/ $ for((i=0;i<=100000;i++)); do echo $i; done >test.txt $ ls -lh 575K test.txt $ git add test.txt $ git commit -m "First commit" [master (root-commit) b3061d2] First commit 1 file changed, 100001 insertions(+) create mode 100644 test.txt $ find .git/objects -type f | xargs ls -lh 204K .git/objects/97/578648a76227f183339438512ad99a383b48cc #   ... $ echo 10001 >> test.txt $ git commit -m "Added another line" test.txt [master 0361e3c] Added another line 1 file changed, 1 insertion(+) # Git ,   1  $ find .git/objects -type f | xargs ls -lh 204K .git/objects/59/e434385635dccf949e66353f7a74a077357438 #   204K .git/objects/97/578648a76227f183339438512ad99a383b48cc #    ...  git cat-file -t .   : 
 
 BLOB ( ). 
    BLOB       .  :   ,     . 
 
  ,   BLOB    ,   ,   . ,        100 000       ,       : 
 
 $ git init Initialized empty Git repository in test/.git/ $ for((i=0;i<=100000;i++)); do echo $i; done >test.txt $ ls -lh 575K test.txt $ git add test.txt $ git commit -m "First commit" [master (root-commit) b3061d2] First commit 1 file changed, 100001 insertions(+) create mode 100644 test.txt $ find .git/objects -type f | xargs ls -lh 204K .git/objects/97/578648a76227f183339438512ad99a383b48cc #   ... $ echo 10001 >> test.txt $ git commit -m "Added another line" test.txt [master 0361e3c] Added another line 1 file changed, 1 insertion(+) # Git ,   1  $ find .git/objects -type f | xargs ls -lh 204K .git/objects/59/e434385635dccf949e66353f7a74a077357438 #   204K .git/objects/97/578648a76227f183339438512ad99a383b48cc #    ... 
 git cat-file -t .   : 
 
 BLOB ( ). 
    BLOB       .  :   ,     . 
 
  ,   BLOB    ,   ,   . ,        100 000       ,       : 
 
 $ git init Initialized empty Git repository in test/.git/ $ for((i=0;i<=100000;i++)); do echo $i; done >test.txt $ ls -lh 575K test.txt $ git add test.txt $ git commit -m "First commit" [master (root-commit) b3061d2] First commit 1 file changed, 100001 insertions(+) create mode 100644 test.txt $ find .git/objects -type f | xargs ls -lh 204K .git/objects/97/578648a76227f183339438512ad99a383b48cc #   ... $ echo 10001 >> test.txt $ git commit -m "Added another line" test.txt [master 0361e3c] Added another line 1 file changed, 1 insertion(+) # Git ,   1  $ find .git/objects -type f | xargs ls -lh 204K .git/objects/59/e434385635dccf949e66353f7a74a077357438 #   204K .git/objects/97/578648a76227f183339438512ad99a383b48cc #    ...

Also, storing objects entirely allows you to reliably merge branches with conflict resolution. But more on that later.

Tree (FS hierarchy)

a tree object ( tree ) is stored a list of records that corresponds to the hierarchy of the file system. One entry is the following:

 < > < > <sha1 > < >

File permissions in Git can have only a very limited set of values:

040000 - directory;
100644 - a regular file;
100755 - file with the rights of execution;
120000 - symbolic link.

The object type is a BLOB or tree, for a file and a directory, respectively. That is, the object of the tree type for the root directory stores the entire hierarchy of the file system, since within one tree there may be references to other trees.

Commit

In Git, one commit (eng. Share ) is a link to the tree object corresponding to the root directory, and a link to the parent commit (except for the very first commit in the repository). Also in the commit there is information about the author and UNIX timestamp from the time of creation.

If a commit is a simple merge ( git merge < > ), then it will have 2 parents: the current HEAD and the commit that < > points to. Git also supports the octopus merge strategy ( octopus ), in which it can merge more than two branches. For such commits, the number of parents will be more than two.

 $ git cat-file -p 0361e3c6d16fb3bbbcac8faa4e673667ea6fe20b tree ce9f2ced0ebb4346676879c7b12b92628378477f parent b3061d23da6f1a62dbc8f97b2a06e10e1aee2afa author Yuriy Nasretdinov <...> 1354450065 +0400 committer Yuriy Nasretdinov <...> 1354450065 +0400 Added another line

Pack files

If Git really kept all the objects entirely (albeit compressed), the .git folder would be a huge set of files, and they would be much larger than in the working copy. However, this does not happen, but mysterious pack-files appear in which the objects are packed. Oddly enough, there is little information on the Internet about how Git stores data in these files, so here’s an excerpt from a Linus Torvalds email that gives some explanation about these mysterious files (source: gcc.gnu.org/ml/gcc / 2007-12/msg00165.html ):

Hidden text

"... It's worth explaining (you are not aware of it, but
let me go through the basics anyway) how git delta-chains work
they are so different from most other systems.

In other SCM's, a delta-chain is generally fixed. It might be "forwards"
or "backwards", and it might be a bit like you work with the repository,
but some
kind of single scm entity. In CVS, it's obviously the *, v file, and a lot
systems do rather similar things.

Git also has a lot more "loosely". There
is no fixed entity. Delta's are generated against any random other version
that git deems to be a good candidate
successful heursitics), and there are absolutely no hard grouping rules.

This is generally a very good thing. It's good for various conceptual
reasons (ie git internally never
revision chain at it)
delta rules means
that git doesn't
for example - there simply are no arbitrary *, v "revision files" that have
some hidden meaning.

It is a little more open-ended
question. You really
There is a lot of choices, but in git, it really
can be a totally different issue.

In short, in the pack-files, objects are grouped by similarity (for example, type and size), after which they are stored in the form of "chains". The first element of the chain is the newest version of the object, and the next one is diff to the previous one. The most recent versions of the object are considered the most requested, so they are stored higher in the chain.

Thus, Git still keeps diffs, but only at the level of direct data storage. From the point of view of any API at a higher level, Git handles objects entirely, which allows you to implement various merge strategies and easily resolve conflicts.

History storage

Git does not have a separate history repository. The whole story can be expanded, but only by following the links to the parent from the commit you need. If you only need to look at the history by one file (or by subdirectory), Git still needs to do the same, but it will return the filtered results. It is worth keeping this in mind when you are integrating with Git, and not forcing Git to do a full history view for each file.

In addition, as you may have noticed, Git does not store information about renaming files. If you need to understand whether the file has been renamed or not, Git analyzes the contents of the objects it has and with some (adjustable) tolerance, considers that the file has been renamed.

Merge : three way merge ( resolve strategy)

If you need to merge two branches, then git defaults to the recursive strategy, but more on that later. Before this strategy appeared, the resolve strategy was used, which is a trilateral merger . In order to perform such a merge, you need to have 3 versions: a common parent, a version from one branch and a version from another branch. If you are merging files, such a three-way merger can be done using the diff3 utility, which is included in the standard diffutils package. This modest and rarely mentioned utility, in one way or another, does all the dirty work of merging in most existing version control systems, including RCS, CVS, SVN and, of course, Git.

In addition to using the diff3 analog (the specific implementation used in Git is LibXDiff), Git also calculates file renames on the fly and uses this information to merge tree objects. Merging directory hierarchies does not represent anything fundamentally difficult compared to merging files, but generates a lot of different types of conflicts.

A small illustration of how Git performs a three-way merge in a simple case (taken from man git-merge ):

Suppose we have such a story and the current branch is master:
  A---B---C topic / D---E---F---G master 
Then the git merge topic will repeat the changes made in the topic starting with the commit, when the story split (commit E), and create a new commit H, which will have two parents, and a commit message that the user will provide.
  A---B---C topic / \ D---E---F---G---H master 

Nevertheless, the development in the branches of the topic and master can be continued, and then the merger will no longer look so simple: we can have more than one commit that fits the definition of a “common ancestor”:

  A---B---C---K---L---M topic / \ D---E---F---G---H---N---O---P master

If we use the resolve strategy, the oldest common ancestor (commit E) will be selected. If, as a result of merge, there were conflicts resolved in commit H, we would still need to resolve them again.

To perform a merge using the resolve strategy, Git will take commit E as a common ancestor and commit M and P as two new versions. If there was a conflict in commit C, then the conflicting changes can be rolled back using git revert (for example, this is done in commit K), then the final state M will no longer contain a conflict, and there will also be no conflicts when merging conflicts.

Merge made by the 'recursive' strategy

Imagine this story:

  A---B---C---K---L---M topic / \ / D---E---F---G---H---N---O---P master

Now we need to run the git merge topic while in the master branch. We could choose to commit E as a common ancestor, but Git with the recursive strategy does otherwise. On the Internet, you can find one good article that describes this strategy in some detail: codicesoftware.blogspot.com/2011/09/merge-recursive-strategy.html . The article describes an algorithm that boils down to the following:

compile a list of all common ancestors, starting with the most recent;
we take the very first ancestor for the current commit;
we merge the current commit with the next ancestor and get a virtual commit that we take as the current commit ;
perform the previous operation until the list of common ancestors ends.

The result of this operation will be a virtual commit, which is the "cursed" state of all common ancestors in the correct order - conflict resolution will also fall into this commit, and more recent commits will have priority. When we have a common ancestor, the three-way merger described above is performed.

Excerpt from merge-recursive.c :

 int merge_recursive(...) { <...> if (!ca) { ca = get_merge_bases(h1, h2, 1); ca = reverse_commit_list(ca); } <...> merged_common_ancestors = pop_commit(&ca); <...> for (iter = ca; iter; iter = iter->next) { <...> merge_recursive(o, merged_common_ancestors, iter->item, NULL, &merged_common_ancestors); <…> } <...> clean = merge_trees(o, h1->tree, h2->tree, merged_common_ancestors->tree, &mrtree); <...> return clean; }

Low Level Git Commands

If you've worked for a while with Git, then you probably know about the checkout, branch, pull, push, rebase, commit, and some other commands. But initially Git was created not as a full-fledged version control system, but as a framework for its creation. Therefore, Git has a very rich set of built-in commands that work at a low level. Here are some of them, very useful, in our opinion:

git rev-parse <revision>

This command is very simple: it returns the commit hash for the specified revision. For example, git rev-parse HEAD will return the hash of the commit pointed to by HEAD.

git rev-list <commit> ...

The command displays a list of commit hashes for the specified request and can be used as a faster alternative to git log . For example, git rev-list branch ^origin/branch ^origin/master will output all the commits from the branch that have not yet been started (provided that origin / branch and origin / master are fresh, for example, before this was done git fetch ) .

Pitfalls: With regard to requests of the type branch ^ other_branch , Git may incorrectly output results if commits have the wrong time. For example, the output may be missing commits that “occurred in the future” compared to merge branches.

git diff-index

Shows the difference between a working copy and an index (.git / index). In the index, Git stores the lstat () cache from all files it knows about.

Pitfalls: if you transfer files from one server to another (or make a copy of a folder), then git diff-index will show many changes, although they are not there. This is due to the fact that almost all lstat fields are stored in .git / index , including the inode , and the contents of diff-index files are not parsed. Therefore, you need to do additional git update-index , or use the usual git diff , which does this automatically. Read more about .git / index: www.kernel.org/pub/software/scm/git/docs/v1.6.5/technical/racy-git.txt

git cat-file <object>

This team has already met in the article, but it is still worth mentioning again. It allows you to get the contents of a commit and any other Git object.

git ls-tree <object>

Prints the contents of the tree object in a reasonable form.

git ls-remote <repository>

Displays information about branches and tags (along with commit hashes) from the specified remote repository.

GIT_SSH

If you wrote scripts that make git pull , then most likely you’ve come across SSH asking for confirmation of the “authenticity” of the remote repository, and it does it interactively. The solution to this problem is not so elegant, because GIT_SSH should be the path to the executable file (rather than the SSH option):

 echo '#!/bin/sh exec ssh -o BatchMode=yes -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \"\$@\"' >/tmp/gitssh; chmod +x /tmp/gitssh; #  git pull: GIT_SSH=/tmp/gitssh git pull …

Conclusion

As you can see, Git allows you to work really well and reliably with branches, including correctly handling situations where two branches have more than one common ancestor. Unless your goal is to write your version control system, we would recommend using the results of Git, rather than trying to reproduce its merge algorithm.

Hopefully this material was interesting for you and let you understand why Git works this way and not otherwise. We believe that the article will also be useful for developers of various interfaces to Git, leading to a deeper understanding of what is happening "under the hood."

Yuri Nasretdinov, Badoo developer

Source: https://habr.com/ru/post/163853/

All Articles