DVCS and DAGs

Translation of the article by Eric Sink - DVCS and DAGs ( Part 1 and Part 2 ).

Note translator: In this article I will use the original English abbreviations DVCS and DAG to refer to Distributed Version Control System (DVCS) and Directed Acyclic Graph (DAG).

Part 1

There are two categories of people:

Those who are trying to divide everything into two categories.
Those who do not.

I am one of the first. :-)
')
There are two categories of version control systems:

Those in which the story is the Line.
Those in which the story is the Directed Acyclic Graph (DAG).

Traditional tools (such as Subversion and Vault) have a history of history as Lines. In DVCS (such as Git and Mercurial) the story is DAG. The difference between these two models is quite interesting.

Linear model tested and working. History is a sequence of versions, one after another.
1761_image001

To create a new version you need:

Get the latest version.
Make changes.
Commit the changes made.

People love the linear model for its simplicity. It gives the exact answer to the question which version is the latest.

But the linear model has one big problem: you can create a new version only if it is based on the latest version. And often the following happens:

I get the latest version from the repository. At the moment I received it, it was version 3.
I make a change there.
While I'm doing this, someone is creating version 4.
When I am going to commit my changes, I cannot do this, as they are not based on the current version. The “base” for my changes was version 3, which was current at the time when I started work.

The linear model of history will not allow me to create version 5 as shown in the picture. Instead, I will have to get the changes made from version 3 to version 4 and merge them with my version.

The obvious question: What happens if we allow to fix the 5th version based on the 3rd? Our story will cease to be the Line. And turn into a DAG.

And why should we do this?

The main advantage of the DAG-model is that it does not interrupt the developer at the moment when he is trying to fix the result of his work. In this regard, the DAG is perhaps a more accurate reflection of what is really happening in a team that is practicing parallel development ( here the not quite accurate translation of the term “concurrent development” is used, in which the emphasis is placed on the fact that parallel changes made by developers are often made in the same code and often in conflict - approx. lane. ). Version 5 was founded 3rd, so why not reflect this fact?

True, it turns out that there are reasons for not doing this. In this column, we do not know which version is the "last". And this leads to many problems:

Suppose we need changes in versions 4 and 5 for a release release. At the moment we can not get it. There is no version in the system that includes both sets of changes.
Our build system is set to automatically build the latest version. What should she do in this situation?
Even if we build both versions, the 4th and the 5th, which one should be passed to QA for testing?
If a developer wants to update his tree to the latest version, which one should he prefer?
When a developer wants to make some changes, which version should he base on?
Our project manager wants to know what tasks have been completed and how much work remains to be done. His understanding of "done" is very closely related to the concept of "last." If he cannot understand which version is the last, his brain melts in attempts to update the Gantt chart.

Yes, it is a sad picture. The world, such as we know it, is literally crumbling before our eyes.

In order to avoid the coexistence of dogs and cats in a state of continuous mass hysteria, tools using the DAG-model of history are doing everything possible to help us resolve the confusion. The answer is the same as for the linear model - we need a merge . But instead of requiring the developer to do a merge before committing changes, we allow it to be done later.

1761_image003

Someone needs to create a version that includes all the changes made in versions 4 and 5. When this version is fixed, it will contain arrows pointing to both its “parents”.

The order is restored. We again know which version of the "last". If someone remembers that we need to restart our manager, he will most likely immediately realize that this graph looks almost like a line. And with the exception that between versions 3 and 6, something incomprehensible and confusing happened, this is the Line. But the best thing for our manager is not to worry too much about this.

What this manager does not know is that this particular crisis is a trifle. He thinks that his old paradigm is completely destroyed, but one fine day he will come to the office and discover this:
1761_image004

& @ #!

And now what?

If you live within a linear model, this graph is an absolute nightmare for you. He has FOUR end nodes. All that requires an understanding of which version is the last one is doomed to failure, including the manager described above, who is now probably curled up in his office and hopes that his mother has not forgotten to put his favorite cookies for pasta.

The linear model looks very attractive at such times. And this is a good rationale for the fact that 99.44% of developers use SCM tools based on a linear model of history management (yes, I invented these statistics).

But still, despite all this obvious chaos, we have to remind ourselves of the main advantage of the DAG-model: it more accurately describes the real course of affairs in the work of the programmer. It does not force developers to sag under their desires, as does the linear model. When a developer wants to fix something, he does it. And the DAG simply records exactly what really happened.

Many teams will always prefer the linear model. And there is nothing wrong with that. Life is easier with this approach.

But for some other commands, the DAG model can be very useful. And to some teams, it can even get "in the load." Just because they need to use DVCS for some other reason. DVCS tools use a DAG model because they have no choice. If we cannot assume the presence of a permanent connection with the central server, we have no other way and we cannot force developers to write all their work into the linear model.

Therefore, we need to find ways to manage the DAG. And how should we be?

One option is to restructure each operation. If you tell the doctor "this is a real torture, when it becomes necessary to determine the latest version," the doctor will tell you to "stop trying to do it." Instead, always specify which node to use:

The build system does not build the last node. Instead, she collects exactly the one that we indicate to her. Or maybe she collects each node.
QA tests builds that someone finds necessary to test.
Developers do not update their tree to the "last." Instead, they look at the graph, select a node, and update to it.

I do not say that this approach is appropriate. I just want to note that it is theoretically correct. As long as you are able to specify the node that you want to use, each of these operations can be performed.

But how do we specify a node? One of the circumstances that makes this approach problematic is that these nodes often have complex names. For example, in Git, the node name is something like e69de29bb2d1d6434b8b29ae775ad8c2e48c5391. Developers find this way of naming is not very intuitive.

All DVCS tools use DAG. And they all do a lot of different things to either prevent the “multiple end-node crisis” or help the team manage it. But they all do it a little differently.

Fortunately, it gives me a convenient opportunity to divide them all into 2 groups:

Those that solve this problem in a way that I like.
And those who solve this problem in a way that I do not like.

Part 2

For the first part of this article I received two types of noteworthy reviews ( Eric Sink, the author of the original article, published it in two parts with a time difference of one week, which explains the reader’s reaction to the first part separately - approx. Lane ) :

Several people accused me of spreading fear, uncertainty and doubt ( in the original, the author uses the common slang abbreviation FUD - Fear Uncertainly and Doubt, which has no direct analogue in Russian, approx. Lane ) in relation to the linear model, because I am only casual mentioned problems with the DAG model and stopped just a step away from stating that the DAG model could become a cure for cancer, stop global warming and bring peace and tranquility to the Middle East.
Some people asked how I drew such cool diagrams.

Before starting the second part, let me briefly respond to both of these reviews.

My answer is in defense of DVCS

Yes, my company ( SourceGear , - approx. Lane ) is developing a version control system (Vault) based on a linear model of history management. Therefore, any DVCS is, to a certain extent, a direct competitor to my product.

I am well aware that I am breaking a number of rules:

Representatives of the business community, such as I, are not supposed to say anything good about their competitors.
Our job is to fear change and spread this fear to others.
We are supposed to pretend that we do not know that any choice has its side effects, and to assert that our option is better in any situation.

My mother will easily confirm that I do not always follow the rules well :-)

The simple fact is that I find this topic interesting. I have worked in the version control industry for more than ten years. I am writing a book on this topic. That's what I'm doing. That's what interests me.

In fact.

But there is more to it than just me in the role of a capitalist rebel.

Git fans, you should cool down a bit.

Seriously, the ardent defenders of Git make this world uninhabitable. Git is a really great thing, but it's just not the right choice in all situations.

In their defense, one must admit that in this matter the apple fell far from the apple tree. When people become interested in DVCS, one of the first things they stumble upon is a video featuring Linus Torvalds about Git, recorded in 2007. And there they see a man who, it seems, does not understand this either.

Guys, Subversion is probably the most popular version control tool in the world at the moment. Almost everyone who uses a version control system today uses one that is based on a linear model of history management. And they use these tools successfully and productively. When someone refuses to recognize any suitability of this model, they look stupid.

Torvalds video has done a lot of damage. Such a position is a big disappointment for people who are interested in what is new in the world of version control.

Therefore, my dear Git fans, if you are trying to warn people against using DVCS and want to be sure that they will not change their current approaches, then you can continue in the same vein, you are doing great.

But if you really want to help the world see the benefits of Git and similar tools, then realize that people did their work productively and before they appeared.

My answer is about those cool diagrams.

These pictures were drawn by SourceGear's illustrator, John Woolley, who also created all the illustrations for the Evil Mastermind comic book . All the design work and illustration creation for my upcoming book on source code management is also done by John.

Nevertheless, since the pictures of John received much more praise than my “thousand words”, I decided to embrace and refused to include any of his illustrations in the second part of the article. :-)

Ok let's talk more about graphs

As I said in the first part, if DAG is allowed to grow without any control, everything can turn into a real mess. DAG is easier to create. Lines are easier to use. As soon as we start using the DAG model to the maximum in order to use all its advantages, the next thing that happens right away is that we want to bring the lines back.

That is why in any DVCS there are opportunities that can be used to ensure that the growth of DAG is manageable. These features are designed to prevent developers from fixing changes, while avoiding all responsibility for the resulting confusion that grows each time we create a new branch point.

In other words, each DVCS contains capabilities that allow developers to take a fragment of a graph and treat it as a line.

Git

Git manages the growth of DAG by supporting named branches. You are unable to commit changes if their “parent” is not an end node (“sheet”).

Thus, if I use the Git checkout command to update my working copy of the repository to a non-terminating host, Git takes polite care of me:

  eric $ git checkout 9542b
 Note: moving to "9542b" which
 If you want to create a new branch, you may do so
 (now or later) by using -b with the checkout command again.  Example:
   git checkout -b
 HEAD is now at 9542b5f ... initial

If you commit changes, always based only on the end node, then your story remains very similar to the line.

Mercurial

Historically, Mercurial has been described as supporting only one branch per repository. Comparisons with Git often focused on an obvious flaw - the lack of support for branching within a single repository ( here the translation does not match the original text, since in the comments to the article the author acknowledged the presence of a typo in this place - note. ).

I’m talking about past tense, as I’ve heard that Mercurial has added some features in this area.

I mentioned Mercurial here, so as not to leave his fans aside. I can not speak from the position of great experience with this system.

But I still think it possible to mention Mercurial as a confirmation of my point of view: in early (at least) versions, Mercurial controlled the growth of the graph, preventing its branching. In addition, it contributed to the universal perception of Mercurial as a very easy-to-use tool.

Bazaar

This tool is DVCS, which I used the most, but I still can not yet consider myself an expert on it. In my experience, I would characterize Bazaar as a system that works hard to control DAG growth.

Every time I try to send changes from my local repository to a central server, Bazaar requires me to merge with other changes from the end node. Exactly the same as any system with a linear model would.

But, which is probably great, Bazaar gives me the opportunity not to use a central server, like a real DVCS. But in this mode, the same key limitations apply: I cannot fix any changes if they are not based on the end node in the repository.

When I use Bazaar, I get the feeling that I am using a tool with a linear model.

My preferences

In this regard, the Git approach to solving this problem is closest to me.

Bazaar seems to believe that DAG branching is permissible only if it occurs in separate instances of the repository, and must be eliminated before changes from one repository are transferred to another.

I like the ability of Git to switch my local copy of the repository through the command “git checkout name_ of the color”. I understand that people who are not accustomed to the constant thought of DAG-e, this opportunity seems incomprehensible. But I like her.

I’ll ask you to note that I still like the linear model tools, such as Subversion and Vault. I just want to say that a tool with a DAG model should act in a similar way.

Fossil

DVCS, which has intrigued me the most recently, is Fossil . It is written by Richard Hipp (D. Richard Hipp), creator of SQLite .

Fossil has a number of interesting features. The most significant is the built-in support for bug tracking. This is an area in which all other DVCS are inferior to it. They give you distributed version control, but when the time comes for the developer to update the information in the bug tracker, the world becomes centralized again.

Anyway, I’m just starting to look closely at Fossil, but I like the way the branching problem is described on its website:

Having more than one end node in a tree is usually considered undesirable, so usually branching is either completely avoided, as in fig. 1, or quickly eliminate, as in Fig. 3 ( meaning the pictures in this article, and not on the fossil site, - approx. Per. ). But sometimes someone needs to have several end nodes. For example, a project can have one end node that represents the latest version of the product version being developed, and the second node has the last “stable” (tested) version. When multiple end nodes are appropriate, we call it branching, not fork. ( Here it is important to feel the difference between the English terms of branch and fork. The author wants to emphasize the positive meaning of the term branch (branch) and to some extent the negative nature of the fork (fork), - approx. Lane. ).

Perfectly. As a result, I got the impression that Fossil works in this respect like Git. When the DAG forks, complexity increases.

Source: https://habr.com/ru/post/57335/

All Articles