"In one basket": A little about the storage code

Efficient data storage interests absolutely everyone who is in any way connected with IT. We in IaaS-provider 1cloud constantly analyze the experience of colleagues - quite recently we discussed how large companies store their data.

Today we will continue this topic and discuss how best to store your code: in one repository or in several. We will also take a look at two examples that demonstrate the features of both approaches.

')
/ photo by Dennis Skley CC

Do I need to save my source in a single, monolithic repository, or should I break the code into blocks and write them into several different repositories? As a rule, it depends on the team and the project it is working on. First, consider the advantages and disadvantages of both types of storage.

Monolithic repository

Usually, the first thing that comes to mind is to write all the code in one repository, at least in the first stage: most projects start with this. A repository is called monolithic if it contains two or more separate projects. These projects are loosely or completely unrelated, and the repository itself contains too many files, commits, and other objects.

The main advantage of storing code in a single repository is that it is much easier to organize collaboration with the code. We can create one common project consisting of several subprojects, and then link these subprojects as we wish.

If a developer needs to change the code or principle of communication between parts of a project, it is easier to do this when he has access to the code of the entire project. Suppose we are writing a system for online trading, which is based on microservice architecture. When we write code for the cart service and we need to view or change the shared library, we can immediately go to it: we do not need to open another project or repository. Since we can edit dependencies, we can quickly make global changes without worrying about versioning.

When all the code is stored in one place, we just have to start the process and, for example, monitor how changes in the shared library affect the work with the basket. Objects are available at any time from any place, changes are quick and painless. But not everything is so smooth.

Often, managers choose a single repository simply because it will be easier with them, and they supposedly know what they are doing. Because of such decisions, there are frequent cases when developers make changes to parts of the code that they shouldn’t touch. And this is easy to do if you have access to all the code, and the project has no clearly defined boundaries.

Many problems arise when deploying and scaling. Thus, the integrity of the system is lost. The larger the repository volume, the slower the check will be. If the code is stored in several repositories, the process can be parallelized, and errors that occur in one of the parts of the project will not be able to bring down the work of all services.

Conclusion: If you have a small team or you are not going to expand, it is more logical to store all the code in one place. It is convenient to use a single repository even if you are not working with microservices, but are developing a monolithic application.

Several tips to mitigate the shortcomings of monolithic repositories in Git (large file sizes, the number of commits and pointers) are offered here by the user Habr.

Storing code in multiple repositories

Part of the problems that arise when there is a single repository is solved by introducing several repositories. If we talk about microservices, then ideally for each service should be its own repository. This approach facilitates the version control process: made changes to the library — updated its version, corrected the service code — updated its version.

The presence of several repositories makes it necessary to write the code as if third-party developers were going to view it (which, by the way, is quite likely). Instead of thinking of changes in the code as a large-scale change of the entire program, the developer begins to think about how to change one module without affecting the work of the entire system. As a result, connectivity between modules weakens.

This allows you to deploy them independently. If our order processing service works with both versions of the protocol, we can deploy it even before the basket code is fixed. Such an approach requires a high level of discipline.

Conclusion: If your team is experienced enough to maintain regular version updates and work with microservices, or there are a lot of people who are organized in small groups, then it is better to store the code in several repositories. The approach will also be useful in training new employees, who will become more disciplined if they follow the version update rules and preserve the boundaries between services.

How to store the code of Google and Kiln

Judging by the findings, most companies, especially large ones, would prefer to work with several repositories. Even if so, there is at least one big exception to this rule. Oddly enough, tens of thousands of Google developers today use a monolithic repository, where about two billion lines of code are stored. To maintain this scale, Google had to develop a version control system, better known as Piper.

Access to Piper is organized using the Clients in the Cloud (CitC) system, consisting of cloud storage and the FUSE file system for Linux. Each developer has a working environment in which the files they modified are stored. All recorded files are stored in CitC in the form of snapshots, which allows, if necessary, “roll back” the work several steps back.

The built-in CitC tool for searching the CodeSearch code allows you to make minor corrections to the code, and also to transmit the modified code for checking with the possibility of autocommit: if the check is passed, a test is performed, after which the system sets the commit itself.

The basis of the monolithic repository model is an approach called trunk-based development (“trunk development”). The main (trunk) line is the latest version of the code, changes are made one-time and sequentially. Immediately after the commit, a new version of the code is available to all Piper users, that is, in fact, the developer always has a fresh version of the code before his eyes.

As for adding functionality, both the old and the new code exist parallel to each other, and their use is monitored using configuration flags. This approach avoids the problems that arise from the merging of changes.

Stack Overflow users are advised to keep the code in a single repository, even when it is possible to split it into several repositories. To do this, there are tools such as submodules in Git, external objects in Subversion, and sub-repositories in Mercurial.

All of them are designed to build the internal hierarchy of a large project, and they can be used to select individual modules: just put each project in a separate repository, and then use sub-modules to include the necessary projects at a certain hierarchy level.

In addition, Git has the ability to create independent branches, which are called orphan (orphan). They have nothing in common with each other and retain their history exclusively. This is how a new orphan branch is created:

git checkout --orphan BRANCHNAME

Each individual project can be represented by a separate orphan branch. For some reason, Git needs to do this cleanup after creating this branch:

 rm .git/index rm -r *

Before cleaning, make sure that the appropriate commit is set. After it you can safely use the branch.
Another option is to create several repositories and throw these branches into each of them (the repository names should not be the same):

 # repo 1 git push origin master:master-1 # repo 2 git push origin master:master-2

Kiln developers, who once switched from a monolithic Subversion repository to a Mercurial multi-repository, adhere to a different opinion about the storage of the code. Their project is divided into five parts: exe-clients, server for client interaction (Reflector), site, billing system and library Aadvark.

For each part, they created two repositories - devel and stable. The first one gets new features, which later pass to the second, and the corrected bugs, on the contrary, are first placed in stable, and then as new functions are returned to devel. For synchronization tags are used. In Mercurial, they are repository metadata.

For example, to deploy a new version of the site, the website-stable and aadvark-stable repositories are taken. Each tag is attached, say, Website-000123. Then, the build process starts, which clones both repositories from the server to the build directory and executes the hg up –C Website-000123 command to switch the local copy to the required tag. After the build is assembled, it is deployed.

Conclusion

The choice of where and how to store the code should be approached intelligently, and this requires some effort. It cannot be said that one approach is definitely better than the other. You need to take into account the composition of the team, your experience and your goals, and to make a decision based on this. Moreover, if you wish, you can always go from one repository to several, and vice versa.

Anyway, any understanding comes with experience. Sometimes it is useful to fill the cones, then to know what to fear and which methods will surely work. Therefore, the time and the desire of everyone to make the maximum contribution to the development of the product will help you to truly understand what suits your team more.

PS Our materials on the development of IaaS-provider 1cloud :

PPS Our new series of posts about the myths in the field of cloud technologies:

Part 1: about "useless" technical support and "clever" services

Source: https://habr.com/ru/post/308060/

All Articles

"In one basket": A little about the storage code

Monolithic repository

Storing code in multiple repositories

How to store the code of Google and Kiln

Conclusion

More articles: