MultiGitRepository

From APIDesign

Revision as of 11:07, 4 May 2018 by JaroslavTulach (Talk | contribs)
Jump to: navigation, search

Using single Git repository is certainly more comfortable than working with multiple Git repositories. On the other hand, distributed development can hardly be performed in a single repository (unless you believe in a single Blockchain for the whole sourcecode on the planet). How to orchestrate multiple Git repositories to work together? That is the thousands dollar question various teams seek answer to! For example there was a talk at GeeCON 2017 in Prague about that by Robert Munteanu:


Let's assume we have a project split into multiple Git repositories. What are the options?

Contents

Remember non-Distributed Version Control Systems?

There used to be times when people were afraid of distributed version control systems like Mercurial or Git. All the users of CVS or Subversion couldn't understand how one can develop and commit in parallel without integrating into the tip of the development branch! If each developer or team of developers has its own tip, where is the truth?

These days we know where the truth is: there is a master (integration) repository somewhere out there and whatever the tip there is, it is the truth. There can of course be multiple repositories, people are free to fork GitHub repositories like crazy, and some may even agree that one of the forks is the important one. Yet, unless the fork overtakes the original repository in minds of majority of developers, the truth will always remain in the original repository.

The situation with multiple repositories isn't that different. The history repeats on a new level. It is just necessary to explain even to single Mercurial or Git repository users that there is nothing to be afraid of!

Gates for Correctness

Typical GitHub workflow uses pull requests and some integration with Travis or other form of ContinuousIntegration which is usually well integrated with the review tool. As soon as one creates a PR, the continuous builder runs the tests and marks the PR as valid or broken. This greatly contributes to the stability of the master branch - it is almost impossible to break it by merging in PRs.

On the other hand please note that before your PR gets merged it may contain as many broken (e.g. not fully correct) commits as you wish. It is quite common one makes changes to the system, pushes them on a branch of own repository fork, creates a PR just to find out that while the functionality is OK, there are other things that need to be polished (formatting and proper spacing being my favorite). One then adds few more commits to polish the non-semantical problems of the code.

What I'd like to point out is: It is absolutely OK to have commits which are broken if they get fixed before merging into the master branch. Now we are going to transplant this observation to the MultiGitRepository case.

Single Integration Repository

Just like there is the master branch in classical Git repository where all the commits have to ultimately end up (be merged) to, there has to be such integration point in the MultiGitRepository scenario as well. That means there has to be a single integration repository which references all the other repositories and identifies their exact commits at which they were integrated together.

One can use Git submodules for that, but other possibilities that uniquely identify the changesets work as well (GraalVM is using tool called MX which keeps these references in a special file called suite.py). All that is important is to have a single version of the truth - a single place that uniquely and completely identifies all the source code spread among all the repositories.

As in the single repository case, it is good to have a gate. An automated check that verifies with every PR to be merged into master branch of the integration repository that everything is still OK, still consistent. Such Travis or other ContinuousIntegration test checks out all the dependent repositories at their appropriate revisions (they are stored somewhere in the integration repository) and runs the test. If it passes, the PR is eligible for being merged. That guarantees the master branch of the integration repository is always correct.

What happens in the individual repositories meanwhile? may be your question. Well, anything. Things may get even broken (from a global perspective) there, but please note that was also the case in the single repository setup. There could also be broken commits meanwhile - all that mattered was to fix them before integrating. The same applies to the MultiGitRepository case: all that matters is that before the changes from a single repository get integrated (which means to update the appropriate commit references in the integration repository, create a PR and merge it into master branch of the integration repository), they are correct. But they have to be correct, as we have a gate in the integration repository which would refuse our PR otherwise!

Of course individual teams working on the non-integration slave repositories are encouraged to run tests and have their own gates. However such tests give just a hint, they aren't the ultimate source of truth. Just like developers working on a branch of a single repository are adviced to execute tests before making commits, yet they cannot expect such tests to guarantee their code will be able to be merged without any changes into master branch. In the same way regardless what happens in your slave repository, nothing can be guaranteed with respect to integration repository in the MultiGitRepository case.

Only when the final PR in the integration repository gets merged, one can claim that we have a new version of the truth that has just moved forward.

Always Correct vs. Eventually Correct

For a long time I was proposing usage of lazy MultiGitRepository scheme. E.g. let anything happen in the individual repositories and concentrate only on the final gate check in the integration repository. Once the changes pass the gate and get merged, everything has to be OK. Clearly a win-win situation, I thought. However there must be some cultural aspect of this lazy verification which prevented my colleagues to accept this as a solution. I am still not sure what was the problem, as in my opinion it mimics the single repository behavior - anything can happen on branches and forks, even broken commits are allowed - all that matters is that the problems get fixed before PR that contains them happens to be merged into master.

Possibly the biggest psychological problem is that one can integrate into master of one of the non-integration slave repositories and then find out that such change cannot get into the integration repository. There is no rationalistic explanation why that should be a problem: The master branch in Git is just a name for a commit that most of users of the repository treat as the tip of developments. There is nothing special on it. If it is broken, you can add more commits on top of it to fix whatever needs to get fixed, or you can ignore few last commits and assign the name master to some other commit and try again. At the end (e.g. when some new commit from your slave repository successfully passes the integration repository gate) it has to be correct - e.g. the model leads to eventually correct code.

There is however fix for this psychological problem that has recently been implemented in one of the GraalVM teams and which seems to overcome the psychological barrier. It modifies the ContinuousIntegration builder of the slave repository to get the tip of the integration repository, update it with the commit in the slave and perform all the gate tests necessary for the integration. Only if these tests are OK the PR can be integrated - this time into both repositories at once - e.g. the master branches are always correct. This kind of eager check for correctness seems to be more acceptable among my colleagues.

Tight Coupling

The always correct approach may be a better choice if the repositories are separate, but their proximity is close. For example the repositories may be split for licensing reasons. Then it is very common a technical change cross-cuts both of these repositories and one needs to integrate both parts of the change at once. Given the nature of such tight coupling, the always correct integration policy seems reduce hassles on balance compared to the eventually correct approach. It is better to have longer gates times to verify each commit properly, than force people to do that manually (which they would have to do almost every time anyway).

This nicely shows the importance of API for economy of your project. If you have two repositories isolated by BackwardCompatible API, you can start practicing distributed development - e.g. disconnect the repositories a bit by using eventually correct approach. If you don't bother with maintaining BackwardCompatible API, you immediately increase coupling and you have to treat them as one - burn CPU cycles on the always correct verifications.

The Scalability Problem

When you sign for the always correct approach you need to run the overall integration repository tests for every PR in each slave repository. Soon you may find out that you are running out of your computation capacities. Of course, there are rational reason for testing every commit and every PR. It is true that bugs fixed in an early stages are cheap whereas fixing the bugs in a later stage increases the cost significantly. If we had enough computation power, we should fully verify every commit! Alas, we don't have it and that is the reason why we can't automatically verify every commit and in fact neither every push. Even if you think the computing resources are well spent on testing everything, you can get into situation when you can no longer trade computing resources for developer resources.


As such you are likely to seek policies that slowly turn always correct towards eventually correct checks. Because the eventually correct checks can certainly scale better: for example you may run them just once a day, or a week - of course under the assumption that you can culturally accept the uncertainty that until your changes are in the integration repository they may be completely broken. Remember the most critical goal: We want to be assured that master branch is free of bugs. We never want to see errors on master, which could have been avoided by automated testing. However this is the single repository thinking! In the world of many slave repositories and one integration repository we care "only" about the master in the integration repository. That is the integration place that has to be free of bugs. Master branches in the slave repositories are less important and we can save significant amount of resources by applying eventually correct approach in their gates.


The nirvana lays in a system properly designed around BackwardCompatible APIs. Then most of the commits are just locally important and you can save a lot of the testing. Enough to run limited sanity tests, verify binary compatibility with SigTest and delay the throughout checks for later. The system has to become eventually correct before the changes get merged into master branch of the integration repository, right?

In a system with APIs designed for distributed development, going though overall testing for each commit is clearly a waste of resources.

Single vs. Multi: Where's the difference?

It is 2018 and we, developers, have already learned to work with distributed version control systems like Git. Time to learn to work in MultiGitRepository setup too! It is not that hard, in fact it perfectly matches what we were taught so far. We just need integration repository and bring our expectations into new level:


Action Single Git Repository MultiGitRepository
final commit destination master branch master branch in the integration repository
request to integrate PR targetting master branch PR updating references to slave repositories in the integration one
temporary work done on branches or in forks anything done in slave repositories before master in the integration repository references it
collaborative areas branches in the repository even master branches in slave repositories
origin of team sanity builds a dedicated branch with full featured ContinuousIntegration best to use master branch a slave repository (has ContinuousIntegration by default)
ultimate gate runs before PR is merged to master branch runs before reference to a slave is updated in the integration repository
bug free system master branch is bug free master branch in the integration repository and all referenced slave repository versions are bug free

Don't be afraid to work in MultiGitRepository setup. With single integration repository it is not complicated at all!

Appendix A: Local Collaboration Area

Robert commented that having broken master branch in a slave repository is bad for collaboration. That is indeed true! When I wrote about broken, I meant broken from a global perspective, not from a perspective of the slave repository.

Let's envision a team using the slave repository approach that develops for example GraalJS. Then there could be a GraalVM integration repository that includes the GraalJS one and integrates it together with other languages. In such situation a commit in GraalJS may break the interop functionality between the GraalJS and some other language. But one will not know for sure until the change gets integrated into the GraalVM integration repository.

From the overall perspective the slave repository may get into a broken state. However from a local perspective, there is no reason to have broken master branch in any repository, right? There are tests and we develop via PRs and merge only when everything is (locally) green, right?

In some sense the master branch of a slave repository is another temporary collaboration area. When you need to collaborate in the single repository setup, you create a branch and let multiple members of a team commit into such branch. Only when the work is done, it gets integrated into the final commit destination - e.g. master branch of that repository. However in case of MultiGitRepository setup, a team may easily collaborate in the master branch of their repository. Until a reference to the latest commit is integrated into the integration repository all such work is just a temporary collaboration.


The advantage of using master branch for collaborative development is the simplicity of producing daily builds ready for quality checks or publishing to Maven snapshot repository. Most of the projects do that for master branch. In the MultiGitRepository setup each slave repository gets such infrastructure automatically. Again, these are just temporary builds, not fully correct from the global integration repository perspective, but for many usages they are good enough: many teams use such temporary bits for manual sanity checks to be sure everything is OK, before they include their work (e.g. id of their repository latest master commit) into the integration repository.

Personal tools
buy