DVCS Round-Up: One System to Rule Them All?--Part 2

7 comments

In the first part of this article series, I presented some history concerning the development of distributed revision control systems, and took a closer look at both SVK and monotone. In this sequel, we will take a closer look at the remaining four systems promised to be reviewed, namely darcs, Bazaar, Git, and Mercurial. The latter three are quite popular, being used by Ubuntu (Bazaar), the Linux kernel (Git), and the Mozilla project (Mercurial).

Synopsis:

darcs: nice & clean... but scales badly

The darcs system was developed by David Roundy (“David's advanced revision control system”). The software provides a completely distributed version control system and exists since early 2003. In contrast to e.g., SVK, darcs does not know the concept of a central repository, marking it as a truly distributed system. Every repository is potentially equal, and becomes special only if the developers choose it to be. Another feature which is quite different from older systems like CVS, Subversion and GNU arch is that darcs does not separate the “repository” and the “checkout directory.” Revision information is stored alongside the working files, essentially creating a directory that is able to remember (and go back to) older states.

Unlike other systems, darcs has no direct notion of a “branch.” Branching is done by copying the repository to a different place. On filesystems that support linking, the copy shares quite a bit of data with the original, making this task relatively fast and efficient. The operation of darcs is based upon the concept of a “patch algebra”: any state of the directory is modelled as the consecutive appplication of a sequence of “patches” to the initial state. Patches can be changed, and can even be removed. Since there are no branches inside one repository, recording changes is pretty straight forward. Merging is simply copying patch information between repositories. Conflicts that cannot be resolved automatically are marked for manual resolution. All in all, the system is quite impressive in its elegance and simplicity. Of course, you have to adapt your work flow to the way darcs works, not the other way round.

For small repositories, darcs works quite fast, and its feature set is sufficient. It tracks directories as well as files, and its rename tracking works rather well. Of course it is always possible to produce merge conflicts with file renames, but the usual “harmless” standard cases I tested (e.g., developer 1 changes the file's content, developer 2 changes its name) worked without a hitch. This trick alone is something that the average Subversion user is not quite used to, as renaming in SVN still is an invitation to disaster. The unique design of darcs actually allows for some neat tricks, as you can create patch sets as unions or intersections of other patch sets. Tagging is also very flexible, as a tag is just a “virtual patch” that depends on an arbitrary set of patches.

Darcs works well on different platforms, and prepared versions for MacOS, Windows, and Unix systems exist. However, synchronizing between repositories takes a surprisingly long time on all systems, depending on your repository and memory size. Darcs simply appears not to scale too well with project size, as there are reports of memory usage up to 1GB for large trees (i.e., putting the current Linux tree under darcs control would probably not be a good idea). I have personally seen memory footprints of up to 400MB when checking in a 50MB binary file, although it seems to have gotten a bit better with darcs 2.2. All things considered, darcs should be fine for small and medium-sized projects as long as you do not put large binaries under darcs control, and as long as you can live with its strict “a branch is a repository is a branch” philosophy.

Another drawback is the complete lack of alternative user interfaces. There is no GUI, no TortoiseSVN-like Explorer integration on Windows and no Eclipse support either, as far as I know. On the other hand, the darcs commands are quite easy to understand and the tool is actually pretty nice once you are used to it. Since it is so radically different from CVS and SVN it does not even try to emulate their behaviour, resulting in a more or less completely orthogonal command set. Whether this is a good or a bad thing is largely a matter of personal taste. Remains the fact that darcs is a nice and clean system, and has its small but devoted user base not without reason.

Bazaar: Jack of All Trades

Bazaar is the revision control system used by the Ubuntu project. Its development started in 2005 and is coordinated by Canonical, the company by Mark Shuttleworth. The two slogans used by the project to describe itself are “a distributed version control system that Just Works” and “Version Control for Human Beings”. So, it is clear what the Bazaar people are aiming at: a powerful system that can still be used without big headaches. It is implemented in Python, making it very portable and ensuring availability on more or less any platform one could wish for.

The basic tests were passed by Bazaar with flying colors. Checking in a 50 megabyte binary file took just a few seconds, suggesting the hard disk's speed as the actual bottleneck, and rename tracking tests showed no obvious flaws. As in darcs, directories are first-class objects, meaning that the software tracks even empty folders. Repositories can be created in two flavors: the “standalone” version works just like in darcs, making it very easy to put an existing directory under Bazaar's control. Shared repositories have advantages when working with multiple branches that only differ slightly (and thus have much in common).

The normal workflow is similar to darcs and monotone. But where darcs merges branches automatically and monotone pulls changes without the need to merge at all, Bazaar simply refuses to pull changes if branches have diverged. In this case, the only way to pull the changes is through the merge command, essentially emulating darc's behavior. After merging, the merged state has to be committed to the local branch and can then be pushed back, synchronizing both repositories.

Personally, I do not care too much about Bazaar's behavior during merges, as it tends to make merging more tedious (especially when merging over the network). In systems with automatic local branches, you first get all the remote changes, can kill the connection, and then deal with the merge afterwards. However, Bazaar does allow you to commit your own changes into the local branch before you attempt a merge, just as e.g., monotone allows you to do. It should be noted that the global structure of revisions in Bazaar is once again a DAG, although here the local branch is treated as special.

Performance feels quite good, as Bazaar seems to have been improved a lot in that regard lately. Reports by people on the 'Net confirm this, rendering Bazaar a viable solution even for larger projects. Graphical shells are available, too, as well as integration into Eclipse and other IDEs. The Windows version comes bundled with TortoiseBZR, which is yet another variant of TortoiseSVN. However, it seemes that the version bundled with Bazaar 1.6 is still fairly limited in its functionality, meaning that you will have to do some things at the command prompt.

Bazaar can also be extended by plugins, and a large number are already available, ranging from command extensions to notification addons. Particularly interesting is the rebase plugin that essentially enables you to move around sub-branches in the revision graph. This is a powerful feature, allowing you to re-order revisions before pushing them back to the main repository, and adds functionality historically to be found only in Git. All in all Bazaar is an impressive system, and one cannot really go wrong in choosing it. Personally, I found the advertising texts on the web page a bit over the top, but maybe that's just me (the “comparisons” against Mercurial and Git were definitely a bit childish, albeit the response by the Mercurial team was not much better).

Git: The Speed King

As mentioned before, Git is the revision control system developed by Linus Torvalds who is (in his own words) “not a source control person.” In fact Torvalds claims that this was essential to the evolution of Git, since he never managed to convince himself that “CVS did something sane.” Of course, several distributed systems already existed at the time Torvalds designed Git and undoubtedly have influenced its design. Apart from the proprietary BitKeeper, the two systems I would like to mention in this context are darcs and monotone. Both had already existed for roughly two years at the time of Git's inception, and both show clear similarities to Git concerning workflow.

The basic operation of Git is nothing new to someone who has used other distributed systems before. It implements the same “the directory is the repository” philosophy we have already seen, making it especially easy to put existing code under the VCS's control. Directories are second-class citizens, as Git does not track them explictly. In general Git does not care much about files either, but (according to its creator) rather about their content. This also means that it is usually quite unimpressed by file renames. Large binary checkins were no problem during my tests. Storage is quite efficient and strongly compressed, often resulting in a repository that is actually smaller in its entirety than the current checkout.

Apart from duplicating the whole repository (like darcs), Git can work with named and unnamed sub-branches in the local repository. After pulling in changes from other repositories, Git will often produce different strands of development each with their respective “heads” in the local repository, just like monotone does. These branches can be very easily merged by anyone who happens to have the respective changesets, and then pushed back. A nice feature is the possibility to have a “local” branch that is not automatically pushed to the other developers, making it easy to test changes without the need to create full-fledged branch repositories.

One aspect that really sets Git apart is its speed. Updates are blazingly fast and the dependence on repository size is very, very weak. For all facts and purposes, Git shows nearly a flat-line behavior when it comes to the dependence of its performance on the number of files and/or revisions in the repository, a feat no other VCS in this review can duplicate (although Mercurial does come quite close). Apart from raw speed, Git also provides some powerful features like “git-rebase”, which allows for efficient management and manipulation of local, temporary branches. However, Bazaar and Mercurial are rapidly gaining ground due to their plugin-based design (and already offer plugins for rebase's functionality, for example).

Git is available as source code for practically any POSIX-like platform (BSD, Linux, MacOS etc.), and most major binary distributions provide packages ready to be used. One drawback at this time is Git's (lacking) support for Microsoft Windows platforms. These are only “officially” supported through Cygwin, although another version (msysgit) provides almost complete functionality in an easy-to-use installer package. Nevertheless, Windows support can still be problematic e.g. if you have to work in a mixed environment. A few years back, msysgit did not exist, leaving Cygwin as the only option and effectively making the Mozilla project reject Git for that very reason. Of course, things have improved greatly with the availability of the native client and I expect further improvements to it.

Mercurial: The Rival

Roughly at the same time Linus Torvalds designed his Git system, kernel hacker Matt Mackall started his own: Mercurial. While Git was a POSIX application very much in the UNIX spirit, consisting of several small tools written in C and shell script, Mackall decided on a different approach for Mercurial. His solution consisted of a monolithic Python application, which used C/C++ only where it absolutely mattered for speed.

The net result of that decision was that Mercurial was available for several platforms (including Windows platforms) from the get-go, and this has not changed. Since it uses optimized compiled code for critical operations, it is very fast. Not quite as fast as Git, but very close and with similar scaling properties. This means that Mercurial handles large repositories and/or large revision histories particularly well, just like Git. Since Mackall is a kernel developer, he designed Mercurial explicitly with that goal in mind, since at the time there was even the possibility that it would end up as the kernel's VCS.

Although the kernel developers settled on Git, the Mercurial project was continued and its interoperability with Git further improved. It is possible to convert repositories between the two systems without much hassle, and even a synchronized, unofficial Mercurial kernel repository exists on the kernel.org site. Like I said before, the two systems are eerily similar in many aspects (for example, Mercurial does not directly track directories, just like Git), although the implementation details differ greatly. However, while Git was very powerful from the very beginning, but lacking a user interface that enabled it to be used by mere mortals, the development of Mercurial happened the other way round. It had a clean and usable interface from the get-go, but a number of advanced features were only added later.

One example for this is the handling of branches. Early versions of Mercurial had no support for named branches, and it was considered best practise to just create a “clone” of the repository. So in this regard, early Mercurial was very much like darcs and monotone. Named branch support was implemented as a later improvement, and “temporary” branches (i.e. local to the repository) only recently. In fact the Mercurial team long argued that a project's history should be both complete and fixed, and Git's “history-changing” features like local branches or rebase were both unnecessary and dangerous.

Actually, both positions are correct in their own way, and one can work perfectly without features like git-rebase. It is a matter of personal choice and philosophy, which is why some of the more advanced functions in Mercurial are offered via plugins (e.g., patch management via “Mercurial Queues”). Feature-wise, Mercurial has gotten very powerful indeed in recent releases, mirroring and sometimes even surpassing Git's functionality while remaining its closest rival in terms of performance. Since its main part is implemented as a Python API, it is quite easy to extend Mercurial's functionality via custom plugins.

Conclusions and Outlook

After looking at all the contestants, it is really difficult to nominate a “winner.” Four systems of the six are definitely viable choices for most purposes and the choice is largely a matter of personal taste, leaving SVK and darcs as the only contestants I cannot recommend for general use. SVK has so many serious flaws I do not have the room to enumerate them all; to name the worst: it is based on Subversion (inheriting many of its limitations), it is horrendously slow, it hogs quite a lot of space on the hard disk and the setup process is apparently optimised for people who actually like pain. Add to that its unflexible, centralised workflow and the fact that it has been written in Perl by essentially a single person, resulting both in slow development and an “unfirm” future.

darcs, on the other hand, feels indeed very nice and clean, and it should actually work quite well for small projects. However, it is not very flexible, also tends to hog disk space, and quickly gets pretty slow when your project grows. Although I hate to have to say it, it is simply not worth the effort, considering that the alternatives offer at least comparable performance for small projects, can be used in a way that very closely mirrors darcs' workflow (but are more flexible), and scale much better. The patch algebra idea is intriguing, but that's it, more or less.

There is no truly “bad choice” among the other four systems, since they are all actively developed and used by prominent open source projects. Monotone adds an additional layer of security that some people will find attractive. Bazaar allows both central and standalone, workspace-integrated repositories and also offers many interesting plugins, although I find the developers' “we are the best” attitude a bit annoying. Mercurial and Git are the speed kings, and are also feature-rich and flexible.

If you really want a direct recommendation, it would be thus: download Git and Mercurial, then throw dice. Personally I like the “it just works” approach of Mercurial a bit better, and the fact that it is written in Python. Also for cross-platform projects needing Windows support Mercurial is (at this moment) probably a tiny bit less problematic. If you work in a completely UNIX-centered environment, Git might be the slightly better choice. However, both systems work very well, are stable and very, very fast. So it's a tie, more or less.

In the third (and last) part of this article series, I will try to confirm the impressions I got so far about performance. So expect lots of boring numbers, but maybe a few pretty graphs as well. Stay tuned!

 

 

4.5
Average: 4.5 (2 votes)
Git vs Mercurial, and benchmarking DVCS
Submitted by jnareb on Fri, 01/23/2009 - 15:14.

First, a disclaimer: I use Git, and contribute to it a little (gitweb mainly), while I know Mercurial only from documentation, discussion on mailing list, and discussion on IRC. Therefore I might be biased here, and my knowledge of Mercurial outdated.

With respect to Git vs Mercurial as choice for DVCS: in Mercurial named local branches are still second-hand citizens (they are available via localbranch extension), and IMHO Git has much better support for multiple-branches in single repository workflow, and for interaction with other multi-branch repositories (the concept of remote-tracking branches). Rename detection had a litle to be desired in Mercurial: for example rename was shown as copy plus removal in history browsing, but this might have changed. Having directories as first-class citizens, among others dealing with empty directories and renaming directories (or to be more exact situation where one side renamed directory, and other side created new files in the old-name directory) is not impossible with Git (there were some patches proposed on git mailing list) but it seems impossible in Mercurial with its repository structure. Mercurial also doesn't support octopus merges (merges with more than two parent commits) due to fixed record structure of commit (version) representation.

On the other hand Mercurial has Python API to write extensions (while Git has scriptability helpers), it has better support on MS Windows, and "hg serve" is (I think) simpler as means to serve repository and have web interface than setting up equivalent (git-daemon or web server + hooks, gitweb or cgit) in Git. And it has good documentation, although nowadays Git documentation is also of good quality.

With respect to benchmarking: take a look at http://git.or.cz/gitwiki/GitBenchmarks and http://vcscompare.blogspot.com/ . The latter lacks currently published speed comparisons, but you might want to contact author, Pieter de Bie for ideas.

Re: Git vs Mercurial
Submitted by rmfendt on Fri, 01/23/2009 - 17:43.

You are right, there are things that Git is still better at than Mercurial. And for some things, this will likely remain the case, since there are differences in philosophy. Like I mentioned in the article, e.g. hg did not have anything like rebase for a very long time, and one reason for that was that "changing history" was not considered good practice. Truth to be told, I am no big fan of extensive restructuring, either, so I find myself to be a bit biased towards Mercurial there.

Yes, renaming is still represented as a copy&remove. In fact, it is more or less the same, from a semantical point of view. The important bit is that the renamed file still carries the complete history, and that changes are not lost during a merge. Mercurial copes quite well with the "harmless standard cases" (which already profoundly confuse SVN, for example). There are endless possibilities to create renaming conflicts, and I have yet to see one system that would somehow magically catch all problem cases. You can even confuse Bazaar, and those guys take special pride in the quality of their rename tracking.

The localbranch extension is still considered "experimental", since AFAIK not all the hg developers really like the idea. Personally I never missed that particular feature: if I want to try something, I just clone the repo. If it doesn't work out, I kill the clone and that's it. Octopus merges and directory tracking go in the same direction: one can life quite happily without, and doing so avoids a lot of complication and hacks. As I said, there are some differences in design and philosophy that will probably never go away. However, the fact remains that both systems are very powerful, much more powerful in fact than most projects will ever need.

Thanks for the benchmarking links, I will certainly look at them. I have conducted my own measurements simulating a linearly growing repository, which I will present on LDN shortly. Perhaps I will do some testing with the kernel tree as well, although this will not be possible with the two 'losing' systems (SVK probably would literally take DAYS for the initial check-in, and darcs would just explode from the memory usage). However, I consider the results of the synthetic benchmarks I did so far already quite revealing.

Re: Git vs Mercurial, benchmarking DVCS
Submitted by jnareb on Fri, 01/23/2009 - 18:19.

I don't know if Mercurial local clones are smart enough (e.g. by using hardlinks, like git-clone now uses by default, or some pointer to alternate database like alternates mechanism in Git) to share repostory database to reduce filesystem size cost of creating new clone (new branch). Even if it is, it is not as lightweight as it is in Git, and lightweight branching (and creating new clone in new directory is IMHO not very lightweight) is extremly useful in branch-heavy workflow. IMHO branch-heavy workflow using topic branches is preferred workflow when/if you interact with large number of developers.

BTW I have forgot one things in my opinion Mercurial got wrong. Tags. They should be unversioned and trasferrable, and in Mercurial (from what I understand) are either versioned and transferrable (and use horrible hacks to behave sanely) or unversioned and untrasferrable (local tags, which simply lack transfer mechanism).

About benchmarking DVCS: http://vcscompare.blogspot.com/ currently contains on repository size comparison, but the beginning posts quite nicely discuss on how example projects were chosen, and how conversion tools were chosen.

I'm not sure if you can import Linux kernel repository to Mercurial, as it has octopus merges in it...

BTW. while from what I remember bisect and rebase went from Git to Mercurial (rebase as Mercurial extension named 'transplant'), hg-serve was inspiration for gitweb, and bundle went from Mercurial to Git (well, they are not entirely equivalent...). In that vein Darcs 'record' command, allowing to select diff chunks to commit, was inspiration for Git interactive add.

Re: Git, Mercurial differences
Submitted by rmfendt on Fri, 01/23/2009 - 21:00.

Local clones do indeed (i.e., have always) use links to reduce filesystem load. So yes, they are "heavy-weight" branches, but I still prefer it that way. However, it is quite easy to create a light-weight branch in Mercurial. At some point there has to be a 'fork' in the DAG, or in other words one changeset that is present in one branch but not the other. Just step to the parent changeset, commit any independent changes to it and you have thus created a head inside the repository.

You can switch between heads at any time, and even give them local symbolic names which are tracked much like in Git (via the 'bookmarks' extension which comes with hg). If you want to track a "branch" with contents that are not to be pushed during synchronisation, you have possibilities, too. For example you can just maintain a local patch queue (the queues extension is also part of the standard distribution), which is in my opinion the best variant for such cases.

The question of tagging is a topic of ongoing discussion. Mercurial's implementation is not without problems, but the concept is actually quite appealing in its simplicity. BTW, interactive patching is also available in Mercurial ('record' extension, also comes with hg). Importing the Linux kernel is apparently possible, since there is even a synchronised Mercurial repository on kernel.org. What is actually still a bit immature in Mercurial is support for sub-repositories. There is support there (through the 'forest' extension), but it is not quite as stable as in Git.

However, I do not think it is quite productive to go on listing perceived flaws in one system or the other. The systems are different, yes, but that is not the same as one of them being "superior". ;-)

Re: Git, Mercurial differences
Submitted by jnareb on Fri, 01/23/2009 - 21:16.

The question of tagging is a topic of ongoing discussion. Mercurial's implementation is not without problems, but the concept is actually quite appealing in its simplicity.

Everything should be made as simple as possible, but not simpler. (Albert Einstein)

Besides, take a look at how .hgtags differs in behaviour from e.g. .hgignore.

As to synchronized Mercurial repository for Linux kernel: I wonder how they dealt with octopus merges...

Octopus merges
Submitted by bboissin on Mon, 01/26/2009 - 20:40.

As to synchronized Mercurial repository for Linux kernel: I wonder how they dealt with octopus merges...

Octopus merges looks cool (that's what Linus said when he introduced them) but they do not serve any real purpose. You can always emulate them by doing repeated merges. I don't think you can cite this as a limitation from Mercurial, there's a bunch of limitations but this is not one of them.

Re: octopus merges
Submitted by rmfendt on Mon, 01/26/2009 - 21:28.

Octopus merges looks cool (that's what Linus said when he introduced them) but they do not serve any real purpose.

Exactly. Even the Git manual states that the normal way to merge is two-way merges and that the octopus algorithm is much more fragile. That is the reason I have not counted that as a must-have feature, the end-result being "Mercurial has the slightly cleaner user interface, Git is a bit more flexible" (and thus largely a matter of personal taste).