A Subversion User Looks at Git

Subversion was, until yesterday, the only SCM system that I understood well enough to use. Today, I feel I can add Git to that list. The disclaimer on that which follows is that it's mostly an understanding gained from reading documentation. Git appears to have an excellent documentation set but, if those documents mislead in some way, I have likely been misled too. Having said this, I'm not going to couch this in weasel words in order to appear circumspect. This is my current understanding of Git and its pros and cons. I may be wrong.

Basic Git Architecture

From an architectural perspective, Git is gloriously simple. There are four essential objects: blobs, trees, commits and tags:

A blob is strictly a piece of file content. I believe that blobs are generally segmented along file boundaries, but I haven't yet worked out if blobs are also used to track portions of a file. Blobs are named by the SHA1 hash of their contents. This can lead to a performance problem if your files are large - as a pathological case, I created a git repository of several 500MB AIFF files - it took rather a long time and ate all my RAM. That's hardly the normal case, however.

A tree assembles blobs and other trees into a hierarchical structure, matching the on-disk hierarchy of your files. A tree is essentially a mapping between a blob's name (i.e. it's SHA1) and file name. Trees are stand-alone objects in the history of a project - they don't contain any information about where they came from.

The commit object refers to a tree - specifically, the state of the tree after that commit is applied - and contains some information about who committed and what was done. It is the commit object, rather than its related tree, which connects the commit to its predecessor (or predecessors, in the case of a merge).

The tag object simply collects the SHA1 sum of an object, the object's type and a symbolic name. My understanding is that you could, in principle, tag a blob, a commit or a tree. I'm not completely certain whether one should tag commits or trees, but I suspect commits would be the correct object. It's not clear that one can reach a commit from a given tree object.

Once nice feature of Git is that it allows you to undo or change a commit after it has been made. Here's one example of where it's super useful: I work between a desktop and a laptop machine. Using subversion, when I have to move machines, I commit my work in progress and then update the machine I'm moving onto. This is generally fine, but it means there are a lot of commits in the repository that represent points at which I wouldn't normally commit code - where things are broken, incomplete or don't compile. With Git and some care, you can commit your work in progress, pull the changes to another machine and then undo the last commit.

The Directory Index

One concept that exists in Git that doesn't exist in Subversion in quite the same way is the notion of the Directory Cache. The directory cache is a file which describes a tree, although the tree which it describes may not exist in the repository yet. As you work, you add changes to this cache and when you commit, the tree described by the directory cache is written to the repository with an associated commit object. The key line from the documentation here is: "creating a new tree always involves a controlled modification of the index file" (ref: core-intro.txt).

The index file is not so very different in practice from Subversion's idea of having added files that are not yet committed. The index file is Git's representation of the same.

Having said that, git's notion of "adding" a file is sightly different from Subversion's. In SVN, you're telling SVN to "start tracking file X". In Git, you're saying "take a snapshot of the content of file X and store it in the index for the next commit". As a result, you have to - at least in principle - perform a "git add filex.c" every time you change filex.c. There is, however, some syntactic sugar in the form of "git commit -a" which adds all the modifications to known files and commits in one step.

This is pretty powerful: how often have you done some work on a feature and cleaned up some headers as you went by? When you're done, you have to look at each file you've changed and perhaps do a number of commits to specific files. In git, you can just decide not to "git add" those clean-ups to the index until after you've committed the meat of your work.

Branching and Merging

Branches, and merges between those branches, are a central concept in Git. Given that this was developed to track the Linux kernel, this is hardly surprising.

Given that NIB files are not mergeable with any common merge algorithm, it's not clear that this style of working would be terribly good for Cocoa development. The documentation does not say a whole lot about what happens to binary files. It's not that Git is unsuitable for handling NIBs - far from it. I just observe that the the approach of frequently repeated branch-and-merge operations rather depends on a high probability of clean automatic merges to be bearable. The fact that a git merge will automatically commit in the absence of conflicts suggests that this expectation underlies the design.

Having said that, it's no easier to merge a NIB in Subversion. It's just that merging isn't so commonplace an operation in SVN. The correct solution, of course, is for Apple to make NIB files more easily mergeable.

Repository Layout

One thing that I already love about Git is that it does not depend on putting a dot-directory in every directory in a working copy (recall that every Git working copy is also a repository). There's one .git directory at the root of the repository and absolutely nothing else. Anyone who has had to check RTFd files into Subversion and then edit those files with TextEdit will be cheering right about now. For those who haven't, understand that RTFd files are actually bundles, and bundles are directories. Thus, Subversion adds a .svn directory inside your RTFd file. When TextEdit saves this file, the .svn directory is lost and the file appears disconnected from its history.

For this fact alone, I'm looking at the implications of switching to Git.

What's Missing

Currently, the only thing that appears to me to be obviously missing from Subversion is the concept of svn:externals. I use externals a fair bit in my SVN projects, and I'm not yet certain how one could replicate them in Git.

You can add so-called "remote tracking branches" in Git, in which your repository tracks a branch in the repository you originally created yours from or, indeed, arbitrary branches from arbitrary repositories. This lets you switch your working copy to another branch from somewhere else, but it doesn't let you attach an arbitrary tree to an arbitrary point in your tree.

I suspect the approach might be to import source from some remote repository, create a tracking branch and then merge between the tracking branch and some subdirectory of your working copy. I have not yet seen any documentation on how to do this, nor on how to do it if the other repository is not using Git but, say, Subversion.

Conclusions

Git's rethink of the entire content management problem enables some powerful new capabilities. I write this whilst on holiday with very sporadic net access. I've been coding away in my Subversion-managed projects, but unable to commit in sensible chunks without internet access. With Git, there would be no problem whatsoever.

Because few operations depend on the network, Git's performance is excellent for most common operations and cases.

Git is confusing and alien to someone raised on CVS and Subversion, that much is certain. I feel like I understand the component parts of Git, but that I'm not necessarily entirely understanding their implications and interactions just yet. It also feels like Git gives you slightly more rope with which to hang yourself, but I do recall feeling that way about Subversion when I started using it. With SVN, I've come to trust that my usual workflow and conventions don't produce broken results and, when I'm doing something new there are good docs to back me up. I suspect I could reach the same position with Git quite easily.

Finally, I continue to ask myself whether using Git would really confer serious advantages to a (usually) solo Cocoa developer. The answer is that I'm currently not sure, for the following reasons:


  • I rarely have several branches in active development at any one time. Even if I have multiple SVN branches, I'm usually only working on one at a time.
  • Git offers nothing new to the problem of merging NIBs.
  • Git's optimistic approach to the probability of conflicts during an automatic merge is somewhat less likely for Cocoa projects than for, say, the Linux kernel.
  • I don't often collaborate with large numbers of people on projects.
  • Git's pretty confusing, even after reading the docs twice.
  • Everyone else uses Subversion (see the point about svn:externals).


Where does Git provide compelling improvements?


  • Much cleaner handling of bundle files.
  • The ability to revert a commit is something I've, er, occasionally had reason to wish existed in SVN :-)
  • Working disconnected on a laptop no longer requires either (a) a gigantic checkin once you get home or (b) picking apart your changes to commit separate features.
  • Performance is a feature.
  • The ability to explicitly define the contents of a commit in a structure other than the current state of the working copy is pretty nice.


There will certainly be more on this as time goes on. I've been hearing too much buzz about Git from people whom I respect to ignore it. I don't hear anything about arch, monotone, BitKeeper, codeville, SVK or darcs from anywhere except the nerdiest of SCM nerds.