Understanding Git Submodules

I've been using Git full-time for over a year now, but I had not yet adopted Git's submodule feature for my projects. Git submodules are functionally similar to Subversion's svn:externals mechanism, but submodules do appear slightly alien and confusing at first (and second) glance.

So I went deep and here, for the internet, is my best explanation: Git Submodules are basically the same as svn:externals, except that Git submodules are locked to a specific revision and don't automatically track the external project's HEAD.

Git submodules behave more like svn:externals that are managed by Piston than by Subversion's default externals.

In my experience, to understand Git, you have to understand its implementation. Git is very driven by its model layer and, once you understand the model layer, I find that everything else follows quite logically.

As you may know, Git stores commits as blobs of data and trees which describe the layout of that data in the filesystem. The commit ID is the SHA-1 hash of the blob's contents. I simplify slightly, but that's the core. Keep this in mind.

Git submodules are implemented using two moving parts: the .gitmodules file and a special kind of tree object. These together triangulate a specific revision of a specific repository which is checked out into a specific location in your project.

The submodules file contains two parts:

[submodule "FooKit"]
path = FooKit
url = git@github.com:fspeirs/fookit.git

The submodule's definition contains a path, which is the location in your repository where the submodule should be placed. The `url` is the URL of the repository to clone from. This example is a GitHub URL but it could equally be a path to a repository on your system. Thus far, Git knows where to get your submodule and where to put it.

The second question is which commit should be checked out into the submodule's path. You tell Git this by adding the submodule path to your index and committing.

Let's try an example. This is a repository which contains a Git repository called "a" and another called "super". We will add "a" as a submodule of "super":

[/tmp/git]$ ls -l
total 0
drwxr-xr-x 4 fspeirs wheel 136 May 11 11:03 a
drwxr-xr-x 4 fspeirs wheel 136 May 11 11:03 super

The first thing to do is run "git submodule add" in super:

[/tmp/git/super(master)]$ git submodule add /tmp/git/a ProjectA
Initialized empty Git repository in /private/tmp/git/super/ProjectA/.git/

[/tmp/git/super(master)]$ git submodule status
-85ab8ba4edf9168ab051ded7ddbbe20861b71528 ProjectA

[/tmp/git/super(master)]$ ls ProjectA/

Having done that, let's look at the impact of that command on the project "super":

[/tmp/git/super(master)]$ git status
# On branch master
# Changes to be committed:
# (use "git reset HEAD ..." to unstage)
# new file: .gitmodules
# new file: ProjectA

We have the new .gitmodules file, which should be checked in, and a new file called "ProjectA", which is the "path" of our submodule. Let's commit these two now:

[/tmp/git/super(master)]$ git commit -m "added submodule"
[master]: created ffba648: "added submodule"
2 files changed, 4 insertions(+), 0 deletions(-)
create mode 100644 .gitmodules
create mode 160000 ProjectA

Note the mode "160000" on ProjectA - that's a special mode for a certain kind of entry in the Git index. It's different from normal files.

Now, if we look at the contents of the Git index, we'll see the SHA-1 for the tracked files:

[/tmp/git/super(master)]$ git ls-files --stage
100644 831cdc0dc1b88e69aa9943cf09907ae1bcd031fc 0 .gitmodules
160000 85ab8ba4edf9168ab051ded7ddbbe20861b71528 0 ProjectA
100644 16f5c2d3aa9656fc424352e4cfaa2523c809778b 0 super.txt

Notice the SHA for ProjectA - 85ab8ba - this is the SHA-1 of the commit to which the submodule is locked in Project A. Commit 85ab8ba does not exist in the "super" repository - it refers to a commit in a submodule repository.

So Git now knows the three things required to set up your submodules when cloning a project:

  • The what comes from the "URL" property in the submodule's entry in your .gitmodules file.
  • The where comes from the corresponding "Path" entry in .gitmodules.
  • The when, if you will, comes from the SHA-1 stored in the superproject's index file for the remote.

Working in a Submodule

The checked out submodule is, of course, a full Git repository in itself and you should treat it that way. It is perfectly possible to make changes in your checked-out submodule. As you commit in your submodule, the SHA-1 of the submodule's HEAD will advance away from the SHA-1 that the superproject has stored in its index.

To return to the example, suppose some change is made in ProjectA:

[/tmp/git/super(master)]$ cd ProjectA/
[/tmp/git/super/ProjectA(master)]$ echo "b" >> a.txt
[/tmp/git/super/ProjectA(master)]$ git commit -a -m "Added B"
[master]: created 82b6450: "Added B"
1 files changed, 1 insertions(+), 0 deletions(-)
[/tmp/git/super/ProjectA(master)]$ cd ..
Submodule 'ProjectA' (/tmp/git/a) registered for path 'ProjectA'
[/tmp/git/super(master)]$ git submodule status
+82b64501654dca53ba570827d8d3e7d465abbae5 ProjectA (heads/master)
[/tmp/git/super(master)]$ git ls-files --stage | grep ProjectA
160000 85ab8ba4edf9168ab051ded7ddbbe20861b71528 0 ProjectA

Notice that, now, the SHA-1 of the submodule's head is at 82b6450, whilst the superproject is expecting 85ab8ba4. There are two ways Git shows you that you're out of sync:

  • "git submodule status" will show a "+" in front of the SHA-1 of the HEAD of any submodule that has advanced from the SHA-1 stored in the superproject.
  • Running "git status" in the superproject will show the submodule as modified.

If you want to commit the superproject to using the new HEAD of the submodule, simply add and commit the submodule's directory as you would any other file:

[/tmp/git/super(master)]$ git submodule status
+82b64501654dca53ba570827d8d3e7d465abbae5 ProjectA (heads/master)
[/tmp/git/super(master)]$ git add ProjectA
[/tmp/git/super(master)]$ git commit -m "Advanced ProjectA to new HEAD"
[master]: created 37750a6: "Advanced ProjectA to new HEAD"
1 files changed, 1 insertions(+), 1 deletions(-)
[/tmp/git/super(master)]$ git ls-files --stage
100644 831cdc0dc1b88e69aa9943cf09907ae1bcd031fc 0 .gitmodules
160000 82b64501654dca53ba570827d8d3e7d465abbae5 0 ProjectA
100644 16f5c2d3aa9656fc424352e4cfaa2523c809778b 0 super.txt

Notice how 'git ls-files --stage' and 'git submodule status' now show the same SHA-1 for ProjectA?

Gotcha's for those used to svn:externals

The big thing to remember is that, unlike svn:externals, updating your superproject from a master repository does not do the same for the project's submodules. If you think about it, this makes sense: the submodules are locked to specific commits in their respective repositories.

It's also important to remember the distributed nature of what you're doing. If you advance HEAD in a submodule, then update the superproject, it's important to remember to push submodule changes before you push the superproject changes. If you don't, your superproject will contain references to commits that only exist in your local clone of the subproject.

Wrapping Up

This post does not attempt to cover every command for working with Git submodules. In particular, you should be aware of the 'git submodule init' and 'git submodule update' subcommands - read the man page for that.

Git submodules really aren't that complex or scary. They have comparatively few moving parts and, to my mind, enforce a certain welcome stability and discipline in your use of external projects.