1

We have a repository, in which couple of directories have been extracted as submodules two years ago.

As git submodules cause too much headaches, it was decided to revert the extraction as submodules and bring the directories back to the parent repository.

Now the question is, what is the best way to do so - while keeping all history.

I was thinking of adding the submodules as remote and then cherry-picking all the changes. But for that I would need to tell git that it shouldn't treat paths of the commit relative to the current directory, not to the parent repo's root.

Does one no a way to do it with cherry-pick or any other clever way?

Thanks a lot!

1 Answer 1

1

You can do this with git filter-branch, using the example in the man page or the slightly modified version in this answer. This is the man-page version in git v1.8.2:

To move the whole tree into a subdirectory, or remove it from there:

git filter-branch --index-filter \
    'git ls-files -s | sed "s-\t\"*-&newsubdir/-" |
        GIT_INDEX_FILE=$GIT_INDEX_FILE.new \
        git update-index --index-info &&
        mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"' HEAD

First, add each submodule as a remote in the parent repo, then checkout the master branch of each as a local tracking branch (e.g. submoduleA-master, submoduleB-master, etc.). Git will throw a warning because the branches don't share history, but will otherwise let you proceed. Rewrite the submodule branch's history into the appropriate subdirectory and merge it into the parent's master. In the end, you'll have a series of merge commits for those subdirectories, and a cohesive, singular history in the parent repo.

It sounds far more complex than it is. Be sure to make backups in case something goes wrong. Script the whole thing so you can try it until you get it right. The rough order of execution for every submodule is:

git remote add submodule submodule_remote
git checkout -b submodule-master submodule/master
git filter-branch ...        # With the index-filter described above.
                                 # Depending on length of history, this could
                                 # take quite a while to process/
git checkout master          # Get back on parent's master.

Now you're faced with a choice. Do you rewrite the parent to remove all traces of submodules, or not? If the latter, remove the submodule from the parent repository using a solution appropriate for your git version, and then git merge submodule-master. If you want to erase all submodule commits from the history, too, you can rewrite the parent as well with git filter-branch.

I once did this for 35 disparate repositories. Here's a tip: Spend $10 on a few hours of cluster-compute in AWS. git filter-branch is extremely RAM bound. Something your laptop couldn't do in 20 hours an AWS cluster-compute instance could finish over lunch. It's a beautifully simple, cheap way to conduct operations like this.

One final note. If you use BSD sed there's a fair chance the \t substitution in the man page will fail. Jeff King's perl version will work around that issue:

git filter-branch --index-filter '
  git ls-files -s |
    perl -pe "s{\t\"?}{$&newsubdir/}" |
    GIT_INDEX_FILE=$GIT_INDEX_FILE.new git update-index --index-info &&
  mv $GIT_INDEX_FILE.new $GIT_INDEX_FILE
' HEAD

Not the answer you're looking for? Browse other questions tagged or ask your own question.