3

The problem

I write a lot of exploratory code in my research. As I go along, I put functionality that I'd like to reuse in a central location. A project might look like this:

./mylib
./exploration
    /experiment_1
    /experiment_2
    /experiment_3

Where each experiment uses some functionality from mylib.

Now I come along and start my fourth experiment. In the process, I may need to change my library in some backwards-incompatible way. Now I can't re-run my first three experiments without updating their codes to be compatible with the newest library.

Note: As of now, I keep mylib in version control, and exploration in version control, using git. This means that all of the experiments are in the same repository. This is done so that a single push or fetch and merge in experiments syncs all of my experiments between computers. I feel that there may be a better way, but that might be for another question...

Possible solutions

  • I could bite the bullet and update the old experiments manually whenever I need to run them (bad, tedious, but straightforward).
  • I could "vendorize" my library by copying it whenever I make a new experiment. (Bad, bugfixes have to be inserted into each copy).
  • Since I keep my library in version control, I could tag points in the library's history by whatever is required by an experiment. When I want to run experiment n, I'd checkout tag n. (Better, but what if I want to run two experiments simultaneously? It also seems like there should be a way to automatically use a specific version of the library.)
  • Whenever I start an experiment, I'll make a new branch in the library. In each experiment folder, I'll clone the library repo and checkout the correct branch. (This seems reasonable, though it is perhaps wasteful of space, since I'm duplicated all branches when I clone. Also, I might have a lot of experiments, meaning that there will be lots and lots of branches in my repository, cluttering things unnecessarily.)

Should I reconsider any of these above solutions?

I have also heard about git's subtrees and submodules, and while they sound like they might be the answer to my problem, I want to get the input of more knowledgeable coders before sinking time into a rabbit-hole.

7 Answers 7

4

This is not so much a question of the version control system you are using, but more of your general configuration management strategy. First think about your strategy, then check how you map this to your VCS.

Each version of your library you release into "production" should have a unique version number. You should keep track which of your "experiments" uses which version of the lib, and which experiments you still have "under maintenance". This lets you find out for which older versions of your lib you may need "maintenance releases", and for which you can omit them. The different version numbers can be included in the file name of your lib if that helps you to use them in parallel (if that's necessary depends on your physical library management / resolving strategy).

Lets say you have 3 versions lib_v1.0, lib_v2.0 an lib_v3.0, each one used by experiment1, experiment2 and experiment3. Now, during development of experiment4, you made incompatible changes to lib_v4 and find a bug which affects all former versions. Lets further assume you immediately fix that bug in V4. Now you have the following alternatives

  • don't fix the bug in older versions. For example, experiment1 is not "in production" any more - then there is no need to fix the bug in V1.0. Or you know for sure experiment1 is not affected by the bug, and you know "experiment1" is the only program using your lib, then there is also no need to fix the bug in V1.0

  • upgrade all affected experiments to your current lib_v4. This can become tedious, but with @RobertHarveys suggestion using an Adapter (or to avoid breaking changes) it may be a feasible solution

  • if upgrading the affected experiments is too much effort, consider to port the bugfix down to your older lib versions (so creating lib V1.1, V2.1, V3.1)

Of course, you can mix that strategies: experiment 2 and 3 may be easily switched to V4, while experiment1 needs to stick to lib V1, then you will only have to port the bugfix down to V1. That leaves you with lib_V1.1 in maintenance and lib V4 in active development, but no need to maintain V2 and V3 any more.

What you should avoid is to have more than one version tree of your lib under "active development". When you decide to improve an older experiment, either stick with the library version it is currently linked to, or switch to the newest library version for this older experiment.

A remark about version control: this development model maps easily to each VCS which supports tagging and maintenance branches (in other words: basic features supplied by any decent software worth the title "VCS").

2

After spending the afternoon reading, it looks like git subtree is what I'm after. In this approach, I keep my library in version control with git, and each experiment goes into a separate repository. When I start an experiment, I pull the latest version of the library in in with a git subtree add. Each experiment has its own version of the library. If I want to update an experiment to use a new version of the library, I can make a branch, do a git subtree pull, patch up my experiment code to work with the new interface, and merge back into master.

The great thing about git subtree is that the history of the experiment is tied to the history of the library that I am using. This is incredibly useful in exploratory research where reproducibility is paramount. For example, I might run an experiment with version 1 of a library and get a certain result. Later, after updating the experiment's code and moving to version 2 of the library, I might re-run the experiment and find, to my surprise, that the result is different. With subtrees, if I have the commit hash that produced my original result, I can restore my experiment to the exact state it was when I ran it originally, library and all, by simply checking out that commit.

1
  • That's putting the cart before the horse - as Doc Brown says, work out your strategy /then/ investigate the best way to implement it in a VCS.
    – Gwyn Evans
    Commented Mar 26, 2014 at 0:33
1

Use an Adapter.

An adapter is a class that converts from one version of an API to another. On one side of the adapter is the original API. On the other side is the API for the newest version of your library.

You could, of course, simply make your library backwards-compatible, retaining the old API calls for the benefit of your existing experiments.

3
  • 1
    I fear that using adapters might result in a lot of clutter. My libraries can change wildly: for example, in the past week I decided that my python library should take on a nested-directory structure instead of being contained in a single file. I could write adapter functions to map the new module structure to the old, but when my library changes this much, I feel that version control has to provide a better alternative.
    – jme
    Commented Mar 25, 2014 at 15:51
  • 1
    That's why I suggested that you make your libraries backward-compatible. If that also produces too much clutter, it might be time to revisit your design process to see if it could be improved to lessen the number of changes required between versions. What does a "nested directory structure" have to do with anything? Does that change the way your code calls the API? If it does, then you simply have to balance the benefit of the reorganization against the disruption that it causes. Commented Mar 25, 2014 at 15:53
  • 1
    For better or for worse, my "design process" is the scientific process: a lot of trying this and that and seeing what sticks. Nevertheless, it's usually more useful than not to build a library as I go along, but as you might imagine, making backwards compatible changes can be difficult in such a setting. And yes, using nested directories in a python package is a way of organizing its namespace. It does change the API.
    – jme
    Commented Mar 25, 2014 at 16:12
1

I think what you need is somewhere to store and managed versioned artifacts. As you have noticed, this is slightly different from keeping the librar under source control. How to do this depends on the language: usually all language communities tend to reinvent this sort of thing. For instance I would use:

  • SBT for Scala
  • Maven for Java
  • Bower for client-side JavaScript or CoffeeScript
  • NPM for node JavaScript or CoffeeScript

All of these tools work both with a remote repository or in local mode, where packages are cached somewhere on your filesystem. Bower works with git and tags, but makes local checkouts on each project.

Since you mention Python, the closest equivalent I can think of is Pip, but I am not sure wheter Pip allows you to use local packages.

1
  • Interesting. I'm not familiar with these tools, so I'll have to read more. Python has virtualenv, but it might be overkill to make a new environment for every experiment.
    – jme
    Commented Mar 25, 2014 at 16:15
0

I quite like how http://cocoapods.org/ does it. It can be use publicly or privately.

We have a similar issue where we a developing multiple objective-c controls that will be used in many applications. We couldn't make breaking changes as this would mean possibly breaking a project you know nothing about. This seriously hampers progress / innovation.

So with cocoapods we basically have a repo that lists the names / versions and locations of all of our controls (a list of podspecs). Inside cocoapods we say, for this project use

  1. control1 - v0.0.1
  2. control4 - v0.0.7
  3. control5 - v0.0.2

Then cocoapods will go to each controls repo and pull out a tag with the name of the version number specified e.g. "0.0.1".

This is working quite well for us being able to use specific versions of multiple libraries in many different projects.

Not sure if cocoapods will support your platform, you might need to build something yourself or just come up with your own process, but the idea works well.

0

Keep your library in one git-repository and your experiments in an other. Use git submodules to keep track of versions of the library. It's actually built for this...

4
  • That's good to hear. All I've ever heard about submodules is that that they probably aren't the right tool for the job, so I haven't spent much time investigating them. Perhaps I should reconsider.
    – jme
    Commented Mar 25, 2014 at 16:00
  • why shouldn't they be a good fit? Where did you read that?
    – iveqy
    Commented Mar 25, 2014 at 18:54
  • There are several blog posts out there which discuss the problems of submodules. For example: one, two, three. Many of the authors suggest using subtrees instead. Now I'm not sure that I totally appreciate the relative merits of subtrees over submodules, but I do like how their history is tied to the superproject. I think they are what I'm looking for.
    – jme
    Commented Mar 25, 2014 at 21:10
  • The posts are basically saying that you shouldn't use submodules if you don't understand how they works. Several of the statements are old and outdated in those links. In your case it sounds like what you're looking for (but don't start using submodules before understanding them, they are not like svn external)
    – iveqy
    Commented Mar 26, 2014 at 7:16
0

You could keep binaries for each library with a version name in each, e.g.

mylib/thelibname-alpha.dll
mylib/thelibname-beta.dll

Your tests then reference the relevant version. If you need to patch a library, all tests using it will benefit, but other tests will be unaffected.

The reason for doing this is the same as for embedding a version string in any library - you can control precisely which one you are using, and control what is the "active" version using symbolic links if you want. Take a look in /lib on a Linux host and you'll see exactly this arrangement being used:

$ ls -l /lib/libaudit.so.*
lrwxrwxrwx 1 root root     17 Feb  8 14:24 /lib/libaudit.so.1 -> libaudit.so.1.0.0
-rwxr-xr-x 1 root root 112224 Mar 14  2012 /lib/libaudit.so.1.0.0

Not the answer you're looking for? Browse other questions tagged or ask your own question.