0

I have about 160k commits each with 3 files being updated (been using github as a website), and i'm looking for a way to get the files so I can then put the contents into a real DB.

My question is how can I get (download?) the updated files from each commit, saving them to a folder with a timestamp/commitSHA appended to the name to avoid naming conflicts.

Is this possible with git? I know I can use the github site to see the files and what has changed, but the problem is there are over 160k commits.

2
  • what i understand is that you are trying to get every 160k version of the file, is it right? Commented Aug 29, 2016 at 19:12
  • Correct. Once I have the files I know what to do with them, getting them is the problem.
    – Tribe
    Commented Aug 29, 2016 at 19:15

3 Answers 3

2

This is not the most elegant solution but it should work.

First you have to get a local copy of the repository using:

git clone <repo-url>

You get the <repo-url> from the GitHub page of your project (check the "Clone or download" button).

Then you cd into the local repo and run something along these lines:

for rev in $(git log --format=%H); do
    git checkout $rev -- file1
    cp file1 ../history/file1-$rev
done

Make sure you create the history directory in advance. Duplicate the two lines inside the loop for each file you need to get.

Run git reset --hard at the end to let the repository in its original state.

If you also need the timestamp of the file you can get it using git log --format=%ct file1. Replace the cp command with:

ts=$(git log --format=%ct $rev file1)
cp file1 ../history/file1-$rev-$ts

Check the documentation for other file or commit properties you can get using get log.

1
  • Thanks to all 3 of you (larsks,Fabrizio Migotto, and you axiac). Making this one was correct only because it has all steps. Now I get to wait while I download 500Kish txt files!
    – Tribe
    Commented Aug 29, 2016 at 20:07
1

Once you have a local working copy of your repository*, you can get the files from any git commit just by checking out that commit, as in:

git checkout 1e6c98511d9154bfdc49a31fd26229953df0bd70

So, to get the files from every commit in your project history, you would just need to (a) generate a list of commits for your project, and then (b) iterate over that list, checking out each commit and processing the files.

The git rev-list HEAD command will generate a list of all the commits on the current branch (from newest to oldest). If you wanted to process these files in forward order, you could pipe that to tac to reverse the list, e.g.

for rev in $(git rev-list HEAD | tac); do
  git checkout $rev
  ...do something here...
done

* by running git clone <repourl>

1

Assuming you are new with git, you will have to install the git tools from here:

https://git-scm.com/

Then you will have to clone your repository in the git console run:

git clone https://github.com/username/repositoryname.git

After these steps you will be able to move in the different commits as @larsks explains.

For listing every commit of a particular file:

List all commits for a specific file

Not the answer you're looking for? Browse other questions tagged or ask your own question.