Archiving papers, simulation and experimental data, etc?

Question

Archiving of papers and projects is very important for most of the people, and everyone has a method of it.

I, personally, try to have different folders for different papers. In the folder of each paper, in addition to my manuscript, I keep my simulation files, experimental data, program codes, etc, in different subfolders. Also, I try to keep track of my revisions on the paper, reviewer comments, my response to reviewers in different subfolders. My method has the advantage of all the related material for a paper are together, and I can quickly track the whole process from initial submission to the final proofreading. However, this usually leads to duplicated files.

I work with both my office desktop and my personal Laptop. I transfer my files between these systems by a flash memory. I agree maybe cloud technology is a better choice in this age, but I postponed it because of my poor internet connection at home.

I am curious to know how others approach to archive their works. I also hope to find some methods more efficient than mine, or get some tips to improve my method.

Edit at 2 Apr 13: Thanks to these great answers, it is almost a month that I use Git for my version control. Also, I manage my repositories in the Bitbucket, which gives me unlimited storage for unlimited number of projects.

Related (but not a duplicate): academia.stackexchange.com/questions/5277/… — Piotr Migdal, Commented Mar 4, 2013 at 10:16
After you publish the paper, don't forget to publish these supplemental materials too! Some journals allow you to include them, or you can put them on Github or on your own site. Some grad student who wants to reproduce your work will be very grateful. — David Ketcheson, Commented Mar 4, 2013 at 17:47
Just to mention that a tool like Figshare could be of hand. But I'm no expert. — Aubrey, Commented Jan 5, 2015 at 0:31

Community · Accepted Answer · 2020-06-10 14:12:13Z

16

I have a similar approach with folders, with two additions:

Everything goes into a revision control system. In my case, I've got some things in Subversion repositories, and others in Mecurial repositories (I've also dabbled with Git, but haven't made the final transition). The benefits of revision control are that you can always go back to a previous version, you won't have old versions littering your folders, and sharing with collaborators is relatively easy. This should take care of your duplicated-file problem.
I also use Dropbox extensively, in order to have my files available on any computer at any time. Dropbox provides a modicum of version control (30 days worth), but it should not replace a proper revision control system. It does provide a cloud-based backup of your work.

Finally, regardless of how you're keeping your work arranged, make sure you keep an off-site backup (e.g., via Dropbox or personally-controlled media).

edited Jun 10, 2020 at 14:12

CommunityBot

1

answered Mar 4, 2013 at 6:06

Chris Gregg

17.9k1 gold badge53 silver badges87 bronze badges

3

If you use a cloud-based version control repository such as BitBucket you can access the files via their web interface even without setting up git on the local machine.
– Jack Aidley
Commented Mar 4, 2013 at 13:26
1

@JackAidley Yes, you can access, but it becomes a pain when you want to modify/update/etc. I wouldn't actually suggest people actually do that... it's a extremely clumsy way of using it, inefficient and possibly error prone too.
– user5633
Commented Mar 5, 2013 at 5:56
1

@John Q. Public: It is. I'd certainly not suggesting you use it as a primary method! However, Chris suggested that an advantage of DropBox is that you can access the files on any machine; this advantage is shared by BitBucket so you don't need to duplicate your files to DropBox to get the advantage. Obviously if you're doing any amount of work you should use Git properly (btw, it's perfectly possible to run a portable Git from a memory stick).
– Jack Aidley
Commented Mar 5, 2013 at 11:56

Add a comment |

enthu · Accepted Answer · 2015-01-03 22:11:39Z

I generally write in LaTeX and do statistics in R so those can be easily versioned in a Git repository on BitBucket. On a sidenote, I chose BitBucket over GitHub that because collaborating is easier as it allows you to set up repos that can be forked but you can still prevent the forker from sharing it publicly. GitHub on the other hand (at least when I last looked) required you to tightly integrate into teams. Also, using Git submodules allows me to include common parts (such as my BibTeX files) into multiple projects without duplication of versions (though technically it does reside in multiple places on the drive), however this sometimes causes me some grief because I'm not very skilled with Git.
I have tried to acquire additional space in every way I can for Dropbox. It is now large enough that I can generally keep 100% of my active projects (some of them are the Git repos from above) in it. This way everything is always backed up offsite with no intervention. Because I work out of Dropbox as opposed to "My Documents" or the like, I also can seamlessly move from my Windows box at work to my laptop running Ubuntu or OSX. This also means that my stuff is not only backed up in the cloud but also on my other machine's hard drive. One tricky thing for me is the sharing between the two OSes on the laptop. For that I keep the Dropbox in a shared partition, to avoid wasting 2X the disk space.
Finally, I do an incremental backup every week or so when I think to plug my external drive in. That external drive is well cared for in a fancy case with fans and it is usually powered off except when I'm backing up. It never leaves my home.

So I have the storage part pretty well worked out. However, I'm constantly trying to tweak the organization part. Right now I keep a projects directory in my Dropbox with multiple subprojects labeled like "2012-XXX_YYY_ZZZ" in some attempt to sort them. In these directories are generally subdirectories for study materials (I work with human-subjects), analysis, notes, and any products such as papers that came from the work. In the analysis directory is generally a subdirectory with the actual R code. I try to name that directory something like R_git, as I generally use the ending to signify that something is backed up elsewhere in another repo. My system gets difficult when I create a product (such as a paper) based on two projects (perhaps two studies). In that case I usually just move the paper subdirectory out to the main projects directory to avoid duplication. I do sometimes find myself searching through directories trying to remember where a given paper was stored, so clearly my system needs more work.

By the way, when I was a grad student I was disappointed in the lack of blogs and resources discussing this topic. Specifically at the time I was often looking for Mac software to help me keep my life in order. Because grad school can be very different from field to field, it seemed that there were not many people writing about using technology in it. I discovered that the legal community does have such blogs so I used to follow some of those. — Eric Marsh, Commented Mar 6, 2013 at 9:38

jakebeal · Accepted Answer · 2015-01-04 20:36:29Z

Toward the less-addressed organizational aspect of the question, I find that I need to maintain a fairly structured organization in order to effectively manage my papers, presentation, code, etc. over time. My methods result in a minor amount of duplication, but it is rare and there are never more than 3 copies of a document at the very most. This is, of course, my own idiosyncratic system, but perhaps it will be useful as inspiration or a template for how you think about developing your own.

First, my driving principles of organization:

Since collaborative projects have to be shared in so many different ways, I do not use any specialty organizational software, but just the hierarchy of the file system.
My primary sorting heuristic mirrors how I organize my time (and how it is attributed to projects in funding bookkeeping)
Higher level directory names are shorter, since they are more persistent and more frequently typed; lower level names are as long as they need to be to understand what they are far in the future.
No directory should have more than a dozen or so subdirectories

Following these principles, my first layer of directories are sorted by the main business functions of academia:

pursuit: all proposals and funding pursuit goes here
projects: this directory contains one subdirectory for each funded project that is currently active (plus one for each major line of preliminary work). Rationale: each grant/contract needs to have its activities tracked individually for reporting to the funder.
internal: administrative dealings with my institution, such as travel receipts, training documents, and internal process documents go here. Travel receipts get their own subdirectory.
service: professional service, including teaching, recommendation letters, conference organization, seminar series, journal editing, and reviewing. One subdirectory for each major topic (e.g., one for each conference, another for all recommendation letters).
notes: all personal notes and reading, with a subdirectory for talk notes, another for manuals, and another for downloaded papers (with further subdirectories for major topics)
sites: contains one directory for each website where I am one of the maintainers.

Every one of these also contains either an archive subdirectory, where I move completed tasks, either by topic (e.g., pursuit, projects) or by year (e.g., internal, service).

In any second-level directory, I maintain a README file that tells me what I will need to know when I re-visit something after forgetting all about it. The directories for funded projects also have a stereotyped structure:

contract: This is where all contract documents go for funded projects.
admin: all reporting, deliverables, etc.
publications: each paper gets its own directory; the conference presentation for a paper and any derivative papers also go here.
presentations: all presentations not directly associated with a paper go here
Beyond that, there are directories for each major strand of work in the project
For publications, every published paper (and supplement) also gets a copy, with a long informative name including the year, in the publications directory for my professional website.
For collaborative projects, there may also be a top-level split between internal and shared, with certain documents having a master version in internal and a second copy in shared.

Finally, everything that I care about must be backed up in two different ways:

By the backup system of each machine that I use.
By means of some sort of synchronization software (with version control when possible). I am currently using a mixture of SVN, git, Mercurial, Dropbox, and BitTorrent Sync, chosen per-project based on the collaborators.

Could you please tell me more about how you organize and archive your downloaded/read ebooks and papers? Also, do you have any naming standard of the files for yourself? Or you leave their name without any change? — enthu, Commented Jan 4, 2015 at 22:40
In my archive of papers that I have read, I start by dumping everything into a single directory. When any directory starts to have than a couple dozen, I break out clusters into subdirectories. I name each paper as FirstauthorSeniorauthor-SummaryPhrase-JournalYear, e.g. SmithWang-PlasmidJuggling-PNAS2007 — jakebeal, Commented Jan 4, 2015 at 22:45
I have to thank you for your answer to this question. I give you the 250 reputation bounty, but I know it worth nothing in front of your great knowledge and perfect answer. Thank you very much. — enthu, Commented Jan 9, 2015 at 8:55

Dikran Marsupial · Accepted Answer · 2013-03-04 08:52:08Z

I do something vaguely similar, however I also use a makefile to integrate the simulations with the LaTeX source code for the paper (generally there are also a set of MATLAB files that generate the contents of tables and the figures). Then if necessary, the experiments can be repeated and the results stitched in to the paper again, just by typing "make". However, more recently I have been having to make a lot of use of my universities High Performance Computing facility, which makes it much more difficult to do this. It works nicely for more simple projects though.

Peter Jansson · Accepted Answer · 2013-03-04 08:32:17Z

The question of archiving and synchronizing is a good one and for me relates also to backup and storage as key questions, particularly when working on more than one computer. I will describe what I do without implying it is THE solution, it works for me. I have my work computer as well as a desktop at home. I also have several laptops.

On the two desktops at home and work, the data resides on the harddisk, organized in a folder system that has developed over time. I have two 600 Gb 2.5" portable harddisks which I carry between work and office and which I use as shuttle/backups by synchronizing with the desktops at home and work. I use a software called Total Commander (shareware), which works fine for me but I am sure others can be used. This way I have three copies, at work, at home and on the portable disk(s), at all times (except in the worst case antyhing produced between syncs).
I do not have any data on my laptops but use the 2.5" as an external hard disk when I am away from home and work. I keep a second 2.5" for backup if I am away for longer periods and try to transport the two disks separately, one in carry on and one in checked in luggage. (I have to add I do not have any secret data so I do not worry about disks being stolen apart from my own losses). I could have all data on the laptop as well but have opted for having a faster but smaller SSD disk in the laptop so my data will not fit.
I use drpbox to keep a limited number of files that I use frequently and most often need to share with others. I also use Dropbox to deposit files that I think I might need for specific purposes when out of the office as an extra backup and to be able to access quickly. I do not use Dropbox as a backup but rather backup dropbox occassionally, particularly collaborative files.

This works for me and the solution has developed over time and now the synching is a natural begin-end of the work day and I live with three (four when I synch the second, pure backup, hard disk) exact copies. I could go for an automatic backup as well but have not felt this was worth it at this stage. With this system, I always carry with me all files I have ever produced. I clearly see this will be impossible for some activities but will be quite feasible for most.

Jez · Accepted Answer · 2013-03-05 17:00:51Z

Several great answers here already, but here's a few more things to consider:

You could take a look at rsync (graphical interfaces are available) and unison for synchronising data/documents between different computers (and/or a USB memory stick). Rsync is simpler but unidirectional, though you can effectively do a full sync by rsyncing in one direction then the other. Unison is much more powerful and results in the two copies ending up identical, letting you specify which files to ignore, how to resolve conflicts, etc. I use unison every day to keep my laptop and desktop in sync.

Also, don't discount your own university's networked storage if available, which will usually take care of backup and give you a level of protection against most problems short of a nuclear bomb. This can be slow when accessed offsite, for example, but for ours I find it doesn't matter too much as long as your files aren't enormous. If nothing else, it can be useful as a "master" copy that you synchronise your other copies against.

Finally, however you choose to organise your files and folders, spend a few minutes writing down how you're organising them in a read-me file so that, if the worse should happen, your colleagues can understand how to access your files and your work won't be wasted.

Stack Exchange Network

Archiving papers, simulation and experimental data, etc?

6 Answers 6

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
publications
software
data
tools
backup-archiving
.

Linked

Hot Network Questions

Archiving papers, simulation and experimental data, etc?

6 Answers 6

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged publicationssoftwaredatatoolsbackup-archiving.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
publications
software
data
tools
backup-archiving
.