28

I plan to backup my large HDDs by rsync, and anticipate that it takes a few days. Is it safe to use the original HDD (adding files) while rsync is working? Or it is better to leave the HDDs untouched until the rsync is finished?

2
  • 1
    Note that "using" may be as simple as having a browser open doing nothing. Browsers tend to write a lot of random stuff in their data directories. In the worst case, what you get is an inconsistent backup, i.e. when restoring, you may be unable to restore your tabs, your bookmarks may be gone (because the database is corrupted) or something in that order of magnitude. Commented Feb 1, 2017 at 8:39
  • If you have that much data to backup, you may want to consider splitting the backup into smaller pieces (sub-trees). Then, only the part that is currently running needs to be kept as static as possible - and you can see which part that is by following the progress of your script (with a log, etc.). Since it's not one big backup, some of the pieces could be little out of sync with the others, but if you're running one big backup on a live system, that's going to happen anyway.
    – Joe
    Commented Feb 6, 2017 at 19:10

5 Answers 5

36

As others have already pointed out, it is safe to read from the source disk, or use the target disk outside out the target directory, while rsync is running. It is also safe to read within the target directory, especially if the target directory is being populated exclusively by the rsync run.

What's not generally safe is to write within the source directory while rsync is running. "Writes" is anything that modifies the content of the source directory or any subdirectory thereof, so includes file updates, deletes, creation, etc.

Doing so won't actually break anything, but the change may or may not actually get picked up by rsync for copying to the target location. That depends on the type of change, whether rsync has scanned that particular directory yet, and whether rsync has copied the file or directory in question yet.

However, there is an easy way around that: Once it finishes, run rsync again, with the same parameters. (Unless you have some funky delete parameter; if you do, then be a bit more careful.) Doing so will cause it to re-scan the source, and transfer any differences that weren't picked up during the original run.

The second run should transfer only differences that happened during the previous rsync run, and as such will complete much faster. Thus, you can feel free to use the computer normally during the first run, but should avoid as much as possible making any changes to the source during the second run. If you can, strongly consider remounting the source file system read-only before starting the second rsync run. (Something like mount -o ro,remount /media/source should do.)

9
  • 7
    One can even do a third run after a second run: it may take even less time... ;-)
    – gerlos
    Commented Jan 31, 2017 at 16:05
  • 5
    @gerlos A pattern seems to be emerging. It sounds almost like one could just keep running the rsync command at the end of each use session, and within a few days it would be done in no time. Commented Jan 31, 2017 at 16:28
  • 5
    @gerlos If you remount read-only before running rsync the second time, that won't be necessary and the backup will be all but guaranteed to be consistent while minimizing the time during which you cannot write to the source file system.
    – user
    Commented Jan 31, 2017 at 16:38
  • 1
    @gerlos As an aside, that's why I have an entry much like @reboot root find / -print &>/dev/null in my system crontab, to populate the cache. (The actual entry is more complex to account for a few special cases on my particular system.) It uses some RAM and some wallclock time early after startup to improve directory-tree scanning quite a bit IME.
    – user
    Commented Feb 1, 2017 at 15:27
  • 1
    @MichaelKjörling: interresting idea to cache the hierarchy. But maybe you should run updatedb (building locate's database) or slocate -u (same, if you have slocate) instead? That way you still cache the hierarchy but you also build-up the databases of locate or slocate, allowing you to use those commands to quickly find many file ? Commented Feb 1, 2017 at 17:25
23

This depends of the backup system you use, but in general it is a bad idea to modify the contents of a device while you're backing it up. However, you can read its contents; that's a safe operation, even if it will slow down the process.

In your case, rsync will build up a file list and then start the backup. Therefore any file you add to the source HDD after the backup has started will not be copied.

What I do is not to use a device at all during a backup. This is the safer way to obtain a fast and consistent backup.

1
  • 15
    I usually let it run and then do a second run of rsync which will finish in a few seconds because only the files that I have changed during the run will be copied. Everything will be in the caches, so it is way easier to refrain from modifications during that period. Commented Jan 31, 2017 at 21:46
15

It is safe to read data from the source areas while rsync is operating, but if you update anything the copy that rsync creates/updates is likely to be inconsistent:

  1. If you update a file that rsync has already scanned then it will not see the update until a future run. If you update a file it has yet to scan the change will be respected in the destination. If you update files that both have and have not been scanned you will end up with a mix of old and new versions in the destination.

  2. If you add a file to a directory that has already been scanned it will be missed from the destination copy this time around. If you remove a file from a directory that has already been scanned it will be left in the destination copy this time. Depending on how you invoke rsync the whole tree may be scanned at the start or it may be incrementally scanned as the sync process happens.

  3. In some circumstances rsync will see the inconsistency and warn you. If you remove a file or sub-directory from a directory that has already been scanned itself but has not had its contents scanned you will get an error message about the object being missing. In similar circumstances it can sometimes (if the size and/or timestamp has changed) also warn about files changing mid-scan.

For some backups this inconsistency may not be a massive issue, but for most it will be so it is recommended that you don't try sync an actively changing source.

If you use LVM to portion your storage system you could use a temporary snapshot to take a point-in-time backup. This requires that you have enough space on the volume group to create a snapshot volume large enough to hold all the changes that will happen in the duration that the snapshot is needed. Check the LVM documentation (or one of many online examples: search for "LVM snapshot backup" or similar) for more details.

Even without LVM some filesystems support snapshots themselves - so you may wish to look into that option too.

If you want to backup large active volumes without long downtime and can't use snapshots, it may be sufficient to run the "live" scan to completion then stop access to the volume and run another rsync process which may take far less time (if very little has changed it will just scan the directory tree then the few updated files). This way the duration in which you should avoid changes could be much shorter.

1
  • I like your answer best because you go in to detail about what happens if files are modified. You not only provide an alternative but also address the inconsistencies it can cause (missing an update, warning about a missing file, etc.). In my situation, using rsync to seed a long backup and then refreshing it days later is no big deal, and that sounds like the OP's situation as well. It doesn't sound like he/she is requiring an enterprise level backup the first time through, but just wants to use the computer in the mean time. I say just run rsync a second time to catch the updated files.
    – ibennetch
    Commented Feb 1, 2017 at 16:04
12
  • Source HDD can read anything while rsync.

  • Source HDD can write any content not related to the rsync content.

  • Destination HDD can read anything while rsync.

  • Destination HDD can write anything while rsync with the condition to have sufficient space reserved for the sync'ed content.

Of course, in any of the cases, there will be performance reduction.

0

All of the current answers are talking about data safety in terms of consistency and assuming perfect hardware.

Another thing to consider is the hardware safety itself. If you have non-backed-up hard drives which could be on the verge of failing (you may not even know yet) and you are making a initial comprehensive backup don't use it. Don't even mount it if the data is critical. You can use a tool such as dd to clone the disk as a block device. What you don't want the disk head seeking, and possibly writing while you are trying to make a backup. Plus dd should be faster for the initial backup since it just copies the bits in order(If the drive is isn't mostly full I suppose rsync would win in the initial case as well).

For subsequent incremental backups rsync is a great choice and I agree with the other answers 100%.

2
  • 1
    If the media is marginal or even potentially marginal, dd is not the best choice. Use ddrescue instead; it handles partial failures much better. But that was not a consideration in the original question.
    – user
    Commented Feb 2, 2017 at 13:06
  • @MichaelKjörling That is a good point.
    – Zak
    Commented Feb 2, 2017 at 15:58

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .