4

I'm working on a server via SSH and need to copy a directory from the server to an HPC with rsync:

rsync -a -q "sourcedir" "username@hpc:~/destdir/"

In this example, sourcedir is on the server and contains a sub-directory which contains a small .csv file. destdir doesn't exist on the HPC, but gets created by rsync (when it works).

It works roughly 3/4th of the time, but sometimes fails with one of two errors:

  1. a 'stale file handle':

    rsync: recv_generator: failed to stat "/home/u27/username/destdir/path/to/file.csv": Stale file handle (116)
    rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]
    

OR

  1. 'error in file IO' error

    rsync: mkdir "/home/u27/username/destdir" failed: File exists (17)
    rsync error: error in file IO (code 11) at main.c(656) [Receiver=3.1.2]
    

The stale file handle error (1) happens more often than the error in file IO (2).

Environment:

  • Host is a Virtual Machine running Ubuntu 18.04.6 on OpenStack. Data are stored in a mounted volume. Both the VM and data are hosted on OpenStack.
  • The HPC is running CentOS Linux 7 and is accessed via SSH through a bastion host.

I suspect that the problem is happening on the host, not the HPC, because I can't reproduce this error on my laptop running macOS. I can reproduce this problem using both the sftp set up for transferring files to the HPC as well as via ssh.

Any ideas what could be causing this, or what further steps I could do to debug or eliminate this error?

Cross-posted here: https://github.com/WayneD/rsync/discussions/415

UPDATE: I'm able to reproduce this problem with or without a trailing "/" on username@hpc:~/destdir/

12
  • 2
    "destdir doesn't exist on the HPC" -> mkdir "/home/u27/username/destdir" failed: File exists (17) tells us it does exist sometimes. Commented Dec 5, 2022 at 21:41
  • No, destdir does not exist on the HPC because I remove it before each test run of rsync in this example. Also, rsync generally does not care if the directory exists or not. If it doesn't exist, it makes it, right?
    – anon
    Commented Dec 6, 2022 at 21:43
  • 1
    @DavidLeBauer a stale file handle means that NFS tries to access a file via its inode, which has in the meantime changed on disk. Unmounting and remounting the NFS filesystem before copying the files could help. If that doesn't help, additionally try restarting the NFS server. You might want to check if you run the same version of NFS on both client and server. NFS4 has some problems with NFS3 servers. Commented Dec 8, 2022 at 3:35
  • 1
    If this was a problem with the HPC NFS, wouldn't I be able to reproduce the error from another host besides the server?
    – anon
    Commented Dec 8, 2022 at 15:22
  • 1
    The errors all refer to "destdir", and "stale file handle" is NFS-related. So this heavily points towards a HPC server issue. The lack of reproducibility on MacOs may be due to other factors (speed comes in mind, but their may be implementation subtleties). Can you provide the filesystem mount options for both source and destination directories ?
    – Uriel
    Commented Dec 10, 2022 at 11:12

2 Answers 2

2

Based on the elements provided, the issue is likely due to the target destdir being an NFS mount. The rsync command tries to remove/recreate entries, and the NFS layer latency ends up providing an inconsistent view on the filesystem.

Some possible actions:

  • running the same command on a local directory on the server, to ensure that NFS is indeed the culprit. If possible, switch to such local storage.
  • reviewing and adjusting the NFS mount options, but changing those may have significant performance or consistency issues
  • play with --delay-updates and --delete-delay options of rsync, that may help doing some cleanup and actions after the transfer phase. You may however online hide the problem with those, and they may reappear on some timing conditions.
2
  • This seems a likely culprit. When I do testing to try to reproduce this issue, I've noticed that sometimes when I run rm -r destdir on the HPC it tells me the directory doesn't exist, but if I run ls then destdir shows up and I'm able to delete it after that. It's like the directory isn't there until it needs to be.
    – anon
    Commented Dec 13, 2022 at 19:43
  • @EricScott the rsync command probably faces the same situation, hence the strange error message. You can either tweak rsync options and fix the root cause (NFS client or server options, network issues...)
    – Uriel
    Commented Dec 13, 2022 at 22:20
1

Based on the command you provided above, the issue would seem to be as simple as removing the "/" after destdir, and should read like this:

rsync -a -q "sourcedir" "username@hpc:~/destdir"

If you review the man page for rsync, a trailing "/" in the destination directory string explicitly tells rsync to expect that to be pre-existing, and prevents rsync from from creating it on the fly.

If you remove that directory just before rsync, you may also want to force a filesystem "sync", first on source host, then on mirror host, before attempting the rsync.

Lastly, are you sure that you successfully removed that target directory before proceeding? Do you actually test to reconfirm that your delete action did in fact work?

Just asking the obvious actions and test that you might want to incorporate into your script logic.

3
  • 1
    This is not correct. I tested and rsync creates the target directory whether or not it ends with a "/". rsync reacts differently for a source trailing "/" but hit is not related. Please cite the man section where you found this.
    – Uriel
    Commented Dec 12, 2022 at 10:36
  • I'm able to reproduce the problem without the trailing "/". I'll update the question. If I don't remove the target directory before proceeding, I'm pretty sure the rsync always works. rsync sees that the files do not need to be updated and copies nothing. So yes, I check that the delete action is successful, but an unsuccessful delete wouldn't cause a problem.
    – anon
    Commented Dec 12, 2022 at 18:13
  • 1
    I confused the two directory references. The trailing "/" rule applies to the source dir in the command line, not the target location. From the man page: A trailing slash on the source changes this behavior to avoid creating an additional directory level at the destination. You can think of a trailing / on a source as meaning "copy the contents of this directory" as opposed to "copy the directory by name", but in both cases the attributes of the containing directory are transferred to the containing directory on the destination. I will remove this answer in 2 days. Commented Dec 12, 2022 at 22:34

You must log in to answer this question.