1

I want to make backups of a directory, everything the same as original files, only except for file content converting to all zeros (sparse files), and also keep the same file size. So for example, I have

  • dir/file1 with 1 GB non-zero data
  • dir/file2 with 2 GB non-zero data
  • ... (the dir is large enough with many more files and sub-directories)

and I need

  • dir_bak/file1 with exactly same metadata as dir/file1.txt and 1 GB zeros (or corresponding sparse file)
  • dir_bak/file2 with exactly same metadata as dir/file2.txt and 2 GB zeros (or corresponding sparse file)
  • ...

The result files as well as sub-directory structure should be compressed as an archive (preferred) or sparsely stored, so only little disk space would be used.

One key matter here is that I have no more disk space to make a full backup, so please don't come up with solutions like making a full backup first and processing it later. During the whole process please also don't use up too much extra disk space.

FYI, it's on ext4, linux with basic shell command (like ls, find, cp, tar, dd, truncate, etc.) support.

P.S. Any other kind of backup/snapshot/image/archive solution, meeting the demand (empty content with full metadata), than shell script is welcome, too.

0

1 Answer 1

0

First, copy the metadata and the structure, using the command you know (use sudo if needed):

cp -R --attributes-only --preserve=all dir dir_bak

Now regular files in dir_bak are of size 0 each. We will fix this. For every regular file in dir_bak we will read the size of the corresponding file in dir and truncate the one in dir_bak accordingly. If the filesystem supports sparse files then the files in dir_bak will become fully sparse. Then we need touch to fix mtime affected by truncate (ctime is also affected but you cannot change it easily; note our cp … does not preserve ctime in the first place). Ask yourself if you need sudo for all this.

Important:

cd dir_bak

Check twice if you are in dir_bak. Invoking the below find … command (or sudo find … if needed) in another directory may affect files in the directory, you may lose data.

Also note you need to replace path/to/source/dir/ with the actual path to the source dir/ (with the trailing slash). In case you used the exact cp … command I provided, the path may be ../dir/; otherwise it should be something else. Anyway you should adjust the code. In any case the full actual path to the source dir/ is a good replacement.

# make sure you're in dir_bak
# MAKE SURE you're in dir_bak
find . -type f -print \
   -exec truncate -r 'path/to/source/dir/'{} {} \; \
   -exec touch    -r 'path/to/source/dir/'{} {} \;

The above will work with GNU find. In general find is obliged to substitute one occurrence of {} provided as a separate argument. Support for substituting more than one {} and/or for substituting {} being only a substring of an argument (like in …/dir/{}) is an extra feature your find may lack. If your find is limited this way then use the following snippet instead:

# make sure you're in dir_bak
# MAKE SURE you're in dir_bak
find . -type f -exec sh -c '
   for f do
      printf "%s\n" "$f"
      r="path/to/source/dir/$f"
      truncate -r "$r" "$f"
      touch    -r "$r" "$f"
   done
' find-sh {} +

-print or printf is not really needed, it's here to show progress.

I'm not sure if all implementations of truncate support -r. If your truncate doesn't then use this snippet (it still uses touch -r):

# make sure you're in dir_bak
# MAKE SURE you're in dir_bak
find . -type f -exec sh -c '
   for f do
      printf "%s\n" "$f"
      r="path/to/source/dir/$f"
      s=$(<"$r" wc -c)
      [ "$?" -eq 0 ] || continue
      truncate -s "$s" "$f"
      touch    -r "$r" "$f"
   done
' find-sh {} +
2
  • The code work well, many thx, however, for better performance, especially when processing huge amount of files on those devices with slow I/O, is it possible to store the result files directly into an archive like tarball instead of in storage, for improving efficiency as well as avoiding excessive file entry creation? Commented Mar 11, 2022 at 5:27
  • @user17549713 Well, an easy improvement may be to place your dir_bak in /dev/shm or similar mount (try echo "$XDG_RUNTIME_DIR", compare to df -h, you want tmpfs), then create a tarball from there. For huge amount of files you may need to tweak the options (e.g. mount -o remount,nr_inodes=10G /dev/shm). Or modify the source of tar and compile a variant that does what you want (I haven't found this functionality; I'm not a programmer, I cannot help you with this). Commented Mar 11, 2022 at 7:39

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .