Deduplicating Multiple File Types Across Multiple Sources

Question

We are faced with a situation where data has been backed-up to several external mediums and we are undergoing an exercise to consolidate the data. The data is comprised of binary files, audio, video, compressed archives, virtual machines, databases, etc.

Is it a best practice to copy all the files to a single source prior to deduplicating the data or is it normal to run the procedure across multiple media?
Is it best to run file-level or block-level deduplication? I am aware of the technical differences but am unclear why you would choose one over the other. We are after accuracy as opposed to performance

EDIT

When I say copy, I mean we would copy each source to a single drive or NAS. Each source would be represented by a directory. All the data is currently stored in external hard drives. The objective is to deduplicate the data and have a single source of truth.

The paid version of CCleaner can detect duplicate files. I don't know if it scans network drive locations. Your actual question is not all that clear. — Ramhound, Commented Feb 27, 2014 at 16:50
How would you copy it all to a single source? Are you talking about a single drive instead of several network locations? Are you talking one folder vs multiple folders? What about some HDD, SDD and other removable media? Please clarify. — Raystafarian, Commented Feb 27, 2014 at 17:28

Robert Calhoun · Accepted Answer · 2014-02-27 18:38:13Z

0

Tools like rsync can manage the comparison operations and the moving of bits back and forth, but you're going to have to supply your own logic about which version of the data is canonical.

Is it best to run file-level or block-level deduplication?

This part of your question is easy, at least: you should never need to care about what is going on at the block level.

answered Feb 27, 2014 at 18:38

Robert Calhoun

3333 silver badges9 bronze badges

My understanding is that rsync uses file level checksums to verify duplicate data. How accurate is this? Why use rsync over opendedup for example? I assume that it's better to use block level deduplication.
– Motivated
Commented Feb 27, 2014 at 19:02
By default rsync only compares file size and modified time. If you use the --checksum option (which I would recommend as it doesn't take that long) it will compute checksums on each file as it goes. I am not sure what algorithm rsync uses; probably MD5. MD5 is no longer considered cryptographically secure but it's fine for declaring two files to be identical (with very high probability.) I am not familiar with opendedup. Rsync is powerful but notoriously difficult to use; there certainly might be more appropriate tools out there.
– Robert Calhoun
Commented Feb 27, 2014 at 19:36
How does that differ from standard file level deduplication?
– Motivated
Commented Mar 2, 2014 at 23:43
It differs in that you do not have to actually transfer the file over the network to do the comparison. Re-reading your question, I see that your data is on external hard drives, so you don't care about this feature of rsync and it is probably not the right tool. Sorry.
– Robert Calhoun
Commented Mar 5, 2014 at 21:56

Add a comment |

Stack Exchange Network

Deduplicating Multiple File Types Across Multiple Sources

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
deduplication
.

Hot Network Questions

Deduplicating Multiple File Types Across Multiple Sources

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged deduplication.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
deduplication
.