Online backup : how could encryption and de-duplication be compatible?

Question

A "soon to enter beta" online backup service, Bitcasa, claims to have both de-duplication (you don't backup something already in the cloud) and client side encryption.

http://techcrunch.com/2011/09/12/with-bitcasa-the-entire-cloud-is-your-hard-drive-for-only-10-per-month/

A patent search yields nothing with their company name but the patents may well be in the pipeline and not granted yet.

I find the claim pretty dubious with the level of information I have now, anyone knows more about how they claim to achieve that? Had the founders of the company not had a serious business background (Verisign, Mastercard...) I would have classified the product as snake oil right away but maybe there is more to it.

Edit: found a worrying tweet : https://twitter.com/#!/csoghoian/status/113753932400041984, encryption key per file would be derived from its hash, so definitely looking like not the place to store your torrented film collection, not that I would ever do that.

Edit2: We actually guessed it right, they used so called convergent encryption and thus someone owning the same file as you do can know wether yours is the same, since they have the key. This makes Bitcasa a very bad choice when the files you want to be confidential are not original. http://techcrunch.com/2011/09/18/bitcasa-explains-encryption/

Edit3: https://crypto.stackexchange.com/questions/729/is-convergent-encryption-really-secure have a the same question and different answers

I wonder if this deduplication feature could be used to identify individuals that had uploaded the same data? I think that would be a privacy issue. — MToecker, Commented Sep 14, 2011 at 17:24
@MToecker: In this case, the purpose of the encryption is strictly to conceal the data. It is not intended to provide privacy. (That does have to be made clear to users, of course!) — David Schwartz, Commented Sep 20, 2011 at 10:32
You could ask the developers at Conformal Systems about their Cyphertite project? — user5428, Commented Oct 13, 2011 at 21:09

Misha · Accepted Answer · 2011-09-14 12:31:57Z

I haven't thought through the details, but if a secure hash of the file content were used as the key then any (and only) clients who "knew the hash" would be able to access the content.

Essentially the cloud storage would act as a collective partial (very sparse, in fact) rainbow table for the hashing function, allowing it to be "reversed".

From the article: "Even if the RIAA and MPAA came knocking on Bitcasa’s doors, subpoenas in hand, all Bitcasa would have is a collection of encrypted bits with no means to decrypt them." -- true because bitcasa don't hold the objectid/filename-to-hash/key mapping; only their clients do (client-side). If the RIAA/MPAA knew the hashes of the files in question (well known for e.g. specific song MP3s) they'd be able to decrypt and prove you had a copy, but first they'd need to know which cloud-storage object/file held which song.

Clients would need to keep the hash for each cloud-stored object, and their local name for it, of course, to be able to access and unencrypt it.

Regarding some of the other features claimed in the article:

"compression" -- wouldn't work server-side (the encrypted content will not compress well) but could be applied client-side before encryption
"accessible anywhere" -- if the objid-to-filename-and-hash/key mapping is only on the client then the files are useless from other devices, which limits the usefulness of cloud storage. Could be solved by e.g. also storing the collection of objid-to-filename-and-hash/key tuples, client-side encrypted with a passphrase.
"patented de-duplication algorithms" -- there must be more going on than the above to justify a patent -- possibly de-duplication at a block, rather than file level?
the RIAA/MPAA would be able to come with a subpoena and an encrypted-with-its-own-hash copy of whatever song/movie they suspect people have copies of. Bitcasa would then be able to confirm whether or not that file had been stored or not. They wouldn't be able to decrypt it (without RIAA/MPAA giving them the hash/key), and (particularly if they aren't enforcing per-user quotas becausrer they offer "infinite storage") they might not have retained logs of which users uploaded/downloaded it. However, I suspect they could be required to remove the file (under DMCA safe harbour rules) or possibly to retain the content but then log any accounts which upload/download it in the future.

It seems like it would be easy to dodge the RIAA's known hash of an MP3 by simply setting an ID3 tag to a long random string. Some similar non-operational modification to movie file's would hamper efforts by the MPAA. — bstpierre, Commented Oct 25, 2011 at 12:27
Deduplication isn't likely to happen at the file level, rather on blocks of a selected size. So the hashing and deduplication wouldn't probably be possible for obtaining very useful information on specific files. — deed02392, Commented Feb 18, 2014 at 17:25

Thomas Pornin · Accepted Answer · 2011-09-14 12:43:49Z

The commercial ad you link to, and the company web site, are really short on information; and waving "20 patents" as a proof of competence is weird: patents do not prove that the technology is good, only that there are some people who staked a few thousand dollars on the idea that the technology will sell well.

Let's see if there is a way to make these promises come true.

If data is encrypted client-side, then there must be a secret key K_f for that file. The point of the thing is that Bitcasa does not know K_f. To implement de-duplication and caching and, more importantly, sharing, it is necessary that every user encrypting a given file f will end up using the same K_f. There is a nifty trick which consists in using the hash of the file itself, with a proper hash function (say, SHA-256), as K_f. With this trick, the same file will always end up into the same encrypted format, which can then be uploaded and de-duplicated at will.

Then a user would have a local store (on his computer) of all the K_f for all his files, along with a file ID. When user A wants to share the file with user B, user A "right clicks to get the sharing URL" and sends it to B. Presumably, the URL contains the file ID and K_f. The text says that both users A and B must be registered users for the sharing to work, so the "URL" is probably intercepted, on B's machine, by some software which extracts the ID and K_f from that "URL", downloads the file from the server, and decrypts it locally with its newly acquired knowledge of K_f.

For some extra resilience and usability, the set of known keys K_f for some user could be stored on the servers, too -- so you just need to "remember" a single K_f key, which you could transfer from one computer to another.

So I say that what Bitcasa promises is possible -- since I would know how to do it, and there is nothing really new or technologically advanced here. I cannot claim that this is what Bitcasa does, only that this is how I would do it. The "hard" part is integrating that in existing operating systems (so that "saving a file" triggers the encryption/upload process): some work, but hardly worth a patent, let alone 20 patents.

Note that using K_f = h(f) means that you can try an exhaustive search on the file contents. This is unavoidable anyway in a service with de-duplication: by "uploading" a new file and just timing the operation, you can know whether the file was already known server-side or not.

Ain't TechCrunch a pinnacle of fair and ethical reporting ;-) — Bruno Rohée, Commented Sep 14, 2011 at 12:55
If the technology functioned as you described, would that mean that if your hard drive crashed you wouldn't be able to recover your files from the cloud, as the originals (and probably the keys too) would have been lost to you? If this is the case it would make the service useless as backup, correct? — Joshua Carmody, Commented Sep 19, 2011 at 16:01
@Joshua: well, with crypto you always have to start at something. If the servers stored everything in such a way that your data could be recovered even if you did not remember anything at all, then the system would not be secure against the servers themselves. What could be done is to store all the Kf in a file, and then just "remember" the Kf for that file -- possibly, encrypt it with a password, or write it down on a paper which you store in a safe. With crypto you can begin at a single, small key, which can be store with low-tech tools. — Thomas Pornin, Commented Sep 19, 2011 at 16:06

zedman9991 · Accepted Answer · 2011-09-14 13:31:53Z

Bruce Schneier touched on the subject in May http://www.schneier.com/blog/archives/2011/05/dropbox_securit.html related to the Dropbox problem of that week. TechRepublic offers a great 7 page white paper on the subject for the price of an e-mail sign up at http://www.techrepublic.com/whitepapers/side-channels-in-cloud-services-the-case-of-deduplication-in-cloud-storage/3333347.

The paper focuses on the side channel and covert channel attacks available in cloud deduplication. The attacks leverage the cross user deduplication. For example, if you knew Bob was using the service and his template-built salary contract was up there you could craft versions of same until you hit his salary. Success indicated by the time the file took to upload.

Of course your protection is to encrypt prior to using the service. That will however prevent the cost savings to the service that makes it economically viable as it would eliminate almost all deduplication opportunities. Thus the service will not be encouraging the choice.

Community · Accepted Answer · 2020-06-16 09:49:05Z

In addition to the other good answers here, I'd like to point you to the following two academic papers, which were published recently:

Martin Mulazzani, Sebastian Schrittwieser, Manuel Leithner, Markus Huber, and Edgar Weippl, Dark Clouds on the Horizon: Using Cloud Storage as Attack Vector and Online Slack Space, Usenix Security 2011.

This paper describes how Dropbox does de-duplication and identifies attacks on the mechanism. They propose a novel way to defend against some -- but not all -- of these attacks, based upon requiring the client to prove they know the contents of the file (not just its hash) before they're allowed to access the file.
Danny Harnik, Benny Pinkas, Alexandra Shulman-Peleg. Side channels in cloud services, the case of deduplication in cloud storage, IEEE Security & Privacy Magazine.

This paper analyzes three cloud storage services that perform de-duplication (Dropbox, Mozy, and Memopal), and points out the consequent security and privacy risks. They propose a novel defense against these risks, based upon ensuring that a file de-duplicated only if there are many copies of it, thus reducing the information leakage.

These papers seem directly relevant to your question. They also demonstrate that there is room for innovation on non-trivial mitigations for the risks of naive de-duplication.

Nakedible · Accepted Answer · 2011-09-14 23:07:41Z

Encryption and de-duplication between arbitrary users are not compatible if you are concerned about distinguishing certain plaintexts. If you are not concerned about these types of attacks, then it can be safe.

If the data is only de-duplicated for a certain user, the server doesn't know anything about the equivalence of plaintexts and the attacks that remain are really minor.

If the data is de-duplicated between a circle of friends that share something that isn't known to the service provider (doable automatically), only people from that circle of friends can distinguish plaintexts (via timing etc.).

But if the data is de-duplicated between all users, all the hypothetical attacker, who wishes to know which plaintexts are accessed, needs to do is to store the file to the cloud themselves and then monitor which user accounts are accessing the same data. Sure, the service can just "not log" the user accounts / IP addresses accessing the data - but that has nothing to do with encryption then and the same "protection" would remain even if the files were plaintext.

None of the other answers given here seem to propose anything that would stop this attack and I believe Bitcasa does not either. I would be glad to be proven wrong though.

(Note: There are some ways to possibly achieve something close to this - there have been quite a few papers published about secure cloud storage using all sorts of innovative techniques - but these are new research and most of them will probably be broken or shown infeasible rather fast. I wouldn't trust my data on any of them yet.)

To this I can only add that MPAA and RIAA will most likely just get a court order/law forcing Bitcasa to implement mechanism to enable the two organizations to get list of users having certain content. So the problem is not even technical. — Franci Penov, Commented Sep 15, 2011 at 0:58

Community · Accepted Answer · 2017-04-13 12:48:18Z

5

The same question was asked at the cryptography stack exchange. Please see my answer there, as there is a subtlety that is easy to overlook and that has been carefully analyzed by the Tahoe-LAFS open source project: https://crypto.stackexchange.com/questions/729/is-convergent-encryption-really-secure/758#758

edited Apr 13, 2017 at 12:48

CommunityBot

1

answered Sep 23, 2011 at 16:14

Zooko

1512 bronze badges

1

Can you expand just a little here - couple of bullet points on the subtlety you mention would help users.
– Rory Alsop ♦
Commented Sep 28, 2011 at 14:23
1

There are two possible attacks. The first one, which we call the "confirmation of a file attack" is the obvious problem that deduplication exposes the fact that the two things were the same as each other. This issue was immediately appreciated and discussed when convergent encryption was first proposed (not under that name) on the cypherpunks mailing list in 1996. (Before Microsoft applied for a patent on convergent encryption, so the cypherpunks discussion is prior art that invalidates the Microsoft patent.)
– Zooko
Commented Oct 1, 2011 at 5:19
1

The second attack, which we call "learn the remaining information", is not so obvious, and as far as I know nobody was aware of this attack until 2008 when Drew Perttula and Brian Warner developed it as an attack against the Tahoe-LAFS secure filesystem. In the "learn the remaining information" attack, the attacker can make guesses about a few secret, random, unknown parts of a larger file and then find out if one of their guesses is correct. Please see the write-up at: tahoe-lafs.org/hacktahoelafs/drew_perttula.html
– Zooko
Commented Oct 1, 2011 at 5:21

Add a comment |

Rory Alsop · Accepted Answer · 2011-09-14 12:34:06Z

2

Aside from the great answer @Misha just posted on the 'known hash', client side encryption effectively removes any other way to do de-duplication unless there is an escrow key, which would potentially cause other logistical issues anyway.

answered Sep 14, 2011 at 12:34

Rory Alsop♦

61.5k12 gold badges118 silver badges323 bronze badges

1

I don't believe that is correct. Metadata is one side channel that can provide a deduplication avenue. Just look at the filesize of all your documents. Prior to easily available hashing, filesize was a frequently used metric for duplication detection.
– this.josh
Commented Sep 16, 2011 at 7:36
I didn't realise it was actually used. Anyone still doing it? Way too easy to spoof whatever file you want, surely?
– Rory Alsop ♦
Commented Sep 16, 2011 at 17:03
1

It was used in Windows (3.1 and 95) shareware programs to look for duplicate files (when the filename wasn't enought). I don't think anyone using that technique explicitly, but size is an important protection against appending to bring modified data back to a target hash value. For the average home user, it used to be that they only had a few documents and they were usually different sizes. The massive amount of data the average consumer now has along with hords of nearly identically sized files (pictures) make file size a poor indicator.
– this.josh
Commented Sep 16, 2011 at 19:10

Add a comment |

pAkY88 · Accepted Answer · 2013-12-16 15:50:37Z

-1

you totally right! Using just convergent encryption is not a good choice, even for non-original files https://tahoe-lafs.org/hacktahoelafs/drew_perttula.html Fortunately, looks like there is a solution to combine encryption and deduplication. It's called ClouDedup: http://elastic-security.com/2013/12/10/cloudedup-secure-deduplication/

answered Dec 16, 2013 at 15:50

pAkY88

991 bronze badge

Add a comment |

Stack Exchange Network

Online backup : how could encryption and de-duplication be compatible?

8 Answers 8

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
cryptography
encryption
privacy
backup
.

Linked

Hot Network Questions

Online backup : how could encryption and de-duplication be compatible?

8 Answers 8

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged cryptographyencryptionprivacybackup.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
cryptography
encryption
privacy
backup
.