11
$\begingroup$

Intro

For deduplication purposes, I need to split a stream of plaintext bytes into variable-sized chunks. The way this is traditionally done is using a rolling hash function defined over some window $w$ (e.g. 48 bytes). This window "slides" along the byte stream and is evaluated at each byte position, and when the value of the function is let's say $f(w) \equiv 0\ (\textrm{mod}\ 1024)$ a cutpoint is made, marking the end of the current chunk and beginning of the next one. The hash function is extremely fast to evaluate when the following is known:

  • The value of the rolling hash function at some window position
  • The byte entering the window
  • The byte leaving the window

The issue

Now if I encrypt each chunk and upload it to the server, I already leak some information about the plaintext:

  • The adversary knows that the hash value (a simple CRC32-like checksum) of the last $w$ bytes of each plaintext chunk $\equiv 0\ (\textrm{mod}\ 1024)$
  • All the other windows are hashed to a value $\not\equiv 0\ (\textrm{mod}\ 1024)$.

Related

Here is a similar question, but it seems people there did not realize that the hash value (output of the rolling hash function) is never published or known to the attacker. The only information disclosed is whether the hash value $\equiv 0\ (\textrm{mod}\ 1024)$ (true or false) for every window position in the input plaintext stream. So they recommend to use SipHash or $AES(f(w)) \equiv 0\ (\textrm{mod}\ 1024)$, but both solutions are just too slow compared to using the plain rolling hash.

There seem to be similarities with the hash-table flooding attacks, as the hash value is also never directly published. Besides the already-mentioned SipHash, Microsoft has proposed Marvin32 for this particular issue.


Questions

  1. Besides the already mentioned solutions, what is the proper way to perform chunking without leaking information about the plaintext? Performance is very critical here though.

  2. Would UMAC or Universal hashing have any application here?

  3. Would this rolling hash function solve the information leakage on its own without encrypting the hash value with AES/SipHash? If not, what can be done to fix the information leakage? Pseudo-code for this hash function:

    // My secret key, never shared or disclosed to anyone
    var key = ....; 
    
    // Generate substitution table
    var substitution = new uint32[256];
    var substitutionKey = HMAC(key, SUBSTITUTION_SALT);
    for (uint32 i = 0; i <= 255; i++)
    {
        var randomBytes = HMAC(substitutionKey, UInt32ToBytes(i));
        substitution[i] = BytesToUInt32(randomBytes);
    }
    
    // Generate random irreducible polynomial of degree 33 (GF 2) and store it
    // in a 32-bit register (33 coefficients, but the 33rd is implied because
    // it's always 1)
    // The algorithm is not shown here due to its complexity, but let's assume
    // the polynomial is derived from the key
    var polynomialKey = HMAC(key, POLYNOMIAL_SALT);
    uint32 polynomial = RandomIrreduciblePolynomial33(polynomialKey);
    
    // Hash an input window
    uint8[] window = new uint8[48];
    LoadPlaintext(ref window);
    uint32 hash = 0;
    for (uint32 i = 0; i < 48; i++)
    {
        // Galois multiplication by x and subsequent reduction
        hash = (hash << 1) ^ ((hash >> 31) * polynomial);
        // Add the substitution polynomial
        hash = hash ^ substitution[window[i]];
    }
    
    ...
    
    // Now if we want to move the window forward by a single byte, there
    // exist a very efficient algorithm that does that, but it's not
    // relevant to the issue at hand
    
$\endgroup$
4
  • $\begingroup$ Maybe you want to take a look at BLAKE2sp? $\endgroup$
    – SEJPM
    Commented Jun 13, 2015 at 9:37
  • $\begingroup$ @SOJPM: Thanks, I will check it out. But after a quick investigation of the SipHash paper, they compared SipHash to BLAKE-512, and SipHash turned out to be 5.5 times faster than BLAKE-512 for 64 byte messages (BLAKE-512 has 16 rounds vs 2 rounds of SipHash). I wonder if Blake2sp improves the situation so much on 32-bit platforms as to get ahead in terms of performance. $\endgroup$
    – Paya
    Commented Jun 13, 2015 at 13:48
  • $\begingroup$ BLAKE2SP is the parallelized version of BLAKE2S which is the faster Version of BLAKE-256 which is 32-bit based. (Like SHA-256 vs SHA-512) $\endgroup$
    – SEJPM
    Commented Jun 13, 2015 at 15:58
  • 1
    $\begingroup$ Looking at the Blake2s and Blake2sp, there is no way for them to be faster than SipHash simply by comparing the amount of computation done per round and the number of rounds (2 vs 10), unless I would target 16-core processors, in which case the parallel computation might begin to catch up. SipHash and Blake2 were both made by Jean-Philippe Aumasson, and SipHash was specifically designed for hashing short messages unlike Blake2. Also, Blake2s is not available for my platform (C#). $\endgroup$
    – Paya
    Commented Jul 3, 2015 at 0:50

1 Answer 1

5
$\begingroup$

UMAC uses AES, so it will not lead to a faster algorithm than the rolling hash + AES solution suggested in the previous question. Universal hashing based MACs like UMAC and Poly1305 usually use a PRF (which you could just use directly) to hide the hash. There are universal hashing MACs that do not, but as far as I know they are even slower. (Further, I am not sure if a secure MAC would even be enough, or if you actually need a PRF.)

Without leaking any more information, there are not many options. Rolling hash + SipHash is probably faster anywhere that you lack AES hardware, but that may be the best you can do.

If you find the performance cost to be unacceptable, you might want to allow some information to leak. For example, if you define chunking using parity(w) = 0 and then PRF(k, f(w)) mod (n / 2) = 0, you halve the number of PRF calls while only leaking one extra bit of information on the window (and even less for those positions where you did not end a chunk).

You can of course reveal even more information to reduce the number of calls further, but in the end you back at something leaks as much as just using f(w) mod n = 0.

$\endgroup$
2
  • $\begingroup$ The reason why I mentioned UMAC and universal hashing is that it might be enough to use the universal hashing directly, without the final step of AES/PRF. That's because in the case of deduplication, an attacker never gets to see the output of the hash function. I tried to illustrate attacks against such a model here - I am yet to find any attacks. And if it's a secure construction (for that specific application), it completely solves all performance issues. $\endgroup$
    – Paya
    Commented Sep 22, 2015 at 22:21
  • $\begingroup$ @Paya, if that is secure, then yes, it could be faster. I cannot answer the linked question, however, so I don't know if it is. $\endgroup$
    – otus
    Commented Sep 23, 2015 at 6:25

Not the answer you're looking for? Browse other questions tagged or ask your own question.