15
$\begingroup$

I'm building a service that will contain personal data relating to real people. Initially the dataset will be quite small, and as such it may be possible to identify individuals if the search parameters are narrowed sufficiently. An example of a filter that can be applied is app version - android, ios, etc... This data will be displayed as sums and percentages. For example:

Android users:

{
  "users" : 124,
  "percentage" : 23
}

There will be many filter types, including date ranges, gender, age, etc....
How can I ensure reasonable anonymity?
I am considering requiring a minimum size result set in order help with this, for example if the query returns less than 25 results the service won't return the data, but will return some sort of notification about minimum dataset size.
What is a good way to approach this problem?

$\endgroup$
0

5 Answers 5

16
$\begingroup$

This is a difficult problem, regardless of how you look at it. But some solutions might be good enough. Here, you have the advantage that you are returning aggregate statistics, and not individual records.

Enforcing a minimum size for result sets is a good idea, and should provide privacy guarantees similar to k-anonymity. If a query result would be too small, you could provide a minimum count (e.g. returning users: 25 even if the true value is users: 4).

But if larger counts are accurate and if all possible categories are known, this may still enable recovery of the true value via multiple queries. E.g. if we know that there are 1000 records in total, that there are 599 Android users, 399 iOS users, and that the only other possible OS is Windows Mobile, we can infer that exactly 2 users fall into that category. Similarly, multiple requests with overlapping date ranges could make it possible to make inferences about small result sets.

Differential Privacy makes such inferences unreliable. This works by adding noise to the result, where the appropriate level of noise depends on the specific query. If all queries merely count the number of matching records, this is fairly feasible (in differential-privacy lingo, the L1-sensitivity of such queries is 1). On average, the noisy results will still be accurate (and the bit of noise won't matter on larger counts), but adversaries cannot make accurate inferences about small samples.

In practice, this can still be difficult to implement appropriately. For example, an attacker might try to recover the true result of a query by issuing it multiple times. If each response samples a different noisy result, the true value will be around the average of the responses. E.g. Reddit fuzzes vote counts, but rolls a different number each time a vote count is requested. Request the count three times, and you'll have a good idea of the true vote count.

You can defend against this by using the same RNG seed for each query-response, for example by hashing the query + true result.

But then this makes it possible for an attacker to detect when the true value changes: as long as they get the same result they can be somewhat sure that the true value has changed. If the rate of changes is lower than the rate of queries, this could enable an attacker to estimate the true rate of changes, which may allow inferences about the true result of the query (especially since you allow queries for specific time ranges).

There is still the issue that an attacker can issue multiple distinct queries with overlapping result sets in order to make inferences. E.g. if an attacker issues a query for A="Monday", B="Tuesday", and C="Monday to Tuesday", they obtain three random variables, but they know that their expected value is related via E[A] + E[B] = E[C]. They can use this to defeat some of the added noise. This means that your Differential Privacy Budget (the standard deviation of the noise distribution) must be large enough to obfuscate the true result of all queries combined, not just the result of a single query.

Depending on the sensitivity of the data, I would combine a number of strategies to reduce possible inferences by an attacker:

  • do not provide a complete list of facets/categories/values to users
  • reject requests that go over some rate limit
  • reject requests that are very specific (e.g. too many facets, too small date ranges)
  • count the true number of matching records for the query
  • increment that count to some minimum value
  • derive an RNG seed, for example by hashing the count, current date, and normalized query. Use a cryptographically secure hash function, e.g. SHA-2
  • add random noise to the count. The standard deviation of the noise distribution should match your privacy requirements. Larger SD → more privacy, but less accurate results. Use a cryptographically secure RNG, or just use the hash as a source of random bits if it is long enough.
  • if the query could cover recent/live data: cache the noisy result for a reasonable time in order to mask changes to the true value

If you have very stringent privacy requirements, do not run live queries. Instead, determine all possible queries up front, derive a combined privacy budget from their combined query complexity, then evaluate them together and store the results. Alternatively, determine some privacy budget up front and run live queries, but stop serving responses once the privacy budget has been exhausted.

$\endgroup$
3
  • 8
    $\begingroup$ Excellent! My team's been studying DP for several years; for a real guarantee of "privacy" it's basically the only option. That said "this can still be difficult to implement appropriately" is an understatement! Implementing DP such that you actually get the security guarantee you wanted is F-ing hard. Consider getting expert help; maybe this guy knows someone who could consult with you. $\endgroup$ Commented Mar 23, 2023 at 15:28
  • $\begingroup$ A little anecdote by way of warning: I participated in a survey, where we were assured the result would be anonymous, where we were asked to rate our boss on certain characteristics. In one of the questions, all his staff rated him below 3. Therefore he knew that I had rated him below 3. I think it's almost impossible to avoid that kind of effect. $\endgroup$ Commented Mar 25, 2023 at 15:08
  • 1
    $\begingroup$ @MichaelKay; actually, if correctly implemented, DP does defend against that kind of effect, and worse. More specifically, DP would place an upper bound on the possible change in your boss's best estimate of the likelyhood that you had rated him ≤3, even if he used some side-channel to learn everyone else's exact responses! Of course the tighter that upper bound, the greater the cost in accuracy. $\endgroup$ Commented Mar 25, 2023 at 19:36
7
$\begingroup$

As the other answer points out, total anonymity is difficult to accomplish especially when considering an adversary who is trying to break your data privacy scheme. But, that might not be what you're really trying to do.

Ask yourself exactly what you are trying to accomplish. Do you believe your data is valuable and someone is going to try to steal it? Are you trying to comply with a specific data privacy law? Are you trying to satisfy ethical requirements set by an internal or external review board? Are you just trying to do a good job and sleep well at night?

You can only provide "reasonable anonymity" after you commit to a specific threat model or objective. What is reasonable for protecting data in a lawyer's office or health data under HIPAA is different from what is reasonable to protect someone's operating system and hardware revision number.

One point in particular to make: most data privacy laws require some kind of adherence to best practices. This is challenging because best practices change over time. If someone comes up with a new differential privacy attack tomorrow, and data handling standards change in response to that, then a company might find themselves in violation of the law simply by not keeping up with recent developments. Complying with such laws really require someone in the organization to own the data handling process and ensure that the company is staying current. Best practice compliance can also mean regular outside audits or other forms of external verification. There's no simple technical solution here like k-anonymity: this is really an organizational and process problem more than a technical problem.

$\endgroup$
1
$\begingroup$

Depending on your privacy requirements, another option could be obfuscating small occurrences by pruning. I.e. when the statistics for a particular filter are too prone to identification, just regroup them under "Other" or something.

E.g. say the results for querying about the preferred fruit for people aged between 25 and 35 owning a sidecar are { "apples": 512, "bananas": 317, "litchees": 3, "ananas": 1 }, then return { "apples": 512, "bananas": 317, "other": 4 } instead.

That’s certainly not bullet proof, but probably good enough in many cases.

$\endgroup$
1
$\begingroup$

If you want to absolutely ensure anonymity, you'll have to do aggregation before any queries are run. Any anonymity policy applied to individual queries is vulnerable to data leakage. While it's theoretically possible to design a per-query system that prevent data leakage, you can't know for sure that there's a vulnerability you've missed. This also means that you can't add data one record at a time; new data will have to be added in batches, with the batches large enough to allow aggregation. The advantage, however, is that there is less concern about securely storing the data. If you store personal information in the database, and try to anonymize after a query has been run, then depending on the jurisdiction, you may have serious legal issues, and those issues will likely be lessened if the database stores only aggregated data.

$\endgroup$
0
$\begingroup$

One thing that might help would be to deterministically round the value to a multiple of e.g. 25. An attacker might be able to infer information from how the results of different queries add up, though...

$\endgroup$
1
  • $\begingroup$ As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. $\endgroup$
    – Community Bot
    Commented Apr 2, 2023 at 13:32

Not the answer you're looking for? Browse other questions tagged or ask your own question.