Differential Privacy: The Basics

ε

Description

The graph above is the kernel density function of the chosen distribution. I have included randomly generated numbers as well as some variables sampled* from the NYC taxi dataset for illustrative purposes. The blue dashed line represents the raw distribution, while the red line shows what happens to its shape when differential privacy is applied. The slider reflects ε, our privacy parameter.

Notice how as the privacy parameter decreases, the distribution becomes increasingly random. Features of the distribution, such as peaks and troughs, start to flatten out, while other (nonsense) features appear. Refreshing the noise makes this clearer. Conversely, as ε increases, the noise distribution starts to match the raw data more closely.

Privacy calculation

So how do we calculate how much noise to add? Let's consider the query. In order to map the entire distribution of our sample, our query is just the identity function: Q(D) = D. Now, our sensitivity calculation requires that we consider the change in the query when one record is removed. If we strictly apply the formula, we run into trouble. After all, we cannot compare sets of different lengths. So how can we get around this? How about if we were to turn our query into some kind of aggregation? This would allow us to compare Q(D) and Q(D'), as required by the sensitivity formula.

The noble histogram supplies our aggregation mechanism. By splitting our distribution into several evenly-sized buckets (I have used 100 in this example) and counting the number of values that fall into each bucket, we have a framework for privatizing this query. Consider what would happen if we removed one record - any record. The count in one of our buckets would decrease by 1, while all the others will stay the same. Therefore our sensitivity is 1.**

It is now straightforward to add Laplace(1/ε) random variables to the count in each bucket. We can then translate these counts back into a kernel density plot to complete the illustration.

* I bounded the sample to take reasonable values to ensure erroneous or outlying entries did not spoil the visualization. This also simplified the privatization, as unlikely or impossible values were essentially excluded from the universe of possible outcomes.

** Technically, for the taxi queries, the sensitivity could be more than 1, as we would be interested in the difference if 1 individual were to be removed from the dataset (as opposed to 1 cab ride). Depending on our end user, this individual could represent a passenger, taxi driver or even medallion number.