The NYC taxicab dataset is a rich and diverse source of information, and there has been some talk online about some of its more interesting features. The visualizations above display some of these features, and the effects of applying differential privacy to these findings both in aggregate and at the individual level.
This graph shows the average speed driven per hour of the day by taxi drivers in NYC. The top graph is the average for all drivers, while the bottom shows the average speed over the year for just one driver.
One can see an interesting (if not unexpected) pattern - between the busy hours of 8am and 6pm taxis crawl around at an average of 12mph, after which traffic becomes more managable and the average speed increases, reaching a peak at 5am, after which it begins to slow down again.
From a privacy perspective, it is clear that the speed for all drivers does not change even with the most stringent privacy settings. This is intuitive - with roughly 40,000 drivers in New York, the removal of 1, even if he is the fastest driver, has a negligible effect on the average.
It is a different story at the individual level. Play with the slider to see the effect on the privatized result. The speed varies wildly at low ε levels, but does approach the true value as ε increases. Therefore, by choosing a reasonable privacy budget, one can still extract meaningful and accurate information while protecting the individual.
This is the query used to obtain the true averages:
SELECT HOUR(TIMESTAMP(pickup_datetime)) AS hour,
ROUND(AVG(trip_distance/trip_time_in_secs*60*60),3) AS speed
WHERE trip_time_in_secs > 10 AND trip_distance < 90 AND speed < 70
AND hack_license = "39C68074F40525E67E6328A533836A90" /*for individual query*/
GROUP BY hour
ORDER BY hour
To privatize this for all drivers, noise is added to each of the 24 averages. The sensitivity calculation is as straightforward as considering what the greatest change in the average (of any of the points) will be with the removal of an individual from the dataset. This yields the highly benign sensitivity of 0.000003844.
The treatment is different for the individual. As explained in the appendix to this post, we should not simply add noise to the true averages. Rather, we create a series of buckets (I created one for each speed from 8 to 25mph) and increment the bucket by 1 for each actual value that falls in that bucket. Because we are considering each of the 24 hours of the day independently, there are 24 sets of buckets - and our count is 1 at most. Privatizing is as simple as adding Laplace(1/ε) noise to each of the counts. Technically I should display the whole histogram but it makes much more sense here to just plot the bucket with the highest privatized count, which you can see in the graph.
For more details, view the source code behind this page.
The graph above shows the raw and privatized results from querying an individual driver's income. I have included 10 examples here for illustration, although in reality of course the privacy budget will decrease with each query, due to composition.
What can clearly be seen here is that when ε is low, the results are highly varied, and show little to no correlation with the true values. By clicking "Refresh Noise", we can see how much they vary. With a high privacy parameter, we can get a lot closer, but, as noted below, the accuracy of our answer is bounded by the width of our histogram intervals.
More interesting here from a privacy perspective are the two horizontal lines that represent the true and privatized average incomes respectively. Even with the strictest privacy parameter, the privatized average does not stray too far from the true average. This makes intuitive sense from the context of differential privacy, as no individual's privacy is jeopardised by knowledge of the average.
The individual averages were privatized by considering a query that asks for total income for a certain driver: e.g.
WHERE hack_license = 'A81AB69E50BD54D76B7D6B8A7FD25F6A';
As has been the case with other point queries (note that this is still a point query, as although we are aggregating over taxi rides, we are considering individual drivers as our true data point), for each driver we create a set of buckets over a reasonable range, and increment the count in the bucket containing their actual income. In this case I chose $10,000-wide buckets up to $150,000, and privatized the counts by adding a Laplace(1/ε) random variable to the actual count. The chart shows the midpoint of the bucket with the highest privatized count.
The overall average was privatized differently, as it is a true aggregation. Differential privacy is applied by adding a Laplace(sensitivity/ε) random variable to the true average. The sensitivity is obtained by asking, "how much would the average change if the most different (in this case the highest earner) was removed from the dataset?" This is $15.55, and explains why the change in the average is so small.
The maps above show the top 100 pickup locations for all drivers (left) and one specific driver (right). The circles represent the actual locations, while the grid squares show the result of privatization.
This data is certainly interesting from an urban study perspective, but could also be used by an adversary to locate a particular taxi driver. For example, by hanging out near 40th & 8th, I would have a high chance of coming across this particular driver.
By adding noise, privacy is assured, as it is no longer possible to locate exact pickup coordinates. As in my other examples, the effect of ε is negligible when looking at all drivers, but on the individual basis it would take an incredibly lenient privacy budget to reveal the true data. The use of a grid further fuzzifies the results, and has been chosen to best balance privacy and accuracy.
The true coordinates were obtained with the following query:
SELECT CONCAT(ROUND(pickup_longitude,3),',',ROUND(pickup_latitude,3)) AS coordinate,
COUNT(*) AS cnt
WHERE hack_license = "39C68074F40525E67E6328A533836A90" /*for individual query*/
GROUP BY coordinate
ORDER BY cnt DESC
The privacy calculation is very similar to that done for the celebrity queries. A grid was laid across the map, and the count of actual pickups in each square was recorded. A Laplace random variable was added to the count in each square, although here, unlike our other examples, the sensitivity is not equal to 1. Rather, the sensitivity here represents the maximum pickups for any one driver in any location. Examination of the data revealed this to be in the region of 3,000 (which is in itself an interesting result, and explains why our individual map is so unstable when privatized).
The chord diagram above shows 2013 taxi rides* between the following 6 neighborhoods: East Village, Greenwich Village, Little Italy, Lower East Side, SoHo and West Village. The arcs on the outside represent the total trips taken from the neighborhood, while the chords reflect the number of trips between each source and destination.
The privatized version is again interesting in that it emphasises the importance of the privacy parameter ε. At low levels of privacy there is little change in the diagram, while with strict privacy (ε is small) the results are much less meaningful. Clearly, the optimal result lies somewhere between these two extremes.
It is left to the reader to consider why our aggregate results here are perhaps not quite as stable as seen in the other 3 diagrams.
*Note that this data is for illustration only, as it was taken from a subset of the NYC taxi dataset.
In this case, the query returns a matrix A, with each element Aij reflecting the number of trips from the source i to the destination j. As such, there are 6x6=36 counts returned by this query, and the sensitivity is given by the maximum change to this matrix that could occur with the elimination of 1 driver - and is thus equal to the maximum number of trips taken by any one driver in these neighborhoods.
Since I have subset the dataset by time for this query, I need only consider the maximum trips taken within this subset - which yields 18. Adding a Laplace(18/ε) random variable to each point in the matrix guarantees differential privacy.