To preserve data privacy, companies have traditionally depended on data masking, also known as de-identification. The primary idea is to scrub each record of all personally identifiable information (PII). However, as a result of a number of high-profile events, even apparently de-identified data might compromise consumer privacy. By matching health information with public voter registration data, an MIT researcher uncovered the then-governor of Massachusetts’ health records in a supposedly disguised dataset in 1996.
By merging data from IMDB with a supposedly anonymous dataset made available by Netflix in 2006, UT Austin researchers were able to re-identify movies seen by thousands of people. Researchers used AI to fingerprint and re-identify more than half of the mobile phone records in a supposedly anonymous dataset in a 2022 Nature publication. All of these examples show how attackers can use “side” information to re-identify apparently veiled data.
Differential privacy resulted as a result of these failures. Companies would share data processing results mixed with random noise instead of data. The noise level is chosen so that the output does not reveal anything statistically significant about a target to a would-be attacker: The same output may have come from a database that contained the target or from a database that did not contain the target. The outcomes of the shared data processing do not reveal personal information, ensuring that everyone’s privacy is protected.
In the beginning, operationalizing differential privacy was a huge challenge. The earliest applications were from companies like Apple, Google, and Microsoft, which have significant data research and engineering teams. How can all enterprises with modern data infrastructures exploit differential privacy in real-life applications as the technology matures and costs fall?
When an analyst doesn’t have access to the data, differential privacy is commonly used to create differentially private aggregates. The sensitive information is accessed using an API that only returns noisy responses to protect privacy. From simple SQL queries to complicated machine learning training jobs, this API can conduct aggregations on the entire dataset.
One of the drawbacks of this approach is that, unlike data masking techniques, analysts no longer have access to individual records in order to “get a feel for the data.” One approach to get around this restriction is to supply differentially private synthetic data, which is created by the data owner and duplicates the statistical features of the original dataset.