GDPR

Data anonymization blunders highlight common privacy pitfalls

Eradicating PII details from big data can be more challenging than many would imagine, which has led to numerous data anonymization blunders.

26 January 2023

James Tyrrell

@JT_bluebird1

james.tyrrell@hybrid.co

All stories

Out in the open: there are many examples of anonymized data blunders that have led to the reidentification of sensitive information. Image credit: Shutterstock Generate.

Getting your Trinity Audio player ready...

There are some similarities between cryptography and data anonymization. What appears to be a watertight and unbreakable code turns out to have a weakness, allowing its secrets to be deciphered. Similarly, with data anonymization, what looks like an unrecognizable list of attributes is transformed into highly sensitive information, often through unforeseen means. In cryptography, developers are strongly advised against home-grown solutions, as these will have been untested in the wild. And the application of well-proven methods lowers the risk of data anonymization blunders too.

Study guide

Resolving the conflict between privacy and progress requires study. And data anonymization expertise is highly prized. Khaled El Emam and Luk Arbuckle’s book on ‘Anonymizing Health Data’ is over 200 pages long, highlighting the number of factors that data protection officers and their colleagues need to take into account. Automation would certainly help make large data sets available sooner to researchers, but the complexity of the task and fear of data anonymization blunders often mean adding a manual inspection step to the job sheet.

Turning our attention to where things have gone wrong, there are some classic examples that serve as warnings. And, it should be said that data anonymization is hard to get right in the internet age – particularly as there’s no accounting for how much time people will spend trying to figure out secrets. AOL discovered this in 2006 when twenty million search queries corresponding to what the online service provider believed to be 650,000 anonymized users were released as research. Stripping out IP addresses and usernames turned out to be insufficient to protect user privacy. Fascinated internet users poring through the data were able to stitch together clues in the search phrases to link entries to real-life individuals.

Data fingerprints

A few months later, Netflix succumbed to a similar snafu. Motivated to crowdsource the talents of data scientists in solving the puzzle of how to predict what films users will want to watch next based on their movie reviews, it made linked, but apparently deanonymized information available. And, to encourage participation, there was even a $1 million prize for the winning entry. However, it turns out – somewhat unsurprisingly with hindsight – that movie reviews are personal things. And, if you have a rough idea about subscribers’ taste in films as well as when they are likely to have watched them, you had a good chance of identifying users – despite the steps that Netflix had gone to in deleting obvious PII fields.

Unique and identifying characteristics

Browsers themselves can have characteristics that are sufficient to fingerprint their users. And ad-blockers can have some unintended consequences. In fact, the more unique the browser setup, the easier it becomes to narrow down on individuals. The Electronic Frontier Foundation has a useful web app, dubbed ‘Cover your tracks’, which gives an overview of your browser’s most unique and identifying characteristics. Digital fingerprints can be unintended but raise privacy concerns nonetheless. As far back as 2012, researchers pointed out that power signatures collected by smart meters have the capacity to reveal what types of appliances are in use and when. And the data could even fingerprint which programs are being watched on TV based on changes in the screen brightness.

The nature of the modern world can make some data anonymization blunders tough to anticipate, but that shouldn’t stop us from doubling down on best practices. And there’s always plenty to be learned from mistakes.