Data anonymization blunders highlight common privacy pitfalls

Eradicating PII details from big data can be more challenging than many would imagine, which has led to numerous data anonymization blunders.
26 January 2023

Out in the open: there are many examples of anonymized data blunders that have led to the reidentification of sensitive information. Image credit: Shutterstock Generate.

There are some similarities between cryptography and data anonymization. What appears to be a watertight and unbreakable code turns out to have a weakness, allowing its secrets to be deciphered. Similarly, with data anonymization, what looks like an unrecognizable list of attributes is transformed into highly sensitive information, often through unforeseen means. In cryptography, developers are strongly advised against home-grown solutions, as these will have been untested in the wild. And the application of well-proven methods lowers the risk of data anonymization blunders too.

Big data can have tremendous value in areas such as health research, and locking down this information risks holding back medical progress. Some business models – for example, in the emerging fintech sector – can be dependent on aggregated data to provide features and services that set firms apart from their traditional competitors. But, at the same time, data protection officers will be well aware of compliance responsibilities, including General Data Protection Regulation (GDPR).

According to GDPR, data records shouldn’t identify individuals without their consent. But, thanks to social media, product reviews, and a ton of other information circulating on the web, data anonymization can be challenging, as many organizations have discovered to their cost. What’s more, data anonymization – in the strict sense – refers to information that has been completely decoupled from any personal details. Simple steps such as replacing a customer name with an id number, so-called pseudonymization, aren’t sufficient for GDPR purposes as the data is only one security failure away from becoming identifiable. For example, anonymity needs to be preserved even if a customer list were to be unexpectedly published on the internet.

Study guide

Resolving the conflict between privacy and progress requires study. And data anonymization expertise is highly prized. Khaled El Emam and Luk Arbuckle’s book on ‘Anonymizing Health Data’ is over 200 pages long, highlighting the number of factors that data protection officers and their colleagues need to take into account. Automation would certainly help make large data sets available sooner to researchers, but the complexity of the task and fear of data anonymization blunders often mean adding a manual inspection step to the job sheet.

Turning our attention to where things have gone wrong, there are some classic examples that serve as warnings. And, it should be said that data anonymization is hard to get right in the internet age – particularly as there’s no accounting for how much time people will spend trying to figure out secrets. AOL discovered this in 2006 when twenty million search queries corresponding to what the online service provider believed to be 650,000 anonymized users were released as research. Stripping out IP addresses and usernames turned out to be insufficient to protect user privacy. Fascinated internet users poring through the data were able to stitch together clues in the search phrases to link entries to real-life individuals.

Data fingerprints

A few months later, Netflix succumbed to a similar snafu. Motivated to crowdsource the talents of data scientists in solving the puzzle of how to predict what films users will want to watch next based on their movie reviews, it made linked, but apparently deanonymized information available. And, to encourage participation, there was even a $1 million prize for the winning entry. However, it turns out – somewhat unsurprisingly with hindsight – that movie reviews are personal things. And, if you have a rough idea about subscribers’ taste in films as well as when they are likely to have watched them, you had a good chance of identifying users – despite the steps that Netflix had gone to in deleting obvious PII fields.

Data anonymization blunders can be found in various quarters. And geo-tagged information can be particularly problematic, as Strava discovered when a heat map released by the fitness software company showing aggregated data attracted security concerns. The visualization, which combined global data collected over two years, suddenly made clear that wearers of Garmin watches and other fitness trackers were exercising in some unexpected places. Data science enthusiasts soon made the link between the geo-coordinates of the running and walking routes and the location of military bases. And while the location of military bases can be deduced from satellite images, the concern here, from a national security standpoint – as noted by the BBC – is that the heat map revealed which facilities were most active, and included paths taken by personnel.

Anonymized geo-data is particularly sensitive to reidentification as people often engage in predictable activity, such as commuting to work, which can soon reveal their home and employment locations. And this applies regardless of your job title. The New York Times Privacy Project was able to deanonymize location data and track the whereabouts of then President of the United States, Donald Trump, as shown on a browser-based map.

Unique and identifying characteristics

Browsers themselves can have characteristics that are sufficient to fingerprint their users. And ad-blockers can have some unintended consequences. In fact, the more unique the browser setup, the easier it becomes to narrow down on individuals. The Electronic Frontier Foundation has a useful web app, dubbed ‘Cover your tracks’, which gives an overview of your browser’s most unique and identifying characteristics. Digital fingerprints can be unintended but raise privacy concerns nonetheless. As far back as 2012, researchers pointed out that power signatures collected by smart meters have the capacity to reveal what types of appliances are in use and when. And the data could even fingerprint which programs are being watched on TV based on changes in the screen brightness.

The nature of the modern world can make some data anonymization blunders tough to anticipate, but that shouldn’t stop us from doubling down on best practices. And there’s always plenty to be learned from mistakes.