Synthetic data — all the perks without the risk?

By creating synthetic data, organizations can take away the privacy risk from software and tool development — but just how far can we take it?
22 October 2020
  • As organizations consider the risk of cyber-attacks and malware to combat the potential of significant revenue losses for the business, the concept of synthetic data could provide added accessibility without compromising privacy

Data has incredible importance for every business today. By developing specific software and scripts, organizations can extract insights and value from it that can ultimately inform strategy, shape business models, and optimize customer communications, among much more.

These tools and software, and the results they churn out, need to be rigorously tested and validated. But, in an age of continued digitization where privacy and protection of personal information is a priority for organizations, running these tests with the real thing is not always the most prudent idea. While a combination of policies, protocols, and cybersecurity tools is crucial for safeguarding customer data, the risk of using actual customer data still won’t be eliminated entirely. Organizations that work this way face legal comeuppance and damage to reputation if that data happens to be misused or is compromised in the process of developing new tools.

In order to take advantage of big data without transgressing any laws, big data players face a fairly new need: synthetic data.

What is synthetic data?

Much like in the way a scientist may create synthetic material to conduct experiments at low risk, synthetic data can be used as a stand-in for real data to serve the same function. But it must resemble real-world data in some way to be effective — while invented and non-representative of authentic information, it must have the same mathematical and statistical properties as real-world datasets.

The concept of synthetic data isn’t new, but more organizations are looking towards it in light of the increased risk of cyber-attacks and malware that could potentially lead to significant revenue losses for the business. With many businesses now operating remotely bringing greater cybersecurity vulnerabilities, synthetic data offers a viable means for individuals and departments across an organization to work with consistent datasets without risk.

Most recently, MIT researchers developed a set of open-source tools to expand data access without compromising privacy. The project called Synthetic Data Vault came from a common scenario where artificial data was created for students to work with rather than the sensitive data. Prior to this, developers would make a simple version of the data they needed but would face crashes in the system when it came to applying it properly.

Following rounds of imperfect attempts, the researchers eventually managed to generate perfect data through an artificial intelligence (AI)- powered generator using generative neural networks (GANs). The ability to generate such precise data has garnered plenty of fresh interest from several sectors; banking in particular, with its increased digitization and new data privacy rules, has already been marked as an ideal area for application.

Here are some of the advantages and drawbacks of using synthetic data.

The advantages

Considering data security-related legal obligations like the California Consumer Privacy Act (CCPA) and the General Data Protection Regulation (GDPR), synthetic data can help organizations manage their privacy as lawmakers emphasize the importance of these issues.

While the CCPA does not reference synthetic data, it does expressly exclude de-identified or anonymized data. The GDPR also does not expressly reference it but does expressly say it does not apply to anonymous information. These important global regulatory mandates essentially do not apply to the collection, storage, and use of synthesized data.

As that’s the case, organizations have much more leeway for experimentation with synthetic datasets, providing the breathing room for innovation to take place.

Creating synthetic data through a generation model is much more cost-effective and efficient than collecting it from the real-world. Building autonomous vehicle systems, for example, takes hundreds of hours of time on the road, in order for systems to continually identify and learn new scenarios and events. Engineers can help to expedite the process by creating synthetic data that mimics the uncertainty of real driving conditions, to augment the pace of machine learning.

As well as using GANs to create new data modeled closely around an original dataset, synthetic data can be created by removing personal information such as names, birthdates, addresses from real datasets so it becomes completely anonymized. That means that important financial or behavioral information that’s free of any identifiable data — such as transaction amount, purchase date, time on page — can be analyzed and used to safely monetize datasets without compromising privacy.

As data use becomes a central component in the way that businesses operate, there is greater demand for individuals with the skills and expertise to work with and manage large datasets. But with a data scientist shortage, organizations are increasingly looking in-house to upskill talent with the necessary credentials. Synthetic datasets be utilized in training programs as a risk-free method for individuals to get to grips with data management, modeling, and compliance.

The drawbacks

Synthetic data can closely resemble the properties of authentic data, but it doesn’t reflect the original content exactly. GAN models will seek common trends, and therefore, might circumvent important outliers in the original dataset. As such, the output models, tools, or software developed based on synthetic data won’t necessarily be as accurate as expected. In some cases, this won’t matter much, in others it could pose a critical issue.

The quality of synthetic data depends on the model that created it. GANs can recognize statistical regularities but can also be susceptible to statistical ‘noise’. Adversarial perturbations can cause the model to completely misclassify data, and create inaccurate outputs as a result. In these cases, human-annotated data must be fed into the model to assess whether outputs are consistent.

The reliability of synthetic data requires the use of a verification server to perform identical analyses on both the synthetic dataset and the authentic dataset it’s based on — this ensures the two sets are consistent and outputs from using the synthetic dataset aren’t based on assumptions.