The true promise of synthetic data
Researchers at MIT launch the Synthetic Data Vault, a set of open source tools aimed at expanding access to data without compromising privacy.
Every year, the world generates more data than the previous year. In 2020 alone, an estimated 59 zettabytes of data will be “created, captured, copied and consumed,” according to the International Data Corporation, enough to fill about a trillion 64-gigabyte hard drives.
But just because data is proliferating doesn’t mean everyone can use it. Companies and institutions, legitimately concerned about the privacy of their users, often restrict access to data sets – sometimes within their own computers. And now that the Covid-19 pandemic has closed laboratories and offices, preventing people from visiting centralized data warehouses, sharing information securely is even more difficult.
Without access to data, it is difficult to create tools that really work. Introduce synthetic data: Artificial information developers and engineers can use it as a substitute for real data.
Synthetic data is a bit like diet soda. To be effective, they have to look like “real” ones in certain ways. Diet soda should look, taste, and bubble like regular soda. Similarly, a synthetic data set must have the same mathematical and statistical properties as the real-world data set it represents. “It looks alike, and has a similar format,” says Kalyan Veeramachaneni, principal investigator at the Data Laboratory for AI (DAI) and principal investigator scientist at the MIT Decision and Information Systems Laboratory. If run through a model, or used to build or test an application, it performs as real-world data would.
But – just as diet soda must have fewer calories than the regular variety – a synthetic data set must also differ from a real one in crucial ways. If it is based on an actual dataset, for example, it should not contain or even imply any information from that dataset.
Threading this needle is difficult. After years of work, Veeramachaneni and his collaborators recently unveiled a set of open source data generation tools – a one-stop shop where users can get as much data as they need for their projects, in formats from tables to time series. They call it the Synthetic Data Vault.
Maximize access while maintaining privacy
Veeramachaneni and his team first attempted to create synthetic data in 2013. They had been tasked with analyzing a large amount of information from the edX online learning program, and they wanted to bring in some MIT students to help. The data was sensitive, and couldn’t be shared with these new hires, so the team decided to create artificial data that students could work with instead, thinking that “once they wrote the processing software, we could use it in the real data, “says Veeramachaneni.
This is a common scenario. Imagine that you are a software developer hired by a hospital. You have been asked to build a dashboard that allows patients to access their test results, prescriptions, and other health information. But you are not allowed to see any actual patient data, because it is private.
Most developers in this situation will make “a very simplistic version” of the data they need, and they will do the best they can, says Carles Sala, a researcher at the AID lab. But when the board kicks in, there’s a good chance that “everything will fall off,” he says, “because there are some borderline cases that they weren’t taking into account.”
High-quality synthetic data, as complex as what it is intended to replace, would help solve this problem. Companies and institutions could share them freely, allowing teams to work more collaboratively and efficiently. Developers could even carry it on their laptops, knowing they weren’t putting any sensitive information at risk.
Refine the formula – and handle the constraints
In 2013, the Veeramachaneni team took two weeks to create a data pool that they could use for that edX project. The timeline “seemed really reasonable,” says Veeramachaneni. “But we completely failed.” They soon realized that if they built a series of synthetic data generators, they could make the process faster for everyone else.
In 2016, the team completed an algorithm that accurately captures correlations between different fields in a real data set – think about a patient’s age, blood pressure, and heart rate – and creates a synthetic data set that preserves those relationships, without any identifying information. When data scientists were asked to solve problems using this synthetic data, their solutions were as effective as those made with real data 70 percent of the time. The team presented this research at the IEEE International Conference on Data Science and Advanced Analytics in 2016.
For the next step, the team delved into the machine learning toolbox. In 2019, PhD student Lei Xu presented her new algorithm, CTGAN, at the 33rd Conference on Neural Information Processing Systems in Vancouver. CTGAN (for “Conditional Tabular Generative Adversary Networks”) uses GANs to build and refine tables of synthetic data. GANs are pairs of neural networks that “play with each other,” says Xu. The first network, called the generator, creates something – in this case, a row of synthetic data – and the second, called the discriminator, tries to tell whether it is real or not.
“Eventually, the generator can generate perfect [data], and the discriminator can’t tell the difference,” Xu says. GANs are most often used in artificial imaging, but they also work well for synthetic data: CTGAN outperformed classic synthetic data creation techniques in 85 percent of the cases tested in Xu’s study.
Statistical similarity is crucial. But depending on what they represent, the data sets also come with their own context and vital limitations, which must be preserved in the synthetic data. AID lab researcher Sala gives the example of a hotel ledger: a guest always leaves after checking in. Dates in a synthetic hotel reservation dataset should also follow this rule: “They have to be in the correct order,” he says.
Large data sets can contain a number of different relationships like this, each strictly defined. “Models cannot learn constraints, because constraints are very context-dependent,” says Veeramachaneni. So the team recently finished an interface that allows people to tell a synthetic data generator where those limits are. “The data is generated within those limitations,” says Veeramachaneni.
That accurate data could help businesses and organizations in many different industries. One example is banking, where increased digitization, coupled with new data privacy regulations, has “sparked growing interest in ways to generate synthetic data,” says Wim Blommaert, team leader for financial services at the ING. Current solutions, such as data masking, often destroy valuable information that banks could use to make decisions, he said. A tool like SDV has the potential to sidestep the sensitive aspects of data while preserving these important limitations and relationships.
One vault to rule them all
The Synthetic Data Vault combines everything the group has built so far into “a whole ecosystem,” says Veeramachaneni. The idea is that stakeholders – from students to professional software developers – can come to the vault and get what they need, be it a large table, a small amount of time series data, or a mix of many different types of data.
The vault is open source and expandable. “There are a lot of different areas where we find that synthetic data can also be used,” says Sala. For example, if a particular group is underrepresented in a sample data set, synthetic data can be used to fill in those gaps – a sensible endeavor that requires a lot of finesse. Or companies may also want to use the synthetic data to plan for scenarios they have not yet experienced, such as a large increase in user traffic.
As use cases keep popping up, more tools will be developed and added to the vault, Veeramachaneni says. It may occupy the team for another seven years at least, but they are ready: “We are only touching the tip of the iceberg.”