Even as the availability of processing power increases exponentially, and machine learning algorithms are commoditized, there is a problem that persistently slows the development of complex AI: obtaining high-quality, accurate training data.
All supervised machine learning algorithms require training data, and every ML data scientist knows too well the phrase ‘rubbish in rubbish out’.
The challenge is, how do you efficiently obtain enough quality data to train and optimize your model? And what if obtaining real training data is not even an option?
Synthetic data might be the answer.
The principle behind synthetic data is simple - instead of gathering real data, you generate it.
Synthetic data is not a new concept - the video games industry has been creating synthetic data for decades, and modern incarnations of hyper-realistic games are becoming increasingly lifelike. Gartner estimates that 60% of data used for machine learning projects will be synthetically generated by 2024. Nvidia’s Omniverse Replicator is an example of a synthetic data engine, “an engine for generating synthetic data with ground truth for training AI networks.”
Training models in this user-controlled environment makes it easy to engineer computer ‘understanding’ in any scenario that you think is important (within the realms of your imagination and generation capacity). For instance, an AV manufacturer might worry about the performance of a car in certain edge-case situations e.g. foggy conditions. Instead of relying on limited real-life data of foggy road driving, a realistic computer-generated foggy situation could be created and employed to train the requisite model.
As of yet, synthetic data is far from perfect. When training a complex model on simulated data, it’s often unrealistic to expect it to work perfectly when exposed to real data for the first time.
For example, the synthetic street below is lifelike and detailed, but it’s still easily decipherable as synthetic and is missing some key features that might occur in real data. Moreover, it’s tough to specify what’s missing from the image.
Also, it’s tricky to build real-world outliers into synthetic data. Whether a cat, a pushchair, a scooter, or a unicycle, synthetically generating every feature encountered in a real dataset is time-consuming and meticulous.
Currently, synthetic data can be an excellent way to bolster and complement real datasets, helping expose complex models to more simulations than possible otherwise. In the future, synthetic data might form the backbone of training data. Find our in-depth guide to creating training data here.
Synthetic data should not always be seen as a replacement or alternative to real data - both have their own distinctive pros and cons for data and ML projects of varying types. Here are some examples of synthetic data use cases:
Synthetic data is quick and efficient, which is an asset in fast-moving industries where businesses need to provide data to third parties without negotiating time-consuming legal, technical, and regulatory processes. For example, synthetic data negates the regulatory lag of negotiating deals that involve sensitive data - a representative dataset can be generated and passed over instead of handing over the real thing.
Moreover, personally identifiable information (PII) is subject to numerous privacy and data protection rules which extend to storage and transfer. Rather than negotiating the issues of storing or transferring sensitive datasets, synthetic datasets can be generated and downloaded on-site.
Monetizing valuable real datasets also presents legal and regulatory issues regarding privacy, anonymity, and intellectual property rights. It may not be possible to monetize a dataset even when all PII and other sensitive information is obscured or anonymized.
Synthetic data negates regulatory barriers to dataset monetization (granted that a synthetic dataset may not be valuable as its real counterpart).
Synthetic data is the only option for data science projects that simulate the unforeseen or unknown, thus rendering it impossible to train the model using natural data. For example, NASA tests its spacecraft on vast synthetic datasets that simulate possible conditions.
Synthetic data is fed into non-destructive digital twin models to study the outputs of a system when exposed to experimental variables. This is vital for building large-scale simulations - Monte Carlo models use and generate synthetic datasets which form the basis of complex simulation studies.
Synthetic data can work in tandem with real data. In data analytics, it’s common practice to combine synthetic data with real data to enrich and enhance sparse datasets (i.e. imputation). If only a small amount of real data is available, it’s possible to use that data as a platform on which to predict other values, thus forming a hybrid real and synthetic dataset.
Let’s take a balanced look at the pros and cons of synthetic data:
One of the most challenging elements of any machine learning project is obtaining enough data. In general, the more you have, the better (though quality is also crucial). Synthetic data enables AI teams to tackle the issue of scale - as many training images can be generated as needed (GPU permitting). Moreover, synthetic data is often auto-labeled, which saves a huge amount of time - see our guide to automated labeling here.
Real-world data is subject to laws and regulations regarding privacy and data usage, such as GDPR. Synthetic data solves confidentiality and privacy issues without compromising quality in sensitive industries like finance and healthcare.
Real-world data can be free, but it can also be costly. Since supervised machine learning projects rely on accurate, quality datasets, value is a pressing issue. Synthetic data isn’t always cheap, but it can have the edge over purchasing expensive datasets. It’s cheaper to generate thousands of miles of streets and pavements than hiring advanced image and video capture, for example.
Synthetic data can be engineered with specific user-controlled characteristics that can scale to the complexity of the model. If a feature isn’t present in a dataset, it can be added later. New features can be engineered until the model is suitably accurate.
Synthetic data is both predictable and flexible. Alphabet’s Waymo and General Motors’ Cruise use synthetic simulations to generate LiDAR data. Here, the generated data represents an empirical ground truth. At the same time, synthetic data can be chopped and changed to alter model outcomes and accuracy. This is a huge advantage in complex, iterative training processes.
Even the latest high-grossing video games from top publishers are still not close to being realistic. Replicating the real world is not an easy task by any stretch, and synthetic data engines right now are still just ‘good enough’ rather than being a genuine match for real data.
Training certain models with any purpose-made, overly cross-sectional, or biased data can result in significant issues when the model is exposed to real data. Synthetic data is not guaranteed to be truly representative of a real-world sample.
Additionally, synthetic data is constrained by the features of platforms that offer it. This may be an issue if you want to engineer certain nuances, outliers, or edge cases. By the time you’re deliberately engineering niche features to reduce bias in the model, you are eating into efficiencies gained against using a genuine, representative dataset.
A model trained on high-quality, well-annotated real-world data is more trustworthy in high-stakes situations. While any model should be thoroughly tested before deployment, synthetic data doesn’t contain the same detail as real-world data, which may cause issues when models are deployed rapidly without sufficient testing. Synthetic data provides a tempting shortcut, but ultimately, what use is an inaccurate model?
The process of matching the complexity of synthetic data to real data may be more hassle than it’s worth. Here is an interesting stat to demonstrate the trust of synthetic data: synthetic data proponent MIT conducted a study comparing real datasets to synthetic datasets. The title is Artificial data give the same results as real data — without compromising privacy, yet, the performance of projects trained on synthetic data only matched real data in 11/15 tests, or 70% of the time. Clearly then, synthetic data is not yet up to the same standards as real data.
Synthetic data provided by the likes of Nvidia are distinctly enterprise-level solutions and have a price tag to match. One of the most exciting applications of machine learning is developing novel solutions to small-scale problems with significant impacts. Aya Data understands this - we work with many clients who require high-quality datasets for niche, creative projects where real data is irreplaceable.
For the majority of projects, real data does still have the edge over synthetic data, though they are both well suited to different situations.
Creating synthetic data is simply not always possible or appropriate for certain ML projects, and the reverse applies to real data. For example, Aya Data sourced and labeled images of maize diseases to help our client build a disease classification application. Creating synthetic data wouldn’t have been an option here - the project required real images of maize disease.
Synthetic data solves the issue of scale in projects that require vast quantities of simple data, but the narrative that synthetic data will solve the problems associated with large-scale ML projects belies the challenges of working with data that is inherently limited by what has already been observed and captured, and the imagination of its creator. Moreover, while humans operate the controls of synthetic testing and training environments, the issue of bias and misrepresentation must be carefully monitored.
Overall, synthetic data unquestionably has a growing place in machine learning, particularly for enterprise-level projects that require a multitude of data points. On the other hand, many cutting-edge, creative, or novel applications of ML require at least a portion of real data, and always will do.