Is Synthetic Training Data the Future of Machine Learning?
As demands increase for high-quality, large-scale training datasets, data labeling has become an increasingly important function within AI.
Manually labeling training data is labor-intensive and can be difficult and expensive (see our guide to data labeling here), but is automatic data labeling a valid alternative?
This guide explores the extent to which automatic data labeling can assist or replace human labeling teams.
Data labeling is the task of annotating and labeling data for supervised machine learning algorithms. The terms “annotation” and “labeling” are used interchangeably.
Supervised machine learning algorithms learn from labeled data, which trains them to accurately map inputs to outputs when exposed to real data.
Data labeling varies with the algorithm(s) or project, but broadly falls into one of three categories; text labeling for natural language processing (NLP), audio labeling for conversational AI, voice recognition and transcription, and image/video labeling for computer vision (CV).
Labels range from simple bounding boxes used to identify definitive objects to more complex polygon annotation, segmentation masking, and panoptic segmentation. See our ultimate guide to creating training data here.
Cognilytica found that 80% of time allocated to machine learning projects is taken up by preparation, of which 25% consists of data labeling.
To speed up data labeling is to speed up the end-to-end ML project timeline. This is why automated data labeling workflows and synthetic data (see our guide to synthetic data here) are of such massive interest right now.
There are three broad methods of labeling data. All labeling projects typically involve an element of manual labeling (unless synthetic data is generated with labels), however, manual labeling can be enhanced to various degrees with a number of automated labeling techniques.
Manual labeling makes sense on many practical and theoretical levels. Firstly, only humans are capable of making many of the project-critical labeling decisions required to build accurate models, providing they have domain experience (where necessary) and are well-trained in data labeling. Secondly, source data is rarely perfect, and manual labeling is ideal for highlighting potential problems before the project progresses.
Manual labeling is particularly important when dealing with edge cases, or niche industries/sectors where public or synthetic datasets are insufficient or non-existent. For example, many ML projects require real data collected from the field and labeled by human experts.
For example, Aya Data labeled images of eScooters – an unaddressed edge case in our client’s AV workflow. In this situation, manufacturing synthetic data wouldn’t have been possible. Here, leveraging Aya Data’s HIIT team to manually label the eScooters was both economical, practical, and effective.
Airbus highlights a similar dilemma here, citing that neither fully-automated nor fully-manual labeling pipelines suffice in diverse, large-scale ML projects.
When considering fully manual labeling there is also a decision to make between crowdsourcing labelers, using in-house specialists, or finding a third-party managed service provider – this is explored in our guide to labeling here.
AI-assisted labeling is already in-use across most major data labeling platforms.
Model-assisted learning lets users load pre-labeled images into a labeling platform, which learns from the images and pre-labels forthcoming images. Some platforms have API and Python SDK integration for plugging their UIs into pre-labeled training data.
Then, when exposed to new images, the model will auto-apply labels for humans to check and adjust.
One drawback of model-assisted labeling is that it still relies on human teams to check that the labels are accurate.
A well-cited study shown at the Conference on Human Factors in Computing Systems found that data labelers are prone to making error-ridden and hasty decisions when exposed to pre-labeled data, since it essentially disarms them of their decision-making abilities. As a result, poorly auto-labeled images are often approved rather than edited or rejected.
See an exploration of model assisted labeling on decision making here.
Another form of model-assisted learning is active learning. Here, a subset of pre-labeled images is handed to an algorithm that learns to infer new labels based on that subset.
When exposed to new data, the model will return some automatic labels results that are then checked for accuracy, while actively ‘asking questions’ about any data it cannot automatically label.
The human answers these questions, which further teaches the algorithm. The process is repeated over and over – the model labels the simplest data, refers data it’s unsure about to humans, and then learns from the humans’ decisions. In addition to predicting data labels, active learning also helps discard data that isn’t needed.
For example, an algorithm can learn to label all cars in an AV dataset but discard anything above the horizon line (like a plane or dark cloud).
Active Learning For Video Search – MIT
Active learning bears similarities to a human learning process. For example, suppose you’re labeling millions of messages as spam. In that case, you might expect to teach a human workforce with a selection of strong examples of spam, and expect them to infer future labels based on that small sample. The team should ask questions if they’re unsure and actively learn from the answers. Active learning works in the same way.
This is similar to model-assisted labeling, but the model and human work together in synchrony to build a progressively more accurate auto-labeling model.
The most advanced form of auto-labeling is programmatic labeling. Both AI-assisted and active learning link the human labeler to an auto-labeling algorithm. However, programmatic labeling goes one step further and establishes a link between human interpretations, or heuristics, and the auto-labeling algorithm.
In cognitive science and psychology, heuristics are shortcuts we use to establish meaning and understanding via the application of quick rules. Trial and error, a rule of thumb, and educated guesses are examples of heuristics.
In relation to automatic data labeling, programmatic labeling attempts to transplant some of these human decision-making processes into the auto-labeling algorithm.
Consider labeling spam or fraudulent messages. There are a number of ‘rules of thumb’ you could apply to make an educated guess about whether a message was genuine, spam, or fraud.
For example, a message with a high typo rate or incorrect/wrongly spelled opening, “Der Sir/ madam,” can form the basis of a heuristic interpretation that the message is not genuine.
Moreover, a URL payment link that involves a random string of characters, “paypx.com/1J88XHAO,” might immediately raise suspicion. Programmatic labeling seeks to transplant these innate human heuristic judgments into automated labeling AIs. The user ‘conveys’ their high-level judgments in a way that AI can understand and replicate on new data. Programmatic auto-labeling models are still young. Snorkel AI has been working on heuristic and programmatic labeling for many years in conjunction with Stanford and other top research universities.
Labeling platforms such as LabelBox and V7 Labs have their own built-in automation tools that use some of the above principles. For example, V7 Labs’ auto-labeler, called V7 Darwin, auto-generates polygon and pixel-wise masks. They suggest that it can speed up labeling by some 90%.
LabelBox’s AI-assisted labeler allows users to import pre-labeled datasets and train their own auto-labeling models.
It’s highly likely that commercial labeling platforms will offer more and more automation features going forward. These will involve a blend of model and AI-assisted labeling and, eventually, programmatic labeling. When programmatic labeling is refined, it should offer an almost-perfect blend of the scale of AI and the innate decision-making and interpretive skills of human labeling teams.
There is no one-size-fits-all method for labeling data for machine learning projects. The phrase “rubbish in, rubbish out” rings true here, and the accuracy of a model directly depends on the quality of the data it was trained on.
While automated labeling workflows are becoming more accessible and easier to use, they’re not yet a panacea for the data labeling bottleneck. However, even shaving a few seconds or minutes from each labeling session will have a cumulative benefit and the model-assisted tools offered by top labeling platforms are genuinely useful.
In the future, programmatic labeling is likely to become more straightforward and easier to leverage in smaller projects. For now, this cutting-edge enterprise-level approach to scalable, automated data labeling is somewhat ring-fenced for the very most extensive projects in the industry.
In any and all cases, building robust foundational manual datasets or ‘gold sets’ is imperative.
Aya Data’s expert HIIT workforce has proven multi-industry data labeling experience – our datasets have been used to train leading-edge AIs across a multitude of sectors and verticals. Contact us to get a free quote on our data labeling services.