Data Annotation

As demands increase for high-quality, large-scale training datasets, data labeling has become an increasingly important function within AI.

Manually labeling training data is labor-intensive and can be difficult and expensive (see our guide to data labeling here), but is automatic data labeling a valid alternative?

This guide explores the extent to which automatic data labeling can assist or replace human labeling teams.

What Is Data Labeling?

Data labeling is the task of annotating and labeling data for supervised machine learning algorithms. The terms “annotation” and “labeling” are used interchangeably.

Supervised machine learning algorithms learn from labeled data, which trains them to accurately map inputs to outputs when exposed to real data.

Data labeling varies with the algorithm(s) or project, but broadly falls into one of three categories; text labeling for natural language processing (NLP), audio labeling for conversational AI, voice recognition and transcription, and image/video labeling for computer vision (CV).

Labels range from simple bounding boxes used to identify definitive objects to more complex polygon annotation, segmentation masking, and panoptic segmentation. See our ultimate guide to creating training data here

The Machine Learning Bottleneck

Cognilytica found that 80% of time allocated to machine learning projects is taken up by preparation, of which 25% consists of data labeling.

To speed up data labeling is to speed up the end-to-end ML project timeline. This is why automated data labeling workflows and synthetic data (see our guide to synthetic data here) are of such massive interest right now.

Time Spent on Data Prep

Three Approaches To Data Labeling

There are three broad methods of labeling data. All labeling projects typically involve an element of manual labeling (unless synthetic data is generated with labels), however, manual labeling can be enhanced to various degrees with a number of automated labeling techniques.

  1. Manual labeling
  2. AI-assisted labeling
  3. Programmatic labeling

1: Manual Labeling

Manual labeling makes sense on many practical and theoretical levels. Firstly, only humans are capable of making many of the project-critical labeling decisions required to build accurate models, providing they have domain experience (where necessary) and are well-trained in data labeling. Secondly, source data is rarely perfect, and manual labeling is ideal for highlighting potential problems before the project progresses.

Medical annotation requires high domain knowledge and/or training

Manual labeling is particularly important when dealing with edge cases, or niche industries/sectors where public or synthetic datasets are insufficient or non-existent. For example, many ML projects require real data collected from the field and labeled by human experts.

For example, Aya Data labeled images of eScooters – an unaddressed edge case in our client’s AV workflow. In this situation, manufacturing synthetic data wouldn’t have been possible. Here, leveraging Aya Data’s HIIT team to manually label the eScooters was both economical, practical, and effective.

Airbus highlights a similar dilemma here, citing that neither fully-automated nor fully-manual labeling pipelines suffice in diverse, large-scale ML projects.

When considering fully manual labeling there is also a decision to make between crowdsourcing labelers, using in-house specialists, or finding a third-party managed service provider – this is explored in our guide to labeling here.

Pros of Manual Data Labeling

  • Manual labeling is usually the only option for tasks that require direct training or high-level domain knowledge. HIIT workforces can be briefed on the project, the project’s requirements, things to look out for, essential points to remember while labeling, and other project-critical details.
  • Projects that operate in a new, niche, or emerging area of a sector/industry likely require hand-annotated data. In addition, manually annotated gold sets can be used to train auto-labeling models later on if required.
  • By collecting data from the field and manually annotating it, it’s possible for businesses and organizations to claim full rights over the data, labels, and models. Conversely, ownership may be difficult to negotiate if the training/test sets involve synthetic, paid, or open data.
  • Manually labeled data is customizable. Involving expert labelers in the end-to-end machine learning process unlocks value beyond the labels alone. Labelers can make suggestions, highlight issues, and raise important issues such as bias and representation.

Cons of Manual Data Labeling

  • Manual labeling is undoubtedly slow, and while a slow and steady pace suits some machine learning projects, it will hold back others.
  • Privacy and security issues arise when human labelers are exposed to personally identifiable or sensitive information. In this situation, creating a compliant labeling framework is essential. Measures must be taken to anonymize sensitive data and ensure data security.
  • Without a robust labeling pipeline, human error threatens to increase error rate or introduce unforeseen outcomes to the model. It’s possible to reduce human error with blind or double-blind labeling pipelines.

2: Model/AI-Assisted Labeling

AI-assisted labeling is already in-use across most major data labeling platforms.

Model-assisted learning lets users load pre-labeled images into a labeling platform, which learns from the images and pre-labels forthcoming images. Some platforms have API and Python SDK integration for plugging their UIs into pre-labeled training data.

Then, when exposed to new images, the model will auto-apply labels for humans to check and adjust.

One drawback of model-assisted labeling is that it still relies on human teams to check that the labels are accurate.

well-cited study shown at the Conference on Human Factors in Computing Systems found that data labelers are prone to making error-ridden and hasty decisions when exposed to pre-labeled data, since it essentially disarms them of their decision-making abilities. As a result, poorly auto-labeled images are often approved rather than edited or rejected.

See an exploration of model assisted labeling on decision making here.

Contact us

Active Learning

Another form of model-assisted learning is active learning. Here, a subset of pre-labeled images is handed to an algorithm that learns to infer new labels based on that subset.

When exposed to new data, the model will return some automatic labels results that are then checked for accuracy, while actively ‘asking questions’ about any data it cannot automatically label.

The human answers these questions, which further teaches the algorithm. The process is repeated over and over – the model labels the simplest data, refers data it’s unsure about to humans, and then learns from the humans’ decisions. In addition to predicting data labels, active learning also helps discard data that isn’t needed.

For example, an algorithm can learn to label all cars in an AV dataset but discard anything above the horizon line (like a plane or dark cloud).

Active Learning For Video Search – MIT

Active learning bears similarities to a human learning process. For example, suppose you’re labeling millions of messages as spam. In that case, you might expect to teach a human workforce with a selection of strong examples of spam, and expect them to infer future labels based on that small sample. The team should ask questions if they’re unsure and actively learn from the answers. Active learning works in the same way.

This is similar to model-assisted labeling, but the model and human work together in synchrony to build a progressively more accurate auto-labeling model.

Pros of Model-Assisted Labeling

  • Model-assisted labeling attempts to pre-apply labels without replacing human agency altogether. While model-assisted learning is probably only reasonably accurate when applied to simple images, it is possible to develop more advanced auto-labeling models using active learning.
  • Labeling platforms such as V7 Labs, Roboflow, and LabelBox have model-assisted auto-labelers built-in to the UI. These allow you to train your own models using a sample of manually-labeled images. The UI will go about auto-labeling forthcoming images once you load them for labeling.
  • Model-assisted labeling is scalable to the labeling task. For example, it might be relatively easy to auto-label large objects using bounding boxes. If there are some more complex forms in the same image that require polygon annotation, labelers can focus their efforts on those instead of dealing with menial labeling tasks.

Cons of Model-Assisted Labeling

  • Model-assisted labeling might save time from the labeling process, but training auto-labeling models will also take time, and they might not be accurate enough to be usable. It might be more hassle than it’s worth.
  • Studies have shown that when labelers are given pre-labeled data, they’re less likely to check the pre-labels for accuracy (e.g., is the bounding box tight? Is the class label correct?) Instead of checking the labels, they might just ‘pass’ the image.
  • Using auto-labeling models to save time is risky when quality is sacrificed. If you’re considering auto-labeling to scale up your training and test data, then consider the trade-off if your resulting dataset is of poorer quality than an expertly labeled smaller sample.

3: Programmatic Labeling

The most advanced form of auto-labeling is programmatic labeling. Both AI-assisted and active learning link the human labeler to an auto-labeling algorithm. However, programmatic labeling goes one step further and establishes a link between human interpretations, or heuristics, and the auto-labeling algorithm.

In cognitive science and psychology, heuristics are shortcuts we use to establish meaning and understanding via the application of quick rules. Trial and error, a rule of thumb, and educated guesses are examples of heuristics.

In relation to automatic data labeling, programmatic labeling attempts to transplant some of these human decision-making processes into the auto-labeling algorithm. 

Consider labeling spam or fraudulent messages. There are a number of ‘rules of thumb’ you could apply to make an educated guess about whether a message was genuine, spam, or fraud.

For example, a message with a high typo rate or incorrect/wrongly spelled opening, “Der Sir/ madam,” can form the basis of a heuristic interpretation that the message is not genuine.

Moreover, a URL payment link that involves a random string of characters, “paypx.com/1J88XHAO,” might immediately raise suspicion. Programmatic labeling seeks to transplant these innate human heuristic judgments into automated labeling AIs. The user ‘conveys’ their high-level judgments in a way that AI can understand and replicate on new data. Programmatic auto-labeling models are still young. Snorkel AI has been working on heuristic and programmatic labeling for many years in conjunction with Stanford and other top research universities.

Pros of Programmatic Labeling

  • Programmatic labeling allows human labelers and domain experts to plant their skills and knowledge into a model. Here, the model operates via heuristic rules rather than probability alone. Programmatic labeling provides a rare blend of AI scale and human agency when done properly.
  • For complex datasets, programmatic labeling offers hope for labeling colossal datasets where model-assisted techniques aren’t accurate.
  • Multiple heuristic rules can be created and edited as the project develops. Programmatic labeling is ideal for high-maturity enterprise-level ML workflows.

Cons of Programmatic Labeling

  • Programmatic labeling is cutting-edge enterprise-level technology – this approach to data labeling is highly complex and requires significant investment. It’s simply not viable for the vast majority of ML projects.

Built-In Label Automation Tools

Labeling platforms such as LabelBox and V7 Labs have their own built-in automation tools that use some of the above principles. For example, V7 Labs’ auto-labeler, called V7 Darwin, auto-generates polygon and pixel-wise masks. They suggest that it can speed up labeling by some 90%.

LabelBox’s AI-assisted labeler allows users to import pre-labeled datasets and train their own auto-labeling models.

It’s highly likely that commercial labeling platforms will offer more and more automation features going forward. These will involve a blend of model and AI-assisted labeling and, eventually, programmatic labeling. When programmatic labeling is refined, it should offer an almost-perfect blend of the scale of AI and the innate decision-making and interpretive skills of human labeling teams.

Summary: Guide to Automated Data Labeling

There is no one-size-fits-all method for labeling data for machine learning projects. The phrase “rubbish in, rubbish out” rings true here, and the accuracy of a model directly depends on the quality of the data it was trained on. 

While automated labeling workflows are becoming more accessible and easier to use, they’re not yet a panacea for the data labeling bottleneck. However, even shaving a few seconds or minutes from each labeling session will have a cumulative benefit and the model-assisted tools offered by top labeling platforms are genuinely useful.

In the future, programmatic labeling is likely to become more straightforward and easier to leverage in smaller projects. For now, this cutting-edge enterprise-level approach to scalable, automated data labeling is somewhat ring-fenced for the very most extensive projects in the industry.

In any and all cases, building robust foundational manual datasets or ‘gold sets’ is imperative.

Aya Data’s expert HIIT workforce has proven multi-industry data labeling experience – our datasets have been used to train leading-edge AIs across a multitude of sectors and verticals. Contact us to get a free quote on our data labeling services.

Subscribe to Our Newsletter!

We don’t spam! Read our privacy policy for more info.

Is Synthetic Training Data the Future of Machine Learning?

How to Find Training Data for Machine Learning