How to Find Training Data for Machine Learning

back to our blog

Supervised machine learning projects require training data. By learning from training data, a supervised algorithm aims to be able to accurately predict outcomes when exposed to real data. 

Training data is required for all types of supervised machine learning projects: 

  • Images, video, LiDAR, and other visual media are annotated for the purposes of computer vision (CV). 
  • Audio is annotated and labeled for the purposes of training conversational AIs and technologies with audio sensors. 
  • Text is annotated and labeled for chatbots, sentiment analysis, named entity recognition, and autotranslation. 
  • Numerical data, text, or statistics are used to train all manner of regression or classification algorithms. 

Finding appropriate training data is often a challenge. There are many public and open datasets online, but some are becoming heavily dated and many fail to accommodate the latest developments in AI and ML. With that said, there are still several resources of high-quality, well-maintained datasets. 

This is a guide to finding training data for supervised machine learning projects. See our Ultimate Guide to Creating Training Data Here.

What Is a Good Dataset?

A good dataset must meet four basic criteria:  

1: Dataset Must be Large Enough to Cover Numerous Iterations of The Problem

Generally speaking, the larger the sample size, the more accurate the model can be. However, high dataset variance can result in overfitting, which is indicative of an excessively complex model and dataset. Conversely, low variance or sparse data can result in underfitting and bias.

Rather than aiming for more = better, datasets should contain enough information to cover the problem space without introducing unnecessary noise to the model. For example, Aya Data built a training dataset of maize diseases to train a disease identification model, accessible through an app. Here, the problem space is limited by the number of diseases present - we created 5000 labeled images of diseased maize crops to train a detection model with 95% accuracy. 

2: Data Must be Well-Labeled And Annotated

Skilled data labelers with domain knowledge can make the project-critical labeling decisions required to build accurate models. For example, in medical imaging, it’s often necessary to understand the visual characteristics of a disease so it can be appropriately labeled. 

The annotations themselves should be applied properly. Even simple bounding boxes are subject to quality issues if the box doesn't fit tightly around the feature. In the case of LiDAR, pixel segmentation, polygon annotation, or other complex labeling tasks, specialized labeling skills are essential.  

3: Data Must be Representative And Unbiased

Non-representative or biased data has already impacted public trust in AI. When it comes to AVs, biased training data might be a matter of life and death. In other situations, e.g., recruitment, utilizing biased AIs can result in regulatory issues or even breaking the law. 

Numerous models have already failed due to bias and misrepresentation, such as Amazon’s recruitment AI that prejudiced women and several AV models that failed to spot humans of darker skin tone. The lack of diversity in datasets is a pressing issue with both technical and ethical ramifications that remain largely unaddressed today. 

AI teams need to actively engage with the issue of bias and representation when building training data. Unfortunately, this is a distinctly human problem that AIs cannot yet fully solve themselves. 

4: Data Must Comply With Privacy Regulations

Privacy and data protection laws and regulations, such as GDPR, constrain the use of data that involves people’s identities or personal property. In regulated industries such as finance and healthcare, both internal and external policies introduce red tape to the use of sensitive data. 

Public and Open Source Data For ML Projects

Public and open source data can be used, reused, and redistributed without restriction. Strictly speaking, open data is not restricted by copyright, patents, or other forms of legal or regulatory control. It is still the user’s responsibility to conduct appropriate due diligence to ensure the legal and regulatory compliance of the project itself. 

There are many types of open datasets out there, but many aren’t suitable for training modern or commercial-grade ML models. Instead, many are intended for educational or experimental purposes, though some can make good test sets. Nevertheless, the internet is home to a huge range of open datasets for AI and ML projects and new models are being trained on old datasets all the time. 

Government Datasets

Government-maintained datasets are offered by many countries, including the US (data.gov), the UK (data.gov.uk), Australia (data.gov.au), and Singapore (data.gov.sg). Most data is available in XML, CSV, HTML or JSON format. Public sector data relates to everything from transport and health to public expenditure and law, crime, and policing. 

18 Public and Open Source Datasets

Here is a list of eighteen well-known datasets. Some include already-labeled data. This is by no means an exhaustive list, and it’s also worth noting that some of these datasets are becoming quite aged (e.g. StanfordCars) and may only be useful for experimental or educational purposes. You can find some other datasets for NLP projects here.

  1. Awesome Public Datasets - A huge list of well-maintained datasets related to agriculture and science, demographics and government, transportation, and sports.
  2. AWS Registry of Open Data - Amazon Web Services’ source for open data, ideal for ML projects built on AWS.
  3. Google Dataset Search - Google Finance, Google Public Data, and Google Scholar are also mineable for training data. 
  4. ImageNet - A vast range of bounding box images for object recognition tasks, built using the WordNet database for NLP. 
  5. Microsoft Research Open Data - A range of datasets for healthcare, demography, sciences, crime, and legal and education.
  6. Kaggle Datasets - Probably the go-to for public datasets, of which there are over 20,000 on the site. 
  7. Places and Places2 - Scenes for object recognition projects. Features some 1.8 million images grouped in 365 scene categories. 
  8. VisualGenome - Datasets that connect images to connected language. 
  9. StanfordCars - Contains 16,185 images of 196 classes of cars
  10. FloodNet - A natural disaster dataset created using UAVs. 
  11. The CIFAR-10 dataset - Contains some 60,000 32x32 images in 10 classes.
  12. Kinetics - 650,000 video clips of human actions
  13. Labeled Faces in the Wild - 13,000 faces for face recognition tasks. 
  14. CityScapes Dataset - Street scenes for AV and CCTV models. 
  15. EarthData - NASA’s dataset hub. 
  16. COCO Dataset - Common objects in context dataset for semantic segmentation 
  17. Mapillary Vistas Dataset - Global street-level imagery dataset for urban semantic segmentation 
  18. NYU Depth V2 - For indoor semantic segregation.
synthetic data
Mapillary data from Bucharest

Creating Custom Data For ML Projects

It’s also possible to create custom datasets using any combination of data retrieved from datasets, data mining, and self-captured data.

  • Image and video footage can be annotated for CV projects. You can find some examples in our case studies - such as annotating ultra-HD satellite images to train a model that could identify changing land uses. 
  • Text can be taken from customer support logs and queries to train bespoke chatbots. This is how many businesses and organizations train chatbots, rather than using open datasets. 
  • Audio can be sampled from customer recordings, or other audio sources. This data can then be used to train speech or audio recognition algorithms, or transcribed into other languages. 

The advantage of using public and open data is that it’s (generally) free from regulation and control. But, conversely, data mining introduces several challenges for data use and privacy, even when working with open-source data. For example, mining public data from the internet and using that to construct a dataset might be prohibited under GDPR even when efforts are made to anonymize data. Many sites will also have internal policies restricting data mining and might be forbidden by their robots.txt. 

In contrast, if a company, business, or organization possesses its own data (e.g. from live chat logs or customer databases), then this can be used to construct custom datasets. 

Working With a Data Sourcing and Labeling Partner

If you’re looking to create custom, bespoke data, data labeling services can label data from existing datasets, create entirely new custom datasets, or employ data mining techniques to discover new data. 

Leveraging the skills and experience of data sourcing and annotation specialists is ideal when you need a data set to cover a broad spectrum of scenarios (e.g. edge cases), or when control over intellectual property is needed. Moreover, labeling partners can accommodate advanced labeling projects that require domain knowledge or particular industry-specific skills. 

Training models is usually an iterative task that involves stages of training, testing, and optimizing. Whether it be adding new samples or changing labeled classes and attributes, being able to change training data on the fly is an asset. 

Labeling partners help AI and ML projects overcome the typical challenges involved with using inadequate pre-existing training data. 

The Human In The Loop (HITL)

Whatever the demands of your project, you will likely need a human-in-the-loop workforce to help train and optimize your model.

From bounding boxes, polygon annotation, and image segmentation to named entity recognition, sentiment analysis, and audio and text transcription, Aya Data’s HITL workforce has the skills and experience required to tackle new and emerging problems in AI and machine learning. 

Summary: How To Find Training Data

Sourcing training data is an essential part of the supervised machine learning process. While synthetic training data may yet become a panacea for all ML training data needs (see our full exploration of this here), in the immediate future human-labeled data is needed to fill the majority of global demand. 

Public datasets are advantageous in that they’re free to use for most purposes, but that doesn’t mean the ensuing model can be monetized. While many of these open datasets are expertly maintained, many machine learning projects require the edge that custom data provides. 

You Might Also Like

Speech Recognition: Opportunities and Challenges

Speech Recognition: Opportunities and Challenges

Voice recognition is an important but complex AI technology
Published on
October 5, 2022
Read More
Digital Cryptids: How AI is Making Monsters

Digital Cryptids: How AI is Making Monsters

AI is generating monsters and nightmares
Published on
September 13, 2022
Read More
The Impacts of Bad AI

The Impacts of Bad AI

The positive impacts of AI are hard-fought
Published on
September 9, 2022
Read More