Supervised machine learning projects require training data. By learning from training data, a supervised algorithm aims to be able to accurately predict outcomes when exposed to real data.
Training data is required for all types of supervised machine learning projects:
Finding appropriate training data is often a challenge. There are many public and open datasets online, but some are becoming heavily dated and many fail to accommodate the latest developments in AI and ML. With that said, there are still several resources of high-quality, well-maintained datasets.
This is a guide to finding training data for supervised machine learning projects. See our Ultimate Guide to Creating Training Data Here.
A good dataset must meet four basic criteria:
Generally speaking, the larger the sample size, the more accurate the model can be. However, high dataset variance can result in overfitting, which is indicative of an excessively complex model and dataset. Conversely, low variance or sparse data can result in underfitting and bias.
Rather than aiming for more = better, datasets should contain enough information to cover the problem space without introducing unnecessary noise to the model. For example, Aya Data built a training dataset of maize diseases to train a disease identification model, accessible through an app. Here, the problem space is limited by the number of diseases present - we created 5000 labeled images of diseased maize crops to train a detection model with 95% accuracy.
Skilled data labelers with domain knowledge can make the project-critical labeling decisions required to build accurate models. For example, in medical imaging, it’s often necessary to understand the visual characteristics of a disease so it can be appropriately labeled.
The annotations themselves should be applied properly. Even simple bounding boxes are subject to quality issues if the box doesn't fit tightly around the feature. In the case of LiDAR, pixel segmentation, polygon annotation, or other complex labeling tasks, specialized labeling skills are essential.
Non-representative or biased data has already impacted public trust in AI. When it comes to AVs, biased training data might be a matter of life and death. In other situations, e.g., recruitment, utilizing biased AIs can result in regulatory issues or even breaking the law.
Numerous models have already failed due to bias and misrepresentation, such as Amazon’s recruitment AI that prejudiced women and several AV models that failed to spot humans of darker skin tone. The lack of diversity in datasets is a pressing issue with both technical and ethical ramifications that remain largely unaddressed today.
AI teams need to actively engage with the issue of bias and representation when building training data. Unfortunately, this is a distinctly human problem that AIs cannot yet fully solve themselves.
Privacy and data protection laws and regulations, such as GDPR, constrain the use of data that involves people’s identities or personal property. In regulated industries such as finance and healthcare, both internal and external policies introduce red tape to the use of sensitive data.
Public and open source data can be used, reused, and redistributed without restriction. Strictly speaking, open data is not restricted by copyright, patents, or other forms of legal or regulatory control. It is still the user’s responsibility to conduct appropriate due diligence to ensure the legal and regulatory compliance of the project itself.
There are many types of open datasets out there, but many aren’t suitable for training modern or commercial-grade ML models. Instead, many are intended for educational or experimental purposes, though some can make good test sets. Nevertheless, the internet is home to a huge range of open datasets for AI and ML projects and new models are being trained on old datasets all the time.
Government-maintained datasets are offered by many countries, including the US (data.gov), the UK (data.gov.uk), Australia (data.gov.au), and Singapore (data.gov.sg). Most data is available in XML, CSV, HTML or JSON format. Public sector data relates to everything from transport and health to public expenditure and law, crime, and policing.
Here is a list of eighteen well-known datasets. Some include already-labeled data. This is by no means an exhaustive list, and it’s also worth noting that some of these datasets are becoming quite aged (e.g. StanfordCars) and may only be useful for experimental or educational purposes. You can find some other datasets for NLP projects here.
It’s also possible to create custom datasets using any combination of data retrieved from datasets, data mining, and self-captured data.
The advantage of using public and open data is that it’s (generally) free from regulation and control. But, conversely, data mining introduces several challenges for data use and privacy, even when working with open-source data. For example, mining public data from the internet and using that to construct a dataset might be prohibited under GDPR even when efforts are made to anonymize data. Many sites will also have internal policies restricting data mining and might be forbidden by their robots.txt.
In contrast, if a company, business, or organization possesses its own data (e.g. from live chat logs or customer databases), then this can be used to construct custom datasets.
If you’re looking to create custom, bespoke data, data labeling services can label data from existing datasets, create entirely new custom datasets, or employ data mining techniques to discover new data.
Leveraging the skills and experience of data sourcing and annotation specialists is ideal when you need a data set to cover a broad spectrum of scenarios (e.g. edge cases), or when control over intellectual property is needed. Moreover, labeling partners can accommodate advanced labeling projects that require domain knowledge or particular industry-specific skills.
Training models is usually an iterative task that involves stages of training, testing, and optimizing. Whether it be adding new samples or changing labeled classes and attributes, being able to change training data on the fly is an asset.
Labeling partners help AI and ML projects overcome the typical challenges involved with using inadequate pre-existing training data.
Whatever the demands of your project, you will likely need a human-in-the-loop workforce to help train and optimize your model.
From bounding boxes, polygon annotation, and image segmentation to named entity recognition, sentiment analysis, and audio and text transcription, Aya Data’s HITL workforce has the skills and experience required to tackle new and emerging problems in AI and machine learning.
Sourcing training data is an essential part of the supervised machine learning process. While synthetic training data may yet become a panacea for all ML training data needs (see our full exploration of this here), in the immediate future human-labeled data is needed to fill the majority of global demand.
Public datasets are advantageous in that they’re free to use for most purposes, but that doesn’t mean the ensuing model can be monetized. While many of these open datasets are expertly maintained, many machine learning projects require the edge that custom data provides.