Homepage / Blog / The Challenges of Text, Audio, Photo, and Video Data Collection for ML Training Models

The Challenges of Text, Audio, Photo, and Video Data Collection for ML Training Models

Published 31/08/2023 / Blog posts

tech engineer developing machine learning

The basis of all machine learning projects is data collection and acquisition. But that is also the first stumbling block where many ML projects fail. In this article, we will discuss the challenges of text, audio, photo, and video data collection for ML training models so that you can predict and avoid the many pitfalls.

By the end, you will understand why data collection is not as easy as it may seem. You will know what to look out for if you are thinking of implementing AI solutions within your company and, hopefully, be able to avoid any missteps. If nothing else, you will be able to make an informed decision if an ML project is a worthwhile investment in your current circumstances.

So, let’s get started.

The Importance of Data Collection

Data scientists are not very dogmatic by and large, but they do have one strongly held belief – machine learning models based on bad data will not perform their intended functions. They’ve even created their own proverb related to ML training data – garbage in, garbage out.

And the foundation of reliable, relevant, high-quality training data is professional data collection. There are multiple ways by which data is collected, from physical data collection to automatic data collection by IT systems to conducting surveys to purchasing data from existing data banks.

And all of these modes of data collection come with certain challenges, ones we will now discuss.

The Basic Challenges of Data Collection

Data collection may seem like the easiest step in an ML project. After all, the data already exists in one form or another, so it just needs to be gathered and later it can be curated, right? The simplest answer is – no.

According to a McKinsey survey on the adoption of AI by organizations, it was found that data collection was one of the biggest barriers to implementing AI solutions – 24% of respondents stated that a lack of available data was the biggest barrier, while 20% stated that the limited usefulness of data was.

From this, we can conclude that 44% of organizations believe they cannot adopt AI solutions because they lack relevant, accurate, high-quality data to use as training data for their algorithms.

When you delve a little deeper, it’s not that surprising. So let’s explore the basic challenges of text, audio, photo, and video data collection for ML training models.

A Lack of Available or Usable Data

Many ML projects are unique, which also means that the required data that will later be used for training datasets is hard to come by or non-existent. Think of three (simplified) scenarios:

A company wishes to develop a machine learning model that will conduct predictive weather analyses in a specific region up to three weeks in advance. This type of analysis requires access to historical data and satellite imagery – both of which are generally easily available. In this scenario, data collection is not a big challenge (and is one of the reasons predictive analytics for weather patterns are so common).
A healthcare provider wishes to develop a machine learning algorithm that will predict the occupancy of their facilities. This type of predictive model needs to be partially based on historical occupancy data from their facilities. The healthcare provider has this data but in physical format, i.e., ledgers, charts, and other types of documents. In order to utilize that information for ML projects, it first needs to be digitized, then structured and curated. This type of data collection is more of a challenge than in the previous scenario.
A large agricultural company wishes to develop a computer vision model that will be used for early disease detection for specific crops, with the purpose of increasing yields. This type of project would require physically collecting large volumes of images and videos of diseased plants because that type of data is either non-existent or unavailable. Later, the images and videos will be annotated and fed into an ML learning model. In this scenario, due to the large volume of data required for high-accuracy disease detection, even data collection is a significant challenge. This is not a purely hypothetical situation – here is an example of a similar project conducted in Ghana for maize plants.

And these challenges of data collection are only related to availability. There is another major challenge to consider – usability, i.e., even when relevant data is accessible, it may not be in a format that can be utilized for training datasets. This has to do with the nature of structured and unstructured data.

In the simplest of terms, structured data is organized, defined, and formatted, often in the form of tabular data. It is easily searchable, feature selection is straightforward, and it is usable as training data. Unstructured data, on the other hand, is unorganized, non-defined, and non-formatted.

It is typically found in its native format, like image, video, and audio files, sensor data, etc. Before it can be used as training data, it needs to be heavily curated and annotated. The issue is that most data, 80% – 90% according to estimates, is unstructured. This is another hurdle that data collection experts need to overcome.

artificial intelligence (ai) and machine learning (ml)

Legal Regulations

Another challenge of text, audio, photo, and video data collection for ML training models are the legal regulations relating to the collection of personal data. Any company that wishes to compile personal data must comply with very stringent regulations, both local and international.

An example is the EU’s General Data Protection Regulation, which dictates how information about EU citizens can be collected. A more localized example is the California Consumer Privacy Act. And there are many more local and regional laws that apply to specific situations. In short, any form of data collection comes with additional legal challenges.

The Cost Associated with Physical Data Collection

On top of availability, usability, and legality, data collection can also be hampered by more practical concerns – the costs. Collecting physical data necessitates a large amount of manual labor and/or the use of expensive machinery, from drones to sensory systems. This is a challenge that many organizations cannot overcome if they wish to develop machine learning models in-house.

The Expertise Required

Finally, data collection needs to be done by trained personnel in order to be done correctly. Hiring or training staff, naturally, requires financial investment and time. The issue is that many ML learning projects are one-off, that is, when an organization implements an AI solution, it no longer needs the services of in-house data scientists.

So, an organization would need to hire or train staff for relatively short-lived projects, which is not cost-effective. Additionally, many experts will not look for employment that does not offer job security. Thus, the expertise required for data collection is in short supply for many employers. One way that organizations overcome this challenge is by outsourcing data collection.

We Can Help You Overcome the Challenges of Data Collection

Aya Data provides expert data collection and acquisition services for ML projects. We employ full-time data collection professionals that can acquire the data you need or we can augment your staff with those same experts for as long as you need them.

If you have any questions about our collection process or wish to discuss your project, schedule a free consultation with one of our senior staff members and we can talk about how Aya Data can add value to your project.

Resources

Categories

The Challenges of Text, Audio, Photo, and Video Data Collection for ML Training Models

The Importance of Data Collection