The Importance of Data Quality for Machine Learning: How Bad Data Kills Projects
You will hear many data scientists use the phrase ‘rubbish in, rubbish out’. In the world of machine learning, it means that an algorithm will not serve its intended purpose if the training data is no good. And professional data acquisition for machine learning is the first step to having good training data.
That is what this article will focus on – data acquisition. We will explain what data acquisition is in general, expand on several data acquisition methods for ML algorithms, list the characteristics of good datasets, and discuss why accurate training data is important, while providing real-life examples of these processes. So let’s get started.
In the simplest of terms, data acquisition is the process of sourcing data that can be cleaned and pre-processed and later used to train machine learning algorithms. In a more narrow sense, data acquisition for machine learning refers to measuring and recording real-world signals that are then digitized so that a machine can read them and learn from them.
And this is where some confusion stems from, because when people say data acquisition, they may mean:
Depending on the type of machine learning project, the first or the second approach or even both may be utilized to acquire the necessary data. To compound things a bit further, you can even use synthetic data as training data, i.e., non-real data that was simulated and generated, although synthetic data has some limitations.
But, to avoid overly complicating things at the very beginning, we will first explain how data acquisition systems gather and digitize real-world data, then exemplify the three basic data acquisition sources, before we go on to give an overview of synthetic data. So let’s discuss DAS.
A data acquisition system has three basic components:
So, a DAS is the basis of data acquisition when we are talking about it in the narrow sense, that is, measuring real-world data and transforming it into a format computers can decipher.
Now, let’s talk about data acquisition for machine learning in the broader sense – sourcing data that can be cleaned and processed to be used as training data for ML algorithms. There are 3 basic sources for data and, as we’ve mentioned, an ML project may use just one or a combination of the sources, depending on the needs of the project.
One data source is when an organization manually collects the specific data it needs for a project. This may be the case when a machine learning algorithm is required to do predictive analysis, so it only needs accurate data from one organization or company.
For instance, an eCommerce website may only wish to use the information it collects about its customers to accurately predict their behavior. Manual data collection would also be if a company took its physical ledgers and digitized them.
Of course, manual data collection is also when a company/organization utilizes data acquisition to measure real-world signals and transform them into digital format, as we’ve discussed above. Manually collecting data does not always require the involvement of a data scientist.
If we consider the example of digitizing ledgers – it can likely be done by most people with a little bit of training. Acquiring customer data from an eCommerce website is something many software engineers can do.
And many devices that measure real-world phenomena and transform them into digital format do it automatically. However, data collection for machine learning does require data scientists, as training data needs to be accurate, so it needs to be correctly chosen, cleaned, and processed.
A data warehouse is a centralized storage space for data. A data warehouse may be internal, such as when an organization manually collects its data and stores it in a central repository. The data in a data warehouse is typically structured, fitting into a tabular format.
The data in a data warehouse is usually stored via the extract, transform, and load (ETL) approach, that is, data is transformed before it is loaded into the warehouse, which makes it structured – think a digital warehouse full of tables akin to Google Docs sheets.
Data lakes store data just like digital warehouses, but the difference is that they store both structured and unstructured data. So, besides the data in tabular format, you can find video files, images, PDFs, audio, etc. Data from data lakes is used via the ELT (extract-load-transform) approach, so that data is transformed (processed) later.
However, data warehouses and data lakes don’t need to be internal for an organization. There are public, open-source, cloud-based data warehouses and lakes that compile data from many sources that can be used for data acquisition for machine learning projects. Some of the biggest organizations provide these types of open-source datasets, such as:
Similar to public data lakes and warehouses, there are organizations that compile data and then sell it. This data acquisition source functions much like the previous one, with the (major) exception that you need to pay for the data you use for your ML project.
Another data acquisition method for ML algorithms is not using real-world data, but generating synthetic data. However, generating synthetic data has some severe limitations because it is very difficult to generate every feature that could be encountered in a natural dataset.
In practice, synthetic data is most often used to complement real-world datasets. To avoid complicating things further, we will not expand on the pros and cons and use cases of synthetic data, but, if you are interested, you can read an in-depth analysis of synthetic data here.
Until now, we’ve discussed data acquisition methods for ML algorithms, but an ML algorithm does not serve its purpose if the training data is not good. So, what are the characteristics of a good dataset for ML?
What we’ve discussed until now was mostly theoretical, so let’s take a look at a real-life example of an ML project from Columbia University that utilized bad data. A machine learning algorithm was created to sort which patients with pneumonia should be in the hospital and which patients should stay at home and take antibiotics.
Historic data from clinics was used to assess the risks and sort the patients. The algorithm was mostly accurate but contained a crucial flaw. Asthmatic patients that have pneumonia are always sent to intensive care because asthmatics have the highest risk of complications. Them being in intensive care leads to these patients having low death rates.
As a result of the low death rate of asthmatics, the ML algorithm did not interpret asthma as a severe risk factor during pneumonia and recommended that these patients be sent home. If the recommendations had been followed and these patients sent home, it would have resulted in a severe increase in the death rates.
This is a clear example of why good training data is necessary for ML algorithms.
In case you are reading about data acquisition for machine learning because you are thinking of implementing ML and AI solutions into your organization, Aya Data can help. Contact us to talk to an expert if you need help focusing your ideas and to learn how we can put your ideas into practice.