Basic Data Collection Methods for Machine Learning Projects Explained
Data collection might seem like a simple process on the surface. But because it’s the basic building block of all ML projects, the data needs to be accurate, relevant, and cover all iterations of a problem. Consequently, it is crucial which data collection methods for machine learning are employed to gather it.
Without a good dataset, you can’t have good training data, and without good training data, an ML algorithm can’t serve its intended purpose. And that is what we will examine in this article – data collection techniques and how they can be applied in practice. However, let’s start at the beginning and discuss some basic collection methods.
What Is Data Collection?
In general terms, data collection is the organized process of acquiring information on a specific subject for the purpose of making informed decisions. When it comes to AI and ML, data collection is the methodological process of acquiring data relevant to an ML or AI project.
The collected data becomes a dataset that is then cleaned and pre-processed in order to become training data that can be fed into a machine learning model. While it sounds simple, data collection is a crucial step in ML projects because training data directly impact the performance of machine learning algorithms.
To put it even more bluntly and utilize a phrase data scientis often use – rubbish in, rubbish out. Consequently, it is crucial what type of data is collected and how it aligns with the goals and objectives of a machine learning or AI project.
Types of Data: Primary vs. Secondary
So how can we classify data? A distinction can be made between:
- Primary data – primary or first-person data is the data that is gathered directly by an organization. So, an organizational ledger that contains information like payrolls, working hours, sick days, etc. of employees is primary data.
Similarly, an eCommerce website that is collecting data on the way users behave on their website is primary data. Primary data is often unstructured because not many organizations have the capacity to organize and interpret the data they collect. On the other hand, primary data is often the most relevant for gathering specific insights and for ML projects.
- Secondary data – secondary data is data that was collected and stored by another organization or institution. It can be anything from research papers to census data from the US Census Bureau to primary data that other organizations share. Secondary data is often structured, but it may not always be fully applicable to a problem that an organization is trying to solve.
Both primary and secondary data collection methods are used for ML projects, often in tandem.
Traditional Data Collection Techniques
Before we delve into the data collection methods for machine learning, let’s first discuss some traditional and most common methods of gathering primary data for business purposes.
Surveys are often used to get internal feedback from employees or external feedback from customers. Surveys can gather both qualitative data, via written responses, and quantitative data, via multiple-choice questions. With surveys, there is minimal interaction with the responders.
For instance, a software company may utilize a survey to gather data on how to improve a product. They can ask questions about the features of the software that consumers most often use, what other features they would like to see, what they use the software for, etc. When collected and analyzed, this type of data can provide valuable business insight.
The observation method refers to observing and recording the way people act in an environment and with objects in an environment. Observation can often lead to data bias if the observer interacts with the subject and influences their behavior.
So how can a company utilize observation for business purposes? A common application of this technique is heat maps on a website – when analysts look at the way people behave on a webpage and interact with its components. This method can provide valuable information on the UX design of a page and what alterations are (potentially) needed.
Focus Groups and Interviews
Focus groups are typically used to gather qualitative data, i.e., how a select group of people feels about a specific subject. A company that is considering rebranding may first create a focus group of its target audience to assess how people respond to the new design, logo, overall image, etc. Interviews function like focus groups, only one-on-one.
Online tracking is used to gather qualitative data and has a wide range of uses. For instance, online stores can track purchases to analyze their best- and/or worst-selling products in order to know where to focus their marketing efforts.
Social media monitoring is another form of online tracking. Companies may use it to analyze the interests and motivations of their users and how they engage with certain content or products. A significant benefit of online tracking as a data collection method is that the data can be easily stored as soon as it is collected.
However, online tracking can have serious ethical and legal ramifications. Unlike other common data collection methods where participants give explicit consent, that is not always the case with this data collection technique.
There are many regulations regarding the collection of personal data of online users and any organization that wishes to apply this method should first make sure that its processes and procedures are aligned with the rules and regulations applicable in its area.
Basic Data Collection Methods for Machine Learning
Machine learning algorithms require relevant and accurate data that can be processed and fed into a model. However, the dataset needs to be large enough to cover all relevant iterations of a problem. Consequently, some traditional primary data collection techniques are not practically applicable to many ML projects.
Methods like surveys, interviews, observations, or focus groups often cannot produce a large enough volume of data that can be used as training data. Additionally, there is the challenge of analyzing, cleaning, and processing all of this information, which often requires the involvement of a data scientist.
However, secondary data is readily available via large online repositories. Some of these repositories are provided for free by large organizations like Amazon, Microsoft, or Google, while there are other organizations that specialize in data collection and offer their datasets for a fee.
That is to say that even though ML projects require a large amount of data that is difficult to collect via primary data collection methods, this does not limit the practical application of ML for businesses due to the availability of secondary data (and some other data collection methods). So let’s delve a little deeper.
Primary Data Collection
Again, primary data collection refers to the information that an organization has manually gathered. Some companies have been compiling information for years and can use that data for machine learning projects. That data may still need to be digitized and most likely cleaned and processed, but it is available.
One advantage of primary data is that it is highly relevant to the organization that gathered it. Thus, if your company wishes to create a machine learning algorithm that will predict the behavior of your consumers, the data you manually gathered is the most useful.
However, while primary data collection serves a purpose, most ML models rely on secondary data.
Secondary Data Collection
Secondary data is information that has been collected and stored in large cloud-based repositories by many organizations and can be shared. There are two basic types of online repositories:
- Data warehouses – data warehouses usually store structured data, that is, data in a tabular format. When data is stored in a warehouse, it is first extracted, then transformed, and then loaded (this is called the ETL approach). So the data is transformed into a specific format before it is stored, which makes it structured.
- Data lakes – data lakes typically store structured and unstructured data. Just like data warehouses, lakes can contain data in tabular format, but also audio, images, video, or unstructured text, etc. Data lakes take the ELT approach – extract, load, transform. So the data needs to be transformed into the desired format by the organization that collects it. Most ML developers use unstructured data.
Data warehouses and data lakes are the basic sources of data for many ML projects because datasets for most ML needs can be found here. But what if neither the primary nor secondary data collection methods can come up with a dataset that meets the requirements of an ML project?
Then, data augmentation is an option. Data augmentation refers to the process of expanding a dataset’s size without manually collecting new data. In other words, if you have a relevant dataset but it doesn’t cover all the potential iterations of a problem, you can enrich it via data augmentation by transforming the existing data to create new variations.
Image augmentation is a common example of data augmentation. With an image, you change the saturation, brightness, and contrasts, or you can mirror it, cut it, rotate angles, etc. This creates new variations which can augment the existing dataset that is used as training data for a machine learning algorithm.
The Next Step – Pre-processing
This article has discussed data collection methods and, more specifically, data collection methods for machine learning. However, data collection is just the first step in creating training data for machine learning algorithms.
For accurate ML models, you still need the data to be cleaned, annotated, labeled, and properly analyzed. And that is what Aya Data can help you with. We provide data acquisition and processing solutions that can help you build ML and AI datasets at scale.
Contact us to talk to an expert if you wish to know more about how we can help with your AI and ML projects.