The Lack of Diversity in Datasets And Why We Should Care

back to our blog

There has been a noticeable lack of diverse datasets to train machine learning models, resulting in ethical harms against specific communities. AI ethics researchers are pushing for solutions that involve more transparency in model development and dataset training. Regulations on AI are still coming into effect; until then, there will be little push to diversity in datasets. 

Despite the advancement and adoption of machine learning, there is much work to be done related to bias, diversity and inclusion within datasets themselves. Leaving out specific communities from datasets results in a lack of representation embedded within algorithms. One manifestation of this problem is facial recognition being unable to process black faces, as highlighted by The Algorithmic Justice League within their documentary Coded Bias. Facial recognition also can misidentify faces, resulting in harms against those communities. One striking case occurred in 2015, when Google Photos labeled a black couple as a gorillas, resulting in Google temporarily resolving the problem by removing the ‘gorilla’ tag from their categorization. The same problem was repeated in the middle of 2020 by Meta (known at that time as Facebook), when a user watching a video from a British tabloid featuring Black men saw an automated prompt if they would like to “keep seeing videos of Primates.” 

These machine learning examples typically involve supervised learning, which requires the involvement of humans to manually label data. However, even unsupervised machine learning algorithms, which use vast quantities of data without the involvement of humans, have problems as well. One notable example is OpenAI’s GPT-3, a language generation model that creates text with little input, which was trained on 570 GB of data and produces an anti-Muslim bias within its generated text. The training dataset includes text posted to the internet and books uploaded to the internet, such as English-language Wikipedia. The training data contains linguistic regularities that reflect human unconscious biases, such as racism, sexism, and ableism. 

The inability to address ethical problems in machine learning systems has begun to impact companies on the frontier of AI development. Google Cloud recently turned down a request for a custom financial AI, citing that the research to combat unfair biases must catch up and, “until that time, we are not in a position to deploy solutions.” Meta announced plans to shut down its decade-old facial recognition system, deleting the face scan data of more than one billion users.

Without a way to properly address ethical issues, the progression of AI will be blocked and potentially stop the advancement of AI completely.  

How Can We Move Forward?

Academic researchers in AI ethics are pushing for changes to decrease the chances of deploying machine learning models in contexts for which they are not well suited. One paper, Model Cards for Model Reporting, proposed that we should use model cards, which document model performance characteristics in order to avoid this issue. The model cards would accompany trained machine learning models and provide "benchmarked evaluation in a variety of conditions, such as different cultural, demographic, or phenotypic groups (e.g. race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g. age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains." The model cards would also include the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information. 

Another paper, Datasheets for Datasets, suggests a datasheet for each dataset that would describe its operating characteristics, test results, recommended uses, and other information. Datasheets would be used to improve transparency and accountability in the machine learning community. Microsoft, Google, and IBM have started to pilot datasheets for datasets within their product teams. Creating more documentation within the machine learning development process is one step closer to building more inclusive algorithms.

Changing Regulations Around Data Diversity

As more large companies are driven to build AI systems, diversity and inclusion has yet to become heavily regulated. 

The EU has proposed rules that classify AI systems into three risk categories. The Canadian federal government started requiring algorithmic impact assessments for all systems delivered to the federal government. The US Federal Trade Commission published an article clarifying its authority under existing law to pursue enforcement actions against organizations that fail to mitigate AI bias or other unfair or harmful outcomes through the use of AI. The Office for AI in the UK recently released the National AI strategy, with plans to develop the UK’s position on governing and regulating AI, which is set for publication in early 2022. 

Although North America and Europe have started moving towards more regulation, the actual timeline of when these rules will come into effect is unclear. The adoption of GDPR, for example, was proposed in 2012, adopted in 2014, and went into effect in 2018. And until official regulation has come into effect, companies will lack the momentum needed to bring more diversity into their datasets. 

You Might Also Like

Can AI Read Our Minds?

Can AI Read Our Minds?

AI and the human brain have a lot in common, but can AI read our minds?
Published on
November 23, 2022
Read More
AI Experiments That You Can Try at Home!

AI Experiments That You Can Try at Home!

Here are some awesome AI experiments you can try at home
Published on
November 18, 2022
Read More
Machine Learning to Treat Neurodegenerative Disease

Machine Learning to Treat Neurodegenerative Disease

Machine learning is unlocking insights into neurodegenerative disease
Published on
October 22, 2022
Read More