Artificial Intelligence is a fascinating and confusing field.
We are routinely presented with seemingly mindboggling exploits of Large Language Models or Large Image-Making Models (LLM and LIMM): ChatGPT passes business school exams and DALL-E2 competes with Renaissance masters. LaMDA convinces an engineer that they are actually sentient, while Bing is yearning to be human.
Yet when we scratch the surface, we find out that Bing was adamant that this year was 2022 and not 2023, and ChatGPT was banned by the site Stack Overflow for “constantly giving wrong answers” with confidence. It also became infamous for its fake court citations and fake newspaper articles. Bing reads financial statements incorrectly and displays a manipulative and obnoxious personality. And Google’s Bard is also prone to providing incorrect answers in an equally self-assured manner.
When independent researchers inquired about the reasons behind the mistakes, they were informed that the culprit was the dataset.
Interestingly, the same explanation was offered when algorithmic bias was first discovered. We were told that there was some deficiency in the datasets used to train algorithms behind COMPAS (sentencing guideline and recidivism predictor), PredPol (crime predictor), Amazon Recruitment Engine, Google Photos, IDEMIA’s Facial Recognition Software, and several healthcare allocation applications. All of which displayed overt bias towards women and various minorities. People were given longer sentences, misdiagnosed, confused with primates, and refused employment or benefits.
The problem would go away, we were assured, if larger datasets were used to train algorithms as the size would dilute the skewed data.
This explanation sounds less convincing when we consider that the dataset that trained ChatGPT (which animates Bing) included some 300 billion words scraped off the Internet until 2021. Google is tight-lipped about the sources and final date of the 1.56 trillion words that went into Bard but a similar response would not be surprising.
Current Paradigm: Datasets and Machine Learning
The original AI paradigm was rule-based involving if-then databases and specialized scripts. ChatGPT’s digital grandmother Eliza, the first chatbot that mesmerized the masses in 1966 pretending to be a psychiatrist, was one such program.
Around 2012, a group of researchers at the University of Toronto, led by Geoffrey Hinton, recently dubbed the “godfather of AI”, proposed a new approach based on Neural Networks (NN) that could be trained on massive datasets. NNs use statistical tools such as Regression Analysis that identify dependent and independent variables to find patterns in the dataset, figure out correlations invisible to the human brain and reach some conclusions.
The results were amazing as the new bots seemed to diagnose illnesses quickly and accurately, beat chess grandmasters decisively and win the Jeopardy game show with ease. But two interrelated issues plagued the new paradigm. The NNs were black boxes whose exact functioning eluded even their designers and, secondly, bias seemed to be present in varying degrees in most of the results reached by the algorithms.
Since the black boxes could not be tweaked, as their designers did not really know how, the only tool for data scientists and data engineers was to try to find new methods to reduce bias elements in the dataset. For instance, if facial recognition bots could not correctly identify people with darker skin, the solution would be to add millions more of such images to the training datasets.
However, the problem did not go away, again, for two reasons. First LLMs are trained with data produced by human beings. Not only do they reflect societal biases but given their computational power, they amplify them. An early Microsoft bot called Tay was let loose on Twitter and within 24 hours it turned into a homicidal sexist and racist creature. More recently, rap lyrics produced by Chat GPT suggest that
“If you see a woman in a lab coat, She’s probably just there to clean the floor But if you see a man in a lab coat, Then he’s probably got the knowledge and skills you’re looking for.”
DALL-E is known for producing horrifyingly antisemitic images.
In short AI algorithms are biased because they are trained with our data. And we are biased.
Secondly, some biases are structural and are very hard or impossible to eliminate. For instance, certain groups are over or underrepresented in the actual databases such as African Americans in the American criminal justice system and healthcare respectively. There are statistical manipulations like using proxy variables but the fact remains that, for structural reasons, too many African Americans are in the criminal justice system and too few of them are part of the mainstream healthcare system and therefore health statistics.
An even better example is gender bias. The reason ML algorithms and AI applications have been unable to produce gender-neutral results is the fact that every piece of information we collect is inherently gendered. Gender permeates every aspect of social life and it is embedded in everything we do, say and understand. Consequently, data cannot be made gender-neutral.
Perhaps more importantly, while men and women and all gender groups are affected differently by the same phenomena only the male perspective is considered universal and worthy of note. For instance, men and women experience natural disasters very differently but most measures are designed with men in mind. During the COVID pandemic, many countries refused to collect sex-disaggregated data until it was shown that the virus killed significantly more men than women.
From crash test dummies to cancer research to optimal office work temperatures, women’s presence is marked by their absence.
What we need is not to find ways to produce gender-neutral results, but to create AI tools that will be gender-representative. Instead of eliminating gender differences, which is both impossible and undesirable, we need systems that will identify and address gendered consequences of any course of action, choice, measure, policy, or policy implementation.
Can this be done?
That is the question and symbolically the question mark in our name signifies that there is no categorical yes answer. Read our next post to see what we propose as a way out.