The Flaws of the Machine Learning Paradigm

tcorat
May 28
10 min read

Generative AI and LLMs fascinate people. In an earlier post entitled “Her” I linked to many breathless comments on the revolutionary impact of ChatGPT4o.

My acquaintances who are aware of my interest in the field, constantly feel the urge to share with me the amazing feats of ChatGPT they observed. With a hint of glee, they marvel about where this would all end. A Sora clip on YouTube seems to be a harbinger of Artificial General Intelligence.

They are not alone. Some luminaries, like Geoffrey Hinton, believe that LLMs are already capable of sentient thought and are afraid that this will end badly for humanity.

Conversely, when I look at LLMs, I see a nascent technology with many inherent problems. As you can see in my previous posts, the laundry list includes incorrect or bogus results known as hallucinations, shameless instances of plagiarism, gleeful appropriation of intellectual property, huge carbon footprint and the dearth of useful applications. And of course, a Mechanical Turk problem, with more examples popping up all the time.

I am not the only one expressing doubts about Gemini, Claude, Llama and ChatGPT4o. There are many researchers voicing similar concerns. Gary Marcus has been saying it for a long time. Even Meta’s Chief Scientists Yann LeCun has recently called LLMs a dead end.

Recently, I had the opportunity to explain my reticence to accept the Ray Kurzweil vision to some colleagues, and their feedback convinced me that my perspective might be worthwhile to share.

Coming from a social science background (Political Economy with solid Sociology and Social Psychology add-ons), my outlook is not purely technical: I question the LLMs’ ability to reach AGI from an interdisciplinary perspective. I believe that, at least in their current form, LLMs are unlikely to turn into highly intelligent tools, let alone becoming agents of Singularity.

There are several reasons for that.

1. The Technology Behind LLMs Is Limited

From the outset, AI research was about duplicating the way the human brain worked. However, after a series of inconclusive attempts, including Rule Based expert systems and most notably, Doug Lenat’s Symbolic AI project (Cyc), Geoffrey Hinton’s Neural Networks revolutionized the field around 2013. DeepMind, great Chess and Go players, and the Transformer approach all came out from that origin.

Speaking of this latter, what really made the current LLMs a viable possibility was the Transformer concept introduced in 2017, as summarized in the Attention Is All You Need paper. A group of Google engineers were trying to solve Google’s translation problem, which used a word-by-word (and back) iterative approach. They came up with an innovative solution: assigning a value to each word in a sentence based on their connection to other words and doing the same for the rest of the text created a small context and expanded the “attention” of the algorithm.

When they trained the algorithm with a bunch of other similarly annotated texts with variably connected words, the statistical connections enabled it to translate texts more accurately.

This is an example provided by a Google research blog post:

After starting with representations of individual words or even pieces of words, they aggregate information from surrounding words to determine the meaning of a given bit of language in context. For example, deciding on the most likely meaning and appropriate representation of the word “bank” in the sentence “I arrived at the bank after crossing the…” requires knowing if the sentence ends in “... road.” or “... river.”

I will propose something more basic to illustrate the point. If I used the epithet “voracious” in a text, you would not normally think that the next word might be “swimmer,” right? Your likely guess would be “reader,” as in most texts you have ever come across, “voracious” qualifies as “reader.” Perhaps “eater” in some cases, but never “swimmer.”

Multiply this correlation by a billion. You get the idea.

This eventually led to LLMs that could go beyond translating and generate intelligible and meaningful sentences after going through billions of texts and understanding the statistical word connections.

I am somewhat oversimplifying the process as there is a complicated math behind it but conceptually, not by much. Given this underlying technology, you realize that the LLMs have no real understanding of the texts they generate. They just blindly guess what the next word could or should be based on statistical connection and the context/subject provided by the prompt.

The so-called reversal curse provides another illustration.

We test GPT-4 on pairs of questions like, “Who is Tom Cruise’s mother?” and, “Who is Mary Lee Pfeiffer’s son?” for 1,000 different celebrities and their actual parents. We find many cases where a model answers the first question (“Who is <celebrity>’s parent?”) correctly, but not the second. We hypothesize this is because the pretraining data includes fewer examples of the ordering where the parent precedes the celebrity (eg “Mary Lee Pfeiffer’s son is Tom Cruise”).

Tom Cruise and his mother are two tokens statistically linked in one direction. That is it. In other words, LLMs do not constitute a form of intelligence, artificial or otherwise.

AI researcher Simon Willison recently called this method a “statistical autocomplete.” But perhaps a more explanatory analogy is Derek Slater’s “musical chord progression” which shows how this is not an amazing feat.

1. The Learning Assumptions Behind LLMs are Flawed

There is an implicit model behind the LLM approach and that is the way children learn how to speak.

Researchers assume that babies mimic sounds, then connect them with objects and through trial and error they discover laws of physics like gravity and then social rules and norms. Eventually, they start speaking with a clearer understanding of their context. Here is a descriptive corporate write-up:

Imagine teaching a child to speak. You start with the basics – words, then sentences. Over time, with guidance and feedback, the child learns not just to speak, but to converse. This process is very similar to the current approach to training LLMs.

If this model is correct, then OpenAI CEO Sam Altman and other leading AI figures have a point. If, given enough resources and educational opportunities kids could go from babbling toddlers to Nobel prize winners, imagine what ChatGPT52 could do with much larger training datasets, multitrillion dollar data centers and unlimited energy sources?

Or what self-supervised greatness Yann LeCun’s V-JEPA platform could achieve with slightly less resources?

However, unlike them, I contend that this “Toddler Learning Process” model is based on flawed cognitive assumptions.

Let’s examine Language and Vision.

Cognitive Assumption I: Language

The toddler learning process might appear plausible when you look at your kids and their linguistic and cognitive development in a social environment, but this is not what we are doing with LLMs. Unlike children, LLMs are not social beings with formidably plastic brains equipped with five senses that evolved in six million years. This is the first time we are attempting to teach these alien digital life forms how to speak.

We need to think more in terms of the origin of language.

Let’s do a thought experiment, the kind theoretical physicists liked to do early in the previous century to make concepts intelligible. Like Einstein’s Train in a thunder storm. Let’s consider the implicit notions regarding the birth of language.

This is what most people imagine. Early in our evolution, one of our ancestors had the serendipitous idea of pointing to an object and making a sound and others liked the sound and accepted it as the name of that object.

Let’s use Lucy as our inventor. Apple, she said, apple it was. Stone she intoned, and stone was named. Sun, trees, berries and cave followed. And a few centuries later, our hominid ancestors were gathered around a communal fire telling tall tales to one another using their new verbal skills.

What that simplistic model ignores is that such a scenario would require several preexisting elements. Lucy would have to have been able to recognize herself and her interlocutor as two separate individuals. The Self and the Other. This is a prerequisite for communication.

She would also have to understand what communicating means, as it is an abstract concept, and she would have to have the desire, motivation or reason to undertake that.

There would have to be have been a prior social convention about the meaning of pointing your finger to an object and she and the other person would be aware of that (our closest genetic cousins, the primates do not follow our pointing gesture whereas our fully socialized dogs do).

Finally, she would have to have a conceptual construct about naming objects as distinct entities.

None of these premises are possible without a preexisting linguistic structure to hold these concepts and a logical construct to give them meaning and purpose. No self, no other, no communication, no gesture, no naming, and no invention of language.

That is why there is a philosophical debate about whether language is innate or learned. If you are curious, Google Noam Chomsky and then Jean Piaget’s theory of cognitive development. Our current understanding is that it is a combination of both.

If that is correct, and I cannot fathom how it could not be, I am unable to comprehend how feeding more data to the toddler/algorithm will make them/it more intelligent since it is largely lacking that preexisting logical/linguistic structure.

If we are following the toddler learning process as a model, then it is clear to me that, without preexisting rules and logical structures in place, the resulting AI will not have the contextual understanding and common sense of a five-year-old kid or the family pet.

The Vision component makes that even clearer.

Cognitive Assumption II: Vision

In 1993, the late and great neurologist Oliver Sacks published a short article in the New Yorker magazine about a blind man who regained his eyesight after nearly five decades of sightless existence. He called him Virgil.

After Virgil’s successful operation, instead of the Biblical scales falling from his eyes and him receiving sight, “what he saw had no coherence. His retina and optic nerve were active, transmitting impulses, but his brain could make no sense of them; he was, as neurologists say, agnosic.”

In other words, he could see optically but he was blind mentally. Data went in but the processing did not yield the same result as it would in a sighted person.

In practical terms, Virgil could see his cat and dog, see their head, ears, tail, but could not conceive them as a whole cat or dog and kept confusing them with one another. Unless, that is, he touched them.

He could see the steps on a staircase but could not understand the spatial connection that combines them and could not climb them without his cane. Colors and movement confused him, and space and distance eluded him. No amount of explanation made any difference.

Remember, this is a person with a regular information processing brain.

Eventually, Virgil went back to being blind because his underdeveloped visual cortex could not handle the pressure of pretending to be sighted. More importantly, despite the plasticity of the human brain, his visual cortex did not get any better in figuring out the visual data. And this, despite his best efforts to understand the visual world and the constant help of his wife and family.

If we use this analogy, what Machine Learning approach is attempting to do is to provide massive amounts of data through the optical nerve to get a rudimentary neural network to make sense of it. The hope being that, over time, the system would develop the same abilities as a human being.

Is that a realistic expectation?

We know that the current neural networks are quite limited compared to the human brain. And we also know that the human brain is infinitely superior:

For example, we can learn new information by just seeing it once, while artificial systems need to be trained hundreds of times with the same pieces of information to learn them. Furthermore, we can learn new information while maintaining the knowledge we already have, while learning new information in artificial neural networks often interferes with existing knowledge and degrades it rapidly.

Yet, Virgil’s and similar cases suggest that, even the vastly superior human brain is incapable of handling this much data without a highly capable processing structure in place. Data by itself does not lead to more comprehension.

That is why OpenAI’s Sora, a multimodal text-to-video AI has no idea of laws of physics or how space works.

Can this Dalmatian safely make it to the next window in the real world and not fall off?

Just like Virgil, Sora is agnosic.

3. Not Enough Data

Even if the LLM proponents dismissed my “limited technology based on flawed assumptions argument,” they would have to concede that there are other obstacles.

One is the insufficient quantity of fresh data to train LLMs. It looks like we are not producing enough new data to create larger training datasets. Both Stanford AI Index and Epoch AI predict that we will run out of high quality language stock sometimes this year or the next.

I suspect that if the current intellectual property litigations prevent LLM companies from accessing copyrighted material, this will happen much sooner. Already there are companies generating and marketing “synthetic data” for LLM training purposes. Synthetic data refers to datasets created by LLMs to train other LLMs.

You probably know about the massive bias problems that plague the ML and LLM sectors. Imagine how that problem could become exponentially worse when biased algorithms produced more biased datasets to train yet more LLMs.

This is GIGO on steroids.

Then there is the data poisoning possibility with synthetic data. Ironically, the link I provided is from CrowdStrike as part of their warning literature.

4. Not Enough Energy

No one really knows how much energy LLM companies use in training their models because they stopped publishing those figures. There are some estimates but they are not easy to calculate.

The challenge of making up-to-date estimates, says Sasha Luccioni, a researcher at French-American AI firm Hugging Face, is that companies have become more secretive as AI has become profitable.

Alex de Vries, a researcher at VU Amsterdam, calculated “that by 2027 the AI sector could consume between 85 to 134 terawatt hours each year. That’s about the same as the annual energy demand of de Vries’ home country, the Netherlands.”

If you look at combined data center figures, the picture is bleaker. The International Energy Agency “says current data center energy usage stands at around 460 terawatt hours in 2022 and could increase to between 620 and 1,050 TWh in 2026 — equivalent to the energy demands of Sweden or Germany, respectively.

A report from July 2024, quotes the semiconductor research company TechInsights “AI chips represent 1.5% of electricity use over the next five years, a substantial part of the world’s energy.” (2318 TWh)

This is clearly untenable in the age of global warming.

What Is the Way Forward?

If, for the reasons I cited, the data-centric approach behind LLMs and Generative AI constitutes an epistemological and/or practical obstacle, what is the way forward?

Is there a way to tweak the Machine Learning paradigm and LLMs to claim they are now capable of reasoning?

The recently announced Strawberry and Self-Discover are two such approaches.

Can the toddler turn into Einstein and Virgil into a sharpshooter though Agentic AI?

That's my next assignment.

The Flaws of the Machine Learning Paradigm

Recent Posts

Comments