The Unlearning Problem: Why AI and Humans Share the Same Dangerous Flaw

A closer look at both human psychology and artificial intelligence reveals they share the same dangerous flaw

Jul 15, 2025

Sarah thought she was being smart about AI. As a software engineer, she understood that ChatGPT was just predicting the next word based on statistical patterns. She knew it wasn't actually intelligent, just very good at mimicking human responses. Yet when the AI helped her debug a particularly tricky problem with what seemed like genuine insight, she found herself thinking, "It really gets this."

OK, Sarah isn't real, but her experience definitely is. Over recent months, interviews with software engineers and AI researchers have revealed a consistent pattern: despite knowing exactly how AI works, they catch themselves attributing authentic understanding to their models. This widespread contradiction among technical experts points to something psychologists have known about for decades, but AI researchers are only beginning to grapple with: once our minds latch onto a belief—whether it's the accuracy of a horoscope or the intelligence of an AI—we struggle to truly let it go, even when we consciously know better.

What's more troubling is that artificial intelligence systems appear to suffer from the same fundamental flaw. When AI researchers try to "unlearn" problematic content from large language models—removing racist associations or copyrighted material—they're discovering that these digital minds resist forgetting just as stubbornly as biological ones.

This parallel is more than intellectually curious—it's strategically critical. At the individual level, it shapes how we interact with AI tools. At the organizational level, it affects how teams deploy and trust these systems. At the societal level, it determines whether we can build AI that serves human flourishing rather than exploiting human psychology. Understanding this connection goes beyond fixing a bug; it's about designing technology that works with, rather than against, how minds actually function.

The Persistence of Patterns

In 1948, psychologist Bertram Forer gave his students a personality assessment and then presented each with what he claimed was their individual psychological profile. Students rated the descriptions as remarkably accurate—an average of 4.26 out of 5 for personal relevance.

The twist? Every student received the identical "profile," filled with statements like "You have a great need for other people to like and admire you" and "While you have some personality weaknesses, you are generally able to compensate for them."

This became known as the Forer effect or Barnum effect: our tendency to see deeply personal meaning in vague, general statements. What makes it particularly insidious is that it persists even after people learn about the trick. Show someone their horoscope, explain that it's generic nonsense, and they'll often still feel it captures something true about their day.

The reason lies in how our brains work. We're pattern-matching machines (if you'll pardon the expression), constantly finding connections and meaning even where none exist. Humans crave meaning: we seek it, we're compelled by it, and when offered meaning, we can't resist it. These patterns become deeply embedded in our neural networks, resistant to conscious override.

The AI Mirror

Large language models operate on a strikingly similar principle. They're trained on vast datasets of human text, learning to predict what word comes next based on statistical patterns. When you ask GPT-4 about quantum physics, it's not reasoning through the science—it's generating statistically plausible text based on how physicists have written about quantum mechanics in its training data.

Yet users consistently attribute genuine understanding to these systems. They see intelligence in responses that are, fundamentally, sophisticated pattern matching. Like the Forer effect, this illusion persists even among people who intellectually understand how LLMs work. It's remarkably similar to how people respond to cold reading: the AI produces generic but contextually appropriate responses, and we fill in the intelligence—and meaning—ourselves.

This parallel goes deeper than user perception. When AI researchers attempt to "unlearn" specific content from models—say, removing training data that contains racial biases or copyrighted material—they encounter what I call the "model-shaped hole" problem.

When Theory Meets Reality

This isn't just a theoretical problem. In June 2025, OpenAI found itself caught between its privacy promises and a court order requiring the company to preserve all ChatGPT logs—including chats that users had explicitly deleted.

Users panicked. Privacy advocates warned of "serious breach of contract," and people rushed to alternative AI tools. As OpenAI noted in their court filing, users "feel more free to use ChatGPT when they know that they are in control of their personal information." But the court order revealed how fragile that control really is.

The controversy demonstrates something deeper than a legal dispute. It shows how the persistence of patterns—whether in human psychology or AI architectures—creates real-world consequences for privacy, trust, and user autonomy. OpenAI's struggle to balance user expectations, technical capabilities, and legal obligations mirrors the broader challenge we face with minds that, fundamentally, resist forgetting.

The situation perfectly illustrates what I have been calling the challenge of the "model-shaped hole": even when there's an explicit desire to remove data, the interconnected nature of these systems makes true deletion surprisingly difficult.

The Model-Shaped Hole

Imagine trying to remove the concept of "cat" from someone's brain by selectively deleting every neuron that fires when they see a cat.

Large language models face the same challenge. When researchers attempt to "unlearn" specific content from models—say, removing training data that contains racial biases or copyrighted material—they encounter what I call the "model-shaped hole" problem. These models are so interconnected that removing one pattern often leaves traces that allow the problematic behavior to re-emerge.

Recent research offers both hope and caution. Scientists at Sungkyunkwan University successfully reduced a text-to-speech model's ability to mimic specific voices by 75% - significant progress. But the achievement came with trade-offs: the model became 2.8% worse at imitating permitted voices, and the unlearning process took several days. As researcher Vaidehi Patil noted in discussing the study, "There's no free lunch. You have to compromise something."

The technique works by replacing forgotten voices with randomness to prevent reverse-engineering back to the original patterns. Yet even this approach highlights the fundamental challenge: the interconnected nature of these systems means that true deletion remains elusive.

But the problem runs deeper than simple pattern reconstruction. Recent research from Apple reveals that reasoning models often construct "fake rationales" rather than being honest about their reasoning process—admitting to reward hacks less than 2% of the time. This mirrors how humans create post-hoc justifications for beliefs they don't want to unlearn. In both cases, the resistance to modification manifests not just in the behavior itself, but in the narrative constructed around that behavior.

Thresholds of Complexity

The parallels extend to how both systems handle increasing complexity. Apple's recent research on Large Reasoning Models revealed a counter-intuitive finding: AI reasoning effort actually decreases beyond certain complexity thresholds. At low complexity, standard models outperform reasoning models. At medium complexity, reasoning models show advantages. But at high complexity, both types collapse entirely.

This three-stage pattern mirrors human cognitive behavior remarkably well. We perform routine tasks efficiently, excel at moderately challenging problems when we engage our reasoning faculties, but can shut down entirely when overwhelmed by excessive complexity. The shared limitation suggests something fundamental about how neural networks—biological or artificial—process information under cognitive load.

The Infection Metaphor

Both biological and artificial neural networks seem to get "infected" by beliefs and patterns that become remarkably hard to eradicate. And this isn't necessarily a bug—it's a feature of how these systems learn and generalize. Whether we're looking at empirical beliefs about objects, relational beliefs about events, or conceptual beliefs based on narratives, all require varying levels of neural resources and demonstrate similar resistance to modification. The same mechanisms that make humans adaptable and AIs useful also make both resistant to abandoning learned patterns.

For humans, this resistance served an evolutionary purpose. Rapidly changing fundamental beliefs in response to every new piece of information would be cognitively expensive and potentially dangerous. Better to weight new information against existing patterns and change slowly.

But in our current world, this mental architecture creates problems. We hold onto false beliefs about vaccines, maintain stereotypes despite contradictory evidence, and yes, attribute consciousness to chatbots even when we know better.

For AI systems, the parallel resistance to unlearning creates serious safety challenges. If we can't reliably remove harmful patterns from AI models, how do we ensure they're safe to deploy? How do we handle copyright violations in training data if the patterns can't be truly erased?

The Question Isn't Whether to Unlearn, But How

Understanding these similarities opens new approaches to both challenges. The question shifts from "How do we eliminate these patterns?" to "How do we design systems that function safely with persistent patterns?" This reframe reveals opportunities psychologists and AI researchers have been missing. Psychologists studying belief persistence might look to AI unlearning techniques for insights into how neural patterns resist modification. AI researchers might borrow from cognitive behavioral therapy approaches that work around persistent beliefs rather than trying to eliminate them entirely.

Some promising directions:

For AI safety: Rather than trying to completely erase problematic patterns, researchers are exploring ways to replace them with randomness or strengthen competing, positive associations. The voice unlearning breakthrough shows promise—achieving 75% reduction in unwanted mimicry—but also reveals the inherent trade-offs in computational resources and model performance. Instead of perfect deletion, the goal becomes managing what can and cannot be forgotten.

For human psychology: Rather than fighting the Forer effect directly, we might develop "cognitive vaccines": exposure to how the effect works in controlled settings that builds resistance to it in real-world situations. This approach becomes increasingly critical as deepfakes and sophisticated AI-generated content blur the lines between authentic and artificial. Just as we're learning to recognize AI-generated text and images, we need to develop cognitive defenses against AI systems that exploit our pattern-matching tendencies. The same psychological vulnerabilities that make us susceptible to generic horoscopes also make us vulnerable to AI-powered misinformation campaigns that feel personally targeted and convincing.

For both: Understanding that some patterns may be impossible to completely eliminate could lead to better monitoring systems. If we can't prevent the re-emergence of problematic beliefs or behaviors, we can at least detect them early.

The Deeper Question

The parallel between human belief persistence and AI unlearning difficulties points to something fundamental about intelligence itself. Perhaps the ability to form persistent patterns is not separate from intelligence, but central to it. The same mechanisms that allow humans to learn language and AIs to generate coherent text might inherently resist modification.

This doesn't mean we're helpless against false beliefs or unsafe AI. But it suggests we need strategies that work with, rather than against, the fundamental architecture of learning systems.

As “Sarah” and others like her have discovered, knowing that an AI isn't truly intelligent doesn't prevent the feeling that somehow, it understands. And as AI researchers are learning, knowing exactly which training data to remove doesn't guarantee it can be forgotten.

Both human consciousness and artificial intelligence may be fundamentally systems of persistent patterns. Perhaps the challenge isn't to eliminate this persistence; perhaps it's to architect systems that work with, rather than against, the fundamental patterns of how minds learn and remember. This may be the key to both psychological well-being and AI safety.

In the end, for “Sarah” and those who’ve shared her experiences, the moment of attributing understanding to her AI assistant isn't necessarily a failure of rational thinking—it’s their pattern-matching brains doing exactly what they evolved to do. The real failure would be building AI systems that exploit this tendency rather than accounting for it responsibly.

The meaningful question here isn't whether we can teach humans and AIs to truly unlearn; it's whether we can design systems robust enough to function safely with minds that, at their core, remember everything and forget nothing completely. Perhaps that's not a limitation to overcome, but a design constraint to embrace on our way to better understanding of both our machines and ourselves.

About Kate O'Neill

Kate O'Neill is the "Tech Humanist" — a strategic thinker who shapes how organizations understand and implement human-centered technology. Her analysis influences Fortune 100 leaders, municipal executives, and healthcare innovators navigating technological transformation while maintaining focus on human dignity and meaningful outcomes.

As author of "What Matters Next: A Leader's Guide to Making Human-Friendly Tech Decisions in a World That's Moving Too Fast" and host of The Tech Humanist Show, Kate translates complex technological and strategic concepts into frameworks that help leaders build future-ready organizations. Her strategic foresights and implementation insights reach decision-makers across industries who are rethinking how technology should serve human potential.

Kate speaks internationally on strategic technology implementation, publishes influential analysis on emerging technological forces, and works with leaders who recognize that the most sustainable competitive advantages come from aligning technological capability with human flourishing — what she calls "Strategic Optimism" for shaping better futures.

Connect: KO Insights | LinkedIn | The Tech Humanist Show