The Artificiality of "Intelligent" LLMs

Table of Contents

Investigations in Mind - This article is part of a series.

Part : This Article

Part : A Brief Survey of Consciousness

Part : Memory

Part : LLMs in Thinking Machines

Part : Agents. Where to begin?

Part : It's Time

Part : Investigations in Mind

Part : Synthetic Memories

For now, the intelligence in our systems remains distinctly human.

Eighteen months ago, I wrote about building thinking machines using LLMs, driven by curiosity about whether these systems could help us understand human consciousness. What I’ve learned building LLM applications since then fundamentally changes that original question.

The Investigation Continues
#

At Microsoft, our small research engineering team has continued exploring LLM applications and agents. Our work, which informs Sam Schillace’s weekly Sunday Letters, is open sourced in our Semantic Workbench repository. We’ve built chat-based assistants for knowledge transfer, large context management, synthetic memory, team collaboration, tool calling, and automation. I’m particularly interested in continuing work on procedural memory through my skill library.

But this hands-on experience has revealed something crucial: the path to understanding consciousness through LLMs will likely be fundamentally different than I originally imagined. In this post, I’ll share what building these systems has taught me about my original inquiry into LLMs, agents, and human consciousness.

Deceptive automatons
#

The first thing our research revealed is how easily we mistake sophisticated text generation for actual thinking. Recently, OpenAI announced one of their language models passed the Turing test, which many historically have considered to be the point at which we would truly have machines that think. The lack of fanfare over this milestone betrays the fact we never really thought the Turing test would be a good test for that. Rather than testing consciousness, Turing’s test tests how well a machine can generate plausible text in a conversational setting. In other words, it might actually be testing how easily duped we can be. We are fooled into believing LLMs are conscious because we have never experienced anything that wasn’t conscious being able to hold a conversation with us. Many people thought Thomas Edison’s phonograph (record player) had a soul. We think we’re smarter than that.

The hard limit of LLMs in AI applications is that they cannot reason. In fact, they cannot think at all. The vast majority of the public, including many industry professionals, do not grasp what an absolute limitation this is. We read articles on a daily basis about “What AI really thinks”, about AI having a penchant for blackmail, or exhibiting emerging behavior. But this is all smoke and mirrors.

This is a hard limit because scale won’t solve it. Consider the numbers: In the past two years, we’ve increased humanity’s total compute capacity by 5x and thrown almost all of that at LLM-driven AI. We’re approaching half a percent of all humanity’s power being used for AI.

The result? Marginal improvement in LLM capability. This isn’t because we haven’t done enough training—it’s because LLMs are wholly incapable of thinking.

All AI is Search
#

Let’s do a small thought experiment. Imagine we modify a search engine in three ways:

Instead of requiring search terms, you can ask natural questions and it quietly converts them to search queries on your behalf
Instead of returning links, let’s make it extract information from relevant pages, translating and rewriting in a consistent voice
And finally, let’s interleave questions and responses so it is displayed like a chat conversation

Now let’s call our search engine “Bob”, make a friendly avatar, and kick off conversations with “Hi, I’m Bob. I contain all human knowledge. What would you like to ask me?”

Now ask yourself: As you interact with this system, is it “thinking”? Does it have “thoughts” about your “discussion”? Is it “reasoning” about its replies? Might it have hidden “motives”? Does it “care” about you?

No, but what would keep everyone from talking about it as though it were? What would keep product marketing from describing this new system as “your companion”? This is all pure anthropomorphism.

LLMs improve on this imaginary system by searching more granularly and rewriting more fluently, but they in no way introduce thinking.

Unthinking systems
#

We often conflate LLMs with the systems that use them. The AI assistants we have become used to (CoPilot, Claude, ChatGPT, etc.) are becoming more generally capable, but not primarily due to the LLMs they use. They become more capable by layering other systems on top of them for remembering and accessing other digital tools. The systems become more useful by stringing together LLM calls into routines, either by crafting the routines carefully by hand or by generating steps of a routine dynamically.

We might imagine (like I did two years ago) that stringing these calls into routines to create an agentic system would be a simple matter–just have the LLM do a bit of planning, execute the plan, check the outcomes, and repeat. But this all rests on the LLMs not being entirely incapable of thinking–they can’t plan. Let me explain.

Instead of giving an LLM a question and asking it to generate an answer, give an LLM a scenario and ask for a plan. In a few seconds you’ll have a full set of steps, as detailed as you’d like. Frameworks call this technique “planning” and suggest using it to build better agents.

But this isn’t planning at all. LLMs aren’t reasoning through problems and breaking them into steps—they’re searching for the most probable token sequence that looks like a plan. The plans an LLM generates, like everything else, seem entirely plausible as though an expert human actually created them. Unlike math, we can’t calculate whether the generation is wrong (it usually is). We can’t simply look up whether some facts or sources are hallucinated (they usually are). With many plans we are left to our “wisdom” to decipher whether they are “good”. (How does one evaluate whether getting off social media good plan?) Some plans, though, we can interrogate. For example, we might simulate a plan in a program and see how it goes.

So how do agents do with planning coding tasks, which we can evaluate?

All of the foundational LLMs were trained on code as well as free-form text. They have learned (in the technical “machine learning” sense) semantic-like relationships between programming concepts, so they can easily generate code-like output that can be verified and executed.

So what kind of code do state-of-the-art models produce?

Generated code uses outdated libraries… whichever was most present in the training data.
Generated code mixes abstraction layers… just like many of the coding examples online.
Generated code brute-forces solutions often producing much more code than necessary.
Generated code cannot handle more than one concern reliably… there wasn’t enough training data to handle the combinatorial explosion of more than a few concerns at a time.

Better training may improve some of these issues through curated and machine-generated datasets. But the point isn’t whether they’re good or bad at coding—it’s what this reveals about how these systems actually work. Notice what’s missing: any consideration of legibility, maintainability, provable correctness, efficiency, testability, configurability, backwards compatibility, data migration, integration, failover, product suitability, team culture, or the countless other concerns that guide professional software developers.

Agents are great at acceleration and awful at planning. They’ll convincingly drive your software projects full-speed down the highway, or into a wall. Now imagine using them to plan your life.

The path towards improving the use of LLMs in software development is by training LLM models for specific tasks, limiting the scope of any one LLM call, and automating processes. Human intelligence is required at every stage–evaluating the objective, the data to be used for training, and the resulting model, defining which processes to automate, stringing together the automations, evaluating the outcomes. Many attempts are being made to automate all of this.

Progress towards AGI
#

Here’s what building these systems taught me: Every attempt to use LLMs for complex cognitive tasks falls into the same pattern. We start by asking them to handle memory, planning, or reasoning. They fail spectacularly—impressive enough to excite us, terrible enough to be useless. So we break the task into smaller pieces. They fail again. We break it smaller still. This continues until we’ve reduced the LLM’s role to what it actually does well: search, extraction, translation, summarization, and tool calling.

The intelligence has to come from how we program these simple functions together.

We’ve essentially returned to where AI was before large language models—needing to stitch together logic rules in complicated ways.

So if LLMs and agentic systems are, in the intelligence sense, nothing new, why are we seeing such massive economic impact? It turns out automating vast segments of our knowledge economy requires surprisingly little intelligence. The fact that hallucinated content can replace much of what our society runs on says more about us than about AI. The fact that simple, often wrong plans can automate vast amounts of human labor reveals what limited thinking many jobs actually require.

This LLM automation will accelerate our exploration of the hard problem of building truly thinking machines. But that’s still ahead of us. For now, the intelligence in our systems remains distinctly human.

Postscript
#

This is the final post of this LinkedIn series.

If you’ve enjoyed the “Investigations in Mind” series, please continue on with me by following my website at https://payne.io where I’ll continue sharing thoughts further on AI, technology, and society.

My focus will be shifting away from investigations of AGI and consciousness towards what I believe to be an exceedingly relevant topic–helping ensure that civil society has a strong technology foundation in support of a collaboratively, self-determined future. This new project is part nostalgia for the early Internet, part payback for the career I’ve enjoyed in technology, and a big part looking forward to where our technological future is headed. I’ll post soon about my project, Wild Cloud, and the foundation I started to support it, the Civil Society Technology Foundation. I’d love for you to join me!