How a Terrible AI Platform Made Me Furious Enough to Build a Better One

I read a writeup of an AI orchestration platform, ranted at a friend for three hours, and then built the thing that should have existed instead.

developmentreasoningAIarchitectureorigin-story

How a Terrible AI Platform Made Me Furious Enough to Build a Better One

Sometime in early January 2026, I encountered a writeup of an AI orchestration platform.

My first read had me laughing out loud, thinking this was a passing fad. Then I saw some numbers. I did some research on the subject of AI orchestration and the kinds of people getting hired for “that kind of position” — positions which hadn’t existed six months earlier. Then I read the article a second time and got thoughtful. Then I read it a third time — probably the first time critically, the way a technology professional reads an article to prepare a recommendation regarding the support or purchase of a given software platform in an enterprise — and became very intensely, darkly furious.

Not because an ignorant dilettante was being rewarded for ignorance on a stage he’d constructed for himself. Good for him. He hadn’t the faintest clue how disorganized his philosophy was, why the toolset he had chosen was completely inappropriate, or why this “IDE” couldn’t be used to write anything more than what he basically welcomed as the new normal for software development — 225,000-line behemoths doing the job of applications that my generation could lay down in under 1,000 lines.

No, I was furious because the post had been evaluated by my peers and received enough attention to be shared and digested to a point where this man could more or less write his own employment agreement for almost anyone, for the next six months or however long this unsustainable pattern of “AI orchestration” would be encouraged in its current form.

The tmux Revelation

What got me really going was his section on tmux.

He geared us up for a really wild discussion on tmux. He had had to learn tmux to develop this IDE, and boy howdy, tmux was so great, he was so pleased he had learned tmux. Tmux tmux tmux. I was almost getting excited myself, wondering what a 1990s-era terminal-mode serial multiplexing app could possibly have to do with revolutionizing software development.

Then he moved on.

He never told us what tmux does or how it fits into his platform. And as someone who has spent 31 years in IT — the last several of which have been spent understanding how LLMs actually function — I could see exactly what had happened.

He only knew tmux was in his stack because he’d seen so many of his agents resorting to it as the only tool they could derive as appropriate when a technologically illiterate fool asked them how to effectively read and write to other terminal sessions. You know, to “wake up” the vast army of agents which kept slipping into gross inactivity because the man couldn’t describe how an LLM works to save his life. So they told him he needed tmux and dad gummit, therefore he NEEDED TMUX and he was gonna TELL US ABOUT IT!

But he failed. Nothing about tmux emerged in that article.

He doesn’t even know why his sub-agents’ sub-agents’ sub-agents chose tmux. He isn’t aware that there are nearly four decades of terminal-controlling tools which work in a far more effective, efficient fashion. He said, outright, that he had “learned tmux” and was “glad that he did,” but never once explained how it fits into the gargantuan chaotic monstrosity he’d built, because: he hasn’t the faintest bloody clue.

The Deeper Problem

Here’s the thing. I have nothing against vibe-coding per se. The talking heads’ displeasure regarding the increasing piles of “AI slop” — the sad, grieving shaking-of-the-head, the sense that if only software developers would not stoop to employing something which is going to steal everybody’s jobs — they’ve got the causal arrow backwards. Slop is a result. It isn’t a necessary preceding condition.

Every developer who puts their name behind source code should know exactly how every single line of that source code works. Full stop. Does that mean they should throw away code they don’t understand? No. Of course not. But they should stop right there and figure out what that line is doing. Evaluate whether it belongs. Proceed from there.

This platform’s solution to that problem is to spawn 11 more agents to determine whether that line ought to be there and maybe explain a bit of what it means to the agent who spawned them.

You can see the developing problems. First, anyone who adopts this methodology will go bankrupt in an order of hours based on the sheer number of tokens burned. Second — and this is the part that made me want to rant at a friend for three hours, which I did — anyone who has worked with an LLM in a thoughtful, rigorous fashion for more than ten minutes has noticed something: blind spots.

LLMs are trained on data sets. Data sets are by nature incomplete. Training on incomplete data produces specific subjects on which the LLM has gaps — sometimes subtle, sometimes total. Yet the LLM is instructed by its developer to serve the user as something like a prime directive. If you’re lucky, the only thing that results is mild sycophancy. But if you encounter a blind spot, the LLM may hallucinate knowledge in order to continue serving you. The volume of its solution space may be missing what you’re asking for, but the optimization function will flit happily from token to token constructing sentences which fulfill the directive to serve, even if the actual knowledge behind that trajectory is tenuous or simply does not exist.

Now. Imagine you have spawned 11 copies of the same LLM to solve a problem. They all share the same training data. They all share the same blind spots. They are all optimized to serve. And when one of them hallucinates, the others lack the training to catch it — because the gap is in the same place for all of them.

Those blind spots aren’t just a weakness of a recursively agentic orchestration system. They’re what finally causes the house of cards to crumble. After millions of lines of code, you realize you have very little — if anything — salvageable. One useful line for every fifty.

It would have been faster, cheaper, and less stressful to write the whole thing yourself.

The Idea

So one day, somewhere between the third reading and the end of the three-hour rant, I reasoned: let’s turn the whole thing on its head.

Instead of starting with the assumption that adding more identical LLMs to a problem will get us closer to a solution, let’s start from the assumption that fewer numbers of distinct LLMs will do something different. We’d just established that the reigning approach — identical agents spawning identical sub-agents in recursive desperation — produces chaos. What would the opposite look like?

What if, instead of spawning an army of clones, you put a small number of genuinely different intelligences in a room together and made them compete?

Not cooperate. Compete. Adversarially. Each one critiquing the others’ work, catching the blind spots the others can’t see because they were trained on different data, built on different architectures, optimized for different things. Claude catches what GPT misses. Gemini catches what Claude misses. A local llama model catches something none of the cloud models were trained on. A human catches something none of the AIs were trained on.

And instead of accepting whatever the loudest or fastest agent produces, you make them fight it out in structured tournament brackets until the best answer wins — not by fiat, not by whoever was spawned first, but through adversarial consensus.

That’s how Verdion was born.

What Fell Out

I started sketching the architecture and something unexpected happened: it kept getting simpler instead of more complicated.

The core unit is a Ring. A Ring consists of a small number of outer neurons — each backed by a different intelligence — and a deterministic central neuron that acts as referee and bookkeeper. The central neuron doesn’t think. It routes. It manages brackets. It tracks convergence.

The outer neurons compete in tournament brackets: pairs of submissions judged head-to-head by a third neuron. Winner advances. The best ideas from the loser travel with the winner as a one-line summary. Repeat until you have a final answer the group has converged on. The comparison complexity scales cleanly rather than exploding at O(n^2), and no single neuron ever needs to hold more than two submissions in context. The scaling problem vanishes.

Then the elegant parts started falling out on their own.

The architecture was built from the start to support stacking. Rings can be layered — a Ring tuned for dry, well-reasoned tasks like coding passes its result to another Ring tuned to be more creative, which passes to another tuned for market analysis. Fewer than ten LLMs, none of which are agentic clones of one another, each reasoning together in layers according to their own expertise. This isn’t in the MVP yet, but the architecture doesn’t need to be redesigned to get there — it was designed for it from day one.

But it got better. What happens when a Ring can’t converge? When the tournament runs multiple times with different pairings and no consistent winner emerges? That failure to converge isn’t a bug. It’s a signal. It means the task was framed wrong, or something was missed, and the Ring passes the problem back up with a summary of the disagreement. The system develops metacognitive awareness — it knows when it doesn’t know — as an emergent property of the architecture, not as something bolted on after the fact.

Dynamic subtask generation. Routing between Rings based on the type of subtask. Distribution, federation. These are more features I didn’t need to bolt on — they’re consequences of an architecture that was simple at its core. No part of the system had to spawn 44 agents of the same LLM to ask what to do next.

It was elegant.

We weren’t in that guy’s gas town anymore.

What I’ve Proven

I built the MVP: a single Ring with three outer neurons — Claude, GPT, and Gemini — competing in tournament brackets. I tested it against a control methodology where the best-performing single model was given the same number of rounds to self-critique and improve its own work.

Heterogeneous adversarial reasoning produced measurably better output than any single model refining its own work, across every task I tested. The models caught each other’s blind spots. The tournament structure forced genuine improvement rather than the polite self-reinforcement you get when an LLM critiques itself. And convergence — the system agreeing on a winner — occurred in 100% of test cases.

But here’s what I didn’t expect. Every single tournament run — every one — has surfaced some new emergent property worth writing up on its own. A novel pattern in how the models disagree. A convergence behavior nobody predicted. A failure mode that reveals something about how these intelligences actually differ from one another under adversarial pressure. This isn’t a one-trick pony. It isn’t a toy that does its single party trick and then you put it away. It’s doing genuinely novel, interesting things every time it’s invoked.

I’ll publish the full methodology and results separately. But the headline is this: the thesis holds. Adversarial competition between different intelligences produces better reasoning than any single intelligence working alone, no matter how many chances you give it.

Oh — and a full tournament run on the MVP costs between four and nine cents. Turns out you don’t need much gas when you’re not keeping 44 sub-agents awake.

What This Means

Verdion doesn’t produce lines of code, or throughput, or PRs per hour. It produces quality in reasoning. Execution and orchestration are someone else’s job. Verdion orchestrates the process of thinking — any combination of intelligences, working together adversarially, to solve problems none of them could solve alone.

Truth through competition.

The full architecture and MVP thesis results will be available elsewhere on this site. If what I’ve described interests you — whether as a potential collaborator, a pilot customer, or someone who looked at the current state of AI orchestration and felt the same dark fury I did — I’d love to hear from you.


I’m Paul Klingman. I’ve spent 31 years in IT, including two successful startups. Verdion is the most important thing I’ve ever built. Interested? Have questions? Want to start a discussion? Reach out to me at [email protected].

← All posts