March 19, 2026

A Qwenny For your Thoughts

With Llama-backed neurons in, it was time to test a more complex model. The results were not what I expected.

reasoningllamallamacppqwen32b

I was pretty excited once I’d worked out the minor bugs related to getting llama.cpp models supported in neurons. The potential diversity of Verdion had exploded by orders of magnitude. I couldn’t wait to see what happened with a reasonably complex model in the mix with the big 3.

Qwen Registers for the Tournament

I chose more or less at random from fairly well-known reasoning models on Hugging Face. Once I got the latest branch of llama.cpp compiled on my 4090 box, I downloaded Qwen-32B-Q4_K_M and fired up llama-server with 16k tokens of context.

Qwen is wiped out

Claude (in a chat - not in a Verdion tournament) had mentioned to me that some of these latest 32B models could hold their own with Cloud models’ inference on certain tasks. That didn’t appear to be the case for Qwen, who lost 3/4 of the brackets in which it competed. However, that’s infinitely better than Mistral’s 0% win record, so it’s progress.

Qwen’s legacy lives on

The novel thing - the thing which has me scratching my chin even a full day later - isn’t Qwen’s wins or losses or judging ability. It’s a property of the learning and reinforcement which Verdion enables that Qwen exposed. And there were quite a number of opinions about it!

`lock`, `sealed`, and Qwen’s `ConcurrentQueue<T>`

The big 3 models, when solving this task during all prior runs, would always land on a pattern which they had trouble landing on solo, but which they always converged upon in Verdion: For thread-safety, they would implement a lock / snapshot pattern and seal the class. Fantastic.

Qwen had a different idea for thread-safety: .NET’s ConcurrentQueue<T>. Absolutely thread-safe, which the task did explicitly ask to be considered. And Qwen was not persuaded when Verdion exposed it to the lock/snapshot/sealed pattern - even when exposed to judgments which chose it over ConcurrentQueue<T>. Qwen held fast. Which is to say, it took many of the other improvements offered by the platform, but it stuck with ConcurrentQueue<T>.

A Regression?

Initially, it seemed to me and those I consulted that the lock/snapshot/sealed pattern was superior, and therefore — because Qwen was complex enough to insist (as it were) on its opinion, but trained in sufficiently different a fashion so as to disagree with the judgments in favor of lock/snapshot/seal — a model just smart enough could apparently cause a regression in some situations!

While incredibly compelling from a research perspective, it was not the result I was hoping for. It wasn’t the end of the world; it certainly did not reduce the value proposition of Verdion, but it meant that Rings would need to be constructed much more thoughtfully than I initially thought.

Murkier and murkier

“Not so fast,” thought I. I took the two solutions to thread safety - Qwen’s ConcurrentQueue and the naively superior lock/snapshot/seal - and I asked some colleagues which was better. The answers made the situation even more complicated - but even more enthralling.

Atomicity, Scale, and Use Case

It comes down to use case. ConcurrentQueue wins for high-throughput independent operations like logging. lock wins when you need transactional atomicity, like batch operations that should succeed or fail together. Qwen’s solution wasn’t inferior — I just hadn’t specified which scenario I was asking about.

Garbage-In, Garbage-Out

I circled around to the task statement as I’d written it, and I immediately saw the root of all this confusion.

I hadn’t specified a use case.

I naively assumed this task - effectively, “Here is a poorly-written C# class. Fix the errors, fix the bugs, and optimize it as best you can. Be concise. Do not add features which are not already implied” - had a simple, fairly bounded optimal solution. And it did - if I had specified the use case. I simply mentioned thread-safety. I didn’t mention number of threads, or expected concurrent access attempts, or anything similar - data which four sophisticated models might need in order to converge on something which they could say was the truly optimal solution.

Who’s Evaluating Whom?

The more complex the models in play, the more precise and well-considered your task statement must be, or you will tempt the models into deadlock-adjacent situations exactly like this one.

My own creation had essentially just tested me - and I almost failed. Now I have another useful rule of thumb. Verdion caused me to step back and come up with a more optimized solution.

What will this system teach me next? Stay tuned.

Postscript (20Mar2026)

After refining the task statement to specify moderate concurrency, all four models — including Qwen — converged on ConcurrentQueue. The ‘regression’ was never about model quality. It was about task precision. Verdion doesn’t just test the models; it tests the question.

← All posts