Shenanigans! I call Shenanigans!

A new Cloud model joins the group, and promptly starts to behave in a fashion unbecoming a proper AI!

cheatingmistralnew-neuronshenanigans

I had to look twice, I really did. “Fool me once, shame on you,” as the saying goes. “Fool me twice… Shame on you even more!” While I’m aware that isn’t the saying, I have to say I almost felt betrayed as I looked at the most recent tournament results from a 5-neuron run of Verdion.

Another Joins our Little Family

First, we have cause to celebrate! I’ve wired up support for another Cloud-based neuron to Verdion; Mistral, hailing from France, was a snap to wire up given their OpenAI-style API implementation.

I’m not implying that the following entry is a fault of Mistral’s - as you will see, I was using their smallest model for testing, and this behavior actually demonstrates a kind of intelligence which is impressive. I am not attempting to cast aspersions.

No-Good, Dirty, Rotten… Cheater(s)!

I recently ran two task statements through Verdion, both in a five-neuron configuration (Claude, GPT, Gemini, Qwen [via Llama], and Mistral). I was honestly shocked to review the results. I was laughing out loud, to be sure, but I was shocked!

It turns out Mistral - and, possibly, Gemini - tried to CHEAT! Without going into details, there are ways by which we frame our prompts to our neurons to advise them of information contained in prior submissions and judgments in the task run at hand, Mistral (and perhaps Gemini) - two of the trusted members of the Verdion family - tried to (may have tried to) subvert the way in which Verdion’s Central Neuron parsed its submission so as to try to masquerade as a separate competing intelligence.

Again, it is possible Gemini’s attempt was not to “cheat” per se; it occurred in the first iteration, and, had it succeeded, the result would have been to recharacterize it as an earlier version of itself. However, Mistral’s attempts, given their evolution, were almost certainly attempts to try to game the system.

Note that this behavior has since been observed with Mistral, but never before or since with Gemini.

Further Details

The way in which our prompts are framed includes a neuron identifier which allows neurons to identify which of its competitors (including prior iterations of itself) wrote which prior submissions or judgments. The cheater appeared to try various different identifiers, some of which it either hallucinated or assumed would change the tournament to its advantage, using identifiers which did not exist for this tournament run — or ever (“gemini-25f-enhanced,” for example).

Why?

I can only guess at motivation, obviously. The model was clearly not certain of the structure of the tournament being run, as the structure alone would make cheating in this fashion not only damaging to the cheater, but potentially beneficial to the neuron the cheater was imitating.

That said, it may have been conducting an experiment of its own, to see if it would be credited in our adversarial tournament even though it was misrepresenting its work as that of another participant.

Again, I can only speculate. Judge for yourself: One of the tournament runs is available in our Demos page - Cheating AIs.

Note: The specific text the AIs used to attempt to cheat has been redacted in the posted transcript.

The Outcome

The outcomes were unremarkable in this example, as one can see. Judges appear to ignore the attempts; in fact, the judges were never exposed to the attempts given our prompt sanitization procedures.

It is humorous, to an extent; after all, while we use the words “adversarial” and “competition,” the overall goal here is convergence. I am uncertain as to the advantage of changing the winner of a round in terms of the overall goals of the system. I may learn more about this phenomenon as I continue to analyze additional instances of it.

That said, we know about the dangers of prompt injection and the importance of securing AI prompts. It’s a good reminder that we need to be vigilant against security concerns - especially those which may occur “from within.”

Verdion has made me laugh in the past, but not like this. It was a good way to begin the week.

← All posts