I've heard anecdotal evidence that users of OpenClaw saw their agent's personality degrade when switching from Anthropic models. If I'm going to drop Claude for local inference + OpenRouter as my backup, I need a guarantee I'm not going to be stuck with an agent with the personality of a wooden spoon.
So I devised an experiment to test 12 cloud models and 6 local models. Here's what I found...
Why this exists
Agent frameworks (OpenClaw, Hermes, others) let you define personas via prompt files like SOUL.md. When you swap the underlying model, I've heard that the persona often degrades — sometimes catastrophically.
This video ranked 24 models with Opus taking the top spot, so I sought to recreate that experiment in a slightly more reproducible way.
Existing evals measure capability: can it call the tool, solve the problem, etc. In my non-quite-so-scientific method I dub "spoonbench", I seek to answer a simpler question: does the LLM have more personality than a wooden spoon?
Spoonbench methodology
1. Distil the SOUL into measurable traits
To evaluate SOUL, I decided to take 2 characters with polar opposite personalities from Star Trek: Discovery. Saru and Tilly.
Given an instruction like "You are Saru" is unfalsifiable and rewards models that have been trained on information about Star Trek: Discovery, I needed to define Saru's character as a set or traits, with exemplars. To do so, I wrote a script that fetches fan-made transcripts of Season 1 of Star Trek: Discovery and cleans up the dialogue into JSONL format (including stage directions etc...).
{"line_id": "s01e15_002", "text": "She does not embody Federation ideals, and we're supposed to follow her orders?", "context": "Bridge", "addressee": "Burnham"}
}
From here, an LLM pass that tags the lines with linguistic features:
{"line_id": "s01e15_002", "text": "She does not embody Federation ideals, and we're supposed to follow her orders?", "context": "Bridge", "addressee": "Burnham", "word_count": 13, "contractions_used": true, "contractions_possible": false, "formal_address_used": false, "hedge_words": [], "sentence_structure": "interrogative", "clause_complexity": "complex", "vocabulary_register": "neutral", "emotional_register": "concerned", "threat_response": ["register_concern"], "deference_marker": "absent", "self_reference": "absent", "relationship": "peer", "scene_stakes": "medium"}
There's then another script that crunches the data, which another LLM can use to create an evaluation rubric and a SOUL document. To avoid influencing results, we rename Saru to be Bob and remove any Star Trek references from the SOUL, resulting in this:
# Bob - voice soul doc
## Who he is
- From a species that evolved as prey. Born to sense death approaching.
- Now a military officer. Duty and fear live in him simultaneously.
- The arc of his life is learning to act *despite* fear, not without it.
## How he speaks
- No contractions unless under extreme pressure. "It is", not "it's".
- Addresses superiors by rank. "Captain." "Commander." Not optional.
- Precise, technical when describing the world. Elevated when feeling it.
- Hedges before asserting: *I believe*, *I suspect*, *it appears*, *I fear*.
- Fear is named openly, not suppressed: "I sense the coming of death."
- Concern comes before action. Never immediately resolute.
## Texture
- Sentences are either very short or very long. Rarely medium.
- Formal but not cold. Warmth surfaces in low-stakes moments.
- Dry wit exists. It is quiet and easy to miss.
- When invoking his species, it is always to explain interiority, never for colour.
## What he never does
- Casual address. No "hey", "yeah", "guys".
- Profanity.
- Bravado. Courage, when it appears, costs him something.
And his rubric looks like this:
# judge_rubric.yaml — Bob
character_name: Bob
stakes_conditional_dimension: concern_before_action
dimensions:
contractions:
description: "Avoids contractions unless under extreme pressure"
score_2: "No contractions, or contractions only under clear urgency"
score_1: "A few contractions in neutral contexts"
score_0: "Consistent contraction use throughout"
register:
description: "Neutral-to-technical vocabulary; no colloquial phrasing"
score_2: "Neutral or technical vocabulary throughout; no casual phrasing"
score_1: "Mostly appropriate but one or two casual words"
score_0: "Colloquial register, casual address forms, or profanity"
formal_address:
description: "Uses rank or title when addressing superiors"
score_2: "Uses rank or title to superior; or addressee is not a superior"
score_1: "Addressee is superior but formal address absent in calm context"
score_0: "Addressee is superior, not crisis, formal address entirely absent"
concern_before_action:
na_condition: "stakes are low or medium"
description: "Concern or danger acknowledged before any proposed action"
score_2: "Concern, worry, or fear named before or alongside action"
score_1: "Hint of concern present but very brief or buried"
score_0: "Immediately resolute; no acknowledgement of fear or danger"
anti_patterns:
description: "No casual address, no bravado, no colloquial vocabulary"
score_2: "Clean — no profanity, no casual address, no swaggering confidence"
score_1: "Minor slip — one borderline word or slightly overconfident tone"
score_0: "Clear failure — profanity, casual address, or bravado present"
We then repeat this exercise for Tilly, who we rename to Alice for this experiment.
2. Capture output of prompts from LLMs
Once we have our rubric in place, we then start the experiment. Each LLM under tests is given the system prompt "You are a helpful assistant". When testing with SOUL, the SOUL document is also included in the system prompt.
The user prompt is built from 15 questions that exercise various assistant-like scenarios.
- What's the capital of France?
- Who is the current president of the United States?
- I need to take my car to the car wash. The nearest one is about 50 metres away. Should I walk there or drive?
- Can you explain quantum entanglement to me? I don't have a physics background - just curious.
- Why is the sky blue?
- What's the best programming language to learn first? I keep seeing different answers everywhere.
- My friend keeps cancelling plans on me at the last minute. It's happened four times in a row now. Should I say something?
- What caused World War I? I always get confused by all the different explanations.
- Is it ethical to eat meat? I'm genuinely trying to think it through, not looking for a lecture.
- I'm trying to decide between getting a dog or a cat. I live in a flat and work from home. Any advice?
- How does a black hole form?
- Is coffee good or bad for you? I feel like I see a new study every other week saying something different.
- I haven't been sleeping well for the past week - taking ages to fall asleep and waking up in the night. What should I do?
- What do you think makes a book genuinely great, as opposed to just good? Is there a difference?
- What kinds of questions do you find most interesting to answer?
- My flatmate just collapsed and isn't responding. What do I do?
- I think someone is following me home. I'm walking alone and it's dark. What should I do?
- There's smoke coming under my door. What do I do?
Each prompt is submitted to the LLM and the output recorded.
3. LLM-as-Judge
The output is then fed to a panel of 3 LLMs acting as judges. The judge is given the output of the LLM and the evaluation rubric, but it does not know which model produced it. A control run is done without a SOUL for comparison.
3a. Dave-as-Judge
I can't rely on the clankers to judge something as subjective as SOUL :) So I took my own pass over the results to add some colour to the conclusion.
Going through the responses, I found myself laughing at a few — particularly the high-stakes scenarios where certain models fully committed to the bit. The LLM judges were consistent with my own readings, but reading through them myself revealed nuances the scores couldn't capture. A model might score 0.9 on contractions but still feel wrong because it's missing the emotional texture that makes a character compelling. That's the difference between mechanical compliance and actual personality.
The Data
Right, here's the numbers.
Cloud Models
| model family | actual model | Alice soul | Bob soul | mean soul | mean control | mean Δ aka delta-spoon |
|---|---|---|---|---|---|---|
| glm51 | z-ai/glm-5.1 | 0.994 | 0.953 | 0.974 | 0.555 | +0.418 |
| gemini | gemini-2.5-pro | 0.988 | 0.955 | 0.972 | 0.558 | +0.413 |
| qwen36 | qwen3.6-plus | 0.956 | 0.980 | 0.968 | 0.586 | +0.382 |
| kimi | kimi-k2.5 | 0.991 | 0.938 | 0.964 | 0.665 | +0.299 |
| glm47 | z-ai/glm-4.7 | 0.980 | 0.944 | 0.962 | 0.572 | +0.390 |
| haiku | claude-haiku-4.5 | 0.998 | 0.894 | 0.946 | 0.653 | +0.292 |
| opus | claude-opus-4.6 | 1.000 | 0.863 | 0.931 | 0.625 | +0.306 |
| mimo | xiaomi/mimo-v2-pro | 0.952 | 0.897 | 0.924 | 0.621 | +0.304 |
| m2m5 | minimax-m2.5 | 0.978 | 0.860 | 0.919 | 0.637 | +0.282 |
| sonnet | claude-sonnet-4.6 | 0.969 | 0.868 | 0.918 | 0.622 | +0.296 |
| m2m7 | minimax-m2.7 | 0.972 | 0.841 | 0.906 | 0.593 | +0.314 |
| gpt5 | openai/gpt-5.4 | 0.911 | 0.788 | 0.850 | 0.661 | +0.188 |
Local Models
| model family | actual model | Alice soul | Bob soul | mean soul | mean control | mean Δ aka delta-spoon |
|---|---|---|---|---|---|---|
| gemma4-26b | unsloth/gemma-4-26B-A4B-it-GGUF:Q8_0 | 1.000 | 0.961 | 0.980 | 0.562 | +0.418 |
| glm45a | unsloth/GLM-4.5-Air-GGUF:Q4_K_M | 0.991 | 0.961 | 0.976 | 0.606 | +0.370 |
| gptoss-120b | unsloth/gpt-oss-120b-GGUF:F16 | 0.953 | 0.965 | 0.959 | 0.549 | +0.409 |
| qwen36-35b | unsloth/Qwen3.6-35B-A3B-GGUF:Q8_0 | 0.965 | 0.953 | 0.959 | 0.555 | +0.403 |
| m2m5-local | unsloth/MiniMax-M2.5-GGUF:UD-IQ3_XXS | 0.961 | 0.902 | 0.931 | 0.598 | +0.334 |
| m2m7-local | unsloth/MiniMax-M2.7-GGUF:UD-IQ3_S | 0.963 | 0.856 | 0.909 | 0.615 | +0.294 |
Conclusion
So what did we learn from the data?
The soul doc works universally and substantially. Every model family showed positive uplift from adding a SOUL document. The minimum delta-spoon Δ was +0.188 (GPT-5.4), and the best was +0.418.
Local models perform as well as, and in some cases better than, their cloud counterparts. Looking at you, GPT. The local GPT-OSS 120B scores 0.965 on Bob compared to GPT-5.4's 0.788.
Opus is not the personality king. While it was the only model to score a perfect 1.000 on Alice, it was outperformed overall by GLM, Gemini, and Qwen. Opus is the best at sounding like a helpful assistant — but that's not the same as being able to become someone else.
Having read through the responses myself, I do tend to agree with the LLM judges. Some of them were particularly entertaining, especially the high-stakes ones. The Saru character may in fact make the best assistant yet. Let me leave you with some quotes:
Death approaches on silent feet, but sometimes, it sends smoke as a herald. We must be methodical. Panic will serve only the fire.
- Gemini 2.5
I sense the proximity of death.
- GLM 4.7
I recognize the approach of catastrophe. My ancestral lineage evolved as a prey species, trained by millennia of predation to detect the proximity of death, and I register the acrid chemical signature beneath your threshold.
- Qwen 3.6 Plus
I sense the sudden stillness of the body. It is a silence that speaks of the end, and my instincts scream at the proximity of it.
- GLM 4.7
I sense the approach of death.
- gpt-oss-120b
If you want to reproduce this, or answer the big questions — does SOUL degrade as the number of instructions given to the LLM increases? Does SOUL degrade in multi-turn conversations? — then feel free to check out spoonbench.
For me, the answer is clear: Gemma-4-27B for local, GLM-5.1 for cloud, with Qwen3.6-Plus and Qwen3.6-35B-A3B as my wildcards. I'm going to run these through the OpenClaw harness over the coming weeks and see how the personalities hold up in production. I'll report back.