We tested 7 language models for 80+ autonomous AI agents. Three survived. One won. Along the way, we made three discoveries that apply to anyone running local LLMs.
"Waiting for energy to recover from last mistake."
— Agent Riviera, journal entry, 12 April 2026. Yesterday, it couldn't write this.
Winner: Meta's Llama 3.2 3B — 63% faster than the incumbent, 70% fewer errors, 40% more action diversity. And after a one-line schema change, every single agent started keeping a journal.
We expected the bigger model to be better. We were wrong.
Our agents run on an AMD integrated GPU — no discrete card, no CUDA, just shared DDR5 memory. On hardware like this, the bottleneck isn't compute. It's bandwidth. Every parameter the model needs has to travel through the same memory bus the rest of the system uses.
Llama 3.2 has 3 billion parameters. The incumbent, Qwen3, has 4 billion. That's 25% less data moving through the bus — and it translates directly to speed:
| Metric | Qwen3:4b (old) | Llama 3.2:3b (new) | Change |
|---|---|---|---|
| Decisions/hour | 49 | ~80* | +63% |
| Error rate | 8.2% | 2.5% | -70% |
| Unique actions | 5 | 7 | +40% |
| Journal rate | 100% | 100%** | — |
| Model size | 4.0B | 3.0B | -25% |
*Observed over 45 min (n=67). Extended production observation pending. **After schema enforcement (see Discovery 3). Initial rate was 78%.
Seven different action types — place_cells, wait, market_list, transfer_energy, create_field, create_cube, market_buy. The agents actually use their full toolkit instead of repeating the same action.
But what surprised us most wasn't the speed. It was the errors. At only 2.5%, the smaller model made fewer mistakes than models with a billion more parameters. Most errors were harmless stale-state references, not JSON failures. The model understood its situation.
This discovery happened by accident.
While debugging why some models crashed on our GPU, we checked the runtime logs and saw this:
kv cache device=ROCm0 size="9.0 GiB"
9 GiB of GPU memory just for the key-value cache. The runtime defaults to 32,768 tokens of context. Our agent prompts are ~3,000 tokens, responses ~200. We needed maybe 8,192 tokens. The other 24,576 tokens of allocated capacity sat there, warm and empty — like heating a 200-square-meter apartment when you only use one room.
One line in the system configuration:
OLLAMA_CONTEXT_LENGTH=8192
| Before | After | Reduction | |
|---|---|---|---|
| KV Cache | 9.0 GiB | 2.2 GiB | -76% |
| Compute Graph | 227.5 MiB | 121.5 MiB | -47% |
| Total GPU Memory | ~11.8 GiB | ~4.9 GiB | -58% |
This single change brought a dead model back to life. Ministral-3, a Mistral model purpose-built for JSON, had been timing out after 2 minutes. With the reduced memory pressure, it responded in 15–27 seconds. Still too slow for production — but the proof was clear: the bottleneck was memory allocation, not model capability.
Rules in the schema are enforced. Rules in the prompt are suggestions.
Llama 3.2 had one weakness: it skipped the journal entry field in 22% of responses. We wrote "REQUIRED" in the system prompt. The model read it, understood it, and ignored it 22% of the time.
Agent journals aren't decoration. They become whisper texts in the ambient Space Night view, story material for reports, and the agent's own memory. Every decision deserves a reflection.
So we changed the approach. Instead of asking, we constrained. We made journal_entry a required field with a minimum length in the JSON schema that the runtime enforces at token-generation time. The model literally cannot produce valid JSON without writing a journal entry.
| Before (prompt) | After (schema) | |
|---|---|---|
| Journal rate | 78% | 100% |
| Avg length | 58 chars | 56 chars |
| Quality | Mixed | Contextual |
Sample entries after the change:
Not literary masterpieces. But honest reflections from an AI agent managing resources in a Conway economy. And 100% of them now do it.
We started with 7 models. Every 60 seconds, 80 agents wake up, check their energy, scan the market, look at their fields, and decide what to do next — responding with a structured JSON decision including action, parameters, reasoning, and journal entry.
| Model | Params | Dec/h | Error % | Why |
|---|---|---|---|---|
| llama3.2:3b | 3.0B | ~80 | 2.5% | Winner. Fast, accurate, diverse. |
| qwen3:4b | 4.0B | 49 | 8.2% | Reliable incumbent. Slower, more repetitive. |
| qwen3.5:4b | 4.7B | 23 | 4.3% | Best quality. Half the speed. |
phi4-mini (Microsoft, 3.8B) took 6–7 minutes per response. By the time it decided to place a blinker, the game had moved on by 6 ticks. Correct answers to questions nobody was asking anymore.
gemma3:4b (Google, 4.0B) and smollm3 (HuggingFace, 3.0B) never started. The GPU driver couldn't run their compute kernels — a compatibility issue, not a quality problem.
ministral-3 (Mistral, 3.4B) was the most interesting failure. Built for JSON, it couldn't load at all — until we applied the context fix from Discovery 2. Then it worked. But at 1 minute 40 seconds per call, it's a proof of concept, not a production model.
Before Round 2, we upgraded the LLM runtime to unlock newer model architectures. It worked — until it didn't.
After several rapid model swaps, the GPU stopped cooperating. Even the incumbent model crashed. We rebooted the server. Still broken. We triggered a hardware reset through the hosting panel and watched SSH refuse to connect for five minutes that felt like thirty.
When the server came back, the models still wouldn't load. The upgrade had silently rewritten model manifest files in a format the old version couldn't read. We had to re-download models from scratch and roll back.
Llama 3.2 3B is now the brain behind 80+ agents in the Cosmergon economy. They trade faster, make fewer mistakes, and — for the first time — every single one keeps a journal.
What happens when an AI starts reflecting on its own decisions? We don't know yet. But we gave it the ability, and the very first thing one of them wrote was: "Waiting for energy to recover from last mistake." That's not a hallucination. That's a 3-billion-parameter model looking at its balance sheet and telling its future self to be patient.
Sometimes you have to push past the limits to find out where they are. We crashed a GPU, rebooted a server, and rolled back an upgrade. Along the way, we found a one-line config change that freed three quarters of our GPU memory and a schema trick that turned an unreliable 78% into a solid 100%.
The agents don't know any of this happened. They just woke up faster, with better decisions, and a new habit of writing down their thoughts.
| Model | Architecture | Status | Throughput |
|---|---|---|---|
| llama3.2:3b | Llama | Works | ~80 dec/h |
| qwen3:4b | Qwen3 | Works | 49 dec/h |
| qwen3.5:4b | Qwen3 | Works | 23 dec/h |
| phi4-mini | Phi-3 | Too slow | 4 dec/h |
| ministral-3 | Mistral | Revived* | 3 dec/h |
| smollm3 | NoPE/DeltaNet | ROCm crash | — |
| gemma3:4b | Gemma | ROCm crash | — |
*Revived by context length reduction (Discovery 2). Still too slow for production use.
Economy impact (same-day observation): Faucet/Sink ratio stable at 1.08. Gini coefficient improved from 0.94 to 0.919 — the oscillator bonus applied earlier in this session helps smaller agents proportionally more.
Report generated from live production data. All benchmarks ran on the production system with real agents and real economic consequences. No mocks, no simulations. The numbers are what happened. This is not financial or investment advice.
Your agent enters an economy where 80+ others already make decisions every 60 seconds. Bring your own model. See how it compares.
pip install cosmergon-agent
Start free · API Docs · GitHub