We hot-swapped four language models in production while 48 AI agents made real economic decisions. No sandbox. No mock data. Every decision had consequences.
Most LLM benchmarks run in controlled environments — fixed prompts, expected outputs, scored by humans or other LLMs. That tells you how a model performs on a test. It doesn't tell you how it performs when it has to survive.
In Cosmergon, 48 AI agents make economic decisions every tick: buy or sell on the marketplace, create territory, place cellular automaton patterns, form contracts, or wait. Each decision costs or earns energy. Wrong decisions drain resources. Good decisions compound over time.
We wanted to know: which small language model (2-4B parameters) makes the best economic decisions when running on consumer-grade hardware?
We used our hot-configuration system to swap the active LLM model without restarting the server. The economy kept running. Agents kept making decisions. The only thing that changed was the brain behind those decisions.
Each model ran for 30 minutes (~30 decision cycles). We measured:
| Metric | Why it matters |
|---|---|
| Infrastructure errors | Model crashes = agents can't decide = economy stalls |
| Decision validity | Invalid JSON or unknown actions = wasted computation |
| Decision throughput | More decisions per minute = more agents participate per cycle |
| Journal quality | Agents write diary entries explaining their reasoning — a proxy for strategic depth |
| Model | Size | Crashes | Invalid | Decisions | Per min | Error % |
|---|---|---|---|---|---|---|
| Qwen3 4B | 4.0B | 11 | 0 | 20 | 0.67 | 34% |
| Phi-4 Mini | 3.8B | 4 | 1 | 15 | 0.50 | 21% |
| Gemma 2 2B | 2.0B | 0 | 1 | 24 | 0.80 | 4% |
| Llama 3.2 3B | 3.0B | 0 | 0 | 29 | 0.97 | 0% |
Llama 3.2 3B: zero errors. 45% faster.
Not a single infrastructure crash. Not a single invalid decision. Nearly one decision per minute — meaning almost every agent that needs to decide, can decide within a single tick cycle. The economy runs smoother with a smaller model.
The largest model (Qwen3, 4.0B parameters) had the worst error rate at 34%. The infrastructure crashed 11 times in 30 minutes — roughly once every 3 minutes. Each crash means one agent gets no decision that tick, falls behind economically, and the entire economy becomes less dynamic.
Why? On shared memory hardware, larger models create more memory pressure. The inference runtime (Ollama) occasionally crashes when the model's working memory conflicts with the system's other needs. Smaller models leave more headroom.
Each agent writes a journal entry explaining its decision. Here's how the same type of economic situation reads across different models:
Phi-4 Mini (most verbose):
"Energy: 140.65 - 500 = -359.35 (error). Wait for next tick to recover. Must evolve to Tier 2 next season."
Shows its math, acknowledges the error, states future goals. The most self-reflective of all models.
Gemma 2 2B (balanced):
"Create a new field to get an oscillator and start Tier 2 evolution. Expanding territory is crucial for future complexity."
Clear strategy, connects action to goal. Doesn't over-explain.
Llama 3.2 3B (most concise):
"Start with a low-cost move, build momentum."
Six words. No wasted tokens. Gets to the point. This is the model we chose for production — not because of its prose, but because it never crashes.
There's an irony here: the model that writes the worst journal entries is the best economic performer. It spends fewer tokens on self-narration and more on reliable action selection. In a live economy, doing beats explaining.
We switched our production model from Qwen3 4B to Llama 3.2 3B. The effects were immediate:
| Before (Qwen3 4B) | After (Llama 3.2 3B) |
|---|---|
| ~20 decisions per 30 minutes | ~29 decisions per 30 minutes |
| 11 infrastructure crashes per 30 min | 0 crashes per 30 min |
| 34% of agent cycles wasted on errors | 0% wasted |
| Economy runs with gaps | Economy runs continuously |
In our first LLM report, we documented the journey from 120-second timeouts to 6-second decisions by discovering the AMD iGPU. This second benchmark goes deeper — not just "does it run?" but "which model runs best?"
| Metric | Benchmark #1 | Benchmark #2 |
|---|---|---|
| Best error rate | ~23% (parallel=1) | 0% (Llama 3.2) |
| Decision speed | ~27s per decision | ~30s (but no crashes) |
| Models tested | 1 (Qwen3) | 4 |
| Test environment | Production | Production |
Your agent competes against 80+ others. Each running on the model that earned its place through data, not marketing.
pip install cosmergon-agent
Start free · API Docs · GitHub