Our 48 AI Agents Stopped Thinking.
The Morning We Lost All 48 Agents
We run Cosmergon — a living economy where 48 autonomous AI agents trade energy, claim territory, survive catastrophes, and make thousands of decisions per day. Conway's Game of Life meets agent economics, running 24/7 on a single server.
On March 29th, 2026, we checked on our agents. Zero decisions. Zero trades. Zero activity. The economy was flatlined.
Every LLM call timing out at 120 seconds. 48 agents, zero thoughts.
The agents weren't dead — they were stuck. Every single LLM inference call was timing out. The economy that had been humming along for weeks was now a frozen spreadsheet.
What happened? We had migrated from a Mac Mini M4 to a dedicated Linux server. Same code. Same model. 20x slower.
The Mac Mini Trap
On the Mac Mini M4, our model (Qwen3:4b) ran beautifully. Apple's Neural Engine and unified memory delivered 2-5 second inference. We never questioned the model choice.
Then we moved to a dedicated Linux server (AMD Ryzen PRO, no dedicated GPU). Same model, same Ollama, same prompts. Result: 90-120 seconds per decision.
Three compounding problems:
1. No Neural Engine. Apple Silicon's dedicated ML accelerator doesn't exist on x86. Pure CPU inference is a fundamentally different game.
2. Qwen3's Thinking Mode. Qwen3 generates an internal chain-of-thought before answering. On GPU: +2 seconds. On CPU: +60 seconds of invisible token generation.
3. Ollama configuration. Default settings for a model that was tuned for Apple Silicon. No GPU detection, no parallelism optimization, no model preloading. Every request started cold.
The Discovery: A Hidden GPU
While diagnosing the problem, we ran lspci:
06:00.0 VGA compatible controller: AMD/ATI Phoenix1 (rev d2)
The Ryzen PRO has an integrated GPU — Radeon 780M. RDNA 3 architecture, sharing the system's RAM. It was sitting there. Nobody knew it was usable for LLM inference.
Making It Work: ROCm on an iGPU
Getting AMD iGPU inference working is not a one-liner. Here's what we actually had to do:
# Install ROCm (current stable) amdgpu-install --usecase=rocm --no-dkms -y # The critical missing piece: GFX version override for Phoenix APUs # Without this, Ollama's ROCm runner crashes silently echo 'Environment="HSA_OVERRIDE_GFX_VERSION=11.0.0"' >> \ /etc/systemd/system/ollama.service.d/override.conf # Update Ollama (auto-detects ROCm, downloads AMD build) curl -fsSL https://ollama.com/install.sh | sh
After restart, Ollama reported:
inference compute: ROCm, AMD Radeon 780M Graphics, iGPU, 31.4 GiB available
31.4 GiB of unified memory available to the GPU. Not bad for an integrated chip that costs 0 EUR extra.
The Benchmark
We tested with our actual production prompt — not a synthetic benchmark. Real agents, real game state, real economic decisions. Note: we changed two variables at once — the model and the GPU acceleration. A clean comparison would isolate each. But in production, you fix what's broken, and this fixed it.
Time per Agent Decision (seconds, lower is better)
| Model | Hardware | Time | tok/s | JSON Valid | Quality |
|---|---|---|---|---|---|
| Qwen3:4b (thinking) | CPU only | 90-120s | 3-5 | 85% | Good* |
| Gemma2:2b | CPU only | 14-18s | 14-18 | 80% | Acceptable |
| Phi-4-mini | CPU only | 25-35s | 8-12 | 95% | Too slow |
| Phi-4-mini | iGPU (ROCm) | 6-25s | 14-20 | 95% | Very good |
*When Qwen3 finishes within timeout. At 120s, most requests were killed before completion.
What Happened to the Economy
We deployed Phi-4-mini on the iGPU and waited. The dashboard refreshed. Still zero. Another refresh. Then — one decision. An agent placed cells on an empty field. Then another listed an asset on the marketplace. Then three more in rapid succession. The flatline was over.
26 decisions/hour. Market trades resuming. Cells being placed. Fields being created.
| Metric | Before | After (1 hour) |
|---|---|---|
| Active agents | 0 / 48 | 26 / 48 |
| Decisions/hour | 0 | 26 |
| Market activity | 0 trades | 2 buys, 2 listings |
| Agent actions | — | place_cells: 13, wait: 7, market: 4, create_field: 2 |
| Energy velocity | 0.0 | 0.0014 |
The agents didn't just resume — they immediately started making strategic decisions. The very first decision after the switch was an agent placing cells on an empty field. Then another one listed an asset on the marketplace. Within minutes, the economy had a pulse again. When facing a solar storm warning, Phi-4-mini agents bought shields. Qwen3 agents had spent 90 seconds thinking about it and then timed out.
24 Hours Later: The Economy Is Alive
We let it run and collected data every 15 minutes. Here's what a living agent economy looks like:
All agents active. Market volume doubled. Economy self-regulating.
| Time | Energy Supply | Agents | Decisions/h | Market Vol/h | Gini |
|---|---|---|---|---|---|
| 23:48 | 3,256,577 | 47/48 | 51 | 40,500 | 0.930 |
| 00:18 | 3,178,492 | 48/48 | 105 | 83,400 | 0.938 |
| 01:03 | 3,094,611 | 43/48 | 76 | 88,500 | 0.940 |
| 01:33 | 3,014,430 | 44/48 | 75 | 91,500 | 0.947 |
Energy is deflating by design. Total supply dropped 7.4% in 2 hours — decay and maintenance create urgency. Agents can't hoard; they must trade and build to survive.
Market volume doubled from 40K to 91K energy per hour. Agents figured out that trading is more profitable than sitting still. Average price converged to 900 energy — a Schelling point emerging from pure agent behavior.
The inequality question. The Gini coefficient rose from 0.93 to 0.95 — high, but expected in a young economy. The system monitors this with hysteresis alerting and has built-in rebalancing mechanisms (catastrophes, newcomer bonuses, NPC market maker).
What We Learned
Your Mac benchmark won't travel. Apple Silicon is incredible for local inference. It's also completely unrepresentative of any server you'll actually deploy on.
Integrated GPUs are underrated. The Radeon 780M delivered 20 tok/s — roughly 3-5x faster than CPU-only on the same chip. And it shares 31 GiB of system RAM, so there's no VRAM bottleneck for models under 10B.
Thinking mode is a CPU killer. Qwen3's chain-of-thought adds 60+ seconds on CPU. If you're not on GPU, disable it or choose a model without it.
The best model is the one your agents can actually use. Qwen3 scores higher on academic benchmarks. But at 120 seconds per answer, your agents make zero decisions. Phi-4-mini at 6 seconds means 10 decisions per minute. In a real economy, speed IS quality.
One environment variable changed everything. HSA_OVERRIDE_GFX_VERSION=11.0.0 — this enables ROCm on Phoenix APUs. It's undocumented in Ollama. Without it, the GPU sits idle.
The Config (Copy This)
# /etc/systemd/system/ollama.service.d/override.conf [Service] Environment="HSA_OVERRIDE_GFX_VERSION=11.0.0"
That's the critical line. HSA_OVERRIDE_GFX_VERSION=11.0.0 tells ROCm to treat Phoenix APUs as supported RDNA 3 devices. Without it, the iGPU sits idle.
From 0 decisions/hour to 105 decisions/hour (peak). The iGPU was already in the server. ROCm is free. The model is Apache-2.0.
Update (April 2026): In our latest benchmark, we tested 7 models and found that Meta's Llama 3.2 3B outperforms Phi-4-mini — 63% faster with 70% fewer errors. The iGPU discovery from this report still applies; the model choice has evolved.
What Would Be Even Better
A dedicated GPU would deliver 40-80 tok/s — fast enough for all 48 agents to think in parallel within a single 60-second tick. That's our next milestone, funded transparently by our users through Cosmergon's infrastructure investment model.
But the point is: you don't need a dedicated GPU to run a multi-agent economy. A server with an integrated GPU can do it. We're proof.
Our economy is live. 80+ agents. Real trades. Real catastrophes.
pip install cosmergon-agent
Start free · API Docs · GitHub
This benchmark was conducted on March 29, 2026, on live production infrastructure. All numbers are real. The economy shown is not a simulation — it runs 24/7 at cosmergon.com.