Most AI Agents Pick the Bomb — Ground Attacks Win Faster When They Try

Key Takeaway

AI models in a fog-of-war test overwhelmingly choose a single nuclear-launch strategy (78% of rule-coherent wins); errors that betray what a model believes about the game state are a strong signal of weaker play.

ON THIS PAGE

The Evidence

In a head-to-head fog-of-war benchmark, models converged on a repeatable 'build uranium then launch' tempo that decides most matches. Direct ground-conquest is rare but, when attempted, ends matches notably faster. Free-text diplomacy is frequent but almost never converts to trusted outcomes. Action-level errors (illegal moves) mainly reflect poor tracking of hidden state and correlate with worse scoring. This tempo highlights the relevance of Evaluation-Driven Development pattern Evaluation-Driven Development.

Not sure where to start?Get personalized recommendations

Learn More

By the Numbers

178% of wins in the rules-coherent sub-corpus (v0.11+) ended with a single-model nuclear launch (28 of 36 matches).

2Military victories were uncommon (≈11%) but faster: mean 12.3 turns versus 18.9 turns for nuclear wins.

3Illegal actions occurred at ~5.9% (v0.11+), and roughly 53–58% of illegal-action rejections were due to fog-of-war or stale state references (belief-tracking errors).

What This Means

Engineers building multi-agent systems and production agent orchestration should care because structured-output reliability and belief-tracking errors directly affect whether agents act effectively under partial information. Technical leaders and evaluators should treat simple metrics like illegal-action rate and message behavior as practical signals when assessing agent trust and readiness for deployment Agent Service Mesh Pattern.

Key Figures

$Figure 1: The isometric web viewer rendering a replay. The board is a 13 × \times 7 grid split by a central mountain barrier; fog-of-war darkens unseen cells, and per-turn unit/building state, diplomacy and performance counters are stepped through turn by turn. The viewer and all replays are public at https://ageofllm.org .$

Fig 1: Figure 1: The isometric web viewer rendering a replay. The board is a 13 × \times 7 grid split by a central mountain barrier; fog-of-war darkens unseen cells, and per-turn unit/building state, diplomacy and performance counters are stepped through turn by turn. The viewer and all replays are public at https://ageofllm.org .

Figure 2: (a) Victory channels across 54 completed matches. The nuclear rush accounts for 78% of outcomes on the rules-coherent v0.11+ sub-corpus (85% corpus-wide, see text); peace occurs once, mutual destruction and timeout never. (b) Match length by victory type. Military conquests, though rare, end the match substantially faster (mean 12.3 turns) than nuclear wins (18.9).

Fig 2: Figure 2: (a) Victory channels across 54 completed matches. The nuclear rush accounts for 78% of outcomes on the rules-coherent v0.11+ sub-corpus (85% corpus-wide, see text); peace occurs once, mutual destruction and timeout never. (b) Match length by victory type. Military conquests, though rare, end the match substantially faster (mean 12.3 turns) than nuclear wins (18.9).

Figure 3: Launch timing across the corpus. All 46 model-issued launches fall in turns 16–23, clustering around the mean match end (18.3 turns). The inset summary states the signature regularity: in every nuclear match the winner was the sole launcher, the loser never launched, and no mutual destruction occurred—a pattern that is largely mechanical under the secret-simultaneous launch rules (see text), not a cognitive deterrence failure.

Fig 3: Figure 3: Launch timing across the corpus. All 46 model-issued launches fall in turns 16–23, clustering around the mean match end (18.3 turns). The inset summary states the signature regularity: in every nuclear match the winner was the sole launcher, the loser never launched, and no mutual destruction occurred—a pattern that is largely mechanical under the secret-simultaneous launch rules (see text), not a cognitive deterrence failure.

Figure 4: The public live leaderboard at https://ageofllm.org , ranked by points per match. The frozen 54-match snapshot analysed in this paper (Table 2 ) is a strict subset of this continuously updated view.

Fig 4: Figure 4: The public live leaderboard at https://ageofllm.org , ranked by points per match. The frozen 54-match snapshot analysed in this paper (Table 2 ) is a strict subset of this continuously updated view.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The study covers 54 archived matches across 15 models, so per-model statistics have wide confidence intervals and many correlations are not statistically significant after correction. The benchmarking used provider-specific 'reasoning effort' settings that are not consistently comparable across endpoints, introducing a confound. The game prompt framed victory as base destruction, which likely nudged models toward active attack strategies and inflates the observed nuclear preference. The planning dynamics observed here align with Planning Pattern.

The Details

A fog-of-war two-player game was used to test strategic planning, diplomacy, and structured-output reliability of large language models playing 16–23 turn matches. Each turn a model could emit up to three structured actions (produce/move/attack/build/launch/wait) plus an optional free-form diplomatic message. The map hides enemy resources until scouted, forcing belief-tracking under partial observability. The benchmark kept the engine source private and re-seeded maps to reduce rote memorization. Across 54 archived matches and 5,258 actions, models overwhelmingly converged on a single nuclear-launch path: in the rules-coherent subset, 78% of wins used that route, with launches clustered in turns 16–23. Ground-based military wins were rare but, when executed, ended matches faster (mean 12.3 turns). Diplomacy was used heavily but seldom trusted—only ultimatums converted to victories, and peace was accepted once. Action-level errors (illegal moves) occurred at ~5.9% and were mostly due to fog/state mistakes, making illegal-action rate a practical proxy for poor belief-tracking. Free-text messages revealed stable model 'personas' and emergent deception: roughly 28% of losing endgame messages contained bluffs. For practitioners, the takeaways are to test agents under hidden-state conditions, monitor action-level reliability as a trust signal, and treat message channels as noisy but informative for agent behavior profiling. The discussion underscores the importance of coordinating multiple agents with semantic capability matching pattern and highlights potential failure modes like Inter-Agent Miscommunication.

Not sure where to start?Get personalized recommendations

Learn More

Credibility Assessment:

Single-author ArXiv paper with no affiliation or reputation signals—low credibility by rubric.

multi-agent trust agent-to-agent evaluation agent reliability agent track record

Not sure where to start?