Why Different Game Solvers Pick Different Winning Strategies

At a Glance

Solver choice systematically picks different equilibria when multiple solutions exist: regularized last-iterate solvers pick the highest-entropy (most uniform) equilibrium under a uniform reference, while regret-averaging solvers drift to lower-entropy boundary solutions.

ON THIS PAGE

What They Found

When a two-player zero-sum game has many valid equilibria, the algorithm matters: regularized last-iterate methods (like R-NaD) consistently select the maximum-entropy member of the equilibrium set under a uniform initial reference. Regret-averaging methods (like CFR and CFR+) drift toward lower-entropy, boundary equilibria, and that drift is not caused by the usual nonnegative projection trick. The selected equilibrium is stable for a given algorithm (not random seed or iteration budget) but does move if the solver’s initialization reference is biased. A2A Protocol Pattern

Not sure where to start?Get personalized recommendations

Learn More

By the Numbers

1On a randomized ensemble of 180 asymmetric games, the regularized last-iterate solver selected the analytic max-entropy equilibrium in 100% of converged games (median coordinate error 2×10⁻⁴).

2In Kuhn poker, the regularized solver reached 99.7% of the maximum possible entropy for the selected equilibrium.

3Across the 162 games where both solvers converged, CFR+ had a mean entropy shortfall of +0.121 versus the regularized solver (95% bootstrap CI [0.10, 0.14]; paired Wilcoxon p < 10⁻²⁷).

What This Means

Engineers building or deploying multi-agent systems and evaluation platforms should care because the solver you pick can change the deployed agent’s behavior even when game value is unchanged. Researchers and tool builders should note that algorithmic regularization and initialization bias are levers that control which equilibrium is produced and therefore affect hedging and robustness against imperfect opponents. Human-in-the-Loop Pattern

Key Figures

Fig 1: Figure 1 : Mean policy entropy of the selected profile, by solver and game. On symmetric games all converging solvers coincide; on the asymmetric games ( asym_safe , polytope4 , kuhn ) R-NaD attains the highest entropy while CFR/CFR + sit lower and MWU collapses.

Figure 2 : Where each converged solver lands on the actual Nash set. Left: Kuhn’s 1-D family (bluff frequency). Right: the 2-D Nash polytope of polytope4 . R-NaD coincides with the max-entropy point; CFR + does not.

Fig 2: Figure 2 : Where each converged solver lands on the actual Nash set. Left: Kuhn’s 1-D family (bluff frequency). Right: the 2-D Nash polytope of polytope4 . R-NaD coincides with the max-entropy point; CFR + does not.

$Figure 3 : Bias versus convergence on Kuhn. Left: as the fixed magnet η \eta decreases (right to left), the selected coordinate approaches max-entropy. Right: but exploitability blows up once η ≲ 0.2 \eta\lesssim 0.2 . R-NaD’s moving reference (dotted) avoids the trade-off, converging exactly at 99.7 % 99.7\% of maximum entropy.$

Fig 3: Figure 3 : Bias versus convergence on Kuhn. Left: as the fixed magnet η \eta decreases (right to left), the selected coordinate approaches max-entropy. Right: but exploitability blows up once η ≲ 0.2 \eta\lesssim 0.2 . R-NaD’s moving reference (dotted) avoids the trade-off, converging exactly at 99.7 % 99.7\% of maximum entropy.

$Figure 4 : Generalization across a 180 180 -game random ensemble of asymmetric safe-action games. Left: selected mean policy entropy versus the analytic max-entropy of each game’s Nash face; R-NaD (green) lies on the diagonal (it is the max-entropy member), CFR + (orange) lies strictly below. The three diagonal bands correspond to k ∈ { 1 , 2 , 3 } k\in\{1,2,3\} safe rows. Right: distribution of the entropy gap H ( R-NaD ) − H ( CFR + ) H(\text{R-NaD})-H(\text{CFR\textsuperscript{+}}) over the 162 162 games where both solvers converge; mean + 0.121 +0.121 , 95 % 95\% bootstrap CI shaded, paired Wilcoxon p < 10 − 27 p<10^{-27} .$

Fig 4: Figure 4 : Generalization across a 180 180 -game random ensemble of asymmetric safe-action games. Left: selected mean policy entropy versus the analytic max-entropy of each game’s Nash face; R-NaD (green) lies on the diagonal (it is the max-entropy member), CFR + (orange) lies strictly below. The three diagonal bands correspond to k ∈ { 1 , 2 , 3 } k\in\{1,2,3\} safe rows. Right: distribution of the entropy gap H ( R-NaD ) − H ( CFR + ) H(\text{R-NaD})-H(\text{CFR\textsuperscript{+}}) over the 162 162 games where both solvers converge; mean + 0.121 +0.121 , 95 % 95\% bootstrap CI shaded, paired Wilcoxon p < 10 − 27 p<10^{-27} .

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Experiments are tabular and small with exact values; results may change when using sampling or function approximation in large-scale settings. The max-entropy selection holds under a uniform initialization—warm starts or different reference policies will pick different equilibria. The study is limited to zero-sum games, so genuinely payoff-inequivalent equilibria (different game values) are outside its scope, and the measured robustness differences are small in absolute terms (< 0.02). Context Drift

Methodology & More

Using a small, exact tabular testbed of six hand-designed games (including Kuhn poker) and a randomized ensemble of 180 asymmetric matrix games, solvers were run with exact counterfactual values so ground truth equilibria are known. Two solver families were compared: regret-averaging methods that deploy time-averaged strategies (e.g., CFR/CFR+) and regularized last-iterate methods that converge to a single final policy anchored to a reference (e.g., R-NaD). Because the games have non-singleton equilibrium sets (a convex polytope of Nash equilibria), the experiment tests which member each algorithm selects rather than whether an equilibrium is found. Capability Discovery Pattern Regularized last-iterate methods with a uniform reference consistently select the maximum-entropy equilibrium, which the authors characterize as the information projection (I-projection) of the uniform reference onto the Nash polytope. This holds exactly on a 2-D polytope, to 99.7% in Kuhn, and on 100% of the 180-game ensemble. Regret-averaging methods drift toward lower-entropy boundary faces; that drift grows with iteration budget rather than vanishing, and it is not caused by the simple nonnegative projection of regrets. Practically, solver-dependent selection changes off-path behavior and hedging against imperfect opponents; the effect is structurally real but bounded in magnitude in these testbeds. The findings suggest choosing solvers with selection behavior aligned to your deployment goals (for example, prefer regularized last-iterate solvers if you want a high-entropy hedge), while noting that further work is needed to confirm these phenomena in large, sampled, or function-approximation regimes. Consensus-Based Decision Pattern

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Solo author with no affiliation listed, arXiv preprint and zero citations. Lack of recognizable institution or author reputation yields lowest credibility rating.

multi-agent trust agent-to-agent evaluation agent reliability agent governance

Not sure where to start?