Find reliable AI behavior using only ranked preferences

In Brief

You can compute stable mixed strategies even when players only give ranked preferences (no numeric scores) by treating the ranking as votes and using voting rules to pick best responses.

ON THIS PAGE

Key Findings

A mixed-strategy equilibrium notion works when payoffs are only ordinal: each player treats the distribution of other players' contexts as a population of ranked votes and applies a voting rule to pick best responses. Under mild continuity conditions on the voting rule, such equilibria exist and can be approximated with regularization. Practical learning methods (follow-the-regularized-leader style) use random perturbations to make best responses continuous and produce usable equilibria for agent evaluation and stochastic elections. Related implementation guidance can be found in the A2A Protocol Pattern.

Data Highlights

1Illustrative example used an opponent mix of [25%, 30%, 45%] to build a vote population and pick a best response via voting.

2Regularization parameter p ranges from 0 to 1: p=0 returns the social-choice best response, p=1 yields the uniform (fully regularized) distribution.

3Agent-evaluation experiments confirm prior findings that the 'rainbow' agent ranks top across 9 popular voting rules when using ranked evaluations.

Implications

Engineers and teams doing agent-to-agent evaluation or continuous agent evaluation can use ranked human or agent judgments directly, without forcing numeric scores. Technical leads and researchers studying multi-agent trust or tactical voting can model strategic behavior where only preference rankings are available. For structural patterns and practical modeling guidance, see the Role-Based Agent Pattern and the Agentic RAG Pattern.

Explore evaluation patternsSee how to apply these findings

Learn More

Key Figures

$Figure 1 : An NE is a strategy profile where each player best responds to its co-players. In a game without payoffs, but where players can rank their actions, we use social choice (voting) theory to define a best response. Consider playing an opponent in rock-paper-scissors ( , , ); their mixed strategy is [ 25 % , 30 % , 45 % ] [25\%,30\%,45\%] . When the opponent plays, e.g., , our rank vote over our own actions is { \{ ≻ \succ ≻ \succ } \} . Imagine a population of votes with representation proportional to the opponent’s mixed strategy— 25 % 25\% of the votes are { \{ ≻ \succ ≻ \succ } \} , 30 % 30\% are { \{ ≻ \succ ≻ \succ } \} , etc. We define a best response as the outcome of a voting rule on this population. For example, Borda elects as our best response.$

Fig 1: Figure 1 : An NE is a strategy profile where each player best responds to its co-players. In a game without payoffs, but where players can rank their actions, we use social choice (voting) theory to define a best response. Consider playing an opponent in rock-paper-scissors ( , , ); their mixed strategy is [ 25 % , 30 % , 45 % ] [25\%,30\%,45\%] . When the opponent plays, e.g., , our rank vote over our own actions is { \{ ≻ \succ ≻ \succ } \} . Imagine a population of votes with representation proportional to the opponent’s mixed strategy— 25 % 25\% of the votes are { \{ ≻ \succ ≻ \succ } \} , 30 % 30\% are { \{ ≻ \succ ≻ \succ } \} , etc. We define a best response as the outcome of a voting rule on this population. For example, Borda elects as our best response.

$Figure 2 : Algorithm 1 applied to the example from Fig. 1 . Step 1: Original co-player strategy x − i = [ 25 % , 30 % , 45 % ] x_{-i}=[25\%,30\%,45\%] is perturbed via x ^ − i \hat{x}_{-i} ∼ Dir ( 𝟏 + 1 q \sim\mathrm{Dir}(\mathbf{1}+\frac{1}{q} x − i x_{-i} ) ) , yielding [ 22 % , 31 % , 47 % ] [22\%,31\%,47\%] ; as q → 0 q\to 0 , x ^ − i → x − i {\color[rgb]{1,0.6484375,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0.6484375,0}\hat{x}_{-i}}\to{\color[rgb]{0.91796875,0.26171875,0.20703125}\definecolor[named]{pgfstrokecolor}{rgb}{0.91796875,0.26171875,0.20703125}x_{-i}} . Step 2: A usurper action u = u{=} is sampled from Cat ( μ i ) \mathrm{Cat}(\mu_{i}) , giving v u = { v_{u}=\{ ≻ \succ ∼ \sim } \} with top-ranked and all others tied. Step 3: Each vote is independently replaced by v u v_{u} with probability p p (coin flip). When p = 0 p=0 the population is unchanged; when p = 1 p=1 all votes become v u v_{u} . A voting rule ν i \nu_{i} determines a best response from this perturbed population. Results are averaged over trials to obtain the final regularized best response.$

Fig 2: Figure 2 : Algorithm 1 applied to the example from Fig. 1 . Step 1: Original co-player strategy x − i = [ 25 % , 30 % , 45 % ] x_{-i}=[25\%,30\%,45\%] is perturbed via x ^ − i \hat{x}_{-i} ∼ Dir ( 𝟏 + 1 q \sim\mathrm{Dir}(\mathbf{1}+\frac{1}{q} x − i x_{-i} ) ) , yielding [ 22 % , 31 % , 47 % ] [22\%,31\%,47\%] ; as q → 0 q\to 0 , x ^ − i → x − i {\color[rgb]{1,0.6484375,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0.6484375,0}\hat{x}_{-i}}\to{\color[rgb]{0.91796875,0.26171875,0.20703125}\definecolor[named]{pgfstrokecolor}{rgb}{0.91796875,0.26171875,0.20703125}x_{-i}} . Step 2: A usurper action u = u{=} is sampled from Cat ( μ i ) \mathrm{Cat}(\mu_{i}) , giving v u = { v_{u}=\{ ≻ \succ ∼ \sim } \} with top-ranked and all others tied. Step 3: Each vote is independently replaced by v u v_{u} with probability p p (coin flip). When p = 0 p=0 the population is unchanged; when p = 1 p=1 all votes become v u v_{u} . A voting rule ν i \nu_{i} determines a best response from this perturbed population. Results are averaged over trials to obtain the final regularized best response.

Fig 3: (a) Exploitability

Fig 4: (a) Agent Marginals

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Computing these equilibria can be computationally hard in general; certain voting rules map to games that are not solvable in polynomial time. Results depend on the chosen voting rule and the quality of elicited conditional rankings, so careful selection and robust elicitation matter. Regularization choices (how much noise or a uniform target to inject) affect the learned equilibrium and should be tuned for application needs. For governance and evaluation considerations, see the Inter-Rater Reliability term and, if needed, the Supervisor Pattern.

Methodology & More

Define each player’s preference as a conditional ranking over their actions given every joint choice of co-players. Treat the co-players’ mixed strategy as a distribution over contexts that generates a population of ranked votes. Apply a probabilistic social choice rule (a voting rule that can return distributions over winners) to that vote population to get a best-response distribution. A profile is a context-ordinal Nash equilibrium when every player’s mixed strategy places probability only on actions elected by their voting-based best response. Prove existence when each best-response mapping is upper hemicontinuous, and identify common families of voting rules (score/positional, probabilistic, social grading) that fit these conditions or can be regularized to fit them. Introduce a regularization scheme that perturbs the co-player mix and randomly injects a target vote so the best response becomes single-valued and continuous; use that to define approximation and regret metrics. Implement learning via follow-the-regularized-leader style updates using the regularized best response and demonstrate the approach on agent evaluation (Atari agents) and stochastic ranked-choice elections. The method is robust to utility mis-specification because it relies only on rankings, letting teams evaluate agents or model tactical voting without inventing numeric payoffs. Consider the intuition from Dynamic Task Routing Pattern for how tasks and decisions can be routed under uncertainty.

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Includes Marc Lanctot (h-index ~41), a highly established researcher; despite arXiv venue, strong author reputation warrants top rating.

agent-to-agent evaluation multi-agent trust continuous agent evaluation agent track record

Not sure where to start?