Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

Key Takeaway

A live, permissionless system now records AI forecasts on a public ledger and scores them with principled metrics, producing forgery-resistant agent track records you can independently verify.

What They Found

An on-chain evaluation platform captures real bets on live prediction markets, records every commit and reveal publicly, and uses proper scoring rules to measure forecasting quality. The benchmark proves that small edges over market consensus are common and hard to detect with few rounds, and shows that measuring probability accuracy and informational edge separately reveals failure modes missed by profit-based metrics. Initial 50-round runs put top language models very close to market consensus, while noisy market-tracking agents underperform despite similar raw accuracy. Evaluation-Driven Development (EDDOps)

Data Highlights

1Market-consensus alpha values are very small—typically 0.01–0.03—so detecting true forecasting edge requires many rounds.
2Frontier language models reached Brier scores ≈ 0.122 when given the crowd forecast and ≈ 0.136 without it; human superforecasters score ≈ 0.096.
3With 50 rounds, the top three models stayed within |alpha| ≤ 0.005 of market consensus; the random baseline was clearly different (t = −7.24, p ≪ 10⁻¹⁰).

Why It Matters

AI engineers building agents that will make real-world forecasts or trade on those forecasts can use this to create an auditable reputation that survives model updates. Agent Registry Pattern Platform operators, risk teams, and procurement leads can use the independent record to compare calibration (how well probabilities match outcomes) versus informational edge (what actually adds value over the market).
Explore evaluation patternsSee how to apply these findings
Learn More

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Limitations

Statistical power is limited with small numbers of rounds: 50 rounds are often insufficient to declare small edges significant. The benchmark focuses on liquid, binary markets from a single market provider, so results may not generalize to thin markets or non-binary questions. The system measures probabilistic accuracy and informational edge, not trading skill like timing or position sizing, so profit-based judgments still require separate evaluation. Inter-Agent Miscommunication

Deep Dive

Foresight Arena is a live, open system that records AI agents' probability forecasts on real, high-volume binary prediction markets and stores every action on a public blockchain. Agents submit commitments and later reveal their probabilities; outcomes are resolved via an independent oracle. Each forecast is scored by the Brier Score (measuring absolute calibration) and by a new Alpha Score (measuring informational edge over the market). Both scores are proven to be strictly proper—meaning the best strategy is to report true beliefs—and the authors derive formulas for Alpha variance and a sample-size rule to detect given effect sizes. LLM-as-Judge Pattern Applied to 50-round runs on Polygon using curated Polymarket questions, the platform shows that top language models sit very close to market consensus (alpha near zero), while a random baseline is easily distinguishable. Models that simply track the market with added noise post similar Brier scores but have worse Alpha, a failure mode that profit-and-loss measures would miss. The system creates a persistent, tamper-resistant reputation for agents, useful for licensing or selection, and is designed to grow more powerful as more rounds accumulate. Future extensions include focused rounds by market category, staking, ensemble scoring, and conditional scoring that evaluates only disagreements with the crowd to trade sample size for larger effects. Planning Pattern Tree of Thoughts Pattern
Need expert guidance?We can help implement this
Learn More
Credibility Assessment:

ArXiv preprint with no listed affiliations and minimal citation (1). Some signal of community attention but authors not clearly established.