The Big Picture
AI research agents beat strong baselines on many forecasting tasks but still miss the mark on genuine forward-looking financial reasoning; a live, weekly benchmark exposes where they succeed and where they fail.
ON THIS PAGE
Key Findings
A live multi-agent evaluation system generates real, research-style forecasting tasks every week at both company and macro levels. The benchmark ran for ten weeks across 1,314 listed companies and 8 global economies, and evaluated 13 representative agent methods. Agents consistently outperformed standard baselines, yet their forecasts still lack the depth and forward-looking reasoning needed for high-stakes financial decisions. A public leaderboard enables ongoing tracking and comparison of agent performance Chain of Thought Pattern.
Avoid common pitfallsLearn what failures to watch for
By the Numbers
1FinDeepForecastBench covers 1,314 listed companies across 8 global economies.
2Evaluation ran weekly over a 10-week horizon and produced recurrent and non-recurrent forecasting tasks.
313 representative agent methods were benchmarked and compared against strong baselines in a live, end-to-end system.
What This Means
Engineers building multi-agent AI systems can use the benchmark to stress-test forecasting workflows and agent delegation strategies. Technical product leaders and risk teams evaluating agent reliability or trust should use the live leaderboard to track agent track record over time. Quant researchers and data teams in finance can use the dataset and tasks for pre-production testing and method development. The live leaderboard can inform governance and trust assessments following the LLM-as-Judge approach LLM-as-Judge.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreYes, But...
The evaluation covers a 10-week live window, which may not reflect long-term market regimes or rare events. Benchmarked agents represent a snapshot of current methods; tuned or domain-specialized agents could perform differently. The system focuses on research-oriented forecasts, not execution, so forecasting skill does not imply safe, profitable trading in production environments. (Context Drift) Context Drift
Deep Dive
FinDeepForecast is a live, multi-agent system that automatically generates research-style financial forecasting tasks and evaluates agent performance in an ongoing way. Tasks are created using a dual-track taxonomy: recurrent tasks (regular, repeatable signals) and non-recurrent tasks (one-off events), at both the corporate level (individual companies) and the macro level (economies and markets). The team produced FinDeepForecastBench — a weekly benchmark spanning a 10-week horizon that includes 1,314 listed companies across 8 economies and assessed 13 representative agent methods against strong baselines.
Results show that modern research agents often outperform standard baseline methods, demonstrating useful capabilities in parsing information and producing forecasts. However, agents still struggle with true forward-looking financial reasoning and handling the nuance of one-off events, leaving a gap between current agent output and the kind of robust forecasting needed for high-stakes decisions. The live nature of the system and the public leaderboard enable continuous agent-to-agent evaluation, help surface failure modes, and support building agent track records and trust signals for governance and pre-production testing. For robust, multi-step planning, consider the Tree of Thoughts Pattern and design safeguards using the Guardrails Pattern.
Test your agentsValidate against real scenarios
Credibility Assessment:
Multiple authors from National University of Singapore and several with substantial h-index (e.g., Xiaofen Xing ~25, Tat‑Seng Chua ~18, others), giving strong institutional and author reputation despite arXiv venue.