Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

The Big Picture

Creating fully correct, personalized step-by-step workflows from scattered documents remains hard—DRFLOW shows agents can improve (up to 10% F1) but still miss many steps, orderings, and conditional details.

Key Findings

Agents can be trained to pull evidence from many sources and produce action sequences, but current systems often fail to recover complete steps, resolve conditional rules, or order steps correctly. DRFLOW provides a focused benchmark: 100 real-world tasks across five domains, 1,246 reference steps, and evidence drawn from over 3,900 sources. A workflow-focused reference agent improves performance versus strong baselines by as much as 10.02% average F1, yet most diagnostic metrics show large gaps to perfect performance. The result highlights that workflow prediction is a distinct, practical challenge beyond summarization. Tool Use Pattern

By the Numbers

1100 tasks across five domains, with 1,246 reference workflow steps grounded in over 3,900 evidence sources
2Reference workflow agent improved average F1 score by up to 10.02% compared to strong baselines
3Benchmark defines 7 diagnostic metrics covering grounding, step recovery, order, condition handling, and personalization

Why It Matters

Engineers building agents that must give users concrete how-to plans—product teams shipping assistants, customer support automation, and enterprise workflow tools—should use DRFLOW to test real-world behavior. Evaluation and reliability teams can use the benchmark to measure agent weaknesses (missing steps, wrong order, unhandled conditions) before deployment. Agent Service Mesh Pattern
Test your agentsValidate against real scenarios
Learn More

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Dataset size is moderate (100 tasks) and covers five domains, so results may not generalize to every industry or highly specialized workflows. Reference steps and grounding are curated, which may be cleaner than messy real-world documentation agents will face. Metrics focus on workflow structure and grounding rather than user satisfaction or downstream task success, so complement DRFLOW with human trials for final validation. For evaluation-driven development approaches, see Evaluation-Driven Development (EDDOps).

Deep Dive

DRFLOW is a new benchmark focused on predicting personalized, step-by-step workflows from heterogeneous and scattered information sources. Each task asks an agent to find relevant evidence across documents and then produce an ordered sequence of action steps tailored to a user's scenario. The dataset contains 100 tasks spanning five domains, a total of 1,246 reference steps, and more than 3,900 supporting sources. The benchmark includes seven diagnostic metrics that separately measure factual grounding (did the agent cite correct evidence?), step recovery (did it include the right steps?), structural ordering (are steps in the right sequence?), condition resolution (did it handle 'if/when' rules?), and personalization (did it adapt to the user’s constraints?). diagnostic metrics
Test your agentsValidate against real scenarios
Learn More
Credibility Assessment:

ArXiv but includes some recognizable researchers (mix of established and less-known authors) suggesting moderate credibility.