Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

The Big Picture

AI models can often guess high-level intent and produce runnable visual workflows, but they frequently fail to generate correct, stable workflows when tasks get complex or requirements change.

Key Findings

Chat2Workflow is a focused benchmark of 273 real-world workflow tasks across six domains that checks whether models can generate executable visual workflows from everyday language. Tested on 15 representative language models (open- and closed-source), models often capture high-level intent but make structural or logical mistakes that break execution. Workflow quality drops as dialogues evolve — models degrade across rounds when requirements change, revealing fragility in real deployment scenarios. orchestrator-worker pattern

Key Data

1273 task instances covering six domains (AIGC, Research, Document, Education, Enterprise, Developer) used to evaluate workflow generation.
215 language models evaluated (4 closed-source, 11 open-source) to measure general capability and variability.
3Over 70% of real-world agent deployments rely on off-the-shelf language models without weight tuning, highlighting the practical importance of robust automated workflow generation.

What This Means

Engineers building automated agents and platform owners who want to let non-developers create workflows should care — the benchmark shows where current models break so teams can decide when to trust automation. Technical leads evaluating vendor models can use Chat2Workflow as a testbed to compare how reliably models produce correct, executable pipelines under changing requirements. Semantic Capability Matching Pattern
Test your agentsValidate against real scenarios
Learn More

Key Figures

Figure 1: An example task in Chat2Workflow , which features realistic, variable natural-language instruction inputs and produces outputs that can be directly transformed and integrated into real-world workflow platforms ( e.g., Dify and Coze).
Fig 1: Figure 1: An example task in Chat2Workflow , which features realistic, variable natural-language instruction inputs and produces outputs that can be directly transformed and integrated into real-world workflow platforms ( e.g., Dify and Coze).
Figure 2: Distribution of task types in Chat2Workflow. The benchmark covers six domains: AIGC, Research, Document, Education, Enterprise, and Developer.
Fig 2: Figure 2: Distribution of task types in Chat2Workflow. The benchmark covers six domains: AIGC, Research, Document, Education, Enterprise, and Developer.
Figure 3: Overview of Chat2Workflow benchmark construction and evaluation framework. Left : We collect workflows from six task domains (Research, Document, Enterprise, Developer, Education, AIGC) and reverse-engineer multi-turn instructions. Center : Users interact with LLMs through dialogue, and the model generates workflows in JSON format with CoT reasoning (node selection, design principles, and structured workflow). Right : The JSON workflow is converted to YAML format, uploaded to the platforms for execution, and evaluated against test cases to compute Pass Rate and Resolve Rate.
Fig 3: Figure 3: Overview of Chat2Workflow benchmark construction and evaluation framework. Left : We collect workflows from six task domains (Research, Document, Enterprise, Developer, Education, AIGC) and reverse-engineer multi-turn instructions. Center : Users interact with LLMs through dialogue, and the model generates workflows in JSON format with CoT reasoning (node selection, design principles, and structured workflow). Right : The JSON workflow is converted to YAML format, uploaded to the platforms for execution, and evaluated against test cases to compute Pass Rate and Resolve Rate.
Figure 4: Performance degradation across dialogue rounds. We show the Pass Rate and Resolve Rate for all 15 models across the first three dialogue rounds. Most models exhibit a steady decline in both metrics as the number of interaction rounds increases, indicating the challenge of maintaining workflow quality under evolving requirements.
Fig 4: Figure 4: Performance degradation across dialogue rounds. We show the Pass Rate and Resolve Rate for all 15 models across the first three dialogue rounds. Most models exhibit a steady decline in both metrics as the number of interaction rounds increases, indicating the challenge of maintaining workflow quality under evolving requirements.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Dataset size is modest (273 examples) and may not capture the full variety of complex enterprise logic. Node interfaces were simplified to ensure executability, so real-world parameter complexity might expose more failure modes. Only 20 high-frequency node types were included, so performance may differ when rare or highly specialized tools are needed in production. Red Teaming Pattern

Deep Dive

Chat2Workflow creates a practical test for converting natural-language requests into executable visual workflows: collect real workflows from six domains, reverse-engineer multi-turn user instructions, and store gold-standard workflow representations. Models interact through dialogue and must output structured JSON workflows, which are converted to YAML and executed on a popular workflow platform to compute pass and resolve rates. The benchmark emphasizes not just high-level intent but structural correctness — node selection, control flow, and parameter wiring must align with user intent to succeed. Mutual Verification Pattern Fifteen models were evaluated (four closed-source and eleven open-source) and two top models received an advanced agentic evaluation. Results show models frequently capture the gist of tasks but fail on details that break execution: invalid node connections, inconsistent logic in conditionals and loops, and brittleness when requirements are revised across dialogue rounds. Performance consistently declines over the first three interaction rounds, indicating that evolving user requests are a major pain point. The benchmark exposes a clear gap between language understanding and reliable, structured workflow synthesis, pointing to the need for improved structured reasoning, better tooling for preserving correctness during edits, and richer evaluation tailored to production constraints. context drift
Not sure where to start?Get personalized recommendations
Learn More
Credibility Assessment:

Mix of mostly low to mid h-indices (some authors ~12) but no clear top-tier institutional affiliation and arXiv-only — reasonable credibility but not top-tier.