Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

The Big Picture

A single execution-capable coding agent can discover, model, and query enterprise data end-to-end, producing inspectable, runnable artifacts and matching or beating prior best results on seven public SQL benchmarks without model fine-tuning.

Key Findings

A system that runs one coding agent in three roles (read raw sources, build a schema, and generate queries) over a shared workspace can replace brittle text handoffs with executable artifacts for domain experts. The query component matched or surpassed the best published results across seven SQL benchmarks covering four task types and four SQL dialects, using a single language model and no fine-tuning. The approach trades compute for reliability: average response time ranges from under a minute to about ten minutes per question. Remaining failure modes stem largely from misreading user intent (semantic mismatches) rather than execution bugs.
Test your agentsValidate against real scenarios
Learn More

Key Data

1Matches or surpasses the best published results on all 7 public SQL benchmarks evaluated.
2Evaluation covers 4,187 total instances spanning 4 task categories and 4 SQL dialects.
3Average time per question runs from under 1 minute up to roughly 10 minutes, depending on task complexity.

What This Means

Data engineers and analytics platform teams who wrestle with cleaning, documenting, and turning raw sources into queryable databases can use this pattern to compress handoffs and produce reviewable artifacts. Engineering leaders evaluating agent-based tooling should note it improves reliability and traceability, though at higher per-query cost. Researchers building conversational and code-generating agents will find the shared-workspace, execution-grounded approach a useful design to reduce brittle text-only transfers. See Evaluation-Driven Development (EDDOps) for related practices.

Key Figures

Figure 1: DIA against the best prior system on each of the seven SQL benchmarks, ordered by margin. Each benchmark is scored by its official metric.
Fig 1: Figure 1: DIA against the best prior system on each of the seven SQL benchmarks, ordered by margin. Each benchmark is scored by its official metric.
Figure 2: The DIA system. A single ACA operating over a shared workspace W W realizes three agents ( Data Interpreter , Schema Creator , and Query Generator ), turning raw data D D and a question q q into a grounded answer R R . Each agent reads and writes executable artifacts in W W ; all draw on a shared memory M M ; domain experts review each artifact.
Fig 2: Figure 2: The DIA system. A single ACA operating over a shared workspace W W realizes three agents ( Data Interpreter , Schema Creator , and Query Generator ), turning raw data D D and a question q q into a grounded answer R R . Each agent reads and writes executable artifacts in W W ; all draw on a shared memory M M ; domain experts review each artifact.
Figure 3: Composition of failures per benchmark, aggregated over each benchmark’s task categories and ordered by the share of reasoning failures. Segment labels are failure counts.
Fig 3: Figure 3: Composition of failures per benchmark, aggregated over each benchmark’s task categories and ordered by the share of reasoning failures. Segment labels are failure counts.
Figure 4: Interaction-time scaling on BIRD-Interact: the fraction of the 600 instances whose passing submission landed within the first k k total turns across both phases, computed from the single full-budget run. The dashed line is the final score of the best prior system at its full budget.
Fig 4: Figure 4: Interaction-time scaling on BIRD-Interact: the fraction of the 600 instances whose passing submission landed within the first k k total turns across both phases, computed from the single full-budget run. The dashed line is the final score of the best prior system at its full budget.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The system increases compute and wall-clock time: iterative generate-execute-verify loops mean answers can take seconds to minutes. Verification checks execution shape derived from the agent’s own interpretation, so if intent is misread a wrong result can still pass. The evaluation used a single language model, simulated conversational users, and only qualitative memory analysis, so results may vary with different models or real users. For design guidance on managing these loops, consider the Tool Use Pattern.

Deep Dive

The system treats an execution-capable coder as the main building block: one sandboxed agent is invoked in three roles — interpreting raw sources, creating a validated relational schema, and producing executed SQL for natural-language questions — while all artifacts persist in a shared workspace for expert review. Every output is an executable artifact (code, schema, or query) that the agent runs and inspects, enabling execution-aware fixes instead of fragile text-only handoffs. A [shared memory of past episodes] is described, aligning with the Agent Service Mesh Pattern for orchestrated, execution-grounded collaboration across components. In a fully autonomous setup using a single language model with no fine-tuning, the query component matched or exceeded prior best results on seven public SQL benchmarks (4,187 instances) across generation, debugging, conversational interaction, and project completion tasks and across four SQL dialects. The tradeoff is cost: the iterative execution loop raises per-question time from under a minute to about ten minutes depending on complexity. Key limitations are semantic checks (the system validates result shape but can inherit misread intent), limited evaluation diversity (one model, simulated users), and the need to better organize accumulated experience. Practical next steps are adding true semantic validation, widening model and user studies, and structuring workspace memory for faster reuse. For a concise alignment with semantic interoperability, see the Semantic Capability Matching Pattern.
Not sure where to start?Get personalized recommendations
Learn More
Credibility Assessment:

ArXiv preprint with no affiliations or known authors listed—limited signals of credibility.