Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

In Brief

A six-agent, safety-first pipeline can automatically produce many correct, auditable pull requests while avoiding test regressions — but about half the PRs need better file targeting to be true fixes.

Key Findings

Phoenix runs six specialized language-model agents through triage, optional reproduction, implementation, testing, failure analysis, and handoff to create pull requests that preserve correctness. On a 24-case benchmark slice it reached 75% oracle success (18/24) while never introducing test regressions on successful runs. In a 42-issue pilot across 14 repositories Phoenix preserved correctness for every run, though roughly half the PRs placed code in new or generic paths rather than editing the correct file, pointing to a file-localization limit. Planning Pattern Blackboard Pattern
Explore evaluation patternsSee how to apply these findings
Learn More

Key Data

175% oracle resolution on the evaluated SWE-bench Lite slice (18 out of 24 instances)
2100% correctness preservation on a 42-issue pilot — no introduced test regressions on any pilot run
3Mean time to terminal label for oracle-passing runs: 170 seconds from ai:ready to completion

Why It Matters

Engineers building autonomous code agents and platform teams evaluating agent deployment should care because Phoenix shows a path to automated fixes that are auditable and safety-gated. Open-source maintainers and DevOps leads can use its baseline-aware testing and label-driven workflow to reduce risk from automated PRs while keeping human review in the loop. Evaluation-Driven Development

Key Figures

Figure 2: Phoenix agent pipeline. After planning, the Reproducer (when enabled) attempts to add a failing test that demonstrates the issue on the base branch; implementation and testing follow, with Failure Analyst feedback on test failure. Reproducer failure is non-blocking. On retry exhaustion the issue is labeled ai:failed .
Fig 1: Figure 2: Phoenix agent pipeline. After planning, the Reproducer (when enabled) attempts to add a failing test that demonstrates the issue on the base branch; implementation and testing follow, with Failure Analyst feedback on test failure. Reproducer failure is non-blocking. On retry exhaustion the issue is labeled ai:failed .
Figure 3: Baseline-aware test evaluation. Phoenix stashes its changes and runs the suite on the unmodified branch (step 2) to collect baseline failures B B , then restores changes and re-runs (step 4) to collect post-change failures P P . Correctness is preserved iff P ∖ B = ∅ P\setminus B=\emptyset : no previously passing test now fails.
Fig 2: Figure 3: Baseline-aware test evaluation. Phoenix stashes its changes and runs the suite on the unmodified branch (step 2) to collect baseline failures B B , then restores changes and re-runs (step 4) to collect post-change failures P P . Correctness is preserved iff P ∖ B = ∅ P\setminus B=\emptyset : no previously passing test now fails.
Figure 4: GitHub label state machine. Labels serve as the persistent state store; transitions are atomic and mutually exclusive. The ai:revise loop enables iterative human-in-the-loop refinement. ai:failed is a terminal failure state; ai:review is the handoff to human review.
Fig 3: Figure 4: GitHub label state machine. Labels serve as the persistent state store; transitions are atomic and mutually exclusive. The ai:revise loop enables iterative human-in-the-loop refinement. ai:failed is a terminal failure state; ai:review is the handoff to human review.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Limitations

Evaluation focused on Python repositories and a curated 24-case slice, so overall benchmark claims are not protocol-matched to public leaderboards. About half of generated PRs modify new file paths rather than the correct existing files, driven by a keyword-based file ranker; semantic retrieval over code structure is required to improve true fix rates. Semantic Capability Matching Pattern Human review remains necessary — Phoenix preserves correctness (no new failing tests) but does not guarantee that every PR implements the intended semantic fix.

Methodology & More

Phoenix decomposes issue resolution into six narrow agents that mirror a human workflow: triage and planning, an optional reproducer that tries to add a failing test, implementation, testing, failure analysis, and a review handoff. Each agent has a small, well-defined input/output contract so failures are easier to isolate. Safety is enforced with a baseline-aware test protocol that records pre-change failures, applies the agent’s changes, then checks post-change failures; a change is considered safe only if no previously passing test becomes failing (post-change failures minus baseline failures is empty). All outputs are auditable pull requests and label transitions serve as the state machine for runs and retries. Evaluation-Driven Development Planning Pattern
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

ArXiv paper; authors include one with very low h-index (1) and no strong affiliations.