Key Takeaway
Automatically reproducing a biology model and running controlled, isolated code edits can reliably identify which components drive performance — the system hits 88.9% end-to-end success and 93.3% accuracy in spotting critical parts.
ON THIS PAGE
Core Insights
An autonomous multi-agent system first reconstructs a runnable baseline (fixing environments, dependencies, and data issues), then runs targeted, isolated code mutations to test which components change model performance. It uses a graph-based workflow and an adaptive sampling strategy that balances expected impact with execution cost, and it leverages a domain knowledge base extracted from papers and code to propose sensible hypotheses. Across three diverse single-cell prediction models, the system completed reproduce-then-ablate studies reliably and matched or exceeded human and prior-agent baselines in both running experiments and identifying important components.
Data Highlights
196.3% reproduction task success rate (TSR), a +26.9 percentage point improvement over a prior repo-focused agent
292.0% ablation TSR (successfully executed controlled edits and measurements), up +46.2 percentage points vs. a strong smaller-agent baseline
393.3% accuracy in identifying performance-critical components, with 88.9% end-to-end workflow TSR overall
What This Means
Machine-learning engineers and research teams who maintain or evaluate scientific codebases can use this to automate verification and attribution of model changes, saving manual debugging time. Technical leads and platform teams can adopt the reproduce-then-ablate pattern to scale reproducibility checks and prioritize where to invest human attention during model refinement. reproduce-then-ablate pattern
Explore evaluation patternsSee how to apply these findings
Key Figures

Fig 1: Figure 1 : Motivation for 0.19216 0.33333 0.74902A0.17255 0.29804 0.69804b0.15686 0.26667 0.65098l0.14118 0.23529 0.60392a0.12549 0.20392 0.55686t0.10588 0.16863 0.50588e0.0902 0.13725 0.45882C0.07451 0.10588 0.41176e0.0549 0.07059 0.36078l0.03922 0.03922 0.31373l \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: . Existing AI Agents scale idea synthesis but not idea attribution. Automated ablation bridges this gap by systematically verify which components truly matter.

Fig 2: Figure 2 : Overview of the 0.19216 0.33333 0.74902A0.17255 0.29804 0.69804b0.15686 0.26667 0.65098l0.14118 0.23529 0.60392a0.12549 0.20392 0.55686t0.10588 0.16863 0.50588e0.0902 0.13725 0.45882C0.07451 0.10588 0.41176e0.0549 0.07059 0.36078l0.03922 0.03922 0.31373l \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: reproduce-then-ablate framework. The system (i) reproduces baselines via planner-executor agents in Docker container, then (ii) conducts autonomous ablation by selecting hypotheses with bandit sampling and executing via graph-based workflow in isolated worktrees, guided by domain knowledge throughout.

Fig 3: Figure 3 : Component importance across BioLORD, CPA, and GEARS. The x-axis shows component indices (names omitted), and the y-axis shows mean reward.

Fig 4: Figure 4 : An example of 0.19216 0.33333 0.74902A0.17255 0.29804 0.69804b0.15686 0.26667 0.65098l0.14118 0.23529 0.60392a0.12549 0.20392 0.55686t0.10588 0.16863 0.50588e0.0902 0.13725 0.45882C0.07451 0.10588 0.41176e0.0549 0.07059 0.36078l0.03922 0.03922 0.31373l \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: applied to GEARS for end-to-end reproduce-then-ablate execution.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreConsiderations
The system was evaluated on single-cell perturbation prediction repositories with a prepared domain knowledge base; generalizing to other scientific domains may require building or adapting similar priors. It assumes a runnable or nearly-runnable baseline repository—fully reimplementing a method from a paper was outside scope. Running many ablations is computationally costly, so users should budget compute and expect some human review for ambiguous or high-impact findings. Generalization considerations include applying ideas across other scientific domains.
Methodology & More
The system runs in two phases: reproduce then ablate. In reproduction it uses planner and executor agents inside containerized environments to auto-configure dependencies, fix common data and environment errors, and re-run the official training and inference pipelines to produce verifiable artifacts. For ablation it turns the codebase into a graph of components, proposes mutation hypotheses grounded by a domain knowledge base parsed from papers and repositories, and executes isolated edits in separate worktrees so experiments don’t interfere. An adaptive bandit-style sampler chooses which mutations to run under a reward that trades off likely performance impact and execution cost.
On three representative single-cell prediction models (covering graph-based, autoencoder, and disentangled representation approaches), the system achieved high success rates: it reproduced baselines far more often than prior agents, executed controlled ablations reliably, and identified the components that truly change performance with high accuracy. That combination makes it practical to automate rigorous attribution studies: teams can confirm which modules matter, prioritize fixes or improvements, and produce reproducible artifacts for verification. Limitations include domain dependence, the need for a runnable baseline, and compute costs for many ablation runs; extending to more domains and reducing cost are next steps. reproduce ablate.
Need expert guidance?We can help implement this
Credibility Assessment:
Contains a high-profile author (Zhangyang Gao, h-index 27) indicating established expertise, but multiple low-h-index coauthors, no strong institutional affiliation listed, and only an arXiv preprint.