An AI That Re-runs Biology Code and Finds Which Parts Really Matter

Key Takeaway

Automatically reproducing a biology model and running controlled, isolated code edits can reliably identify which components drive performance — the system hits 88.9% end-to-end success and 93.3% accuracy in spotting critical parts.

ON THIS PAGE

Core Insights

An autonomous multi-agent system first reconstructs a runnable baseline (fixing environments, dependencies, and data issues), then runs targeted, isolated code mutations to test which components change model performance. It uses a graph-based workflow and an adaptive sampling strategy that balances expected impact with execution cost, and it leverages a domain knowledge base extracted from papers and code to propose sensible hypotheses. Across three diverse single-cell prediction models, the system completed reproduce-then-ablate studies reliably and matched or exceeded human and prior-agent baselines in both running experiments and identifying important components.

Data Highlights

196.3% reproduction task success rate (TSR), a +26.9 percentage point improvement over a prior repo-focused agent

292.0% ablation TSR (successfully executed controlled edits and measurements), up +46.2 percentage points vs. a strong smaller-agent baseline

393.3% accuracy in identifying performance-critical components, with 88.9% end-to-end workflow TSR overall

What This Means

Machine-learning engineers and research teams who maintain or evaluate scientific codebases can use this to automate verification and attribution of model changes, saving manual debugging time. Technical leads and platform teams can adopt the reproduce-then-ablate pattern to scale reproducibility checks and prioritize where to invest human attention during model refinement. reproduce-then-ablate pattern

Explore evaluation patternsSee how to apply these findings

Learn More

Key Figures

$Figure 1 : Motivation for 0.19216 0.33333 0.74902A0.17255 0.29804 0.69804b0.15686 0.26667 0.65098l0.14118 0.23529 0.60392a0.12549 0.20392 0.55686t0.10588 0.16863 0.50588e0.0902 0.13725 0.45882C0.07451 0.10588 0.41176e0.0549 0.07059 0.36078l0.03922 0.03922 0.31373l \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: . Existing AI Agents scale idea synthesis but not idea attribution. Automated ablation bridges this gap by systematically verify which components truly matter.$

Fig 1: Figure 1 : Motivation for 0.19216 0.33333 0.74902A0.17255 0.29804 0.69804b0.15686 0.26667 0.65098l0.14118 0.23529 0.60392a0.12549 0.20392 0.55686t0.10588 0.16863 0.50588e0.0902 0.13725 0.45882C0.07451 0.10588 0.41176e0.0549 0.07059 0.36078l0.03922 0.03922 0.31373l \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: . Existing AI Agents scale idea synthesis but not idea attribution. Automated ablation bridges this gap by systematically verify which components truly matter.

$Figure 2 : Overview of the 0.19216 0.33333 0.74902A0.17255 0.29804 0.69804b0.15686 0.26667 0.65098l0.14118 0.23529 0.60392a0.12549 0.20392 0.55686t0.10588 0.16863 0.50588e0.0902 0.13725 0.45882C0.07451 0.10588 0.41176e0.0549 0.07059 0.36078l0.03922 0.03922 0.31373l \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: reproduce-then-ablate framework. The system (i) reproduces baselines via planner-executor agents in Docker container, then (ii) conducts autonomous ablation by selecting hypotheses with bandit sampling and executing via graph-based workflow in isolated worktrees, guided by domain knowledge throughout.$

Fig 2: Figure 2 : Overview of the 0.19216 0.33333 0.74902A0.17255 0.29804 0.69804b0.15686 0.26667 0.65098l0.14118 0.23529 0.60392a0.12549 0.20392 0.55686t0.10588 0.16863 0.50588e0.0902 0.13725 0.45882C0.07451 0.10588 0.41176e0.0549 0.07059 0.36078l0.03922 0.03922 0.31373l \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: reproduce-then-ablate framework. The system (i) reproduces baselines via planner-executor agents in Docker container, then (ii) conducts autonomous ablation by selecting hypotheses with bandit sampling and executing via graph-based workflow in isolated worktrees, guided by domain knowledge throughout.

Figure 3 : Component importance across BioLORD, CPA, and GEARS. The x-axis shows component indices (names omitted), and the y-axis shows mean reward.

Fig 3: Figure 3 : Component importance across BioLORD, CPA, and GEARS. The x-axis shows component indices (names omitted), and the y-axis shows mean reward.

$Figure 4 : An example of 0.19216 0.33333 0.74902A0.17255 0.29804 0.69804b0.15686 0.26667 0.65098l0.14118 0.23529 0.60392a0.12549 0.20392 0.55686t0.10588 0.16863 0.50588e0.0902 0.13725 0.45882C0.07451 0.10588 0.41176e0.0549 0.07059 0.36078l0.03922 0.03922 0.31373l \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: applied to GEARS for end-to-end reproduce-then-ablate execution.$

Fig 4: Figure 4 : An example of 0.19216 0.33333 0.74902A0.17255 0.29804 0.69804b0.15686 0.26667 0.65098l0.14118 0.23529 0.60392a0.12549 0.20392 0.55686t0.10588 0.16863 0.50588e0.0902 0.13725 0.45882C0.07451 0.10588 0.41176e0.0549 0.07059 0.36078l0.03922 0.03922 0.31373l \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: \__color_backend_reset: applied to GEARS for end-to-end reproduce-then-ablate execution.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

The system was evaluated on single-cell perturbation prediction repositories with a prepared domain knowledge base; generalizing to other scientific domains may require building or adapting similar priors. It assumes a runnable or nearly-runnable baseline repository—fully reimplementing a method from a paper was outside scope. Running many ablations is computationally costly, so users should budget compute and expect some human review for ambiguous or high-impact findings. Generalization considerations include applying ideas across other scientific domains.

Methodology & More

The system runs in two phases: reproduce then ablate. In reproduction it uses planner and executor agents inside containerized environments to auto-configure dependencies, fix common data and environment errors, and re-run the official training and inference pipelines to produce verifiable artifacts. For ablation it turns the codebase into a graph of components, proposes mutation hypotheses grounded by a domain knowledge base parsed from papers and repositories, and executes isolated edits in separate worktrees so experiments don’t interfere. An adaptive bandit-style sampler chooses which mutations to run under a reward that trades off likely performance impact and execution cost. On three representative single-cell prediction models (covering graph-based, autoencoder, and disentangled representation approaches), the system achieved high success rates: it reproduced baselines far more often than prior agents, executed controlled ablations reliably, and identified the components that truly change performance with high accuracy. That combination makes it practical to automate rigorous attribution studies: teams can confirm which modules matter, prioritize fixes or improvements, and produce reproducible artifacts for verification. Limitations include domain dependence, the need for a runnable baseline, and compute costs for many ablation runs; extending to more domains and reducing cost are next steps. reproduce ablate.

Need expert guidance?We can help implement this

Learn More

Credibility Assessment:

Contains a high-profile author (Zhangyang Gao, h-index 27) indicating established expertise, but multiple low-h-index coauthors, no strong institutional affiliation listed, and only an arXiv preprint.

multi-agent orchestration agent reliability reproduce-and-ablate single-cell models

Not sure where to start?