AI Teams That Learn to Run Scientific Analyses — 43% Success on a Tough Benchmark

Key Takeaway

Evolving how multiple AI agents are organized and scored meaningfully improves success on computational science tasks: Mimosa reached a 43% task success rate by iteratively mutating and evaluating workflows, producing stable, auditable pipelines after several rounds.

ON THIS PAGE

Key Findings

Iteratively changing and testing multi-agent workflows lets an automated system discover better ways to complete data-driven scientific tasks. Using a loop that mutates a workflow, runs specialized agents, and scores the run, Mimosa raised task-level performance and produced workflows that converge to stable, reusable designs. Gains accumulate across early iterations but tend to plateau after about eight rounds, and effectiveness depends on the underlying language model. An archive mechanism for reusing past workflows exists but wasn’t exercised here because benchmark tasks were too diverse. archive mechanism for reusing past workflows.

Data Highlights

143% success rate on the ScienceAgentBench benchmark using DeepSeek-V3.2.

2Positive reward gains across iterations 2–9 with large effect sizes (Cohen’s d > 0.8); cohort 4 showed d = 2.75.

3Observed improvements are statistically significant versus a null permutation (p < 0.0001); gains plateau around iteration 8, with transitions 8→9 and 9→10 often regressing.

Why It Matters

Engineers building AI orchestration for scientific or data workflows can use iterative workflow evolution to improve automation and produce auditable pipelines. Teams operating automated labs or large-scale analysis pipelines can borrow the archive-and-mutate idea to speed reuse of proven workflows. Research leads evaluating agentic systems can use the results to decide whether to invest in workflow evolution, judge calibration, or multi-model configurations.

Explore evaluation patternsSee how to apply these findings

Learn More

Key Figures

Figure 1: The Mimosa Framework. The system operates through five sequential layers: (0) Divide into sub-tasks — the planner divides the user objective into sub-tasks; (1) Discover Tools — available MCP servers are scanned and enumerated via Toolomics ; (2) Initialize Workflow — the archive is queried for a similar prior task, and the best matching workflow is retrieved or synthesized from scratch; (3) Execute Workflow — specialized agents execute the task using discovered relevant tools; (4) Evaluate Workflow — a judge scores the execution and the meta-orchestrator mutates the workflow for the next iteration; the evaluated workflow is archived, and the loop repeats until the judge score exceeds 0.9 or the predefined number of iterations is reached.

Fig 1: Figure 1: The Mimosa Framework. The system operates through five sequential layers: (0) Divide into sub-tasks — the planner divides the user objective into sub-tasks; (1) Discover Tools — available MCP servers are scanned and enumerated via Toolomics ; (2) Initialize Workflow — the archive is queried for a similar prior task, and the best matching workflow is retrieved or synthesized from scratch; (3) Execute Workflow — specialized agents execute the task using discovered relevant tools; (4) Evaluate Workflow — a judge scores the execution and the meta-orchestrator mutates the workflow for the next iteration; the evaluated workflow is archived, and the loop repeats until the judge score exceeds 0.9 or the predefined number of iterations is reached.

Figure 2: Iterative workflow refinement via single-incumbent search. (1) Mutate Incumbent Workflow — at each iteration, the meta-orchestrator takes the best-performing workflow observed so far (the incumbent) together with its evaluation feedback and proposes a single local modification to generate a mutated workflow. (2) Execute Workflow — each node in the workflow is executed by a SmolAgent CodeAgent instance. (3) Evaluate Workflow — the judge evaluates the resulting execution trace across goal alignment, collaboration efficiency, output quality, and answer plausibility, and returns structured feedback with an overall score. (4) Archive and Select Incumbent — the evaluated workflow is archived, and the highest-scoring workflow observed so far is retained as the incumbent for the next iteration. If the new workflow does not improve the score, the previous incumbent is kept. The cycle repeats for a predefined number of iterations or until the judge score exceeds 0.9.

Fig 2: Figure 2: Iterative workflow refinement via single-incumbent search. (1) Mutate Incumbent Workflow — at each iteration, the meta-orchestrator takes the best-performing workflow observed so far (the incumbent) together with its evaluation feedback and proposes a single local modification to generate a mutated workflow. (2) Execute Workflow — each node in the workflow is executed by a SmolAgent CodeAgent instance. (3) Evaluate Workflow — the judge evaluates the resulting execution trace across goal alignment, collaboration efficiency, output quality, and answer plausibility, and returns structured feedback with an overall score. (4) Archive and Select Incumbent — the evaluated workflow is archived, and the highest-scoring workflow observed so far is retained as the incumbent for the next iteration. If the new workflow does not improve the score, the previous incumbent is kept. The cycle repeats for a predefined number of iterations or until the judge score exceeds 0.9.

Figure 3: Reward gains from successive evolution iterations. Average change in reward relative to the previous iteration with SEM error bars. Data is pooled across runs from all evaluated models (GPT-4o, DeepSeek-V3.2, Claude Haiku 4.5); per-model breakdowns are discussed in the text.

Fig 3: Figure 3: Reward gains from successive evolution iterations. Average change in reward relative to the previous iteration with SEM error bars. Data is pooled across runs from all evaluated models (GPT-4o, DeepSeek-V3.2, Claude Haiku 4.5); per-model breakdowns are discussed in the text.

Figure 4: Statistical validation of evolution efficacy. (a) Mean gain by cohort with 95% bootstrap confidence intervals. All cohorts (2–9) except cohort 10 show significant positive gains (green bars). (b) Probability of improvement (green) versus regression (red) at each transition. Early transitions show improvement rates >50%, while transitions 8→9 and 9→10 favor regression. (c) Permutation test distribution vs observed mean gain (red line). The observed gain significantly exceeds the null distribution (p < 0.0001), confirming improvement is not due to random chance. (d) Effect size (Cohen’s d) by cohort. Cohorts 2–9 achieve large effect sizes (d > 0.8), with cohort 4 showing the strongest effect (d = 2.75). All statistical tests are computed on pooled data across models; per-model analyses are deferred to a future revision.

Fig 4: Figure 4: Statistical validation of evolution efficacy. (a) Mean gain by cohort with 95% bootstrap confidence intervals. All cohorts (2–9) except cohort 10 show significant positive gains (green bars). (b) Probability of improvement (green) versus regression (red) at each transition. Early transitions show improvement rates >50%, while transitions 8→9 and 9→10 favor regression. (c) Permutation test distribution vs observed mean gain (red line). The observed gain significantly exceeds the null distribution (p < 0.0001), confirming improvement is not due to random chance. (d) Effect size (Cohen’s d) by cohort. Cohorts 2–9 achieve large effect sizes (d > 0.8), with cohort 4 showing the strongest effect (d = 2.75). All statistical tests are computed on pooled data across models; per-model analyses are deferred to a future revision.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

Evaluation focused on isolated computational tasks, not full closed-loop lab experiments, so real-world gains may differ when instruments and wet labs are involved. The judge that scores executions provides coarse directional feedback and may introduce biases; judge calibration and cross-model adjudication remain needed. All reported runs are single-seed and archive retrieval wasn’t tested due to low task similarity, so variance and transfer benefits are not yet quantified. judge calibration.

Full Analysis

Mimosa is a layered framework that evolves multi-agent workflows for computational scientific tasks. It discovers available tools, initializes a workflow (from archive or from scratch), executes sub-tasks with specialized agents, and uses a judge to score the resulting execution trace. A meta-orchestrator applies single local mutations to the best-known workflow (the incumbent), reruns the workflow, archives the result, and keeps the best-performing configuration. The evaluation used ScienceAgentBench, a 102-task benchmark spanning biology, chemistry, geographic data, and cognitive science, run in task mode to isolate the orchestration layer’s contribution. ScienceAgentBench Iterative evolution produced measurable gains: across models tested, reward and success improved substantially during early iterations, with statistical significance versus random baselines. Workflows tended to converge after roughly eight refinement rounds, and improvements were model-dependent, pointing to important interplay between orchestration strategies and model capabilities. Key next steps include better judge signals (to avoid bias), [multi-population search strategies to avoid premature convergence], seeding the archive with domain templates to speed reuse, and replicating runs to quantify variance. The framework emphasizes auditable, archived executions, making it a practical step toward reproducible, adaptive automation for computational science.

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Contains an author with h-index ~40 (top researcher), which qualifies for a 5-star rating despite unspecified affiliations and arXiv venue.

multi-agent trust agent-to-agent evaluation multi-agent orchestration agent reliability

Not sure where to start?