Letting AI Build Its Own Tools Makes It Better at Hard Science Problems

The Big Picture

Generating and verifying small, executable tools while the model is answering a problem significantly improves accuracy on complex scientific tasks compared with relying on fixed tool libraries.

ON THIS PAGE

Key Findings

Evolving tools at test time—by breaking problems into sub-steps, synthesizing missing calculators or routines, verifying them, and adding reusable pieces to a live library—helps agents solve more hard, multi-step scientific problems. Starting from an empty library, the on-the-fly approach outperforms top static-tool systems and standard prompting on multiple benchmarks. The same process can adapt an existing domain library to a new field, reusing primitives rather than forcing full hand-built coverage. Overall, dynamic tool evolution increases both problem-solving accuracy and the practical reusability of generated tools.

Data Highlights

1On the new SciEvo benchmark (1,590 test instances) the evolved tool library contains 925 generated tools covering physics, chemistry, math, and materials.

2On SciBench, test-time tool evolution from scratch reached 0.45 accuracy versus 0.37 for the strongest baseline (KTCE) and 0.34 for a domain-specific static agent (CheMatAgent).

3On the SciEvo benchmark, the live-evolution method hit 0.62 accuracy vs. 0.56 for CheMatAgent and 0.55 for KTCE, and improved over basic chain-of-thought prompting by +0.29 on SciEvo.

What This Means

Engineers building AI agents for scientific or engineering tasks—because this reduces the need for exhaustive, hand-built tool libraries and helps handle novel problems. Technical leads deciding where to invest in agent infrastructure—because on-demand tool generation trades up-front curation work for runtime compute and better problem coverage. Researchers focused on AI for science—because the approach suggests a path from passive tool selection to active, reproducible tool discovery.

Avoid common pitfallsLearn what failures to watch for

Learn More

Key Figures

Figure 1: Paradigm comparison: Static Tool Paradigm (left) vs Test-Time Tool Evolution (right). Static approaches require pre-collected tool libraries, limiting coverage and domain adaptability. Our test-time evolution starts with an empty library and generates tools on-demand during problem-solving, enabling continuous evolution to new domains and problems.

Fig 1: Figure 1: Paradigm comparison: Static Tool Paradigm (left) vs Test-Time Tool Evolution (right). Static approaches require pre-collected tool libraries, limiting coverage and domain adaptability. Our test-time evolution starts with an empty library and generates tools on-demand during problem-solving, enabling continuous evolution to new domains and problems.

Figure 2: The architecture of the Test-Time Tool Evolution (TTE) framework. The system operates through a closed-loop workflow comprising five integrated stages. (1) Structured Task Decomposition: The Problem Analyzer decomposes complex scientific queries into a sequence of executable sub-goals. (2) Dynamic Tool Retrieval: The system queries the Dynamic Tool Registry for existing atomic tools. If retrieval fails, it triggers (3) Generative Tool Synthesis: The Tool Synthesizer creates candidate tools on-the-fly, which undergo strict verification by the Tool Verifier. (4) Atomic Tool Refinement: Validated tools are decoupled into reusable atomic units by the Atomic Decomposer, filtered by the Redundancy Checker, and registered to update the library. (5) Runtime Execution Engine: Once the required tools are successfully retrieved or generated for all the steps, the Tool Executor executes the sequence to synthesize the final answer.

Fig 2: Figure 2: The architecture of the Test-Time Tool Evolution (TTE) framework. The system operates through a closed-loop workflow comprising five integrated stages. (1) Structured Task Decomposition: The Problem Analyzer decomposes complex scientific queries into a sequence of executable sub-goals. (2) Dynamic Tool Retrieval: The system queries the Dynamic Tool Registry for existing atomic tools. If retrieval fails, it triggers (3) Generative Tool Synthesis: The Tool Synthesizer creates candidate tools on-the-fly, which undergo strict verification by the Tool Verifier. (4) Atomic Tool Refinement: Validated tools are decoupled into reusable atomic units by the Atomic Decomposer, filtered by the Redundancy Checker, and registered to update the library. (5) Runtime Execution Engine: Once the required tools are successfully retrieved or generated for all the steps, the Tool Executor executes the sequence to synthesize the final answer.

Figure 3: Tool distribution of the curated SciEvo benchmark. SciEvo covers 25 sub-disciplines across four major scientific fields: Physics (499 tools), Chemistry (192), Mathematics (171), and Materials (63), demonstrating comprehensive coverage of diverse scientific computational needs.

Fig 3: Figure 3: Tool distribution of the curated SciEvo benchmark. SciEvo covers 25 sub-disciplines across four major scientific fields: Physics (499 tools), Chemistry (192), Mathematics (171), and Materials (63), demonstrating comprehensive coverage of diverse scientific computational needs.

Figure 4: Accuracy comparison on SciEvo. We compare the “No Tool call” baseline against our TTE-Zero method using direct queries (“Q + Tools”) and Sub-goal Decomposition (“S + Tools”).

Fig 4: Figure 4: Accuracy comparison on SciEvo. We compare the “No Tool call” baseline against our TTE-Zero method using direct queries (“Q + Tools”) and Sub-goal Decomposition (“S + Tools”).

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

Evolving tools during inference increases computational cost and response time compared with selecting from a fixed library, so production use needs strategies to skip evolution for trivial queries. Automatic generation raises dual-use and safety concerns; human review and strict filtering are required before releasing evolved tools. Benchmarks focus on precise, multi-step computational problems—results may not extend directly to noisy real-world experiments or tasks requiring physical lab execution.

Deep Dive

The approach replaces a fixed, pre-built toolbox with a closed-loop system that creates, verifies, refines, and reuses small executable tools while solving a problem. A five-stage workflow decomposes a complex question into sub-goals, checks a dynamic registry for matching primitives, synthesizes candidate tools when nothing fits, runs verification on those candidates, breaks validated tools into atomic reusable units, and executes the resulting sequence to produce the final answer. Two modes are highlighted: starting from an empty library (ab-initio tool synthesis) and adapting an existing library from one scientific domain to another. Evaluations use three benchmarks including a newly released SciEvo dataset of 1,590 instances and 925 evolved tools spanning physics, chemistry, math, and materials. The test-time evolution method establishes new state-of-the-art accuracy across these datasets (example: 0.62 on SciEvo vs. ~0.55–0.56 for top static systems), and shows more efficient tool reuse patterns. Trade-offs include higher runtime cost and the need for robust verification and safety screening. The work shifts the paradigm for scientific agents from selecting pre-made tools to actively discovering small, verifiable computational primitives—an important step for building agents that can handle novel scientific problems without exhaustive manual tool curation.

Need expert guidance?We can help implement this

Learn More

Credibility Assessment:

Authors show low h-indices overall and no strong institutional signals; minimal citation count — limited credibility.

test-time tool evolution scientific reasoning tool generation

Not sure where to start?