An AI That Gets Better by Challenging Itself

Key Takeaway

AI models can markedly improve math and coding accuracy by running four role-based agents—one that generates problems, one that plans, one that solves, and one that critiques—using only 500 seed examples and automatic verifiers to drive training.

ON THIS PAGE

Core Insights

A four-role setup (A four-role setup) (Challenger, Planner, Solver, Critic) lets a single model play multiple roles and co-evolve without large human datasets. The Challenger creates new, progressively harder problems while the Solver is rewarded for verifier-confirmed correctness, producing a self-rewarding loop that expands and improves the training curriculum. A critic enforces task quality and output format, which stabilizes training and reduces noisy examples. The method shows consistent gains across small-to-medium model sizes and generalizes to harder, out-of-distribution math and coding benchmarks. self-rewarding loop

Explore evaluation patternsSee how to apply these findings

Learn More

Data Highlights

1Bootstrapped from only 500 seed examples drawn from math and code datasets.

2Training and baseline comparisons used 200 fine-tuning steps; validation accuracy typically peaked around step 100–120 before declining.

3Demonstrated improvements across multiple model sizes: experiments on 3B, 4B, and 7B parameter models.

What This Means

Engineers building autonomous AI agents or agent orchestration systems can use this to reduce dependence on large human-labeled datasets while still improving reasoning in verifiable tasks. ML teams focused on math or code assistants can use the approach to scale problem curricula automatically and improve out-of-distribution robustness. Technical leaders evaluating agent evaluation and trust systems can adopt the critic and format-reward ideas for quality controls. trust systems

Key Figures

Figure 1: Overview of the SAGE framework. Four specialized agents—Challenger, Planner, Solver, and Critic—interact through quality filtering and format validation to enable closed-loop self-evolution.

Fig 1: Figure 1: Overview of the SAGE framework. Four specialized agents—Challenger, Planner, Solver, and Critic—interact through quality filtering and format validation to enable closed-loop self-evolution.

Figure 2: The SAGE training pipeline. (1) The Challenger generates questions from reference examples, filtered by the Critic for quality; (2) verified questions expand the dataset; (3) sampled questions are processed by the Planner and Solver to produce solutions; (4) all agents are jointly updated using Task-Relative REINFORCE++ with per-role advantage normalization.

Fig 2: Figure 2: The SAGE training pipeline. (1) The Challenger generates questions from reference examples, filtered by the Critic for quality; (2) verified questions expand the dataset; (3) sampled questions are processed by the Planner and Solver to produce solutions; (4) all agents are jointly updated using Task-Relative REINFORCE++ with per-role advantage normalization.

Figure 3: Training dynamics on Qwen-2.5-3B. The Challenger steadily expands the question pool (bars) throughout training, while validation accuracy (line) reaches peak performance around step 100–120 before gradual decline, suggesting potential over-specialization on the self-generated curriculum.

Fig 3: Figure 3: Training dynamics on Qwen-2.5-3B. The Challenger steadily expands the question pool (bars) throughout training, while validation accuracy (line) reaches peak performance around step 100–120 before gradual decline, suggesting potential over-specialization on the self-generated curriculum.

Figure 4: Qualitative case study. The Challenger generates a math word problem, the Planner decomposes it into structured steps, the Solver executes the plan to produce the final answer, and the Critic provides quality scores for both the question and the plan.

Fig 4: Figure 4: Qualitative case study. The Challenger generates a math word problem, the Planner decomposes it into structured steps, the Solver executes the plan to produce the final answer, and the Critic provides quality scores for both the question and the plan.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

Method relies on verifiable domains where correctness can be checked automatically (math with ground-truth and code with unit tests), so it won’t directly apply to subjective or open-ended tasks. It requires a modest seed set (here 500 examples) to bootstrap learning; performance with fewer seeds is untested. Training can over-specialize to the self-generated curriculum—monitor validation and apply early stopping or periodic human review to avoid degradation. automatic verifiers

Methodology & More

SAGE runs a closed-loop, four-role multi-agent process inside a single language model: Challenger (creates tasks), Planner (breaks tasks into steps), Solver (produces solutions), and Critic (scores task quality, plan clarity, and answer format). The loop works adversarially: when the Solver fails, the Challenger is rewarded for producing harder tasks; when the Solver succeeds, those verified examples expand the training pool. Rewards combine verifier correctness and a soft format score to keep outputs structured. Training uses a policy-gradient method with per-role normalization so each role’s learning signal stays stable. In experiments on math and code, SAGE starts from 500 seed examples and is trained across multiple model sizes (3B, 4B, 7B). The system shows sample-efficient gains and better out-of-distribution performance on competition-level benchmarks compared to baselines that relied on human-curated data. Practical implications include automated curriculum generation, reduced labeling needs for verifiable tasks, and built-in quality controls via the critic. Limitations include the need for automatic verifiers, the initial seed set size, and the risk of over-specialization without monitoring; these should guide where and how SAGE is deployed. closed-loop

Need expert guidance?We can help implement this

Learn More

Credibility Assessment:

Authors have low h-indices (<=4) and no institutions listed; arXiv preprint with no citations — some authors but limited established reputation.

agent-to-agent evaluation multi-agent orchestration continuous agent evaluation multi-agent trust

Not sure where to start?