In Brief
Jointly improving how a question is asked and the model's instructions produces better AI answers; a dual-track multi-agent process lifted accuracy up to 3.95% while often using very few training examples.
ON THIS PAGE
Key Findings
Optimizing question wording and prompt instructions together yields consistently better results than optimizing prompts alone. A planner breaks tasks into paired goals, and two specialist agents alternately propose and critique question reforms and prompt edits, while a mediator accepts only complementary improvements. The learned question reformulation strategies transfer to inference, so refined questions plus optimized prompts produce higher accuracy with fewer optimization calls. This aligns with the Agentic RAG Pattern.
Explore evaluation patternsSee how to apply these findings
Key Data
1Up to 3.95% absolute improvement in task accuracy versus strong baselines across 12 benchmarks.
2Evaluated on 12 diverse datasets (including BBH and MMLU), showing consistent gains across reasoning and knowledge tasks.
3Often reached top accuracy using only a single training sample and fewer API calls, achieving the highest accuracy-per-cost in representative tests.
What This Means
Engineers building production AI agents who want more reliable answers—especially when cost or query budget is tight—can use joint question-and-prompt tuning to squeeze extra accuracy without huge data needs. Technical leaders deciding how to improve model outputs should consider investing in instruction-and-query co-design rather than tweaking prompts alone.
Key Figures

Fig 1: Figure 1 : Comparison of three strategies for pronoun disambiguation. Left: original question without prompt instructions yields an incorrect answer. Middle: adding a CoT prompt to the original question still fails. Right: Helix jointly optimizes question formulation and prompt instructions, producing the correct prediction.

Fig 2: Figure 2 : Overview of the Helix framework including 6 6 LLM-based agents for joint optimization of question reformulation and prompt instructions. The ① Planner decomposes the task into a sequence of helix objectives, dual-helix co-evolution alternates between ② Prompt-Architect and ③ Question-Architect with ④ Mediator validation, and the ⑤ Question-Generator together with the ⑥ Question-Judge produces validated refined questions, which are paired with the optimized prompt and fed to the target LLM for inference.

Fig 3: Figure 3 : Accuracy–cost trade-off across 12 tasks for different prompt optimization methods. Bubble size denotes the number of training samples, with Helix achieving the highest accuracy with the fewest API calls using only a single sample.

Fig 4: Figure 4 : Prompt efficiency (PE) comparison on four representative BBH tasks, where Helix consistently achieves the highest performance per optimization cost.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreYes, But...
The observed gains are modest (maximum reported boost 3.95%) and may vary by task or model family. Results come from benchmarks commonly used in research; performance on domain-specific or multilingual data is untested. The method depends on using additional model calls during training (though often few), so cost and latency trade-offs should be evaluated for your setup.
Methodology & More
Helix treats prompt optimization as a two-way problem: how you ask a question and what instructions the model gets are interdependent. A Planner agent first splits a task into a sequence of paired goals that target both question reformulation and prompt refinement. Two specialist agents—a Question-Architect and a Prompt-Architect—alternate proposing changes and critiquing each other's outputs; a Mediator approves only those updates that make the pair more complementary. During inference, a Question-Generator follows the learned reformulation strategy to rewrite new queries, and a Question-Judge checks the rewrites before pairing them with the optimized prompts for the target model. Question-Generator follows the learned reformulation strategy, and the Mediator role is exemplified by patterns such as Mediator. Across 12 benchmarks (including challenging reasoning and knowledge tests), Helix consistently outperformed six strong baselines, showing up to 3.95% absolute accuracy gains and better accuracy-per-optimization-cost. The approach produces interpretable reformulation strategies that generalize to unseen queries and often achieves top results with very few training samples and API calls. That makes Helix attractive for teams that need modest but reliable improvements without heavy data or compute investments, though practical deployment should weigh training-call costs and domain coverage.
Avoid common pitfallsLearn what failures to watch for
Credibility Assessment:
All authors have low h-indexes, no affiliations or venue prestige (arXiv preprint), and zero citations — limited signals of credibility.