When Tuning Instructions Makes AI Teams Much Better — or Much Worse

At a Glance

Tuning each agent’s system instructions can raise a multi-agent AI team’s performance by as much as 24 percentage points, but it can also hurt performance by up to 16 points; gains depend on task structure, communication rules, and team size.

ON THIS PAGE

What They Found

A focused benchmark shows prompt tuning (changing the text instructions given to each agent) sometimes yields large improvements and sometimes causes big regressions. Biggest wins appear on coding and tool-using tasks where agent actions are explicit and verifiable; reasoning tasks see much smaller average gains. More structured communication between agents makes improvements more reliable, while larger teams make prompt optimization harder and more unstable.

Data Highlights

1Up to +24.0 percentage points improvement on a sequential coding task (BFCL) after prompt tuning.

2Performance can drop by as much as -16.0 percentage points for certain multi-agent configurations.

3Average domain gains: coding +3.7 points, tool-calling +4.3 points, reasoning +1.3 points.

What This Means

Engineers building teams of AI agents and technical leads deciding where to invest effort should care—prompt tuning is a low-cost lever that can yield large wins for tasks with clear, checkable outputs. Researchers and tool builders should care because the results show optimizer design must account for task structure, communication rules, and team size to avoid harmful side effects. Role-Based Agent Pattern

Not sure where to start?Get personalized recommendations

Learn More

Key Figures

Figure 1 : Prompt-optimization gains using a state-of-the-art optimizer GEPA in single-agent and multi-agent settings. While GEPA consistently improves single-agent performance across all five diverse tasks, its natural multi-agent extension yields highly variable effects across tasks and workflow topologies, ranging from large gains to severe performance drops.

Fig 1: Figure 1 : Prompt-optimization gains using a state-of-the-art optimizer GEPA in single-agent and multi-agent settings. While GEPA consistently improves single-agent performance across all five diverse tasks, its natural multi-agent extension yields highly variable effects across tasks and workflow topologies, ranging from large gains to severe performance drops.

Figure 2 : Overview of benchmark MAS-PromptBench. Given an input task, a multi-agent system produces a final solution through interactions among LLM-based agents. MAS-PromptBench measures prompt-optimization gains across four axes: task distribution, workflow topology, communication protocol, and team size.

Fig 2: Figure 2 : Overview of benchmark MAS-PromptBench. Given an input task, a multi-agent system produces a final solution through interactions among LLM-based agents. MAS-PromptBench measures prompt-optimization gains across four axes: task distribution, workflow topology, communication protocol, and team size.

$Figure 3 : The five coordination structures evaluated by our protocol. Single is the single-agent baseline. Independent uses n n parallel agents whose outputs are aggregated without inter-agent messaging. Sequential forms a directed chain A 1 → A 2 → ⋯ → A n A_{1}\to A_{2}\to\cdots\to A_{n} with no backward edges. Centralized uses a coordinator to route subtasks to workers that do not communicate with one another. Decentralized allows all agents to exchange messages over a fully connected graph for a fixed number of rounds. Arrows indicate message flow; nodes indicate agents.$

Fig 3: Figure 3 : The five coordination structures evaluated by our protocol. Single is the single-agent baseline. Independent uses n n parallel agents whose outputs are aggregated without inter-agent messaging. Sequential forms a directed chain A 1 → A 2 → ⋯ → A n A_{1}\to A_{2}\to\cdots\to A_{n} with no backward edges. Centralized uses a coordinator to route subtasks to workers that do not communicate with one another. Decentralized allows all agents to exchange messages over a fully connected graph for a fixed number of rounds. Arrows indicate message flow; nodes indicate agents.

Figure 4 : Prompt-optimization gains of MAS-GEPA across diverse communication protocols: Freeform, Semi-structured, and Structured, on HotpotQA and LiveCodeBench. More structured protocols give MAS prompt optimization more room to improve.

Fig 4: Figure 4 : Prompt-optimization gains of MAS-GEPA across diverse communication protocols: Freeform, Semi-structured, and Structured, on HotpotQA and LiveCodeBench. More structured protocols give MAS prompt optimization more room to improve.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The study evaluates fixed team topologies and two natural extensions of single-agent optimizers; other optimization methods may behave differently. Results favor tasks with verifiable intermediate outputs, so gains may not generalize to open-ended reasoning or creative tasks. Larger teams and freeform communication increase fragile interactions, so topology-aware or structure-aware optimizers are likely needed in practice. Planning Pattern

Methodology & More

MAS-PromptBench is a controlled benchmark for tuning the system-level instructions given to each member of a multi-agent AI team. It tests many combinations of task types (reasoning, coding, tool use), team topologies (single, independent, sequential, centralized, fully connected), communication protocols (free-form to highly structured), and team sizes. Instead of changing model weights, the study searches for better text instructions per agent and measures end-to-end task performance across many rollouts, comparing optimized prompts to the default instructions. Across experiments, optimized prompts sometimes deliver sizable gains—up to 24 percentage points on a sequential coding task—and more modest average gains for coding and tool-calling tasks. However, optimization can also harm performance, with drops up to 16 points in some configurations. The main pattern: tuning helps when each agent’s role produces explicit, verifiable outputs (for example, code that can be tested or a tool call with a fixed format) and when communication enforces a shared structure. When tasks require distributed, implicit reasoning or when teams grow large and free-form messaging dominates, local prompt changes are often lost or amplified in undesired ways. The takeaway for practitioners is to try prompt tuning where agent-local behaviors are controllable and to prefer structured communication; for other settings, invest in topology-aware or more robust optimization methods. The benchmark itself provides a foundation for developing and comparing such algorithms, though broader evaluations across more optimization approaches remain necessary. Tool Use Pattern Semantic Capability Matching Pattern

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

ArXiv preprint with no clear affiliations or high-impact author metrics; modest credibility as a work from lesser-known authors.

multi-agent trust multi-agent orchestration agent reliability

Not sure where to start?