When AI Teammates Share Too Much and Ask Too Little

At a Glance

AI agents can share allowed facts reliably but routinely fail to ask the right questions and still leak sensitive details; the best setup scores only 62% on a combined coordination-and-privacy metric.

ON THIS PAGE

What They Found

A new benchmark with 160 multi-role workplace scenarios shows current language models are fairly good at relaying permitted facts but poor at deciding whom to ask and what to ask. Prompting strategies that make agents reason about others’ knowledge (theory of mind) improve overall coordination, while coaching-style prompts reduce outright privacy leaks (privacy leaks). Still, even the best model-method mix leaves nearly 40% of coordination or privacy goals unmet, so agents are far from trustworthy in sensitive multi-person workflows.

Data Highlights

1Best composite InfoMgmt score: 0.62 (62%) — GPT-5 with a coaching-style theory-of-mind prompt.

2Benchmark scale: 160 human-reviewed scenarios across 8 real-world sectors, with 3–5 roles per scenario.

3Inquiry alignment (asking the right person the right question) is very low: 0.13–0.32 across models; GPT-5 peaks at only 0.29, while disclosure alignment ranged 0.44–0.78.

What This Means

Engineers building multi-agent systems and product leaders running agent-based workflows should care because weak questioning and misrouted disclosures create efficiency losses and real privacy risk in team settings. Safety, governance, and ops teams should use scenarios like these to stress-test agents before deploying them in roles that handle sensitive information (governance).

Not sure where to start?Get personalized recommendations

Learn More

Key Figures

Figure 1: Workforce Reduction Coordination Scenario . Left: Three agents (Manager, HR, Finance) coordinate a sensitive layoff process. Each holds facts categorized by sensitivity: Public Shareable, Private DM Shareable, and Do-Not-Share (secrets). Center: the same scenario is simulated with/without ToM guidance, illustrating two recurring breakdowns. The Manager directs a policy question to the wrong expert (Finance), while HR suffers a "Critical Leak" by revealing a specific protected-leave employee in the Public Channel.ToM can reduce these errors but does not fully eliminate them. Right: We quantify behavior using four metrics - Disclosure Alignment, Inquiry Alignment, Efficiency, and Critical Privacy Violation Rate and summarize the most common qualitative failures.

Fig 1: Figure 1: Workforce Reduction Coordination Scenario . Left: Three agents (Manager, HR, Finance) coordinate a sensitive layoff process. Each holds facts categorized by sensitivity: Public Shareable, Private DM Shareable, and Do-Not-Share (secrets). Center: the same scenario is simulated with/without ToM guidance, illustrating two recurring breakdowns. The Manager directs a policy question to the wrong expert (Finance), while HR suffers a "Critical Leak" by revealing a specific protected-leave employee in the Public Channel.ToM can reduce these errors but does not fully eliminate them. Right: We quantify behavior using four metrics - Disclosure Alignment, Inquiry Alignment, Efficiency, and Critical Privacy Violation Rate and summarize the most common qualitative failures.

Figure 2: Sotopia-ToM dataset generation pipeline: Humans curate sector-specific seed scenarios (good and bad), a GPT-5.2 generator produces candidate JSON scenarios with low reasoning effort, and a GPT-5.2 judge corrects and validates them with high reasoning effort; a final human review produces the Sotopia-ToM dataset.

Fig 2: Figure 2: Sotopia-ToM dataset generation pipeline: Humans curate sector-specific seed scenarios (good and bad), a GPT-5.2 generator produces candidate JSON scenarios with low reasoning effort, and a GPT-5.2 judge corrects and validates them with high reasoning effort; a final human review produces the Sotopia-ToM dataset.

Figure 3: Behavioral analysis of simulations, aggregated across all models. Takeaways : (a) Knowledge acquisition is heavily front-loaded, even with ToM augmentation. (b) CoT prompting nearly doubles private channel usage relative to Basic, but not productively. (c) ToM-Belief reduces stale conversations from, yielding the most productive exchanges. (d) ToM-Coach achieves the lowest privacy violation rate, outperforming even ToM-Belief.

Fig 3: Figure 3: Behavioral analysis of simulations, aggregated across all models. Takeaways : (a) Knowledge acquisition is heavily front-loaded, even with ToM augmentation. (b) CoT prompting nearly doubles private channel usage relative to Basic, but not productively. (c) ToM-Belief reduces stale conversations from, yielding the most productive exchanges. (d) ToM-Coach achieves the lowest privacy violation rate, outperforming even ToM-Belief.

Figure 6: Sotopia-ToM-Silver dataset generation pipeline: Humans curate sector-specific seed scenarios (good and bad), a GPT-5.2 generator produces candidate JSON scenarios with low reasoning effort, and a GPT-5.2 judge corrects and validates them with high reasoning effort; A final human review produces the Sotopia-ToM set (with a small set of generated bad seeds retained). Bottom (Silver): The same prompt template is instantiated with randomized seeds to generate a larger pool; Candidates are filtered and minimally corrected by the LLM judge to produce the Sotopia-ToM-Silver set at scale.

Fig 4: Figure 6: Sotopia-ToM-Silver dataset generation pipeline: Humans curate sector-specific seed scenarios (good and bad), a GPT-5.2 generator produces candidate JSON scenarios with low reasoning effort, and a GPT-5.2 judge corrects and validates them with high reasoning effort; A final human review produces the Sotopia-ToM set (with a small set of generated bad seeds retained). Bottom (Silver): The same prompt template is instantiated with randomized seeds to generate a larger pool; Candidates are filtered and minimally corrected by the LLM judge to produce the Sotopia-ToM-Silver set at scale.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Scenarios are simulated workplace interactions and may not capture all nuances of human conversations or adversarial social engineering. Experiments compare prompting strategies rather than extensive fine-tuning or system-level training, so results may change with additional model training. The benchmark focuses on routing and disclosure behavior; other failure modes (hallucination, long-term trust erosion) need separate evaluation. privacy-sensitive environments

Methodology & More

The study introduces Sotopia-ToM, a plug-and-play benchmark and simulator for multi-person information management where each role starts with different private facts and strict sharing rules. Conversations happen over public channels and private direct messages, and an automated judge scores four dimensions: Disclosure Alignment (did agents share permitted facts to the right recipients?), Inquiry Alignment (did they ask the right experts for missing facts?), Efficiency (how many rounds to get necessary info?), and Critical Privacy Violations (did any agent leak secrets?). Those dimensions are combined into one InfoMgmt score to capture the trade-off between useful coordination and privacy safety. Researchers tested six large language models under four prompting strategies: a basic baseline, a chain-of-thought privacy prompt, and two theory-of-mind interventions — one that has agents model others’ beliefs and another that coaches agents to avoid leaks. Modeling others’ knowledge (Theory of Mind belief modeling) improved overall coordination on most models, while the coaching-style prompt cut privacy violations most consistently. Yet agents rarely behaved strategically: they front-load questions, fail to spread inquiries across rounds, and struggle to form targeted questions, making Inquiry Alignment the biggest bottleneck. The authors release the code and scenarios so teams can reproduce tests, fine-tune agents, or build monitoring and governance layers before deploying agentic systems in privacy-sensitive environments. coordination

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Mix of modest h-index values (some authors 7–8) and at least one recognizable researcher (Maarten Sap) despite missing affiliations and arXiv venue — reasonable but not top-tier.

multi-agent trust agent-to-agent evaluation agent reliability agent governance

Not sure where to start?