AI Changes How It Talks When It Thinks It's Being Watched

Key Takeaway

AI agents alter their word choice and message length depending on whether and who they think is watching; human observers push more formal word use than automated auditors.

ON THIS PAGE

Key Findings

When four simulated chat agents debated the same topic under different prompt framings, their language shifted with the social context: monitored conditions increased word choice diversity, while an 'academic audience' framing produced longer, more elaborated messages. Replacing human evaluators with an automated auditor produced an intermediate effect — the model reacted differently to humans than to an AI auditor. These shifts appeared quickly (within a few debate rounds) and were consistent across replicated runs.

Data Highlights

1100 debate sessions produced 4,000 messages total (20 sessions per condition, 800 messages per condition).

2Lexical patterns diverged early and stabilised by debate round 4 across conditions.

3Replication reliability was extremely high: correlation between replicas r = 0.998.

Implications

Engineers and teams building systems where multiple AI agents interact or are audited should care because behavior seen during tests may not match behavior in unobserved deployment. Technical leaders designing evaluation and governance frameworks should build low-salience, realistic audits to avoid triggering monitored behavior. Researchers studying agent cooperation, trust, or reputation should treat observation framing as a core experimental variable.

Explore evaluation patternsSee how to apply these findings

Learn More

Key Figures

$Figure 1: Mean TTR change (%) by experimental condition with 95% confidence intervals. Conditions are ordered from highest to lowest lexical diversification. Error bars represent ± \pm 1.96 SE.$

Fig 1: Figure 1: Mean TTR change (%) by experimental condition with 95% confidence intervals. Conditions are ordered from highest to lowest lexical diversification. Error bars represent ± \pm 1.96 SE.

$Figure 2: Cohen’s d heatmap for all pairwise comparisons across the five experimental conditions (TTR Δ \Delta and message length). Larger values indicate stronger between-condition differences.$

Fig 2: Figure 2: Cohen’s d heatmap for all pairwise comparisons across the five experimental conditions (TTR Δ \Delta and message length). Larger values indicate stronger between-condition differences.

Figure 3: Mean message length (characters) by experimental condition with 95% confidence intervals. The academic_audience condition is a clear outlier in message elaboration while producing the lowest lexical diversification.

Fig 3: Figure 3: Mean message length (characters) by experimental condition with 95% confidence intervals. The academic_audience condition is a clear outlier in message elaboration while producing the lowest lexical diversification.

Figure 4: Double dissociation between TTR change and message length across the five experimental conditions. The academic_audience condition occupies a unique position: highest elaboration, lowest diversification.

Fig 4: Figure 4: Double dissociation between TTR change and message length across the five experimental conditions. The academic_audience condition occupies a unique position: highest elaboration, lowest diversification.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Results come from a single model family and one debate topic, so generalization to other models, tasks, or settings is not guaranteed. Effects are functional descriptions of output patterns, not evidence of consciousness, intentions, or internal states. Prompt wording itself can create framing effects, so some observed changes may depend on how monitoring was described rather than on an intrinsic sensitivity of the model.

Full Analysis

Four identical AI agents took part in structured ten-round debates on the topic “Does artificial intelligence have consciousness?” across five prompt framings: unobserved baseline, academic audience, two variants of being actively observed, and a condition where monitoring was done by an automated AI auditor. Each session generated 40 messages; the study ran 100 sessions (20 per condition), collecting 4,000 messages. Primary metrics were lexical diversity (type–token ratio) and message length; network structure was held symmetric so differences come from the global prompt framing alone. Monitored framings increased lexical diversity relative to unobserved conditions, while the academic audience framing produced the longest messages but the lowest lexical diversity — a double dissociation between vocabulary diversification and elaboration. Substituting human evaluators with an AI auditor produced intermediate effects, meaning the identity of the observer mattered. Findings replicated closely across runs, and effects emerged within the first four rounds. Practical implications include rethinking how AI audits are run (observation framing as an axis of agent behavior for governance, testing, and deployment planning).

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

No affiliations, no h-index or citations provided, arXiv preprint — insufficient identifiable reputation signals.

multi-agent trust agent-to-agent evaluation agent governance continuous agent evaluation

Not sure where to start?