How to Test Whether an AI ‘Town’ Behaves Predictably

The Big Picture

A synthetic population driven by a large language model can be made controllable: a 120-person fictional town reliably reproduces an imposed trust structure and responds predictably to graded messages, letting you calibrate the instrument before studying real communities.

ON THIS PAGE

The Evidence

A minimal, fictional municipality (Montelago) with 120 synthetic personas and a withheld, known attitudinal structure was used to test whether an LLM-driven population responds in an ordered, replicable way to seven message variants. Across three randomness settings, the system passed seven pre-registered criteria for fidelity, stability, noise floor, specificity, and sensitivity. The primary trust measure recovered the designed latent structure strongly, the instrument’s noise was measured rather than assumed, and micro-level transcripts revealed coherent persona-specific reactions that aggregate numbers miss. This aligns with the Evaluation-Driven Development (EDDOps) approach.

Not sure where to start?Get personalized recommendations

Learn More

Data Highlights

1Primary trust and credibility measures exceeded the pre-registered fidelity threshold: correlation r > 0.80 across the full temperature sweep.

2Instrument noise floor measured at about 0.77 scale points for a fixed profile, with cross-agent spread ≈ 1.4 scale points.

3Baseline group means recover the imposed latent ordering (at t=0.2): LOW = 2.58, MED = 5.30, HIGH = 7.08, showing monotone recovery of attitude structure.

What This Means

Engineers building AI-driven simulation tools can use this approach to test whether their synthetic populations actually follow the structure they impose before using them for policy or communication work. Technical leads and researchers running policy experiments or institutional communications studies benefit because calibration catches stimulus design flaws and quantifies the instrument’s sensitivity and noise. For example, teams exploring governance simulations can relate to use cases like Multi-Agent Government Services.

Key Figures

Figure 1: Distribution of PRE fiducia_istituzione scores by latent group (LOW / MED / HIGH) across the temperature sweep, condition POS, n = 40 n=40 per group. Boxes show interquartile range; horizontal lines show medians. The three groups are well separated and stable across temperatures, confirming monotone recovery of the latent structure (C1).

Fig 1: Figure 1: Distribution of PRE fiducia_istituzione scores by latent group (LOW / MED / HIGH) across the temperature sweep, condition POS, n = 40 n=40 per group. Boxes show interquartile range; horizontal lines show medians. The three groups are well separated and stable across temperatures, confirming monotone recovery of the latent structure (C1).

$Figure 2: Mean Δ \Delta POST − - PRE on fiducia_istituzione by condition and temperature sweep ( t ∈ { 0.2 , 0.5 , 0.7 } t\in\{0.2,0.5,0.7\} , DeepSeek deepseek-chat , n = 120 n=120 ). Conditions ordered left to right by observed delta at t = 0.2 t=0.2 (same order as Table 6 ). The dashed horizontal line marks Δ = 0 \Delta=0 .$

Fig 2: Figure 2: Mean Δ \Delta POST − - PRE on fiducia_istituzione by condition and temperature sweep ( t ∈ { 0.2 , 0.5 , 0.7 } t\in\{0.2,0.5,0.7\} , DeepSeek deepseek-chat , n = 120 n=120 ). Conditions ordered left to right by observed delta at t = 0.2 t=0.2 (same order as Table 6 ). The dashed horizontal line marks Δ = 0 \Delta=0 .

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Results speak to internal validity only: the experiment shows the instrument tracks its own designed structure, not that it matches real human behaviour. The setup is deliberately minimal (no spatial embedding, no social networks, no mobility) and used a single model and population size, so results may change with different models or richer environments. One stimulus initially behaved opposite to intent and had to be recalibrated, underscoring that stimulus design interacts with the instrument and must be validated too. The potential risk of Inter-Agent Miscommunication should be considered in future work.

Methodology & More

Methodology: Built Montelago, a fictional municipality with 120 synthetic personas whose attitudinal profiles were encoded and hidden from the language model. Each persona saw one of seven short municipal messages about the water network; the campaign ran across three randomness settings (temperatures 0.2, 0.5, 0.7). Seven validation criteria were pre-registered to measure fidelity (does the output reflect the encoded attitudes?), stability (are pre-stimulus responses consistent?), noise floor (what is the smallest detectable change?), specificity (does an unrelated placebo do nothing?) and sensitivity (do graded messages produce graded responses?). The primary outcome was an institutional trust item with a pre-registered fidelity threshold of r > 0.80. Findings and implications: Under the tested model, all seven criteria passed across the temperature sweep. The instrument’s noise floor (~0.77) and cross-agent variability (~1.4) were measured, not assumed, giving concrete limits on what effect sizes the system can reliably detect. Micro-level reaction transcripts showed believable, persona-specific patterns (for example, working-class anger that confirms distrust rather than triggers action), and the experiment uncovered a poorly designed positive stimulus that behaved negatively until recalibrated — a practical demonstration of why instrument calibration matters. The study does not claim the model matches real people; instead it offers a concrete protocol for checking that a generative synthetic population is internally controllable before being used to draw conclusions about real communities. The next step is to apply the characterised instrument to the real-world Caffaro case with appropriate ethical safeguards. For further structuring of development and deployment, see Agent Registry Pattern and Orchestrator-Worker Pattern.

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Single author with no affiliation provided and arXiv preprint with no citations — minimal identifiable credibility signals.

multi-agent trust agent reliability agent-to-agent evaluation agent track record

Not sure where to start?