The Big Picture
Coordinating separate AI characters with a shared world view and critic-guided refinement yields longer stories with fewer contradictions and more coherent plot progression.
ON THIS PAGE
The Evidence
Magnet—an ensemble of character agents plus a shared, updating world state—keeps facts and goals aligned across scenes so characters act more consistently over long narratives. A critic step refines actions and a narrator module turns those actions into readable prose, while dynamic goal switching avoids repetitive story trajectories. Atlas evaluates generated stories by building a graph of characters, events, and relationships to spot contradictions across scenes. Together, these methods reduce editorial critiques and internal hallucinations Event-Driven Agent Pattern compared with baseline single-agent generation.
Not sure where to start?Get personalized recommendations
Data Highlights
1Character actions tended to become repetitive after about 15 time steps; Magnet replaces stale goals not completed within 15 steps.
2The system forces larger story direction changes every ~40 steps to introduce new conflicts and avoid repeating trajectories.
3The character action adapter was trained with 1,012 preference examples and validated on 53 held-out examples to improve in-character, grounded actions.
What This Means
Engineers building AI storytelling platforms and creative tools will gain a practical architecture for keeping multi-character plots consistent over long documents. Technical product leads and researchers evaluating agent behavior can use the shared world-state and Atlas graph checks to detect internal contradictions and guide improvements. The framework aligns with patterns like the Orchestrator-Worker Pattern to coordinate actions across modules.
Key Figures

Fig 1: Figure 1: Magnet’s generation pipeline.

Fig 2: Figure 2: Atlas’s evaluation pipeline
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreConsiderations
The approach is computationally expensive because multiple agent modules interact repeatedly, which limited evaluations to a small set of stories. Results depend on closed-source large models and a modest preference dataset, so behavior may vary and could improve with more training data. Atlas relies on extracting explicit graph information from text, so sparse or vague outputs reduce its ability to detect contradictions. These challenges relate to potential Explanation Degradation in long-form generation.
Methodology & More
Magnet breaks story generation into distinct roles: character agents propose grounded actions based on their persona, a critic refines or selects actions to keep behavior on-character, and a narrator converts selected actions into polished prose. All agents read and update a shared world state that tracks entities, events, and relationships across scenes. The system also uses automatic goal generation and replacement rules (swap incomplete goals after ~15 steps and force domain changes every ~40 steps) to prevent repetition and keep narratives moving. Atlas evaluates stories by extracting a graph-like world model from each scene and comparing the current scene’s facts to prior scenes to detect contradictions and hallucinations. The team trained a preference-based adapter on about 1,012 examples to make character actions more relevant and less repetitive, and evaluated Magnet across 2-, 20-, and a proof-of-concept 100-page story setting. Findings show improvements in editorial critique counts, pairwise rubric scores, and fewer internal inconsistencies, though the gains come with higher compute cost and dependence on specific pretrained models. The work offers a practical path toward controllable, character-grounded long-form generation while highlighting scaling and evaluation limits that future work must address. Agent Registry Pattern Evaluation-Driven Development (EDDOps) Event-Driven Agent Pattern
Avoid common pitfallsLearn what failures to watch for
Credibility Assessment:
Authors affiliated with industry labs (Lockheed AI Center, Algoverse) but low h-indices and arXiv preprint; some institutional backing but not top-tier — emerging credibility.