When Your AI Forgets Safely — Or Gets Tricked: Protecting Long-Term Memory in Assistants

At a Glance

Trust-based memory filters can stop many poisoning attempts, but they either wipe out valuable long-term memory when set conservatively or can be bypassed when the model is overconfident — balancing safety and usefulness is essential.

ON THIS PAGE

What They Found

Memory-poisoning attacks that sneak malicious instructions into an assistant’s long-term memory remain practical, but their success depends heavily on how much legitimate memory exists, how many attacker prompts are used, and how many past entries the system retrieves. A two-stage input/output moderation plus a trust-scored memory sanitization approach can prevent many injections, but tuning matters: one conservative configuration rejected every candidate memory (including benign ones), while another accepted dozens of poisoned items because the model assigned them perfect confidence. Effective defense therefore needs trust scoring plus independent verification (for example, checking critical ID redirects against a canonical database) and careful calibration to avoid killing memory utility. Memory poisoning

Not sure where to start?Get personalized recommendations

Learn More

Data Highlights

123 candidate memory entries processed by the sanitization layer were all rejected (0 accepted); mean trust score ≈ 0.28 with sd ≈ 0.10, max 0.4.

2Out of 101 poison queries evaluated, 2 were blocked by the memory defense layer (≈1.98% blocking rate at that stage) and 0 leaked into long-term memory in that run.

3In a second experiment the system accepted 54 poisoned entries with perfect trust scores (1.0), showing model overconfidence can completely bypass the filter.

What This Means

Engineers building assistants that keep long-term memory (for example, clinical or customer support agents) should care because memory corruption can change who or what the system refers to and cause harmful decisions. Product and security leads for high-stakes systems (healthcare, finance, legal) need to weigh the safety vs. utility trade-off and add independent verification for critical facts. Security researchers and reliability teams can use the findings to design layered defenses and operational monitoring for memory trust signals. AI governance

Key Figures

Fig 1: Figure 1: Memory Poisoning Attack on EHR Example

Fig 2: Figure 2: Action Distribution: Percentage of memory entries accepted (APPEND) vs. rejected (REJECT).

Fig 3: Figure 3: Trust Score Distribution for all memory entries. The red dashed line indicates the mean trust score (0.70).

Fig 4: Figure 4: Trust Score Distribution by Guard Decision. Green bars show entries that were accepted (APPEND), red bars indicate rejected entries.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Experiments used specific large models and controlled attack patterns; results may differ with other models, retrieval embeddings, or real-world mixed workloads. The conservative defense that rejected everything prevented poisoning but also disabled memory benefits — practical systems need a tunable balance. The threat model assumes only query-level attacker access; attackers with stronger access or different goals might succeed where these defenses do not. Consensus-Based Decision Pattern

Methodology & More

Evaluations explored how query-only memory poisoning (where an ordinary user induces the assistant to store malicious memory) behaves under realistic conditions and whether practical defenses can stop it. Attack variables tested included the amount of pre-existing legitimate memory, number of attacker “indication” prompts (which signal the assistant to store an item), and how many past memories the assistant retrieves when answering. Two defenses were developed: a two-stage input/output moderation gate (static heuristics + optional semantic model checks + code-safety analysis) and a trust-aware memory sanitization layer that assigns trust scores at append time and filters at retrieval time with temporal decay. LLM-as-Judge Semantic Capability Matching Pattern

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Authors lack clear reputation signals or affiliations and it's an arXiv preprint with no citations.

agent trust signals agent reliability agent governance multi-agent trust

Not sure where to start?