Save the Middle Steps: Make AI Work Fixable by Keeping Its Notes

At a Glance

Keep typed, versioned intermediate artifacts (evidence, criteria, plans) as first-class state so later edits target the real source of errors instead of redoing or guessing from the final answer.

ON THIS PAGE

Key Findings

Durable intermediate artifacts—typed, addressable pieces like evidence digests, claim matrices, and criteria—unlock targeted revision: change the right artifact and regenerate only the downstream pieces that depend on it. Treating these artifacts as authoritative and supersession-aware prevents hidden or flattened state from producing opaque, hard-to-fix outputs. The paper formalizes what makes an artifact first-class, distinguishes artifact lineage from prompt or memory traces, and proposes additive versus superseding update rules so the system can resolve which artifact is "current." Finally, evaluation should measure maintained-state quality (can the runtime pick the right artifact after edits?) not just whether the final output looks acceptable.

Need expert guidance?We can help implement this

Learn More

By the Numbers

1Defines "active-resolution accuracy" — the percentage (0–100%) of perturbation tasks where the runtime resolves the correct "current" artifact after an edit.

2Proposes a benchmark setup using controlled artifact graphs with one gold-active artifact per role so you can report per-role and overall resolution accuracy as percent-correct scores.

3Points out a possible evaluation gap: final-output acceptability and maintained-state correctness can diverge dramatically—potentially up to 100 percentage points in extreme cases (perfect final output but no coherent maintained-state).

What This Means

Engineers building multi-step or agent-driven systems should adopt artifact-first designs so targeted changes (for example, new constraints or corrected evidence) only recompute what depends on them. Technical leaders and governance teams should use maintained-state metrics when evaluating agent reliability, trust, and long-term track records rather than relying solely on one-off final answers. Researchers designing benchmarks and agent-to-agent evaluation can use the artifact semantics and the active-resolution metric to test longitudinal improvability.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Typed artifacts require careful schema design and discipline; bad schemas can create brittle or superficial structure. Choosing artifact granularity is a tradeoff: too coarse loses localizability, too fine adds management cost. Persisting artifacts does not guarantee correctness—durability makes revision possible but still relies on validation, good judgment, and selective substrate entry for exploratory work. Versioning considerations and discipline are part of getting this right.

The Details

The paper argues that multi-step AI work should preserve durable, inspectable intermediate artifacts rather than collapsing everything into a final answer or an opaque transcript. An artifact-first data model treats intermediate outputs (evidence digests, claim matrices, criteria, plans) as first-class state when they are typed, addressable, authority-bearing, and supersession-aware. The model separates artifact lineage (how artifacts depend on one another) from prompt lineage or memory, and it introduces two update semantics—additive and superseding—to control how new edits become "current." A concrete example shows how changing a recommendation criterion should only trigger regeneration of dependent artifacts (implementation plan, syntheses) while leaving unrelated upstream evidence untouched. For evaluation and governance, the paper proposes measuring maintained-state quality with executable tests: controlled perturbation tasks, gold-active annotations per role, and an "active-resolution accuracy" metric that reports the percentage of cases where the runtime resolves the correct artifact after a change. That shifts part of system assessment from final-output acceptability to whether the runtime preserves coherent, editable state that humans and agents can reliably intervene on. Practical tradeoffs include schema discipline, granularity choices, and the fact that durable artifacts can still be wrong or noisy; the model is a substrate-level proposal for making AI work maintainable and improvable over time rather than a claim that artifacts make models more accurate by themselves. Evidence digests, claim matrices, and criteria Active-resolution accuracy

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Solo/very low h-index authors and arXiv venue with no institutional backing — minimal identifiable credibility.

agent governance agent reliability agent track record multi-agent orchestration

Not sure where to start?