Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

Key Takeaway

Hallucination risk depends on the kind of contract claim: obligation and numeric items fail about 65–74% while temporal items fail about 29–35%; auditing by claim type and using a calibrated multi-agent debate cuts fabricated claims by 45%.

Key Findings

Aggregate accuracy masks a consistent 38–41 percentage-point gap between types of contract claims: obligations and numbers fail roughly twice as often as dates. A single-number Risk Direction Index makes which direction errors lean (missing vs. invented clauses) comparable across systems. Calibrating a multi-agent debate pipeline to the measured failure profile reduces fabricated detections by 45%, with the largest gains on obligation and numeric claims. risk-direction index

Data Highlights

1Obligation and numeric claim hallucination rates: 64.8%–74.3% across tested models.
2Temporal (date) claim hallucination rates: 29.0%–35.1%.
3Typed hallucination gap ≈ 38–41 percentage points; calibrated debate pipeline reduced fabricated detections by 45% and lowered obligation false-accept rates by ~8.2 percentage points.

What This Means

Compliance officers, legal operations teams, and procurement evaluators should use typed audits and the Risk Direction Index to choose systems that match their risk tolerance (missed obligations vs. false positives). AI engineers and product leads building legal extraction pipelines should tune multi-agent review to the model's per-claim failure profile to concentrate mitigation where it matters most. Risk Direction Index guidance
Explore evaluation patternsSee how to apply these findings
Learn More

Key Figures

Figure 1 : Typed hallucination rates on the 510-contract benchmark. The grey band marks the aggregate Hal TP \mathrm{Hal_{TP}} cluster (50.9–56.5%). Numeric and obligation claims hallucinate at 64.8–74.3% across every tested model; temporal claims remain at 29.0–35.1%. The resulting within-model gap (approximately 38–41 pp) is not observable under aggregate reporting.
Fig 1: Figure 1 : Typed hallucination rates on the 510-contract benchmark. The grey band marks the aggregate Hal TP \mathrm{Hal_{TP}} cluster (50.9–56.5%). Numeric and obligation claims hallucinate at 64.8–74.3% across every tested model; temporal claims remain at 29.0–35.1%. The resulting within-model gap (approximately 38–41 pp) is not observable under aggregate reporting.
Figure 2 : Error direction across benchmark models (percentage of contradicted TP findings). Scope errors dominate universally (62–71%), but the residual signal reveals a deployment-critical distinction: qwen3-32b predominantly omits conditions (23.7% missing-condition errors), whereas gpt-5.2 predominantly invents them (21.0% extra-condition errors). Both systems report 52% aggregate Hal TP \mathrm{Hal_{TP}} . Only the directional decomposition separates their compliance risk profiles.
Fig 2: Figure 2 : Error direction across benchmark models (percentage of contradicted TP findings). Scope errors dominate universally (62–71%), but the residual signal reveals a deployment-critical distinction: qwen3-32b predominantly omits conditions (23.7% missing-condition errors), whereas gpt-5.2 predominantly invents them (21.0% extra-condition errors). Both systems report 52% aggregate Hal TP \mathrm{Hal_{TP}} . Only the directional decomposition separates their compliance risk profiles.
Figure 3 : Typed debate pipeline, organised into three phases. (1) Debate : a Skeptic issues claim-type-specific challenges (Appendix C ); a Supporter defends with verbatim contract quotes; a Route node directs traffic. If the Skeptic flags a structural error in Round 1, the Re-extractor fires once and the loop restarts. If agents disagree with rounds remaining, the loop continues; on deadlock, the Arbiter tie-breaks conservatively. (2) Independent verify : the Verifier searches the contract independently and checks definition fit. (3) Judge with safety gates : the Add gate (absent → \to present) requires both Verifier confirmation and debate consensus, blocking fabricated additions; the Del gate (present → \to absent) is blocked when the Verifier confirms presence, preventing erasure of correct findings. The asymmetry encodes the measured FAR > > FRR profile from Experiment 1.
Fig 3: Figure 3 : Typed debate pipeline, organised into three phases. (1) Debate : a Skeptic issues claim-type-specific challenges (Appendix C ); a Supporter defends with verbatim contract quotes; a Route node directs traffic. If the Skeptic flags a structural error in Round 1, the Re-extractor fires once and the loop restarts. If agents disagree with rounds remaining, the loop continues; on deadlock, the Arbiter tie-breaks conservatively. (2) Independent verify : the Verifier searches the contract independently and checks definition fit. (3) Judge with safety gates : the Add gate (absent → \to present) requires both Verifier confirmation and debate consensus, blocking fabricated additions; the Del gate (present → \to absent) is blocked when the Verifier confirms presence, preventing erasure of correct findings. The asymmetry encodes the measured FAR > > FRR profile from Experiment 1.
Figure 4 : Per-type deltas from Experiment 2. Gains concentrate on obligation ( Δ \Delta FAR = − 8.2 =-8.2 , Δ ​ Hal Gen \Delta\mathrm{Hal_{Gen}} = − 6.3 =-6.3 ) and factual ( Δ \Delta FAR = − 5.8 =-5.8 ). Temporal Hal Gen \mathrm{Hal_{Gen}} is essentially unchanged ( + 0.6 +0.6 pp), consistent with temporal being the lowest-hallucination type at baseline. The calibrated intervention produces the per-type pattern predicted by Experiment 1. Δ \Delta Hal in the legend denotes Δ ​ Hal Gen \Delta\mathrm{Hal_{Gen}}
Fig 4: Figure 4 : Per-type deltas from Experiment 2. Gains concentrate on obligation ( Δ \Delta FAR = − 8.2 =-8.2 , Δ ​ Hal Gen \Delta\mathrm{Hal_{Gen}} = − 6.3 =-6.3 ) and factual ( Δ \Delta FAR = − 5.8 =-5.8 ). Temporal Hal Gen \mathrm{Hal_{Gen}} is essentially unchanged ( + 0.6 +0.6 pp), consistent with temporal being the lowest-hallucination type at baseline. The calibrated intervention produces the per-type pattern predicted by Experiment 1. Δ \Delta Hal in the legend denotes Δ ​ Hal Gen \Delta\mathrm{Hal_{Gen}}

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

Results come from 510 English-US commercial contracts (CUAD) and may not generalize to other jurisdictions or document types without re-measurement. All measurements use a single automated judge for label checking; judge noise could affect small differences in the Risk Direction Index. Experiments assume full-document context and one backbone for the mitigation run, so long documents and different extraction backbones may introduce other failure modes. planning pattern

Full Analysis

LegalHalluLens replaces a single aggregate hallucination number with a typed profile that splits clause-level extractions into four verification categories: numeric, temporal, obligation/entitlement, and factual. Evaluating four architectures on 510 contracts showed a persistent pattern: numeric and obligation claims hallucinate at roughly 65–74%, while temporal claims hallucinate at roughly 29–35%, creating a within-model gap of about 38–41 percentage points. The Risk Direction Index (RDI) summarizes whether a model tends to invent conditions or omit them, a directional signal that aggregate accuracy conceals. Risk Direction Index Using the typed profile to calibrate a multi-agent debate pipeline (a Skeptic, a Supporter, an independent Verifier, and a conservative Judge with asymmetric safety gates) produced measurable gains: fabricated detections fell by 45%, and the biggest improvements concentrated on the high-failure categories (obligation and numeric). Practical takeaway: procurement and deployment should include per-task, per-claim-type audits and prefer models or pipelines whose RDI matches the workflow’s risk tolerance. Even after mitigation, high overall hallucination rates mean qualified human review remains required for high-stakes legal work. Risk Direction Index
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

ArXiv with no affiliations or notable author metrics provided—limited information.