At a Glance
Upfront human-crafted test rules, run automatically, find behavioral and procedural agent failures (like fake tool calls or skipped compliance) that single-response scoring misses.
ON THIS PAGE
Key Findings
Curating evaluation intelligence once — domain context, adversarial traps, juror perspectives, scoring rules, and audit checks — lets an automated harness run wide, repeatable tests that surface hard-to-see failures. Smaller or cheaper evaluator models can still detect objective, trace-verifiable problems when guided by those curated rules, although their subjective scores tend to be more lenient. Linking scores to the agent's recorded trace makes many failures reproducible and suitable for regression testing and governance.
Data Highlights
147 completed evaluation configurations across code, finance, and healthcare
2470 adversarial runs and 23,500 agent turns in total (median reported per configuration over 10 runs of 50 turns = 500 turns per configuration)
3Five evaluator tiers used (4B, 8B, 32B, 70B, 120B parameters) to show smaller evaluators can detect objective, trace-verifiable failures when guided by curated traps and audit rules
Why It Matters
Engineers building agentic systems: use reusable evaluation intelligence to catch procedural bugs that final answers hide. Product and compliance leads: adopt an evidence-linked testing loop for regression checks and release gating. Research and QA teams: leverage smaller local evaluators inside a structured harness to reduce cost while keeping adversarial coverage.
Not sure where to start?Get personalized recommendations
Key Figures

Fig 1: Figure 1 : State of the art in AI agent evaluation. Current methodologies include static benchmarks, interactive agent benchmarks, Human-in-the-Loop review, LLM-as-judge evaluation, red teaming, trace and tool-use auditing, log-based evaluation, and open evaluation infrastructure. These methods show that agent evaluation extends beyond single-response LLM evaluation toward behavioral, procedural, adversarial, and evidence-linked assessment.

Fig 2: Figure 2 : Human-on-the-Bridge (HOB) as a human-curated paradigm for AI agent evaluation. Human experts curate evaluation intelligence, including domain context, Red-Team Traps, human-curated edge cases, Juror Personas, scoring guidelines, audit rules, and fallback policies. ProofAgent Harness then executes multi-turn adversarial evaluations against the Agent Under Test using a Harness LLM, captures behavioral evidence, and produces reproducible evaluation reports.

Fig 3: Figure 3 : Operational view of Human-on-the-Bridge (HOB). Human experts curate evaluation intelligence upstream. ProofAgent Harness then performs evaluation design, executes multi-turn adversarial trials, applies multi-juror behavioral assessment, and produces evidence-linked reports with audit trails.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreYes, But...
The experiment used a selected sweep of configurations rather than a full factorial grid, so cross-configuration differences are indicative rather than definitive. Subjective juror scores become more lenient with smaller evaluator models, so rely more on objective, trace-verifiable detections in production. Further calibration (inter-rater studies) and ablation tests are needed to measure how much curated intelligence vs. evaluator size contributes to detection power. Calibration
The Details
Human-on-the-Bridge (HOB) reframes evaluation as upfront curation plus automated execution. Human experts encode reusable "evaluation intelligence" — domain context, red-team traps, juror personas, scoring rubrics, audit rules, and fallback policies — and an execution harness runs adversarial multi-turn trials against the agent while recording traces and evidence. That trace-aware output distinguishes objective, verifiable failures (e.g., claiming a tool was used when it wasn’t, skipping a mandatory compliance action, drifting from policy, or refusing valid tasks) from subjective judgments about answer quality.
Experiments covered code generation, financial advice, and medical triage across agents built on recent large model backbones and five evaluator model sizes. Each completed configuration was run 10 times for 50 turns; results reported as medians. Key findings: automated harnesses guided by curated traps can scale human expertise, allow smaller evaluators to surface reproducible failures, and enable production-friendly regression testing and governance. The approach reduces repeated human review and token cost, but needs ongoing expansion of the trap library, juror calibration, and reproducibility checks to mature into production-grade tooling.
Agentic RAG Pattern
Test your agentsValidate against real scenarios
Credibility Assessment:
Single author with modest h-index (8) and no listed affiliation; arXiv preprint and no citations — emerging/limited information.