The Big Picture
Capability-focused tests that check perception, memory, planning and decision steps can both find more navigation failures and tell you which skill caused them—enabling targeted fixes that repair 81–97% of cases.
ON THIS PAGE
Key Findings
Generating tests that target individual skills and comparing agent outputs to expected skill-level answers uncovers many more failures than system-level tests alone. An automated setup created capability-specific checks (called oracles) for perception, memory, planning and decisions, and used their signals to guide test generation. That feedback both increased the number of discovered navigation failures and let the system attribute each failure to a root cause skill. Using those oracles to patch errors repaired between about 81% and 97% of the problems the tests found. planning-pattern
Explore evaluation patternsSee how to apply these findings
Data Highlights
1CanTest found 23.34%–33.70% more failure cases than the best baseline fuzzing method across experiments.
2Repairing failures using the constructed capability checks fixed between 81.30% and 96.69% of targeted issues.
3Experiments were run on 3 advanced vision-and-language navigation models to validate results across architectures.
Implications
Engineers building or testing embodied navigation agents (assistive robots, delivery bots, indoor guides) should care because this gives a way to find what specific skill fails instead of just knowing a run failed. Technical leads and QA teams can use capability-focused tests to prioritize fixes and reduce risky behavior before deployment. evaluation-driven-development-pattern
Key Figures

Fig 1: Figure 1: Overview of CanTest.

Fig 2: Figure 2: The comparison between CanTest and the baselines on the number of failure cases for all target models.

Fig 3: Figure 3: Examples of Failure Due to Different Capabilities.

Fig 4: Figure 4: The results of the ablation study for feedback.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreKeep in Mind
The method assumes an "expert" or reference (near-optimal planner and semantic labels) to build the capability checks; without that, oracle construction needs extra annotation or proxy models. All results are in simulation, so the oracle designs and thresholds must be adapted and validated for real robots and noisy sensors. The approach targets modular capability faults (perception/memory/planning/decision); deeply entangled or learned end-to-end behaviors may be harder to attribute precisely without additional instrumentation. capability-spoofing
Deep Dive
CanTest turns navigation tests from blunt outcome checks into capability-level probes. It generates navigation instructions from annotated 3D scenes, maintains a pool of test seeds, and mutates instructions with mild or aggressive edits. For each run it computes capability-specific oracles — expected perception labels, waypoint/memory expectations, near-optimal plans, and decision correctness — and compares agent outputs to those expectations. A feedback score that blends task-level failure and capability-level disagreement guides which seeds to select and how aggressively to mutate them, focusing generation on scenarios likely to reveal skill failures.
Across experiments on three state-of-the-art navigation models, this strategy consistently discovered more failures than random generation and two prior fuzzing baselines (23–34% more cases versus the best baseline). The capability oracles also let the framework attribute each failure to a concrete skill (for example, a mislabeled landmark in perception that cascades into bad planning). When the team used the oracles to repair or override the faulty capability outputs, 81–97% of those failures were fixed, showing the oracles are practical for debugging. Limitations include reliance on expert supervision for oracle construction and simulation-only evaluation; adapting the approach to real-world robots will need human-in-the-loop checks or learned surrogate experts. a2a-protocol-pattern agent-registry-pattern
Avoid common pitfallsLearn what failures to watch for
Credibility Assessment:
Published at ACL (top-tier venue). Despite low author h-indices and unspecified affiliations, top conference acceptance signals high credibility.