Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up
Back to Ecosystem Pulse
ProtocolExperimentalMCP

eval-view

by hidai25

CI-friendly regression testing for agent behavior and tool-call diffs

Python
Updated Jun 3, 2026
Share:
114
Stars
20
Forks

View on GitHub

Overview

Detects regressions in agent behavior by snapshotting outputs and diffing tool calls across runs. Runs in CI to catch behavior drift and API-level changes, comparing agent responses and tool interactions against stored golden snapshots. Distinctive features include CLI-friendly snapshots and integrations with LangGraph, CrewAI, OpenAI, and Anthropic for multi-agent and agent-framework workflows. See also the A2A Protocol Pattern and the Open Agent Specification (Agent Spec) for how agents coordinate across tool boundaries.

Key Benefits

As agents evolve and depend on other agents or tools, silent regressions in behavior become a major trust risk. Eval-view makes it possible to track an agent's track record over time and surface behavioral deltas before they hit production. That visibility is essential for continuous agent evaluation and building reproducible agent-to-agent evaluation pipelines. This aligns with the Mutual Verification Pattern to ensure cross-agent confidence.

Ideal For

Developer teams adding automated regression checks for agent outputs and tool interactions in CI, especially when using LangGraph/CrewAI and large API providers. Consider integrating the Human-in-the-Loop Pattern for critical scenarios where human oversight complements automated checks.

Use Cases

  • Catch behavioral regressions after model or prompt updates by diffing snapshots in CI
  • Validate agent tool-call sequences remain stable during refactors or dependency changes
  • Audit agent-to-agent interaction changes when switching LLM providers (OpenAI/Anthropic)
Works With
autogencrewaiopenaianthropic
Topics
agent-benchmarkagent-evaluationagentic-aiai-agentsanthropicautogenclicrewaievaluationlangchain-agent+8 more
Similar Tools
agent-playgroundautogen
Keywords
multi-agent trustA2A evaluationagent track recordregression-testing