Back to Ecosystem Pulse
EvaluationProduction Ready
giskard-oss
by Giskard-AI
Open-source LLM and agent evaluation, red-teaming, and continuous testing
Python
Updated Feb 11, 2026
Share:
Summary
Provides an open-source framework for evaluating and testing LLMs and agent behaviors. Runs red-team tests, metrics-driven evaluations, and fairness checks using configurable test suites and data sinks. Offers interactive dashboards, automated test pipelines, and connectors to common model providers for reproducible LLM/agent validation MCP Pattern and Defense in Depth Pattern.
Why It Matters
As agents are composed and delegated across services, systematic evaluation is required to surface failure modes and measure reliability. Giskard makes continuous evaluation and red-team testing practical, so teams can track agent track record and regression over time. For multi-agent trust, it supplies the metrics and test harnesses needed to compare agents and feed reputation systems like RepKit Agent Registry Pattern.
Target Use Cases
Teams validating LLMs or agent components before deployment who need automated tests, fairness checks, and dashboards for continuous agent evaluation Human-in-the-Loop.
How It's Used
- Run red-team and adversarial tests against LLM-driven agents to find unsafe behaviors
- Automate regression and continuous evaluation pipelines for model updates
- Measure fairness, robustness, and performance across model providers for pre-production gating
Works With
openaihuggingfacetransformerslangchain
Topics
agent-evaluationai-red-teamai-securityai-testingfairness-aillmllm-evalllm-evaluationllm-securityllmops+7 more
Similar Tools
lm-eval-harnessevidently
Keywords
agent-to-agent evaluationmulti-agent trustai-testingllm-evaluation