Back to Ecosystem Pulse
EvaluationProduction Ready

giskard-oss

by Giskard-AI

Open-source LLM and agent evaluation, red-teaming, and continuous testing

Python
Updated Feb 11, 2026
Share:
5.1k
Stars
393
Forks
12
Commits/Week
12
Commits/Month

View on GitHub

Summary

Provides an open-source framework for evaluating and testing LLMs and agent behaviors. Runs red-team tests, metrics-driven evaluations, and fairness checks using configurable test suites and data sinks. Offers interactive dashboards, automated test pipelines, and connectors to common model providers for reproducible LLM/agent validation MCP Pattern and Defense in Depth Pattern.

Why It Matters

As agents are composed and delegated across services, systematic evaluation is required to surface failure modes and measure reliability. Giskard makes continuous evaluation and red-team testing practical, so teams can track agent track record and regression over time. For multi-agent trust, it supplies the metrics and test harnesses needed to compare agents and feed reputation systems like RepKit Agent Registry Pattern.

Target Use Cases

Teams validating LLMs or agent components before deployment who need automated tests, fairness checks, and dashboards for continuous agent evaluation Human-in-the-Loop.

How It's Used

  • Run red-team and adversarial tests against LLM-driven agents to find unsafe behaviors
  • Automate regression and continuous evaluation pipelines for model updates
  • Measure fairness, robustness, and performance across model providers for pre-production gating
Works With
openaihuggingfacetransformerslangchain
Topics
agent-evaluationai-red-teamai-securityai-testingfairness-aillmllm-evalllm-evaluationllm-securityllmops+7 more
Similar Tools
lm-eval-harnessevidently
Keywords
agent-to-agent evaluationmulti-agent trustai-testingllm-evaluation