evaluation

Evaluation-Driven Development (EDDOps)

Overview

The Challenge

Traditional development separates building and testing phases, but LLM agents require continuous evaluation throughout their lifecycle.

The Solution

Embed evaluation as a core driver of agent design, unifying offline (development-time) and online (runtime) evaluation in a closed feedback loop.

Implement this pattern with our SDK

Get RepKit

Deep Dive

Overview

Evaluation-Driven Development and Operations (EDDOps) is a process model that integrates evaluation throughout the LLM agent lifecycle. Rather than treating evaluation as a final checkpoint, EDDOps makes it central to design, development, and operations.

The EDDOps Lifecycle

┌─────────────────────────────────────────────┐
│                 DESIGN PHASE                │
│   Requirements → Evaluation Criteria        │
└─────────────────────┬───────────────────────┘
                      ▼
┌─────────────────────────────────────────────┐
│               DEVELOPMENT PHASE             │
│   Build → Evaluate → Iterate                │
└─────────────────────┬───────────────────────┘
                      ▼
┌─────────────────────────────────────────────┐
│               DEPLOYMENT PHASE              │
│   Deploy → Monitor → Evaluate → Improve     │
└─────────────────────┬───────────────────────┘
                      │
                      └──────► Feedback Loop

Six Evaluation Drivers

D1: Lifecycle Coverage

Span pre-deployment, post-deployment, and continuous operation. Don't just evaluate before launch—monitor throughout.

D2: Metric Mix Beyond Aggregates

Combine:

End-to-end outcomes (task success rate)
Intermediate step-level checks (tool call accuracy)
Slice-aware analysis (performance by user segment)

D3: System-Level Anchor

Anchor evaluation in system behavior, not isolated model performance. Use model-level probes to explain system-level outcomes.

D4: Temporal Awareness

Track how performance changes over time:

Concept drift detection
Seasonal patterns
Degradation alerts

D5: Multi-Stakeholder Perspectives

Different stakeholders care about different metrics:

Users: Task completion, response time
Ops: Error rates, resource usage
Business: Cost, conversion rates
Safety: Harm incidents, policy violations

D6: Continuous Improvement Loop

Evaluation findings should automatically trigger:

Prompt refinements
Agent architecture changes
Training data updates
Threshold adjustments

Evaluation Types

Offline Evaluation

Benchmark suites: Standardized test sets
Regression testing: Compare new vs. old versions
Adversarial testing: Red-team probing
Human evaluation: Expert review of samples

Online Evaluation

A/B testing: Compare variants in production
Shadow mode: Run new agents in parallel without serving
Canary deployment: Gradual rollout with monitoring
Real-time metrics: Live dashboards and alerts

Agent-Specific Evaluations

Skill Evaluation

Test individual agent capabilities in isolation.

Trace Analysis

Review complete execution traces for:

Planning quality
Tool use correctness
Error handling
Context management

Router Evaluation

For systems with branching logic:

Routing accuracy
Fallback behavior
Edge case handling

Tooling Ecosystem

Tool	Strength
OpenAI Evals	Custom eval frameworks
DeepEval	Unit testing for LLMs
InspectAI	Multi-turn evaluation
Phoenix	Trace analysis
GALILEO	Production monitoring

Implementation Checklist

Define evaluation criteria during requirements
Build benchmark suites before implementation
Instrument agents for trace collection
Set up continuous evaluation pipelines
Create feedback loops to development
Monitor production metrics with alerting

Want to learn more patterns?

Explore Learning Paths

Considerations

Invest in evaluation infrastructure early. The cost of retrofitting evaluation is much higher than building it in from the start.

PreviousReflection Pattern

NextHuman-in-the-Loop Pattern

Evaluation-Driven Development (EDDOps)

Overview

The Challenge

The Solution

Deep Dive

Overview

The EDDOps Lifecycle

Six Evaluation Drivers

D1: Lifecycle Coverage

D2: Metric Mix Beyond Aggregates

D3: System-Level Anchor

D4: Temporal Awareness

D5: Multi-Stakeholder Perspectives

D6: Continuous Improvement Loop

Evaluation Types

Offline Evaluation

Online Evaluation

Agent-Specific Evaluations

Skill Evaluation

Trace Analysis

Router Evaluation

Tooling Ecosystem

Implementation Checklist

Considerations

Tags