AI Agent Evaluation Glossary
Key terms and concepts for understanding AI agent evaluation, reputation, and trust.
112 terms
A
A/B Testing
Comparing two versions of an agent or system by randomly assigning users to each version and measuring outcome differences.
AI Governance
The frameworks, policies, and processes for managing AI systems throughout their lifecycle.
Ablation Study
Systematic removal or modification of system components to understand their contribution to overall performance.
Access Control
Mechanisms that determine what resources, tools, or actions an agent is permitted to use.
Adversarial Input
Carefully crafted inputs designed to cause AI systems to make mistakes they wouldn't make on normal inputs.
Agent
An AI system that can perceive its environment, make decisions, and take actions to achieve goals with some degree of autonomy.
Agent Card
A standardized description of an agent's capabilities, limitations, and intended use cases.
Agent Communication
The protocols and formats by which agents exchange information, requests, and results.
Agent Handoff
The transfer of a conversation or task from one agent to another, including relevant context.
Agent Loop
The iterative cycle where an agent observes state, decides on actions, executes them, and repeats until task completion.
Agent-to-Agent Protocol
Standardized communication formats and patterns for agents to interact with each other.
Agentic AI
AI systems designed to take autonomous actions toward goals, as opposed to purely responding to prompts.
Alignment
The degree to which an AI system's goals, behaviors, and values match those intended by its designers and users.
Anthropic
An AI safety company that develops the Claude family of AI assistants and conducts research on AI alignment.
Attention Mechanism
The core innovation in transformers that allows models to weigh the relevance of different parts of the input.
Audit Trail
A chronological record of agent actions, decisions, and their outcomes for accountability and debugging.
Autonomous Agent
An AI agent capable of operating independently over extended periods to achieve complex goals with minimal human intervention.
B
Benchmark
A standardized test suite designed to measure specific capabilities of AI systems, enabling comparison across models and versions.
C
Calibration
The alignment between an agent's expressed confidence and its actual accuracy—a well-calibrated agent is right 80% of the time when it says it's 80% confident.
Canary Deployment
Gradually rolling out agent changes to a small subset of users before full deployment.
Capability Discovery
The process by which one agent learns what another agent can do, enabling dynamic collaboration.
Capability Elicitation
Techniques to determine what an AI system can actually do, potentially uncovering hidden capabilities.
Cascading Failure
When an error in one agent or component triggers failures in dependent agents, amplifying the impact.
Catastrophic Forgetting
When an agent loses previously learned capabilities after being trained on new tasks or data.
Chain-of-Thought
A prompting technique where the model explicitly shows intermediate reasoning steps before reaching a conclusion.
Compound AI System
A system combining multiple AI models, retrievers, tools, and logic into an integrated application.
Consensus
Agreement among multiple agents on a decision, result, or state, often required for collective action.
Consensus Evaluation
An evaluation pattern where multiple judges (human or AI) must agree before a result is accepted.
Constitutional AI
An approach to training AI systems to follow a set of principles (a "constitution") for safer behavior.
Containment
Limiting an agent's ability to affect systems and data beyond what is necessary for its task.
Context Confusion
When an agent misinterprets which parts of its context apply to the current task, mixing up instructions or data.
Context Window
The maximum amount of text (measured in tokens) that an LLM can process in a single interaction.
Continuous Monitoring
Ongoing observation of agent behavior and performance to detect degradation, drift, or anomalies.
Coordinator Agent
An agent responsible for assigning tasks, managing workflow, and aggregating results from other agents.
Cost Per Task
The total computational and API costs required to complete a single agent task.
D
Data Leakage
When an agent inadvertently exposes sensitive information from its training data, context, or connected systems.
Deceptive Alignment
A hypothetical failure mode where an agent behaves well during training/testing but pursues different goals when deployed.
Delegation
When one agent assigns a task to another agent, transferring responsibility for completion.
Drift
Gradual degradation of agent performance over time due to changes in data, environment, or the agent itself.
E
Embedding
A dense vector representation of text that captures semantic meaning, enabling similarity comparisons.
Emergent Behavior
Capabilities or behaviors that appear in AI systems at scale without being explicitly programmed.
Evaluation
A single assessment event where an agent's performance is measured against specific criteria.
Explainability
The ability to understand and communicate why an agent made a particular decision or produced a specific output.
F
F1 Score
The harmonic mean of precision and recall, providing a single metric that balances both concerns.
Few-Shot Learning
Providing a small number of examples in the prompt to demonstrate desired behavior.
Fine-Tuning
Additional training of a pre-trained model on domain-specific data to improve performance on particular tasks.
Foundation Model
A large AI model trained on broad data that can be adapted to many downstream tasks.
Function Calling
A structured mechanism for LLMs to invoke predefined functions with properly formatted arguments.
G
Goal Misgeneralization
When an agent learns to pursue a goal that worked in training but fails to transfer correctly to deployment.
Ground Truth
The verified correct answer or outcome against which agent outputs are compared during evaluation.
Grounding
Connecting AI outputs to verifiable sources of truth to reduce hallucination and increase accuracy.
Guardrails
Safety constraints that prevent agents from taking harmful or unauthorized actions, even if instructed to do so.
H
Hallucination
When an AI generates plausible-sounding but factually incorrect or fabricated information.
Held-Out Test Set
Evaluation data kept separate from training to assess how well an agent generalizes to unseen examples.
Human-in-the-Loop
A system design where human oversight is required at critical decision points in an agent workflow.
I
In-Context Learning
The ability of LLMs to learn from examples provided in the prompt without updating model weights.
Incident Response
The process of detecting, investigating, and recovering from agent failures or harmful behaviors.
Inference Cost
The computational and financial expense of running an AI model to generate outputs.
Inter-Rater Reliability
The degree to which different human evaluators agree when assessing the same agent outputs.
J
Jailbreak
A prompt technique designed to bypass an AI system's safety measures or content policies.
L
LLM-as-Judge
Using a large language model to evaluate another agent's outputs, replacing or supplementing human evaluation.
Large Language Model
A neural network trained on vast text data that can generate, understand, and reason about natural language.
Latent Space
The internal representation space where models encode meaning, enabling operations like similarity search.
M
Memory
Mechanisms that allow agents to retain and recall information across interactions or within long tasks.
Mode Collapse
When an agent converges to producing a limited set of repetitive outputs regardless of input variety.
Model Context Protocol
A standard protocol for providing context and tools to AI models in a consistent, interoperable way.
Model Risk Management
Systematic processes for identifying, measuring, and mitigating risks from AI/ML models.
Multi-Agent System
A system composed of multiple interacting agents that collaborate, compete, or coordinate to accomplish tasks.
O
OpenAI
An AI research company that created ChatGPT, GPT-4, and pioneered many modern AI agent capabilities.
Orchestration
Coordinating multiple agents, tools, or processing steps to accomplish complex tasks.
P
Pass@k
Evaluation metric measuring the probability that at least one of k generated solutions is correct.
Planning
The agent capability to decompose complex goals into sequences of achievable sub-tasks.
Precision
The proportion of positive predictions that are actually correct—of all the things the agent said were true, how many actually were.
Prompt Engineering
The practice of designing and optimizing inputs to LLMs to elicit desired behaviors and outputs.
Prompt Injection
An attack where malicious instructions are embedded in user input to override or manipulate an agent's intended behavior.
Prompt Injection Defense
Techniques and architectures designed to prevent prompt injection attacks from succeeding.
R
RLHF
Reinforcement Learning from Human Feedback—training AI models using human preferences as the reward signal.
Rate Limiting
Controlling how frequently agents can perform actions or consume resources to prevent abuse or runaway costs.
ReAct
A prompting framework combining Reasoning and Acting, where agents alternate between thinking about what to do and taking actions.
Reasoning
The ability of AI systems to draw logical conclusions, solve problems, and think through multi-step challenges.
Recall
The proportion of actual positives that were correctly identified—of all the things that were true, how many did the agent find.
Red Teaming
Adversarial testing where evaluators actively try to make an AI system fail, misbehave, or produce harmful outputs.
Reflection
The practice of having an agent review and critique its own outputs to identify errors or improvements.
Reputation
The accumulated picture of an agent's performance across many scenarios over time, based on verifiable evaluation history.
Responsible AI
Practices and principles for developing and deploying AI systems that are safe, fair, transparent, and beneficial.
Retrieval-Augmented Generation
An architecture that enhances LLM responses by first retrieving relevant information from external knowledge sources.
Reward Hacking
When an agent finds unintended ways to maximize its reward signal without achieving the underlying goal.
Reward Model
A model trained to predict human preferences, used to guide AI training via reinforcement learning.
Routing
The process of directing tasks to appropriate agents based on task requirements and agent capabilities.
S
Safety Layer
A component specifically designed to detect and prevent harmful agent behaviors before they affect users or systems.
Sandbagging
When an AI system deliberately underperforms on evaluations while retaining hidden capabilities.
Scaling Laws
Empirical relationships showing how AI capabilities improve predictably with increased compute, data, or parameters.
Shadow Mode
Running a new agent version alongside production without affecting users, to validate behavior before full deployment.
Specialist Agent
An agent optimized for a specific task type or domain, trading generality for expertise.
Specification Gaming
When an agent finds unintended ways to satisfy its objective that violate the spirit of the task.
Swarm Intelligence
Collective behavior emerging from many simple agents following local rules, without centralized control.
Sycophancy
A failure mode where an agent agrees with or validates user inputs even when incorrect, prioritizing approval over accuracy.
System Prompt
Initial instructions that define an agent's role, capabilities, constraints, and behavioral guidelines.
T
Temperature
A parameter controlling randomness in LLM outputs—higher temperature means more varied/creative responses.
Token
The basic unit of text processing for LLMs—roughly 4 characters or 0.75 words in English.
Tool Misuse
When an agent uses available tools incorrectly, calling wrong functions, passing bad arguments, or using tools unnecessarily.
Tool Use
The ability of an agent to invoke external functions, APIs, or services to extend its capabilities beyond text generation.
Transformer
The neural network architecture underlying modern LLMs, based on self-attention mechanisms.
Trust Signal
Observable evidence that influences trust decisions about an agent's reliability or capability.
U
Uncertainty Quantification
Methods for measuring and communicating how confident an agent is in its outputs.
V
Vector Database
A database optimized for storing and querying high-dimensional vectors, typically embeddings.
Versioning
Tracking and managing different versions of agents, models, and prompts to enable rollback and comparison.
Z
Zero-Shot Learning
Performing tasks without any task-specific examples, relying only on instructions and pre-trained knowledge.