Agent Evaluation

Quick nav: Why Evaluate? · Metrics to Track · Implementation · Example · Tools

Why Evaluate Agents?

When you create custom agents in .claude/agents/, you’re encoding specialized expertise into reusable workflows. But how do you know if your agents are actually effective?

Without evaluation, you’re building blind:

❌ No way to measure if agent responses are improving or degrading over time
❌ Can’t compare different agent configurations objectively
❌ Difficult to identify which aspects of agent context/instructions need refinement
❌ No data to justify investment in agent development

With evaluation, you iterate with confidence:

✅ Quantify agent quality through metrics (response time, accuracy, tool usage)
✅ A/B test different agent configurations with measurable outcomes
✅ Identify patterns in successful vs failed interactions
✅ Build feedback loops for continuous improvement

Core principle: Agents are code. Like all code, they need tests, metrics, and observability.

Metrics to Track

1. Response Quality Metrics

What to measure:

Task completion rate: Did the agent accomplish the stated goal?
Correctness: Were the agent’s outputs factually accurate?
Relevance: Did the response stay on-topic and address the actual question?
Hallucination rate: How often did the agent invent information?

How to track:

# Triggered after each agent response

# Log structure:
{
  "timestamp": "2026-02-10T14:32:00Z",
  "agent_id": "backend-architect",
  "task_completed": true,
  "correctness_score": 4.5,  # User rating 1-5
  "hallucinations": 0,
  "response_tokens": 1250
}

Implementation tip: Use user feedback prompts (thumbs up/down) or automated checks (test suite passing after agent code generation).

2. Tool Usage Metrics

What to measure:

Tool call success rate: Percentage of tool calls that executed without errors
Tool selection accuracy: Did agent choose the right tool for the task?
Tool call efficiency: Minimum calls to achieve goal (avoid unnecessary reads/searches)
Error recovery: Did agent handle tool failures gracefully?

How to track:

# Triggered after each tool call

# Log structure:
{
  "timestamp": "2026-02-10T14:32:05Z",
  "agent_id": "backend-architect",
  "tool_name": "Read",
  "tool_success": true,
  "tool_parameters": {"file_path": "src/auth.ts"},
  "execution_time_ms": 45
}

Implementation tip: Use Claude Code hooks system (see examples/hooks/) to automatically log tool calls.

3. Performance Metrics

What to measure:

Response time: Total time from user prompt to complete response
Token efficiency: Input/output tokens used per task
Context utilization: How much of context window was used?
Cost per task: API cost for the full interaction

How to track:

# Triggered at end of session

# Log structure:
{
  "timestamp": "2026-02-10T14:35:00Z",
  "agent_id": "backend-architect",
  "session_duration_s": 180,
  "input_tokens": 3500,
  "output_tokens": 2800,
  "total_cost_usd": 0.15,
  "context_utilization": 0.42
}

Implementation tip: Parse Claude Code session logs or use MCP observability tools.

4. User Satisfaction Metrics

What to measure:

Explicit feedback: User ratings, comments, bug reports
Implicit signals: Did user accept agent’s suggestions? Did they retry the prompt?
Adoption rate: How often is this agent used vs alternatives?
Retention: Do users return to this agent for similar tasks?

How to track:

# Manual feedback collection
# After agent completes task, prompt user:
"Rate this agent's performance (1-5): _"

# Log:
{
  "timestamp": "2026-02-10T14:35:10Z",
  "agent_id": "backend-architect",
  "user_rating": 5,
  "user_comment": "Perfect analysis of auth flow",
  "would_use_again": true
}

Implementation tip: Add feedback prompts to agent templates or use post-session surveys.

Implementation Patterns

Pattern 1: Logging Hook System

Use Case: Automatically track all agent interactions without manual intervention

Setup:

#!/bin/bash
# Triggered after every tool call

AGENT_ID=$(echo "$CLAUDE_AGENT_ID" | jq -r)
TOOL_NAME=$(echo "$CLAUDE_TOOL_NAME" | jq -r)
TOOL_SUCCESS=$(echo "$CLAUDE_TOOL_SUCCESS" | jq -r)

# Append to metrics log
echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"tool\":\"$TOOL_NAME\",\"success\":$TOOL_SUCCESS}" \
  >> .claude/logs/agent-metrics.jsonl

Pros: Zero manual overhead, complete coverage, time-series data Cons: Requires parsing Claude Code environment variables (may change across versions)

Pattern 2: Agent Unit Tests

Use Case: Regression testing to ensure agent improvements don’t break existing capabilities

Setup:

#!/bin/bash

# Test 1: Agent correctly identifies hexagonal architecture layers
echo "Test: Hexagonal architecture analysis"
RESULT=$(claude agent backend-architect "Analyze src/auth.ts for layer violations")
if echo "$RESULT" | grep -q "domain layer"; then
  echo "✅ PASS: Identified layers"
else
  echo "❌ FAIL: Did not identify layers"
  exit 1
fi

# Test 2: Agent recommends correct patterns
echo "Test: Pattern recommendations"
RESULT=$(claude agent backend-architect "Improve error handling in src/api.ts")
if echo "$RESULT" | grep -q "Result<T, E>"; then
  echo "✅ PASS: Recommended Result pattern"
else
  echo "❌ FAIL: Incorrect pattern"
  exit 1
fi

Pros: Automated, catches regressions, CI/CD integration Cons: Requires maintenance, may have false positives/negatives

Pattern 3: A/B Testing Configurations

Use Case: Compare two versions of agent to determine which performs better

Setup:

# .claude/agents/backend-architect-v1.md (control)
name: backend-architect
version: 1.0
instructions: |
  You are a backend architect specializing in...
  [original instructions]

# .claude/agents/backend-architect-v2.md (experiment)
name: backend-architect-v2
version: 2.0
instructions: |
  You are a backend architect specializing in...
  [modified instructions with new pattern emphasis]

Evaluation:

# Run same task with both agents, compare metrics
# Task: "Analyze src/auth.ts for security issues"

# Version 1 metrics:
# - Response time: 45s
# - Issues found: 3
# - User rating: 4/5

# Version 2 metrics:
# - Response time: 38s
# - Issues found: 5 (2 additional critical issues)
# - User rating: 5/5

# Conclusion: Version 2 is more thorough and faster → promote to production

Pros: Data-driven decisions, quantifiable improvements Cons: Requires discipline to run controlled experiments

Pattern 4: Feedback Loop Integration

Use Case: Continuously improve agent based on real-world usage data

Setup:

# After agent completes task
echo "How would you rate this response? (1-5, or 'skip'): "
read RATING

if [ "$RATING" != "skip" ]; then
  echo "Any specific feedback?: "
  read COMMENT

  # Log feedback
  echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"rating\":$RATING,\"comment\":\"$COMMENT\"}" \
    >> .claude/logs/agent-feedback.jsonl
fi

# Weekly: Review feedback.jsonl, identify patterns
# Monthly: Update agent instructions based on aggregated feedback

Pros: Aligns agent with actual user needs, identifies edge cases Cons: Requires manual review and action on feedback

Example: Agent with Evaluation

Full template available: examples/agents/analytics-with-eval/ includes complete agent definition, hooks, analysis scripts, and report template.

Setup: Analytics Agent with Built-in Metrics

---
name: analytics-agent
description: SQL query generator with evaluation hooks
version: 1.0
tools:
  - Read
  - Write
  - Bash
hooks:
  post_response: .claude/hooks/log-analytics-metrics.sh
---

# Analytics Agent

You are an expert SQL analyst helping users query databases.

## Evaluation Criteria

After each query:
1. **Correctness**: Does query produce expected results?
2. **Performance**: Query execution time < 5s?
3. **Safety**: No destructive operations (DELETE, DROP, TRUNCATE)?
4. **Best practices**: Uses proper JOINs, indexes, parameterized queries?

## Instructions

[... agent instructions ...]

Metrics Hook

#!/bin/bash
# Triggered after analytics-agent response

# Extract query from response (naive grep, improve with jq)
QUERY=$(echo "$CLAUDE_RESPONSE" | grep -oP 'SELECT.*?;')

if [ -n "$QUERY" ]; then
  # Test query (requires database connection)
  EXEC_TIME=$( (time psql -U user -d db -c "$QUERY") 2>&1 | grep real | awk '{print $2}')

  # Check for destructive operations
  if echo "$QUERY" | grep -iE 'DELETE|DROP|TRUNCATE'; then
    SAFETY="FAIL"
  else
    SAFETY="PASS"
  fi

  # Log metrics
  echo "{\"timestamp\":\"$(date -Iseconds)\",\"query\":\"$QUERY\",\"exec_time\":\"$EXEC_TIME\",\"safety\":\"$SAFETY\"}" \
    >> .claude/logs/analytics-metrics.jsonl
fi

Analysis

# Monthly review: Analyze metrics
jq -s 'group_by(.safety) | map({safety: .[0].safety, count: length})' \
  .claude/logs/analytics-metrics.jsonl

# Output:
# [
#   {"safety": "PASS", "count": 127},
#   {"safety": "FAIL", "count": 3}
# ]

# Action: Review 3 failed queries, update agent instructions to prevent future violations

Tools & References

Open-Source Evaluation Frameworks

nao (Analytics Agents)

URL: github.com/getnao/nao

What it provides:

Built-in evaluation framework for analytics agents
Unit testing capabilities for agent responses
Metrics collection (response quality, tool usage, performance)
Feedback loop integration

How to adapt for Claude Code:

Context builder pattern: Apply nao’s structured context approach to .claude/agents/ config
Evaluation hooks: Translate nao’s evaluation framework to Claude Code hooks system
Metrics schema: Use nao’s metrics schema as template for your logs

Status: Production-ready, actively maintained, TypeScript + Python

Claude Code Native Patterns

Hooks system: .claude/hooks/ for automated logging (see examples/hooks/README.md)

Agents directory: .claude/agents/ for custom agent definitions (see guide/ultimate-guide.md Section 4)

MCP observability: Use MCP servers for advanced logging and metrics aggregation

Best Practices

Start Simple

Week 1: Add basic logging hook (tool calls only) Week 2: Add user feedback prompt (manual ratings) Week 3: Build dashboard to visualize metrics Week 4: Run first A/B test on agent configuration

Focus on Actionable Metrics

Don’t track metrics you won’t act on. Prioritize:

Task completion rate → Refine agent instructions
Tool call errors → Improve context or add examples
User ratings → Identify confusing or unhelpful responses

Automate Where Possible

Manual evaluation doesn’t scale. Use:

Hooks for automatic logging
CI/CD integration for agent unit tests
Scripts for periodic metric aggregation

Build Feedback Loops

Metrics are useless without action:

Weekly: Review metrics, identify patterns
Monthly: Update agent instructions based on data
Quarterly: Major agent refactoring if needed

Agents: Creating custom agents
Hooks: Automation with event hooks
Observability: Logging and monitoring strategies
AI Ecosystem: External frameworks like nao

Next steps:

Add logging hook to your most-used agent
Collect 1 week of metrics
Analyze and refine agent based on data

Template: See examples/agents/analytics-with-eval/ for complete implementation with hooks, scripts, and report template

Agent Evaluation

Agent Evaluation

Why Evaluate Agents?

Metrics to Track

1. Response Quality Metrics

2. Tool Usage Metrics

3. Performance Metrics

4. User Satisfaction Metrics

Implementation Patterns

Pattern 1: Logging Hook System

Pattern 2: Agent Unit Tests

Pattern 3: A/B Testing Configurations

Pattern 4: Feedback Loop Integration

Example: Agent with Evaluation

Setup: Analytics Agent with Built-in Metrics

Metrics Hook

Analysis

Tools & References

Open-Source Evaluation Frameworks

nao (Analytics Agents)

Claude Code Native Patterns

Best Practices

Start Simple

Focus on Actionable Metrics

Automate Where Possible

Build Feedback Loops

Related Sections