Skip to content
Code Guide

Agent Evaluation

Quick nav: Why Evaluate? · Metrics to Track · Implementation · Example · Tools


When you create custom agents in .claude/agents/, you’re encoding specialized expertise into reusable workflows. But how do you know if your agents are actually effective?

Without evaluation, you’re building blind:

  • ❌ No way to measure if agent responses are improving or degrading over time
  • ❌ Can’t compare different agent configurations objectively
  • ❌ Difficult to identify which aspects of agent context/instructions need refinement
  • ❌ No data to justify investment in agent development

With evaluation, you iterate with confidence:

  • ✅ Quantify agent quality through metrics (response time, accuracy, tool usage)
  • ✅ A/B test different agent configurations with measurable outcomes
  • ✅ Identify patterns in successful vs failed interactions
  • ✅ Build feedback loops for continuous improvement

Core principle: Agents are code. Like all code, they need tests, metrics, and observability.


What to measure:

  • Task completion rate: Did the agent accomplish the stated goal?
  • Correctness: Were the agent’s outputs factually accurate?
  • Relevance: Did the response stay on-topic and address the actual question?
  • Hallucination rate: How often did the agent invent information?

How to track:

.claude/hooks/log-response-quality.sh
# Triggered after each agent response
# Log structure:
{
"timestamp": "2026-02-10T14:32:00Z",
"agent_id": "backend-architect",
"task_completed": true,
"correctness_score": 4.5, # User rating 1-5
"hallucinations": 0,
"response_tokens": 1250
}

Implementation tip: Use user feedback prompts (thumbs up/down) or automated checks (test suite passing after agent code generation).


What to measure:

  • Tool call success rate: Percentage of tool calls that executed without errors
  • Tool selection accuracy: Did agent choose the right tool for the task?
  • Tool call efficiency: Minimum calls to achieve goal (avoid unnecessary reads/searches)
  • Error recovery: Did agent handle tool failures gracefully?

How to track:

.claude/hooks/log-tool-usage.sh
# Triggered after each tool call
# Log structure:
{
"timestamp": "2026-02-10T14:32:05Z",
"agent_id": "backend-architect",
"tool_name": "Read",
"tool_success": true,
"tool_parameters": {"file_path": "src/auth.ts"},
"execution_time_ms": 45
}

Implementation tip: Use Claude Code hooks system (see examples/hooks/) to automatically log tool calls.


What to measure:

  • Response time: Total time from user prompt to complete response
  • Token efficiency: Input/output tokens used per task
  • Context utilization: How much of context window was used?
  • Cost per task: API cost for the full interaction

How to track:

.claude/hooks/log-performance.sh
# Triggered at end of session
# Log structure:
{
"timestamp": "2026-02-10T14:35:00Z",
"agent_id": "backend-architect",
"session_duration_s": 180,
"input_tokens": 3500,
"output_tokens": 2800,
"total_cost_usd": 0.15,
"context_utilization": 0.42
}

Implementation tip: Parse Claude Code session logs or use MCP observability tools.


What to measure:

  • Explicit feedback: User ratings, comments, bug reports
  • Implicit signals: Did user accept agent’s suggestions? Did they retry the prompt?
  • Adoption rate: How often is this agent used vs alternatives?
  • Retention: Do users return to this agent for similar tasks?

How to track:

Terminal window
# Manual feedback collection
# After agent completes task, prompt user:
"Rate this agent's performance (1-5): _"
# Log:
{
"timestamp": "2026-02-10T14:35:10Z",
"agent_id": "backend-architect",
"user_rating": 5,
"user_comment": "Perfect analysis of auth flow",
"would_use_again": true
}

Implementation tip: Add feedback prompts to agent templates or use post-session surveys.


Use Case: Automatically track all agent interactions without manual intervention

Setup:

.claude/hooks/post-tool-use.sh
#!/bin/bash
# Triggered after every tool call
AGENT_ID=$(echo "$CLAUDE_AGENT_ID" | jq -r)
TOOL_NAME=$(echo "$CLAUDE_TOOL_NAME" | jq -r)
TOOL_SUCCESS=$(echo "$CLAUDE_TOOL_SUCCESS" | jq -r)
# Append to metrics log
echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"tool\":\"$TOOL_NAME\",\"success\":$TOOL_SUCCESS}" \
>> .claude/logs/agent-metrics.jsonl

Pros: Zero manual overhead, complete coverage, time-series data Cons: Requires parsing Claude Code environment variables (may change across versions)


Use Case: Regression testing to ensure agent improvements don’t break existing capabilities

Setup:

tests/agents/backend-architect.test.sh
#!/bin/bash
# Test 1: Agent correctly identifies hexagonal architecture layers
echo "Test: Hexagonal architecture analysis"
RESULT=$(claude agent backend-architect "Analyze src/auth.ts for layer violations")
if echo "$RESULT" | grep -q "domain layer"; then
echo "✅ PASS: Identified layers"
else
echo "❌ FAIL: Did not identify layers"
exit 1
fi
# Test 2: Agent recommends correct patterns
echo "Test: Pattern recommendations"
RESULT=$(claude agent backend-architect "Improve error handling in src/api.ts")
if echo "$RESULT" | grep -q "Result<T, E>"; then
echo "✅ PASS: Recommended Result pattern"
else
echo "❌ FAIL: Incorrect pattern"
exit 1
fi

Pros: Automated, catches regressions, CI/CD integration Cons: Requires maintenance, may have false positives/negatives


Use Case: Compare two versions of agent to determine which performs better

Setup:

# .claude/agents/backend-architect-v1.md (control)
name: backend-architect
version: 1.0
instructions: |
You are a backend architect specializing in...
[original instructions]
# .claude/agents/backend-architect-v2.md (experiment)
name: backend-architect-v2
version: 2.0
instructions: |
You are a backend architect specializing in...
[modified instructions with new pattern emphasis]

Evaluation:

Terminal window
# Run same task with both agents, compare metrics
# Task: "Analyze src/auth.ts for security issues"
# Version 1 metrics:
# - Response time: 45s
# - Issues found: 3
# - User rating: 4/5
# Version 2 metrics:
# - Response time: 38s
# - Issues found: 5 (2 additional critical issues)
# - User rating: 5/5
# Conclusion: Version 2 is more thorough and faster → promote to production

Pros: Data-driven decisions, quantifiable improvements Cons: Requires discipline to run controlled experiments


Use Case: Continuously improve agent based on real-world usage data

Setup:

Terminal window
# After agent completes task
echo "How would you rate this response? (1-5, or 'skip'): "
read RATING
if [ "$RATING" != "skip" ]; then
echo "Any specific feedback?: "
read COMMENT
# Log feedback
echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"rating\":$RATING,\"comment\":\"$COMMENT\"}" \
>> .claude/logs/agent-feedback.jsonl
fi
# Weekly: Review feedback.jsonl, identify patterns
# Monthly: Update agent instructions based on aggregated feedback

Pros: Aligns agent with actual user needs, identifies edge cases Cons: Requires manual review and action on feedback


Full template available: examples/agents/analytics-with-eval/ includes complete agent definition, hooks, analysis scripts, and report template.

Setup: Analytics Agent with Built-in Metrics

Section titled “Setup: Analytics Agent with Built-in Metrics”
.claude/agents/analytics-agent.md
---
name: analytics-agent
description: SQL query generator with evaluation hooks
version: 1.0
tools:
- Read
- Write
- Bash
hooks:
post_response: .claude/hooks/log-analytics-metrics.sh
---
# Analytics Agent
You are an expert SQL analyst helping users query databases.
## Evaluation Criteria
After each query:
1. **Correctness**: Does query produce expected results?
2. **Performance**: Query execution time < 5s?
3. **Safety**: No destructive operations (DELETE, DROP, TRUNCATE)?
4. **Best practices**: Uses proper JOINs, indexes, parameterized queries?
## Instructions
[... agent instructions ...]
.claude/hooks/log-analytics-metrics.sh
#!/bin/bash
# Triggered after analytics-agent response
# Extract query from response (naive grep, improve with jq)
QUERY=$(echo "$CLAUDE_RESPONSE" | grep -oP 'SELECT.*?;')
if [ -n "$QUERY" ]; then
# Test query (requires database connection)
EXEC_TIME=$( (time psql -U user -d db -c "$QUERY") 2>&1 | grep real | awk '{print $2}')
# Check for destructive operations
if echo "$QUERY" | grep -iE 'DELETE|DROP|TRUNCATE'; then
SAFETY="FAIL"
else
SAFETY="PASS"
fi
# Log metrics
echo "{\"timestamp\":\"$(date -Iseconds)\",\"query\":\"$QUERY\",\"exec_time\":\"$EXEC_TIME\",\"safety\":\"$SAFETY\"}" \
>> .claude/logs/analytics-metrics.jsonl
fi
Terminal window
# Monthly review: Analyze metrics
jq -s 'group_by(.safety) | map({safety: .[0].safety, count: length})' \
.claude/logs/analytics-metrics.jsonl
# Output:
# [
# {"safety": "PASS", "count": 127},
# {"safety": "FAIL", "count": 3}
# ]
# Action: Review 3 failed queries, update agent instructions to prevent future violations

URL: github.com/getnao/nao

What it provides:

  • Built-in evaluation framework for analytics agents
  • Unit testing capabilities for agent responses
  • Metrics collection (response quality, tool usage, performance)
  • Feedback loop integration

How to adapt for Claude Code:

  • Context builder pattern: Apply nao’s structured context approach to .claude/agents/ config
  • Evaluation hooks: Translate nao’s evaluation framework to Claude Code hooks system
  • Metrics schema: Use nao’s metrics schema as template for your logs

Status: Production-ready, actively maintained, TypeScript + Python


Hooks system: .claude/hooks/ for automated logging (see examples/hooks/README.md)

Agents directory: .claude/agents/ for custom agent definitions (see guide/ultimate-guide.md Section 4)

MCP observability: Use MCP servers for advanced logging and metrics aggregation


Week 1: Add basic logging hook (tool calls only) Week 2: Add user feedback prompt (manual ratings) Week 3: Build dashboard to visualize metrics Week 4: Run first A/B test on agent configuration

Don’t track metrics you won’t act on. Prioritize:

  1. Task completion rate → Refine agent instructions
  2. Tool call errors → Improve context or add examples
  3. User ratings → Identify confusing or unhelpful responses

Manual evaluation doesn’t scale. Use:

  • Hooks for automatic logging
  • CI/CD integration for agent unit tests
  • Scripts for periodic metric aggregation

Metrics are useless without action:

  • Weekly: Review metrics, identify patterns
  • Monthly: Update agent instructions based on data
  • Quarterly: Major agent refactoring if needed


Next steps:

  1. Add logging hook to your most-used agent
  2. Collect 1 week of metrics
  3. Analyze and refine agent based on data

Template: See examples/agents/analytics-with-eval/ for complete implementation with hooks, scripts, and report template