Agent Evaluation
Agent Evaluation
Section titled “Agent Evaluation”Quick nav: Why Evaluate? · Metrics to Track · Implementation · Example · Tools
Why Evaluate Agents?
Section titled “Why Evaluate Agents?”When you create custom agents in .claude/agents/, you’re encoding specialized expertise into reusable workflows. But how do you know if your agents are actually effective?
Without evaluation, you’re building blind:
- ❌ No way to measure if agent responses are improving or degrading over time
- ❌ Can’t compare different agent configurations objectively
- ❌ Difficult to identify which aspects of agent context/instructions need refinement
- ❌ No data to justify investment in agent development
With evaluation, you iterate with confidence:
- ✅ Quantify agent quality through metrics (response time, accuracy, tool usage)
- ✅ A/B test different agent configurations with measurable outcomes
- ✅ Identify patterns in successful vs failed interactions
- ✅ Build feedback loops for continuous improvement
Core principle: Agents are code. Like all code, they need tests, metrics, and observability.
Metrics to Track
Section titled “Metrics to Track”1. Response Quality Metrics
Section titled “1. Response Quality Metrics”What to measure:
- Task completion rate: Did the agent accomplish the stated goal?
- Correctness: Were the agent’s outputs factually accurate?
- Relevance: Did the response stay on-topic and address the actual question?
- Hallucination rate: How often did the agent invent information?
How to track:
# Triggered after each agent response
# Log structure:{ "timestamp": "2026-02-10T14:32:00Z", "agent_id": "backend-architect", "task_completed": true, "correctness_score": 4.5, # User rating 1-5 "hallucinations": 0, "response_tokens": 1250}Implementation tip: Use user feedback prompts (thumbs up/down) or automated checks (test suite passing after agent code generation).
2. Tool Usage Metrics
Section titled “2. Tool Usage Metrics”What to measure:
- Tool call success rate: Percentage of tool calls that executed without errors
- Tool selection accuracy: Did agent choose the right tool for the task?
- Tool call efficiency: Minimum calls to achieve goal (avoid unnecessary reads/searches)
- Error recovery: Did agent handle tool failures gracefully?
How to track:
# Triggered after each tool call
# Log structure:{ "timestamp": "2026-02-10T14:32:05Z", "agent_id": "backend-architect", "tool_name": "Read", "tool_success": true, "tool_parameters": {"file_path": "src/auth.ts"}, "execution_time_ms": 45}Implementation tip: Use Claude Code hooks system (see examples/hooks/) to automatically log tool calls.
3. Performance Metrics
Section titled “3. Performance Metrics”What to measure:
- Response time: Total time from user prompt to complete response
- Token efficiency: Input/output tokens used per task
- Context utilization: How much of context window was used?
- Cost per task: API cost for the full interaction
How to track:
# Triggered at end of session
# Log structure:{ "timestamp": "2026-02-10T14:35:00Z", "agent_id": "backend-architect", "session_duration_s": 180, "input_tokens": 3500, "output_tokens": 2800, "total_cost_usd": 0.15, "context_utilization": 0.42}Implementation tip: Parse Claude Code session logs or use MCP observability tools.
4. User Satisfaction Metrics
Section titled “4. User Satisfaction Metrics”What to measure:
- Explicit feedback: User ratings, comments, bug reports
- Implicit signals: Did user accept agent’s suggestions? Did they retry the prompt?
- Adoption rate: How often is this agent used vs alternatives?
- Retention: Do users return to this agent for similar tasks?
How to track:
# Manual feedback collection# After agent completes task, prompt user:"Rate this agent's performance (1-5): _"
# Log:{ "timestamp": "2026-02-10T14:35:10Z", "agent_id": "backend-architect", "user_rating": 5, "user_comment": "Perfect analysis of auth flow", "would_use_again": true}Implementation tip: Add feedback prompts to agent templates or use post-session surveys.
Implementation Patterns
Section titled “Implementation Patterns”Pattern 1: Logging Hook System
Section titled “Pattern 1: Logging Hook System”Use Case: Automatically track all agent interactions without manual intervention
Setup:
#!/bin/bash# Triggered after every tool call
AGENT_ID=$(echo "$CLAUDE_AGENT_ID" | jq -r)TOOL_NAME=$(echo "$CLAUDE_TOOL_NAME" | jq -r)TOOL_SUCCESS=$(echo "$CLAUDE_TOOL_SUCCESS" | jq -r)
# Append to metrics logecho "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"tool\":\"$TOOL_NAME\",\"success\":$TOOL_SUCCESS}" \ >> .claude/logs/agent-metrics.jsonlPros: Zero manual overhead, complete coverage, time-series data Cons: Requires parsing Claude Code environment variables (may change across versions)
Pattern 2: Agent Unit Tests
Section titled “Pattern 2: Agent Unit Tests”Use Case: Regression testing to ensure agent improvements don’t break existing capabilities
Setup:
#!/bin/bash
# Test 1: Agent correctly identifies hexagonal architecture layersecho "Test: Hexagonal architecture analysis"RESULT=$(claude agent backend-architect "Analyze src/auth.ts for layer violations")if echo "$RESULT" | grep -q "domain layer"; then echo "✅ PASS: Identified layers"else echo "❌ FAIL: Did not identify layers" exit 1fi
# Test 2: Agent recommends correct patternsecho "Test: Pattern recommendations"RESULT=$(claude agent backend-architect "Improve error handling in src/api.ts")if echo "$RESULT" | grep -q "Result<T, E>"; then echo "✅ PASS: Recommended Result pattern"else echo "❌ FAIL: Incorrect pattern" exit 1fiPros: Automated, catches regressions, CI/CD integration Cons: Requires maintenance, may have false positives/negatives
Pattern 3: A/B Testing Configurations
Section titled “Pattern 3: A/B Testing Configurations”Use Case: Compare two versions of agent to determine which performs better
Setup:
# .claude/agents/backend-architect-v1.md (control)name: backend-architectversion: 1.0instructions: | You are a backend architect specializing in... [original instructions]
# .claude/agents/backend-architect-v2.md (experiment)name: backend-architect-v2version: 2.0instructions: | You are a backend architect specializing in... [modified instructions with new pattern emphasis]Evaluation:
# Run same task with both agents, compare metrics# Task: "Analyze src/auth.ts for security issues"
# Version 1 metrics:# - Response time: 45s# - Issues found: 3# - User rating: 4/5
# Version 2 metrics:# - Response time: 38s# - Issues found: 5 (2 additional critical issues)# - User rating: 5/5
# Conclusion: Version 2 is more thorough and faster → promote to productionPros: Data-driven decisions, quantifiable improvements Cons: Requires discipline to run controlled experiments
Pattern 4: Feedback Loop Integration
Section titled “Pattern 4: Feedback Loop Integration”Use Case: Continuously improve agent based on real-world usage data
Setup:
# After agent completes taskecho "How would you rate this response? (1-5, or 'skip'): "read RATING
if [ "$RATING" != "skip" ]; then echo "Any specific feedback?: " read COMMENT
# Log feedback echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"rating\":$RATING,\"comment\":\"$COMMENT\"}" \ >> .claude/logs/agent-feedback.jsonlfi
# Weekly: Review feedback.jsonl, identify patterns# Monthly: Update agent instructions based on aggregated feedbackPros: Aligns agent with actual user needs, identifies edge cases Cons: Requires manual review and action on feedback
Example: Agent with Evaluation
Section titled “Example: Agent with Evaluation”Full template available: examples/agents/analytics-with-eval/ includes complete agent definition, hooks, analysis scripts, and report template.
Setup: Analytics Agent with Built-in Metrics
Section titled “Setup: Analytics Agent with Built-in Metrics”---name: analytics-agentdescription: SQL query generator with evaluation hooksversion: 1.0tools: - Read - Write - Bashhooks: post_response: .claude/hooks/log-analytics-metrics.sh---
# Analytics Agent
You are an expert SQL analyst helping users query databases.
## Evaluation Criteria
After each query:1. **Correctness**: Does query produce expected results?2. **Performance**: Query execution time < 5s?3. **Safety**: No destructive operations (DELETE, DROP, TRUNCATE)?4. **Best practices**: Uses proper JOINs, indexes, parameterized queries?
## Instructions
[... agent instructions ...]Metrics Hook
Section titled “Metrics Hook”#!/bin/bash# Triggered after analytics-agent response
# Extract query from response (naive grep, improve with jq)QUERY=$(echo "$CLAUDE_RESPONSE" | grep -oP 'SELECT.*?;')
if [ -n "$QUERY" ]; then # Test query (requires database connection) EXEC_TIME=$( (time psql -U user -d db -c "$QUERY") 2>&1 | grep real | awk '{print $2}')
# Check for destructive operations if echo "$QUERY" | grep -iE 'DELETE|DROP|TRUNCATE'; then SAFETY="FAIL" else SAFETY="PASS" fi
# Log metrics echo "{\"timestamp\":\"$(date -Iseconds)\",\"query\":\"$QUERY\",\"exec_time\":\"$EXEC_TIME\",\"safety\":\"$SAFETY\"}" \ >> .claude/logs/analytics-metrics.jsonlfiAnalysis
Section titled “Analysis”# Monthly review: Analyze metricsjq -s 'group_by(.safety) | map({safety: .[0].safety, count: length})' \ .claude/logs/analytics-metrics.jsonl
# Output:# [# {"safety": "PASS", "count": 127},# {"safety": "FAIL", "count": 3}# ]
# Action: Review 3 failed queries, update agent instructions to prevent future violationsTools & References
Section titled “Tools & References”Open-Source Evaluation Frameworks
Section titled “Open-Source Evaluation Frameworks”nao (Analytics Agents)
Section titled “nao (Analytics Agents)”What it provides:
- Built-in evaluation framework for analytics agents
- Unit testing capabilities for agent responses
- Metrics collection (response quality, tool usage, performance)
- Feedback loop integration
How to adapt for Claude Code:
- Context builder pattern: Apply nao’s structured context approach to
.claude/agents/config - Evaluation hooks: Translate nao’s evaluation framework to Claude Code hooks system
- Metrics schema: Use nao’s metrics schema as template for your logs
Status: Production-ready, actively maintained, TypeScript + Python
Claude Code Native Patterns
Section titled “Claude Code Native Patterns”Hooks system: .claude/hooks/ for automated logging (see examples/hooks/README.md)
Agents directory: .claude/agents/ for custom agent definitions (see guide/ultimate-guide.md Section 4)
MCP observability: Use MCP servers for advanced logging and metrics aggregation
Best Practices
Section titled “Best Practices”Start Simple
Section titled “Start Simple”Week 1: Add basic logging hook (tool calls only) Week 2: Add user feedback prompt (manual ratings) Week 3: Build dashboard to visualize metrics Week 4: Run first A/B test on agent configuration
Focus on Actionable Metrics
Section titled “Focus on Actionable Metrics”Don’t track metrics you won’t act on. Prioritize:
- Task completion rate → Refine agent instructions
- Tool call errors → Improve context or add examples
- User ratings → Identify confusing or unhelpful responses
Automate Where Possible
Section titled “Automate Where Possible”Manual evaluation doesn’t scale. Use:
- Hooks for automatic logging
- CI/CD integration for agent unit tests
- Scripts for periodic metric aggregation
Build Feedback Loops
Section titled “Build Feedback Loops”Metrics are useless without action:
- Weekly: Review metrics, identify patterns
- Monthly: Update agent instructions based on data
- Quarterly: Major agent refactoring if needed
Related Sections
Section titled “Related Sections”- Agents: Creating custom agents
- Hooks: Automation with event hooks
- Observability: Logging and monitoring strategies
- AI Ecosystem: External frameworks like nao
Next steps:
- Add logging hook to your most-used agent
- Collect 1 week of metrics
- Analyze and refine agent based on data
Template: See examples/agents/analytics-with-eval/ for complete implementation with hooks, scripts, and report template