⚠️ Breaking Change (Opus 4.6, Feb 2026): Opus 4.6 replaces budget-based thinking with Adaptive Thinking, which automatically decides when to use deep reasoning based on query complexity. The budget_tokens parameter is deprecated on Opus 4.6+.
How it works: The effort parameter controls the model’s overall computational budget — not just thinking tokens, but the entire response including text generation and tool calls. The model dynamically allocates this budget based on query complexity.
Key insight: effort affects everything, even when thinking is disabled. Lower effort = fewer tool calls, more concise text. Higher effort = more tool calls with explanations, detailed analysis.
Effort levels (API only, official descriptions):
max: Maximum capability, no constraints. Opus 4.7+ only (returns error on other models). Cross-system reasoning, irreversible decisions.
Example: "Analyze the microservices event pipeline for race conditions across order-service, inventory-service, and notification-service"
xhigh(Opus 4.7+, v2.1.114+): Extra-high effort, between high and max. Default in Claude Code (all plans) with Opus 4.7. Use when you want more reasoning depth without full max latency.
Example: "Debug the race condition in the distributed job queue with concurrent writes"
high (default for API): Complex reasoning, coding, agentic tasks. Best for production workflows requiring deep analysis.
Example: "Redesign error handling in the payment module: add retry logic, partial failure recovery, and idempotency guarantees"
medium: Balance between speed, cost, and performance. Good for agentic tasks with moderate complexity.
Example: "Convert fetchUser() in api/users.ts from callbacks to async/await"
low: Most efficient. Ideal for classification, lookups, sub-agents, or tasks where speed matters more than depth.
Example: "Rename getUserById to findUserById across src/"
The effort parameter significantly impacts how Claude uses tools:
low effort: Combines operations to minimize tool calls. No explanatory preamble before actions. Faster, more efficient for simple tasks.
high effort: More tool calls with detailed explanations. Describes the plan before executing. Provides comprehensive summaries after operations. Better for complex workflows requiring transparency.
Example: With low effort, Claude might read 3 files and edit them in one flow. With high effort, Claude explains why it’s reading those files, what it’s looking for, then provides a detailed summary of changes made.
Relationship between effort and thinking:
Opus 4.6: effort is the recommended control for thinking depth. The budget_tokens parameter is deprecated on 4.6 (though still functional for backward compatibility).
Opus 4.5: effort works in parallel with budget_tokens. Both parameters are supported and affect different aspects of the response.
Without thinking enabled: effort still controls text generation and tool calls. It’s not a thinking-only parameter.
CLI usage: Three methods to control effort level in Claude Code:
/model command with left/right arrow keys to adjust the effort slider (low, medium, high)
CLAUDE_CODE_EFFORT_LEVEL environment variable (set before launching Claude)
effortLevel field in settings.json (persistent across sessions)
Alt+T toggles thinking on/off globally (separate from effort level).
assistant-prefill: Deprecated on Opus 4.6. Previously allowed pre-filling Claude’s response to guide output format. Now unsupported — use system prompts or examples instead.
New features:
Fast mode API: Add speed: "fast" + beta header fast-mode-2026-02-01 for 2.5x faster responses (6x cost)
📖 Complete Workflow Guide: See GitHub Actions Workflows for 5 production-ready patterns using the official anthropics/claude-code-action (PR review, triage, security, scheduled maintenance).
Code Review (Teams/Enterprise): For automated PR review without manual prompting, see Code Review — Anthropic’s multi-agent review feature that posts inline GitHub comments on every PR.
Windows Note: Git hooks run in Git Bash on Windows, so the bash syntax below works. Alternatively, you can create .cmd or .ps1 versions and reference them from a wrapper script.
Pre-commit hook:
.git/hooks/pre-commit
#!/bin/bash
# Run Claude Code for commit message validation
COMMIT_MSG=$(cat"$1")
claude-p"Is this commit message good? '$COMMIT_MSG'. Reply YES or NO with reason."
Pre-push hook:
.git/hooks/pre-push
#!/bin/bash
# Security check before push
claude-p"Scan staged files for secrets and security issues. Exit 1 if found."
Focus on security, performance, and code quality. \
Output as markdown." --bare
--bare flag for CI scripting (v2.1.81+): Add --bare to any claude -p call to get a deterministic, hermetic execution environment. It disables hooks, LSP, plugin sync, and skill directory scanning — ensuring local developer config never bleaks into CI. Requires ANTHROPIC_API_KEY (no OAuth/keychain). Also disables auto-memory.
Terminal window
# Without --bare: picks up local hooks, plugins, skills — non-deterministic in CI
An alternative to generating release notes from commits is to capture the context while implementing, not at release time. The “changelog fragments” pattern replaces a shared CHANGELOG.md with one YAML file per PR, accumulated in changelog/fragments/, assembled automatically at release.
The core problem with commit-based approaches: by the time you run git log to generate release notes, context is gone. The developer who fixed a race condition three weeks ago is the only one who understood the impact. The commit message says fix SSE handling.
The fragments pattern solves this with 3 enforcement layers:
Layer 1 — CLAUDE.md rule: Load a git-workflow.md rule that encodes the full fragment workflow. When a developer asks Claude Code to “create the PR,” it reads the diff, infers type/scope/title, generates the YAML, validates it, and commits it as part of the branch. Claude handles it autonomously.
title: "Fix empty chat after starting activity due to SSE race condition"
description: |
SSE workplan fires before AI stream completes, causing ChatWrapper to mount
with 0 messages. Added isStartingActivityRef guard and await response.text().
breaking: false
migration: false
Layer 2 — UserPromptSubmit hook: Detects PR creation intent and checks whether the fragment was already mentioned.
Terminal window
# Tier 0 enforcement in smart-suggest.sh
ifecho"$PROMPT_LC"|grep-qE'(create.*pr|make.*pr|pull.?request)'; then
if!echo"$PROMPT_LC"|grep-qE'(changelog|fragment|skip-changelog)'; then
suggest"pnpm changelog:add""REQUIRED before merge — fragment missing"
else
suggest"/pr""PR creation with structured description"
fi
fi
The hook is non-blocking and shows one suggestion inline, before Claude processes the prompt. If the fragment is already mentioned, the hook stays silent and suggests the normal PR command.
Layer 3 — CI gate: Two independent GitHub Actions jobs. The first validates fragment existence and structure. The second checks that migration: true is set if the PR adds SQL migration files — this job runs regardless of bypass labels, because a “skip-changelog” PR can still add a migration that the deployment team needs to know about.
Assembly at release:
Terminal window
pnpmchangelog:assemble--version1.8.0 [--dry-run]
Reads all fragments, groups by type, inserts a versioned section into CHANGELOG.md replacing a ## [Next Release] placeholder, archives fragments to changelog/fragments/released/{version}/.
Benefits over commit-based generation:
Zero merge conflicts (each fragment is a unique file per PR)
Context written at implementation time, not reconstructed later
DB migrations surfaced explicitly in every fragment
Bypass is auditable (closed label list visible in PR history)
Claude Code can automate deployments to Vercel, GCP, and other platforms using stored credentials. The key is assembling three components: secret management, a deploy skill, and mandatory guardrails.
For multi-platform secrets (GitHub, Vercel, AWS simultaneously), Infisical provides centralized management with versioning and point-in-time recovery — a useful open-source alternative to HashiCorp Vault:
Terminal window
# Install Infisical CLI
brewinstallinfisical/get-cli/infisical
# Inject secrets into Claude Code session
infisicalrun--claude
# Infisical automatically sets all project secrets as env vars
These guardrails are not optional. Production deployments without them create incidents:
Guardrail
Implementation
Why
Staging-first
Always deploy to staging before prod
Catch environment-specific failures
Human confirmation
Stop and ask before --prod flag
No autonomous production deploys
Smoke test
Verify HTTP 200 on key endpoints after deploy
Catch silent deployment failures
Rollback ready
Keep previous deployment ID before promoting
vercel rollback <deployment-id>
Hook for confirmation (prevent accidental production deploys):
.claude/settings.json
{
"hooks": {
"PreToolUse": [{
"matcher": "Bash",
"hooks": [{
"type": "command",
"command": "scripts/check-prod-deploy.sh"
}]
}]
}
}
#!/bin/bash
# check-prod-deploy.sh — exit 2 to block, exit 0 to allow
INPUT=$(cat)
ifecho"$INPUT"|grep-q"vercel deploy --prod\|gcloud deploy.*production"; then
echo"BLOCKED: Production deploy requires manual confirmation. Run the command directly from your terminal."
exit2
fi
exit0
Sources: Vercel deploy skill pattern documented by the community (lobehub.com, haniakrim21); Infisical multi-platform secrets management at infisical.com. No end-to-end automated deploy workflow exists in the community as of March 2026 — the building blocks are available but the staging-to-production promotion pattern is something each team assembles themselves.
New: Xcode 26.3 RC+ includes native Claude Agent SDK support, using the same harness as Claude Code:
Requirements: Xcode 26.3 RC or later (macOS)
Setup: Configure API key in Xcode → Preferences → Claude
Use:
Built-in code assistant powered by Claude
Same capabilities as Claude Code CLI
Native integration with Xcode workflows
Claude Agent SDK: Separate product from Claude Code, but shares the same agent execution framework. Enables Claude-powered development tools in IDEs beyond VS Code.
Note: Claude Agent SDK is not Claude Code — it’s Anthropic’s framework for building agent-powered developer tools. Claude Code CLI and Xcode integration both use this SDK.
Problem: Fullstack development often requires long-running processes (dev servers, watchers) that block the main Claude session, preventing iterative frontend work.
Solution: Use Ctrl+B to background tasks and maintain tight feedback loops across the stack.
💡 Key insight: Background tasks optimize fullstack workflows by decoupling infrastructure (servers, watchers) from iterative development. Use them strategically to maintain tight feedback loops across the entire stack.
All the loops above validate code. None of them tell Claude whether the UI actually looks correct, whether a form works, or whether the page renders without errors. Without a browser connection, Claude can only infer — it writes code and assumes the result matches intent.
Claude in Chrome closes that gap. It’s a Chrome browser extension that gives Claude Code direct control over your browser: navigate to URLs, click elements, read the console, fill forms, take screenshots, and observe the rendered result of what it just built.
Setup:
Install the Claude in Chrome extension from the Chrome Web Store
Enable it for your session:
Terminal window
claude--chrome# start with Chrome integration enabled
claude--no-chrome# disable for this session
/chrome# check connection status / manage permissions
What Claude can do with Chrome access:
Capability
Practical use
Navigate to localhost
Verify the page renders after a change
Read console errors
No copy-paste; Claude sees errors directly
Click through flows
Test that a form submission actually works
Screenshot + compare
Check visual output against expectations
Fill inputs
Test validation, edge cases, empty states
The key insight from Boris Cherny (Claude Code creator): “If Claude can’t see the result, it can’t improve it.” Code feedback loops catch syntax and logic errors. Browser feedback loops catch the rest — layout, interactions, runtime errors.
When /chrome is hidden: Claude Code hides the /chrome command when no Chrome integration is available for your current auth setup (v2.1.87+). Verify the extension is installed and Chrome is running if it doesn’t appear.
Introduced in v2.0.72 as “Claude in Chrome Beta”. The --chrome/--no-chrome flags and /chrome command control the browser integration. This is separate from the claude-in-chrome MCP server, which is a different browser automation mechanism.
Control how Claude responds to match your workflow and learning preferences. Output styles are a built-in product feature — not a prompt trick — and apply at the session level.
Explanatory and Learning produce longer responses by design, increasing output tokens. Prompt caching reduces this cost after the first request in a session.
Since December 2025, you can define your own styles in .claude/styles/. Create a Markdown file and reference it by filename (without extension) as the outputStyle value.
.claude/styles/
└── strict-reviewer.md # Custom style definition
{
"outputStyle": "strict-reviewer"
}
See examples/styles/ for a ready-to-use custom style template.
Claude Code can generate Mermaid diagrams for visual documentation. This is useful for architecture documentation, flow visualization, and system understanding.
Jens Rusitschka identifies “context overload” as the primary failure mode of vibe coding: dumping entire codebases into context, hoping Claude will figure it out.
Symptoms:
Pasting 5K+ lines of code in first prompt
“Read the entire repo and implement X”
Expecting Claude to maintain context across 20+ file changes
Performance degradation after context pollution (see §2.2 Fresh Context Pattern)
Why it fails:
Attention dilution across too many files and concerns
Lost architectural reasoning in noise
Failed attempts accumulate, further degrading quality
Context bleeding between unrelated tasks
The Phased Context Strategy:
Instead of big-bang context dump, use a staged approach that leverages Claude Code’s native features:
Phase
Tool
Purpose
Context Size
1. Exploration
/plan mode
Read-only analysis, safe investigation
Controlled (plan writes findings)
2. Implementation
Normal mode
Execute planned changes
Focused (plan guides scope)
3. Fresh Start
Session handoff
Reset when context >75%
Minimal (handoff doc only)
Practical workflow:
Terminal window
# Phase 1: Exploration (read-only, safe)
/plan
You:"How should I refactor the auth system for OAuth?"
The insight: Rusitschka’s “Vibe Coding, Level 2” is Claude Code’s native workflow — it just needed explicit framing as an anti-pattern antidote. Plan mode prevents context pollution during exploration, fresh context prevents accumulation during implementation, and handoffs enable clean phase transitions.
Vibe coding gets things built fast. The codebases it produces tend to rot in ways that are hard to see: abstractions drift, naming becomes inconsistent, error handling gets done three different ways. The code still works, but working in it gets progressively worse.
“Slop” — a term coined by Simon Willison in 2024 for unwanted, unreviewed AI-generated content — is the quality problem that vibe coding at scale inevitably produces.
Desloppify (github.com/peteromallet/desloppify) is a community tool that directly addresses this. It installs a workflow guide into Claude Code as a skill, then runs a prioritized fix loop: scan → get next issue → fix → resolve → repeat until a quality score target is hit. The scoring is designed to resist gaming — improving the number requires actually improving the code.
Terminal window
pipinstall--upgrade"desloppify[full]"
desloppifyupdate-skillclaude# installs workflow as a Claude Code skill
# Before scanning: exclude generated files, build output, vendored code
desloppifyexcludenode_modules
desloppifyexclude.next
desloppifyscan--path.
desloppifynext# get first prioritized fix
# fix it, then:
desloppifyresolve<issue-id>
desloppifynext# repeat
The loop handles both mechanical issues (dead code, duplication, complexity) and structural ones (naming clarity, abstraction design, module boundaries). A score above 98 is meant to correlate with what a senior engineer would call a clean codebase.
Status: Early-stage (released February 2026, ~2K GitHub stars). Promising native Claude Code integration but not yet battle-tested at scale. Evaluate token cost before running on large codebases — multi-pass LLM review across a full codebase can be substantial.
Batch operations extend beyond code changes. The same pattern applies to file conversion pipelines using native macOS tooling, with no external dependencies.
Use case: Convert a folder of PPTX presentations to PDF using Keynote.
Terminal window
# Requirements: macOS + Keynote installed. No LibreOffice, no Python.
./pptx-to-pdf.sh~/Downloads/Prose# recursive, processes all subdirectories
Finds all .pptx files recursively under the target folder
Skips files where a .pdf already exists (idempotent, safe to re-run)
Opens each file via shell, exports to PDF via AppleScript, then closes Keynote
Prints a summary of all generated PDFs at the end
Critical gotcha — open via shell, not AppleScript:
The intuitive approach fails:
-- This triggers error -1719 "Index non valable" on ~12% of files
tellapplication"Keynote"to open pptx_file
-- document 1 is sometimes empty, AppleScript throws on access
The fix: use open -a "Keynote" "$pptx" from the shell before the AppleScript block, with an 8-second sleep to let Keynote fully register the document. When Keynote opens a file via its own open command, it doesn’t always add it to the documents list. When the shell hands it a file path via open -a, it does.
Terminal window
# Correct pattern
open-a"Keynote""$pptx"# shell open
sleep8# wait for Keynote to register the document
osascript<<EOF
tell application "Keynote"
if (count of documents) > 0 then
export document 1 to (POSIX file "$pdf") as PDF
close document 1 saving no
end if
end tell
EOF
This same shell-open-then-AppleScript pattern generalizes to any macOS app that supports scripting but has unreliable document registration via its own open command.
Skip project context (CLAUDE.md) - leads to repeated corrections
Use vague prompts like “fix this” or “check my code”
Ignore errors in logs or dismiss warnings
Automate workflows without testing in safe environments first
Accept changes blindly without reviewing diffs
Work without version control or backups
Mix multiple unrelated tasks in one session
Forget to commit after completing tasks
✅ Do:
Maintain and update CLAUDE.md regularly with:
Tech stack and versions
Coding conventions and patterns
Architecture decisions
Common gotchas specific to your project
Be specific and goal-oriented in prompts using WHAT/WHERE/HOW/VERIFY format
Monitor via logs or OpenTelemetry when appropriate
Test automation in dev/staging environments first
Always review agent outputs before accepting — especially polished ones (see Artifact Paradox below)
Use git branches for experimental changes
Break complex tasks into focused sessions
Commit frequently with descriptive messages
⚠️ The Artifact Paradox — Anthropic AI Fluency Index (Feb 2026)
Anthropic research on 9,830 Claude conversations reveals a critical counter-intuitive finding: when Claude produces a polished artifact (code, files, configs), users become measurably less critical, not more.
Compared to sessions without artifact production:
−5.2pp likelihood of identifying missing context
−3.7pp likelihood of fact-checking the output
−3.1pp likelihood of questioning the reasoning
Users do become more directive (+14.7pp clarifying goals, +14.5pp specifying format) — but their critical evaluation drops precisely when the output looks finished.
For Claude Code, this is the nominal case. Every generated file, every written test, every created config is an artifact. The polished compile-and-run output is exactly when you should apply the most scrutiny — not the least.
Counter-measures:
Run tests before accepting generated code, not after
Explicitly ask: “What edge cases or requirements did you not address?”
Try to learn everything at once - overwhelming and inefficient
Skip the basics and jump to advanced features
Expect perfection from AI - it’s a tool, not magic
Blame Claude for errors without reviewing your prompts
Work in isolation without checking community resources
Give up after first frustration
Trust AI output without proportional verification - AI code has 1.75× more logic errors than human-written code (source). Match verification effort to risk level (see Section 1.7)
✅ Do:
Follow progressive learning path:
Week 1: Basic commands, context management
Week 2: CLAUDE.md, permissions
Week 3: Agents and commands
Month 2+: MCP servers, advanced patterns
Start with simple, low-risk tasks
Iterate on prompts based on results
Review this guide and community resources regularly
Join Claude Code communities (Discord, GitHub discussions)
Git worktrees (available since Git 2.5.0, July 2015) create multiple working directories from the same repository, each checked out to a different branch.
Traditional workflow problem:
Terminal window
# Working on feature A
gitcheckoutfeature-a
# 2 hours of work...
# Urgent hotfix needed
gitstash# Save current work
gitcheckoutmain
gitcheckout-bhotfix
# Fix the bug...
gitcheckoutfeature-a
gitstashpop# Resume work
Worktree solution:
Terminal window
# One-time setup
gitworktreeadd../myproject-hotfixhotfix
gitworktreeadd../myproject-feature-afeature-a
# Now work in parallel
cd../myproject-hotfix# Terminal 1
claude# Fix the bug
cd../myproject-feature-a# Terminal 2
claude# Continue feature work
When to use worktrees:
✅ Use worktrees when:
Working on multiple features simultaneously
Need to test different approaches in parallel
Reviewing code while developing
Running long CI/CD builds while coding
Maintaining multiple versions (v1 support + v2 development)
❌ Don’t use worktrees when:
Simple branch switching is sufficient
Disk space is limited (each worktree = full working directory)
Team is unfamiliar with worktrees (adds complexity)
Worktree lifecycle commands:
The full worktree lifecycle is covered by 4 companion commands:
Command
Purpose
/git-worktree
Create worktree with branch validation, symlinked deps, background checks
💡 Tip — Symlink node_modules: The /git-worktree command symlinks node_modules from the main worktree by default, saving ~30s per worktree creation and significant disk space. Use --isolated when you need fresh dependencies (e.g., testing upgrades).
Worktree management:
Terminal window
# List all worktrees
gitworktreelist
# Remove worktree (after merging feature)
gitworktreeremove.worktrees/feature/new-api
# Cleanup stale worktree references
gitworktreeprune
💡 Team tip — Shell aliases for fast worktree navigation: The Claude Code team uses single-letter aliases to hop between worktrees instantly:
Terminal window
# ~/.zshrc or ~/.bashrc
aliasza="cd .worktrees/feature-a"
aliaszb="cd .worktrees/feature-b"
aliaszc="cd .worktrees/feature-c"
aliaszlog="cd .worktrees/analysis"# Dedicated worktree for logs & queries
The dedicated “analysis” worktree is used for reviewing logs and running database queries without polluting active feature branches.
# --worktree / -w flag: creates a temporary worktree based on HEAD
claude--worktree
claude-w
The worktree is created automatically, Claude runs inside it, and it is cleaned up on exit (if no changes were made).
Breaking change (v2.1.133): worktree.baseRef now defaults to fresh, reverting the v2.1.128 behavior where EnterWorktree branched from local HEAD. If you have unpushed commits you need in the worktree branch, set worktree.baseRef: "head" explicitly.
worktree.baseRef (fresh | head, default: fresh): Controls the base commit for worktrees created via --worktree, EnterWorktree, and agent-isolation worktrees.
Value
Behavior
fresh
Branch from origin/<default-branch> — always a clean remote base
head
Branch from local HEAD — includes unpushed commits
The ConfigChange hook fires whenever a configuration file changes during a session. Use it to audit or block unauthorized live configuration modifications — particularly useful in enterprise environments with managed policy hooks.
.claude/settings.json
{
"hooks": {
"ConfigChange": [
{
"matcher": "",
"hooks": [
{
"type": "command",
"command": "scripts/audit-config-change.sh"
}
]
}
]
}
}
Example audit-config-change.sh (log + optionally block):
Enterprise note: disableAllHooks (v2.1.49+) can no longer bypass managed hooks — hooks set via organizational policy always run regardless of this setting. Only non-managed hooks are affected.
Policy fragment deployment with managed-settings.d/ (v2.1.83+)
In multi-team organizations, editing a single managed-settings.json creates merge conflicts and coordination overhead. The managed-settings.d/ drop-in directory solves this: each file is an independent policy fragment that Claude Code merges alphabetically at startup.
/etc/claude-code/managed-settings.d/
├── 00-security-baseline.json # From security team
├── 10-allowed-tools.json # From platform team
└── 50-team-hooks.json # From individual team
Each fragment follows the same schema as managed-settings.json. Conflicts are resolved by merge order (alphabetical). This lets security provide a global baseline without blocking teams from deploying their own fragments independently.
By default, if Claude Code cannot start the sandbox (macOS Seatbelt / Linux seccomp unavailable), it silently falls back to running unsandboxed. In security-sensitive environments this silent fallback is a compliance risk.
Set sandbox.failIfUnavailable: true in managed-settings.json to fail hard instead:
{
"sandbox": {
"failIfUnavailable": true
}
}
Recommended for: regulated environments (SOC 2, HIPAA), CI runners where sandbox availability is guaranteed, any context where an unsandboxed fallback is not acceptable.
By default, subprocesses spawned by Claude Code (Bash tool, hooks, MCP stdio) inherit the full shell environment, including Anthropic API keys and cloud provider credentials. Set CLAUDE_CODE_SUBPROCESS_ENV_SCRUB=1 to strip those credentials before subprocess execution:
Terminal window
exportCLAUDE_CODE_SUBPROCESS_ENV_SCRUB=1
This scrubs ANTHROPIC_API_KEY, AWS_*, GOOGLE_*, AZURE_*, and similar cloud provider variables from the subprocess environment. Claude Code’s own API calls are unaffected — only the child processes are restricted.
When to enable: any hook or MCP script that makes outbound network calls and should not have access to your API credentials.
When running multiple agents in parallel worktrees, the hardest problem isn’t setup — it’s coordination. There is no built-in automatic dependency detection between worktree agents. You manage it explicitly.
The pattern: analyze files touched, then set blockedBy manually
Before spawning parallel agents, identify which tasks share files:
Terminal window
# Quick dependency check: list files each task will touch
Parallelize with explicit conflict resolution step
Tasks touch same files
Sequence them
Task B needs Task A’s API contract
Block Task B until Task A’s interface is defined
Practical rule: A 5-minute analysis to find file overlaps before spawning agents saves hours of merge conflict resolution.
Tooling: coderabbitai/git-worktree-runner provides a bash-based worktree manager with basic AI tool integration. It handles the worktree lifecycle but not dependency detection — that stays manual.
Note: Fully automatic dependency detection (where the system infers which tasks conflict) doesn’t exist in Claude Code or the broader ecosystem as of March 2026. The approaches above are the practical state of the art.
Important: Claude Code uses lazy loading - it doesn’t “load” your entire codebase at startup. Files are read on-demand when you ask Claude to analyze them. The main context consumers at startup are your CLAUDE.md files and auto-loaded rules.
CLAUDE.md Token Cost Estimation:
File Size
Approximate Tokens
Impact
50 lines
500-1,000 tokens
Minimal (recommended)
100 lines
1,000-2,000 tokens
Acceptable
200 lines
2,000-3,500 tokens
Upper limit
500+ lines
5,000+ tokens
Consider splitting
Note: These are loaded once at session start, not per request. A 200-line CLAUDE.md costs ~2K tokens upfront but doesn’t grow during the session. The concern is the cumulative effect when combined with multiple @includes and all files in .claude/rules/.
Important: Beyond file size, context files containing non-essential information (style guides, architecture descriptions, general conventions) add +20-23% inference cost per session regardless of line count — because agents process and act on every instruction. The same research confirms that LLM-generated context files reduce task success by ~3%, while developer-written files improve it by ~4%. (Gloaguen et al., 2026)
# ❌ Bloated CLAUDE.md (wastes tokens on every session)
- 500+ lines of instructions
- Multiple @includes importing other files
- Rarely-used guidelines
# ✅ Lean CLAUDE.md
- Essential project context only (<200 lines)
- Move specialized rules to .claude/rules/ (auto-loaded at session start)
- Split by concern: team rules in project CLAUDE.md, personal prefs in ~/.claude/CLAUDE.md
Research note (Gloaguen et al., ETH Zürich, Feb 2026 — 138 benchmarks, 12 repos): The first empirical study on context files shows developer-written CLAUDE.md improves agent success rate by +4%, but LLM-generated files reduce it by -3%. Cause: agents faithfully follow all instructions, even those irrelevant to the task, leading to broader file exploration and longer reasoning chains. Recommendation: include only build/test commands and project-specific tooling. Style guides and architecture descriptions belong in separate docs. (Full evaluation)
2. Use targeted file references:
Terminal window
# ❌ Vague request (Claude reads many files to find context)
"Fix the authentication bug"
# ✅ Specific request (Claude reads only what's needed)
"Fix the JWT validation in @src/auth/middleware.ts line 45"
3. Compact proactively:
Terminal window
# ❌ Wait until 90% context
/status# Context: 92% - Too late, degraded performance
# ✅ Compact at 70%
/status# Context: 72%
/compact# Frees up context, maintains performance
4. Agent specialization:
---
name: test-writer
description: Generate unit tests (use for test generation only)
model: haiku
---
Generate comprehensive unit tests with edge cases.
Benefits:
Haiku costs less than Sonnet
Focused context (tests only)
Faster execution
5. Batch similar operations:
Terminal window
# ❌ Individual sessions for each fix
claude-p"Fix typo in auth.ts"
claude-p"Fix typo in user.ts"
claude-p"Fix typo in api.ts"
# ✅ Batch in single session
claude
You:"Fix typos in auth.ts, user.ts, and api.ts"
# Single context load, multiple fixes
6. Pre-structural indexing:
Instead of letting Claude read files on demand throughout a session, pre-build a structural index of your codebase before starting. Claude queries the index (1 call) rather than reading files sequentially (5-10 reads per task).
Terminal window
# With CodeXRay (npx setup, SQLite-backed, 15 languages):
npxcodexray# Interactive setup + first index build
cxrwatch & # Background sync on file changes
# Claude Code then queries the graph instead of reading files:
# "find the payment module" → 1 graph query vs 5-10 file reads
Tools built on this pattern replace 5-10 file reads with 1 structured query — roughly 75% fewer tool calls for discovery tasks.
Dead code and circular dependency detection:
A structural index also enables analysis that file-by-file reading cannot surface efficiently:
Dead code: Functions defined but never called — safe to delete, reducing future context noise
Circular dependencies: Module A imports B imports A — architectural debt that silently inflates Claude’s reasoning overhead
Hotspots: Files with the highest dependency count — prioritize for documentation or refactoring first
Terminal window
# With grepai (zero callers = dead code candidate):
grepaitracecallers"MyFunction"# Empty result → safe to investigate for deletion
# With a structural MCP tool (if available):
# Tools like CodeXRay expose: codexray_deadcode, codexray_circular, codexray_hotspots
Community tools: CodeXRay (Tree-sitter + SQLite, 16 MCP tools, 15 languages) and Claudette (Go binary, 4 languages) are early implementations of this approach. Both are alpha-stage as of March 2026 — use grepai for production workflows.
Caveman is a Claude Code skill (also available for Cursor, Windsurf, Codex, Gemini CLI, and 26 other agents) that rewrites the assistant’s output style into compressed, telegraphic fragments. Articles, pleasantries, transitional summaries, and verbose explanations are stripped. Code blocks, file paths, URLs, commands, headings, and version numbers are preserved verbatim.
/caveman lite # grammar intact, only filler removed
/caveman ultra # maximum telegraphic compression
stop caveman # return to normal
Also auto-triggers on phrases like “be brief” or “less tokens please.” Auto-disables for security-critical messages and destructive operations.
Four compression modes:
Mode
Style
Lite
Full grammar, pleasantries stripped
Full (default)
Fragmented sentences, articles dropped
Ultra
Maximum telegraphic compression
Wenyan (文言文)
Classical Chinese literary mode — experimental
How it saves tokens — two mechanisms:
Output compression: Prose responses run 65% shorter on average (22–87% range depending on task type). Most effective on explanation-heavy back-and-forth: architecture discussions, debugging narratives, Q&A.
Input compression via /caveman-compress: Rewrites your CLAUDE.md and project memory files into compressed form in-place — claimed ~46% reduction in session startup token cost. Code blocks, URLs, and paths are untouched.
Companion tools included:
/caveman-commit — conventional commit messages under 50 chars, focused on “why”
/caveman-review — one-line PR comments with emoji severity markers
caveman-shrink — MCP wrapper that compresses tool/prompt description fields before they load into context
Honest numbers: The headline “75% fewer output tokens” applies to individual prose responses. In a typical session, prose represents a small fraction of total token budget — whole-session savings are closer to 4–10%. Caveman pays off most on sessions heavy in conversational back-and-forth, and least on sessions dominated by file reads, tool calls, or code generation.
When NOT to use it:
Documentation generation — output is meant to be read by humans
Code review comments shared with non-technical stakeholders
Debugging sessions where reasoning transparency matters
Multi-agent chains where downstream agents parse prior responses to reconstruct state
Stats: 53K GitHub stars | Created 2026-04-04 | MIT | Benchmark harness (evals/) is still maturing — treat specific percentages as directional
Available primitives: strip_ansi, replace, match_output, strip/keep_lines_matching, truncate_lines_at, head/tail_lines, max_lines, on_empty
Debug: RTK_NO_TOML=1 bypasses all TOML filters. RTK_TOML_DEBUG=1 shows which filter fires.
Integration Strategies:
Hook-first install (recommended):
Terminal window
rtkinit--global# Sets up PreToolUse hook + patches settings.json automatically
CLAUDE.md instruction (manual wrapper):
## Token Optimization
Use RTK for all supported commands:
-`rtk git log` (92.3% reduction)
-`rtk git status` (76.0% reduction)
-`rtk git diff` (55.9% reduction)
Skill (auto-suggestion):
Template: examples/skills/rtk-optimizer/SKILL.md
Detects high-verbosity commands
Suggests RTK wrapper automatically
Hook (automatic wrapper):
Template: examples/hooks/bash/rtk-auto-wrapper.sh
PreToolUse hook intercepts bash commands
Applies RTK wrapper when beneficial
Configuration Options:
~/.config/rtk/config.toml
exclude_commands = ["my-interactive-tool", "fzf"] # Never rewrite these
Migration Note (v0.25.0+):
After upgrading from v0.24.0 or earlier, run rtk init --global to install the new thin-delegator hook. The old hook still works, but won’t pick up new command mappings automatically.
RTK handles command outputs (what you run). Smart explore handles code reading (what you read). Together they cover both major token sinks in a Claude Code session.
The problem: When Claude explores a codebase, it reads files completely — 400 lines when it needed 3 function signatures. A typical 10-file module exploration costs 35,000 tokens. With progressive exploration, the same task costs 3,500.
The pattern (3 steps, 86-92% reduction):
Step 1 — Structure (~200 tokens per file)
Get function signatures, types, fields only
Claude answers "what exists?" without reading any body
Step 2 — Target (~350 tokens per function)
Read one specific function by line offset
Not the whole file — just lines 45-90
Step 3 — Cross-reference (~150 tokens)
Find callers of a function
rg "function_name" --type rust -n
This is the same pattern Aider uses for its repo map (40k+ stars) — validated at scale since 2023.
Approach A: No setup — CLAUDE.md discipline
The fastest path. Add to your project’s CLAUDE.md:
## Code Exploration Protocol
When exploring a codebase or understanding a module:
1.**Structure first** — run the appropriate command for the language:
code-review-graph is the strongest standalone option: MIT, 10k+ stars, 8.2x average token reduction across real codebases (gin: 16x, flask: 9x, FastAPI: 8x, Next.js: 8x). Builds a Tree-sitter AST of your repo, tracks blast radius per change, and exposes 28 MCP tools so Claude reads only the files that matter. Supports 23 languages + Jupyter notebooks, auto-updates on every git commit (< 2s re-index), and ships a multi-repo daemon for editor-agnostic setups.
Terminal window
pipinstallcode-review-graph
code-review-graphinstall# auto-detects Claude Code, Cursor, Windsurf, Zed, Continue, Kiro...
code-review-graphbuild# first-time parse (~10s for 500 files)
Honest benchmarks:
Task
Without smart-explore
With smart-explore
Savings
Understand 5-file module
~18,000 tokens
~2,500 tokens
86%
Find where to add a feature
~8,000 tokens
~800 tokens
90%
PR review (10 changed files)
~25,000 tokens
~3,500 tokens
86%
Single function lookup
~3,000 tokens
~350 tokens
88%
RTK vs Smart Explore — complete picture:
RTK
Smart Explore
What it saves
Command output tokens
Code reading tokens
When
After running git, cargo, npm
Before reading source files
How
Regex + text filtering
AST parsing (signatures only)
Typical savings
60-90% on CLI outputs
86-92% on code exploration
Setup
rtk init --global (2 min)
CLAUDE.md rule (0 min) or script (5 min)
Use both. A 30-minute session with RTK + smart explore: ~15-20k tokens instead of ~150-200k.
Spending 2 hours manually debugging to save $1 in API costs
Using Haiku for complex tasks, generating incorrect code
Over-compacting context, losing valuable history
✅ Smart optimization:
Use right model for the task (time saved >> cost)
Invest in good prompts and memory files (reduce iterations)
Automate with agents (consistent, efficient)
Perspective on ROI:
Time savings from effective Claude Code usage typically far outweigh API costs for most development tasks. Rather than calculating precise ROI (which depends heavily on your specific context, hourly rate, and task complexity), focus on whether the tool is genuinely helping you ship faster. For team-level measurement, see Contribution Metrics — Anthropic’s GitHub-integrated dashboard for tracking PR and code attribution (Team/Enterprise plans, public beta).
15 structured development methodologies have emerged for AI-assisted development (2025-2026). This section provides quick navigation; detailed workflows are in dedicated files.
Memorable named patterns for effective Claude Code interaction. These patterns have emerged from community best practices and help you communicate more effectively.
Reading time: 5 minutes
Skill level: Week 2+
Status: Research Preview (as of January 2026)
Session teleportation allows migrating coding sessions between cloud (claude.ai/code) and local (CLI) environments. This enables workflows where you start work on mobile/web and continue locally with full filesystem access.
Related: Ultraplan uses the same web ↔ terminal handoff specifically for the planning phase — plan in the cloud with browser-based review, then teleport the approved plan back to your terminal for execution. If your primary goal is collaborative plan review before implementation, see Ultraplan first.
TL;DR: Multi-instance orchestration = advanced pattern for teams managing 10+ concurrent features. Requires modular architecture + budget + monitoring. 95% of users don’t need this — sequential workflows with 1-2 instances are more efficient for most contexts.
Research Preview — Available on Pro, Max, Team, Enterprise, and Claude API plans. Opt-in: claude agents.
Before setting up tmux grids or third-party orchestrators, try Agent View — Claude Code’s built-in session manager.
How to access:
claude agents from any terminal
Left arrow ← from within any active session
What you see: each row shows session name, status (working / waiting on you / done), last response preview, and time since last interaction.
Key commands:
Action
How
Open agent view
claude agents or ← from any session
Background current session
/bg
Launch new background session
claude --bg [task]
Peek at last turn
Select session
Reply inline (waiting session)
Select → type reply → session resumes
Attach to full transcript
Enter on any session
Workflow patterns (from early users):
Dispatch and return: Send multiple tasks with claude --bg, return to a list of PRs ready for review
Long-running agents: PR babysitters and looping jobs show next run time in the list
Quick context switch: Left arrow, start a related task or quick question, peek for the answer, right arrow back
Status scan: Status indicators tell you which sessions produced a PR without entering each one
Relation to third-party tools: Before Agent View, parallel session management required tmux, multiclaude, or apps like Conductor. Agent View covers the core “what’s running and what needs me” use case natively. Conductor and similar tools remain relevant for GitHub CI integration, PR workflows, and multi-repo management beyond what Agent View provides.
/goal [condition] sets a natural-language completion condition. Claude keeps working across turns without waiting for your input, stopping only when it believes the condition is satisfied.
/goal all unit tests pass and no TypeScript errors
/goal the PR description is written and the branch is pushed
/goal the migration is complete and smoke tests pass
While /goal is active, a status overlay shows:
Elapsed time — how long the session has been running
Turns used — number of back-and-forth turns consumed
Tokens used — running token consumption
When to use: Long-running tasks where you want to step away and return to a finished result rather than babysitting the session. Works best with clear, verifiable conditions (“tests pass”) rather than vague ones (“looks good”).
Cancel: Send any message to interrupt before the condition is met.
The Boris pattern validation: Boris’s $500-1K/month cost and 259 PRs/month aligns with Anthropic’s enterprise data showing positive ROI at >3 parallel instances.
Anti-pattern alert (Anthropic findings):
Over-delegation (>5 agents): Coordination overhead > productivity gain
Critical context: Boris is the creator of Claude Code, working with perfect architecture, Anthropic resources, and ideal conditions. This is not representative of average teams.
Key insights from Boris:
On multi-clauding: “I use Cowork as a ‘doer,’ not a chat: it touches files, browsers, and tools directly. I think about productivity as parallelism: multiple tasks running while I steer outcomes.”
On CLAUDE.md: “I treat Claude.md as compounding memory: every mistake becomes a durable rule for the team.”
On plan-first workflow: “I run plan-first workflows: once the plan is solid, execution gets dramatically cleaner.”
On verification loops: “I give Claude a way to verify output (browser/tests): verification drives quality.”
Why Opus 4.6 or Opus 4.7 with Adaptive Thinking: Although more expensive per token ($5/1M input vs $3/1M for Sonnet), Opus requires fewer correction iterations thanks to adaptive thinking. Net result: faster delivery and lower total cost despite higher unit price.
The supervision model: Boris describes his role as “tending to multiple agents” rather than “doing every click yourself.” The workflow becomes about steering outcomes across 5-10 parallel sessions, unblocking when needed, rather than sequential execution.
Team patterns (broader Claude Code team, Feb 2026):
The broader team extends Boris’s individual workflow with institutional patterns:
Skills as institutional knowledge: Anything done more than once daily becomes a skill checked into version control. Examples:
/techdebt — run at end of session to eliminate duplicate code
Context dump skills — sync 7 days of Slack, Google Drive, Asana, and GitHub into a single context
Analytics agents — dbt-powered skills that query BigQuery; one engineer reports not writing SQL manually for 6+ months
CLI and scripts over MCP: The team prefers shell scripts and CLI integrations over MCP servers for external tool connections. Rationale: less magic, easier to debug, and more predictable behavior. MCP is reserved for cases where bidirectional communication is genuinely needed.
Re-plan when stuck: Rather than pushing through a stalled implementation, the team switches back to Plan Mode. One engineer uses a secondary Claude instance to review plans “as a staff engineer” before resuming execution.
Claude writes its own rules: After each correction, the team instructs Claude to update CLAUDE.md with the lesson learned. Over time, this compounds into a team-specific ruleset that prevents recurring mistakes.
While Boris’s workflow demonstrates horizontal scaling (5-15 instances in parallel), an alternative pattern focuses on vertical separation: using two Claude instances with distinct roles for quality-focused workflows.
Pattern source: Jon Williams (Product Designer, UK), transition from Cursor to Claude Code after 6 months. LinkedIn post, Feb 3, 2026
This pattern is orthogonal to Boris’s approach: instead of scaling breadth (more features in parallel), it scales depth (separation of planning and execution phases).
Your Context
Use Dual-Instance?
Monthly Cost
Solo dev, spec-heavy work
✅ Yes
$100-200
Small team, complex requirements
✅ Yes
$150-300
Product designers coding
✅ Yes
$100-200
High-volume parallel features
❌ No, use Boris pattern
$500-1K+
Use when:
You need plan verification before execution
Specs are complex or ambiguous (interview-based clarification helps)
Lower budget than Boris pattern ($100-200/month vs $500-1K+)
Quality > speed (willing to sacrifice parallelism for better plans)
Don’t use when:
You need to ship 10+ features simultaneously (use Boris pattern)
Plans are straightforward (single instance with /plan is enough)
Isolated by role separation (planner vs implementer)
Accountability
Git history (commits per instance)
Human-in-the-loop (review plans before execution)
Tooling required
Worktrees, teleport, /commit-push-pr
Plans/ directory structure
Coordination
Self-orchestrated (Boris steers 10 sessions)
Human gatekeeper (approve plans)
Best for
Shipping 10+ features/day, experienced teams
Complex specs, quality-critical, budget-conscious
Key insight: These patterns are not mutually exclusive. You can use dual-instance for complex features (planning rigor) and Boris pattern for high-volume simple features (speed).
While git worktrees are foundational, daily productivity improves with automation wrappers. Multiple professional teams have independently created worktree management tools—a validated pattern.
Alias feels limiting (want CI status, LLM commits, project hooks)
Volume increases to 15+ worktrees/week
Team adopts multi-instance workflows (need consistent tooling)
Bottom line: Most readers (80%) should start with vanilla git or alias. Worktrunk is for power users managing 5-10+ instances daily where typing friction and CI visibility matter.
+50% productivity (self-reported, vs +20% 12 months prior)
2-3x increase year-over-year in usage and output
59% of work involves Claude (vs 28% a year ago)
27% of work “wouldn’t have been done otherwise” (scope expansion, not velocity)
Autonomous actions:
21.2 consecutive tool calls without human intervention (vs 9.8 six months prior)
+116% increase in autonomous action chains
33% reduction in human interventions required
Average task complexity: 3.8/5 (vs 3.2 six months before)
Critical concerns (verbatim quotes from engineers):
“When producing is so easy and fast, it’s hard to really learn”
“It’s difficult to say what roles will be in a few years”
“I feel like I come to work each day to automate myself”
Implications: Even at Anthropic (perfect conditions: created the tool, ideal architecture, unlimited budget), engineers express uncertainty about long-term skill development and role evolution.
Five months after the internal study, Anthropic published updated productivity data alongside a new analytics feature for Team and Enterprise customers.
Updated metrics (Anthropic internal):
+67% PRs merged per engineer per day (vs Aug 2025 self-reported +50%)
70-90% of code now written with Claude Code assistance across teams
Methodological note: These figures are PR/commit-based (measured via GitHub integration), not self-reported surveys as in the Aug 2025 study. However, Anthropic discloses no baseline period, no team breakdown, and defines measurement only as “conservative — only code where we have high confidence in Claude Code’s involvement.” Treat as directional indicators, not rigorous benchmarks.
Product feature — Contribution Metrics dashboard:
Status: Public beta (January 2026)
Availability: Claude Team and Enterprise plans (exact add-on requirements unconfirmed)
Tracks: PRs merged and lines of code committed, with/without Claude Code attribution
Access: Workspace admins and owners only
Setup: Install Claude GitHub App → Enable GitHub Analytics in Admin settings → Authenticate GitHub organization
Positioning: Complement to existing engineering KPIs (DORA metrics, sprint velocity), not a replacement
Goal: Scale to 3-5 instances with orchestration framework.
Terminal window
# 1. Deploy orchestration framework (choose based on needs)
# - Headless PM (manual coordination)
# - Gas Town (parallel task execution)
# - multiclaude (self-hosted, tmux-based)
# - Entire CLI (governance + sequential handoffs)
# 2. Define roles
# - Architect (reviews PRs)
# - Backend (API development)
# - Frontend (UI development)
# - QA (test automation)
# 3. Weekly retrospectives
# - Review conflict rate
# - Measure ROI (cost vs output)
# - Adjust instance count
Orchestration framework options:
Tool
Paradigm
Best For
Manual (worktrees)
No framework
2-3 instances, full control
Gas Town
Parallel coordination
5+ instances, complex parallel tasks
multiclaude
Self-hosted spawner
Teams needing on-prem/airgap
Entire CLI
Governance + handoffs
Sequential workflows with compliance
Entire CLI (Feb 2026): Alternative to parallel orchestration, focuses on sequential agent handoffs with governance layer (approval gates, audit trails). Useful for compliance-critical workflows (SOC2, HIPAA) or multi-agent handoffs (Claude → Gemini). See AI Ecosystem Guide for details.
Success criteria: Sustained 3-5% productivity gain over 3 months.
Every API request Claude Code makes now includes an X-Claude-Code-Session-Id header. Reverse proxies and API gateways can use it to aggregate costs, latency, and quota usage by session without inspecting the request body.
nginx example:
map $http_x_claude_code_session_id $session_id {
default $http_x_claude_code_session_id;
}
log_format claude '$remote_addr - $session_id - $request_time - $status';
This lets you build per-session dashboards, enforce session-level rate limits, or attribute API costs to individual developers or CI jobs — all without modifying Claude Code’s configuration.
The paradigm shift: Traditional codebases are optimized for human developers. AI agents have different needs—they excel at pattern matching but struggle with implicit knowledge and scattered context.
Key principles:
Domain Knowledge Embedding: Put business logic and design decisions directly in code (CLAUDE.md, ADRs, comments)
Code Discoverability: Make code “searchable” like SEO—use synonyms, tags, complete terms
Documentation Formats: Use llms.txt for AI-optimized documentation indexing (complements MCP servers)
Token Efficiency: Split large files, remove obvious comments, use verbose flags for debug output
Testing for Autonomy: TDD is more critical for agents than humans—tests guide behavior
Guardrails: Hooks, CI checks, and PR reviews catch agent mistakes early
When to optimize for agents: High-impact files (core business logic, frequently modified modules) and greenfield projects. Don’t refactor stable code just for agents.
Why this matters: Agents read code sequentially and lack the “mental model” humans build over time. What’s obvious to you (e.g., “this service handles auth”) must be made explicit.
... (minimal context, rest is framework conventions)
Recommendation: For greenfield projects with AI-assisted development, prefer opinionated frameworks unless architectural constraints require custom design. The reduction in agent cognitive load often outweighs loss of flexibility.
Problem: Agents lack context about your business domain, design decisions, and project history. They can read code syntax but miss the “why” behind decisions.
Solution: Embed domain knowledge directly in discoverable locations.
-**Appointment**: External calendar system's term (Google/Outlook)
-**Sync Job**: Background process reconciling our DB with external calendars
-**Conflict Resolution**: Algorithm handling overlapping events (see `src/services/conflict-resolver.ts`)
## Gotchas
- Google Calendar API has 10 req/sec rate limit per user → batch operations in `syncEvents()`
- Outlook timezone handling is non-standard → use `normalizeTimezone()` helper
- Event deletion = soft delete (set `deletedAt`) to maintain audit trail for compliance
Why this works: When the agent encounters syncEvents(), it understands the rate limiting constraint. When it sees deletedAt, it knows not to use hard deletes.
Problem: Agents need to discover and consume project documentation efficiently. Traditional documentation (wikis, Confluence) is hard to find and parse. MCP doc servers require installation and configuration.
Solution: Use the llms.txt standard for AI-optimized documentation indexing.
llms.txt is a lightweight standard for making documentation discoverable to LLMs. It’s like robots.txt for AI agents—a simple index file that tells agents where to find relevant documentation.
Anthropic publie deux variantes LLM-optimized pour Claude Code :
Fichier
URL
Taille
Tokens (approx)
Use case
llms.txt
code.claude.com/docs/llms.txt
~65 pages
~15-20K
Index rapide, découverte de sections
llms-full.txt
code.claude.com/docs/llms-full.txt
~98 KB
~25-30K
Fact-checking, doc complète, source de vérité
Pattern recommandé : fetch llms.txt d’abord pour identifier la section pertinente, puis fetch la page spécifique (ou llms-full.txt) pour les détails. Évite de charger 98 KB quand seules 2 pages sont nécessaires.
Ces URLs sont la source officielle à consulter en priorité quand un claim sur Claude Code semble incertain ou potentiellement obsolète.
Most developers pick one approach and stick with it. But Claude Code’s tooling supports systematic variation—testing multiple approaches to find the optimal solution.
Permutation Frameworks formalize this: instead of hoping your first approach works, you systematically generate and evaluate variations.
A permutation framework defines dimensions of variation and lets Claude generate all meaningful combinations. Each dimension represents a design choice; each combination is a distinct implementation approach.
Agent teams enable multiple Claude instances to work in parallel on a shared codebase, coordinating autonomously without human intervention. One session acts as team lead to break down tasks and synthesize findings from teammate sessions.
Key difference from Multi-Instance (§9.17):
Multi-Instance = You manually orchestrate separate Claude sessions (independent projects, no shared state)
Agent Teams = Claude manages coordination automatically (shared codebase, git-based communication)
Two distinct coordination patterns exist for multi-agent review, and the choice matters:
Dimension
Sequential Specialists
Swarm Mode
Structure
Predefined lead + members
Ad-hoc, no hierarchy
Coordination
Lead assigns tasks, synthesizes
Each reviewer works independently
Leadership
Team lead orchestrates
Human synthesizes findings
Task assignment
Lead delegates to specific agents
All relevant agents get the same input
Best for
Tasks with dependencies between reviewers
Independent review, final pre-merge pass
When to use
Complex workflows, state needs sharing
PR review, unfamiliar codebase, thoroughness
Swarm Mode in practice (Every.to compound-engineering pattern):
Launch all relevant specialist reviewers in parallel against the same diff or PR, with no coordination between them. Each produces independent findings. You read all findings and decide what to act on.
Terminal window
# Swarm: all reviewers see the same input, report independently
This is distinct from Agent Teams: there is no persistent team structure, no shared context between agents, no lead synthesizing in real time. It is faster to set up and appropriate when thoroughness matters more than coordination.
Rule of thumb: Use Agent Teams for workflows with sequential dependencies (agent A’s output feeds agent B). Use Swarm when each reviewer can work from the same starting point and you want maximum coverage with minimum setup overhead.
Standard multi-agent pipelines have a systematic flaw: audit agents over-report. When you ask three sub-agents to find contradictions, duplications, or coverage gaps in a set of artifacts, they will find them everywhere, including in patterns that are intentional, complementary, or simply not conflicting.
The solution is a fourth agent whose only job is to reject false positives from the first three.
How it works:
Phase 1: Artifact inventory (orchestrator builds the inventory)
Phase 2: Pairwise analysis (3 agents in parallel, each owns one pair-type)
├── Agent A: standards vs skills
├── Agent B: standards vs commands
└── Agent C: skills vs commands
Phase 3: Skeptical review (1 agent reviews all raw findings)
└── Applies false-positive filter criteria
└── Produces KEEP/REJECT log + final report
The skeptical reviewer agent operates with explicit anti-hallucination rules. From the Packmind playbook-audit implementation:
“Be skeptical. Audit agents tend to over-report; your job is to filter. A 50%+ rejection rate is normal and healthy.”
False positive criteria the reviewer applies before keeping a finding:
Intentional scope limits: The artifacts address different scopes (all files vs migration files only) and do not actually conflict within the narrower scope
Complementary content: One artifact defines a rule, the other implements it; this is design, not duplication
Different contexts: The artifacts address different situations, even if they use similar language
Trivial overlap: Both mention the same concept but neither prescribes conflicting rules about it
Delegation pattern: A command invoking a skill (or vice versa) is complementary, not a gap or contradiction
Evidence requirement: The reviewer only keeps a finding when it can point to specific passages in both artifacts. No evidence from both sides, no finding.
Detection-only scope: The skeptical reviewer produces a report. It does not modify any artifact. Fixing is a separate step triggered by a human reading the report.
When to apply this pattern:
Situation
Apply?
Auditing a set of N artifacts for cross-artifact consistency
Yes
Running a doc-vs-codebase audit across many files
Yes
Code review where you want coverage, not noise
Yes
Single-agent analysis of one file
No
Connection to Swarm Mode: Swarm Mode (above) sends the same input to multiple reviewers in parallel for coverage. The Skeptical Reviewer pattern adds a synthesis layer that filters swarm output before surfacing it. They compose naturally: run the swarm, pipe its output through the skeptical reviewer.
Paul Rayner (CEO Virtual Genius, EventStorming Handbook author):
“Running 3 concurrent agent team sessions across separate terminals. Pretty impressive compared to previous multi-terminal workflows without coordination.”
Workflows used (Feb 2026):
Job search app: Design research + bug fixing
Business ops: Operating system + conference planning
Context: In February 2026, Anthropic published a COBOL modernization playbook positioning Claude Code as a direct replacement for legacy consulting teams. The same day, IBM stock dropped -13% (its worst single-day performance since October 2000). The workflow described is validated by independent research — it applies to any large legacy codebase (COBOL, Fortran, VB6, PL/I), not just COBOL.
The real cost isn’t the migration itself — it’s the discovery phase. Original developers have retired. Documentation is absent or wrong. Code has been patched for decades by engineers who never understood the full system. Finding what talks to what requires consultants billing by the hour.
AI changes the economics by automating this exact phase.
COBOL context (for scale reference):
~220 billion lines of COBOL still in production (IBM estimate)
~95% of US ATM transactions run on COBOL-based systems (Reuters/industry consensus — methodology varies by source)
Independent validation: Academic research (WJAETS 2025) shows -25 to -30% timeline reduction on average. Best-case: Airbnb migrated 3,500 test files in 6 weeks vs. an estimated 1.5 years. COBOL→Java accuracy: 93% in controlled studies (arXiv, April 2025).
Step 1 — Automated Exploration & Discovery
Map the entire codebase:
- Identify all program entry points and execution paths
- Trace subroutine calls across hundreds of files
- Document implicit dependencies via shared files, databases, and global state
- Generate a dependency graph before touching a single line
Prompt pattern:
"Read the entire [COBOL/legacy] codebase. Map its structure:
and any implicit dependencies via shared data structures,
global variables, or file I/O. Output a dependency map."
Step 2 — Risk Analysis & Opportunity Mapping
With the dependency map in hand:
- Assess coupling levels between modules (high coupling = high risk)
- Surface isolated components as safe modernization candidates
- Identify duplicated logic and dead code
- Flag shared state as the highest-risk zones
Prompt pattern:
"Based on the dependency map: rank modules by coupling level.
Which components can be modernized in isolation?
Which share state with 3+ other modules and should be touched last?"
Step 3 — Strategic Planning
Human + AI collaboration:
- AI suggests prioritization based on risk/dependency analysis
- Team reviews against business priorities (what breaks = most expensive)
- Define target architecture and code standards
- Design function-level tests for validation before migration begins
This phase is not fully automatable — business context requires human judgment.
Hybrid human-AI workflows show 31% higher completion rates within initial time estimates
vs. purely automated approaches (WJAETS 2025).
Step 4 — Incremental Implementation
Never migrate the whole system at once:
- Translate logic component by component
- Create API wrappers for legacy components still in use
- Run old and new code side-by-side in production
- Validate each component independently before proceeding to the next
Prompt pattern:
"Translate [module X] to [target language].
Preserve exact business logic — no optimization yet.
Add a compatibility wrapper so both versions can run in parallel.
Write tests that verify identical outputs for identical inputs."
“Years to quarters” is real — but it’s the optimistic scenario, not the average:
Scenario
Timeline reduction
Source
Conservative estimate
-25 to -30%
WJAETS 2025 academic review
Automation-heavy phases
-40 to -50%
Fullstack Labs industry synthesis
Best-case (test migration)
-88% (6 weeks vs 1.5 yr)
Airbnb case study
COBOL→Java conversion accuracy
93%
arXiv, April 2025
The average gains are real and significant. The headline numbers require favorable conditions: good test coverage, isolated modules, and a team that understands both the legacy system and the target stack.
Reading time: 7 minutes
Skill level: Week 2+
Status: Research Preview (as of February 2026)
Availability: Pro and Max plans only — not available on Team, Enterprise, or API keys
Remote Control lets you monitor and control a local Claude Code session from a phone, tablet, or web browser — without migrating anything to the cloud. Your terminal keeps running locally; the mobile/web interface is a remote window onto that session.
Key difference from Session Teleportation (§9.16): Teleportation migrates a session (web → local). Remote Control mirrors a local session to a remote viewer. Execution always stays on your local machine.
/new, /compact, etc. are treated as plain text in the remote UI
Pro/Max only
Not available on Team, Enterprise, or API keys
⚠️ Slash commands limitation: When you type /new, /compact, or any slash command in the remote interface (mobile app or browser), they are treated as plain text messages — not forwarded as commands to the local CLI. Use slash commands from your local terminal instead.
# Pane 1: claude → run /rc → share URL with your phone
# Pane 2: claude (local only)
# Pane 3: claude (local only)
# To switch which session you're controlling remotely:
# → Go to pane 2, run /rc (disconnects pane 1's remote, connects pane 2)
Each tmux pane hosts its own Claude session. Only one can use remote-control at a time, but you can switch between sessions by running /rc in different panes.
Remote Control works on remote machines (VMs, cloud servers) running in tmux:
Terminal window
# On your cloud server (e.g., Clever Cloud, AWS, etc.):
tmuxnew-session-sclaude-server
clauderemote-control
# → Scan QR code from your phone
# → Control a cloud-hosted Claude session from mobile
# → Sessions survive laptop reboots (tmux keeps them alive)
This gives you persistent sessions that survive closing your laptop. Combine 6-8 Claude sessions in tmux for continuous uninterrupted work while traveling.
Known bug (Research Preview) — use claude.ai/code in Safari instead (see below)
QR code opens app but session not visible
Known bug on iOS — scan with native camera app, open in Safari rather than Claude app
QR code not showing
Press spacebar after starting remote-control
Slash commands not working
Type them in your local terminal instead
Session expired
Reconnect: run /rc again
Corporate firewall blocking
HTTPS outbound (port 443) must be allowed
”Not available” error
Verify Pro or Max subscription (not Team/Enterprise)
Known bug (Research Preview, March 2026): On iOS (confirmed iPhone), scanning the QR code opens the Claude app but the remote session doesn’t appear in the session list. The bug also affects automatic session discovery in the Claude mobile app. MacStories confirmed this is inconsistent on non-local machines.
Most reliable workaround: open claude.ai/code in Safari on your phone — your active session appears in the list there. Alternatively, copy the session URL from the terminal and paste it directly in Safari. Both paths bypass the app’s sync bug entirely.
See also: §9.10 Continuous Improvement Mindset — the conceptual foundation for this section. §9.23 is the operational layer: detecting when to act, and how.
As your Claude Code setup matures — skills, agents, rules, CLAUDE.md — a silent failure mode emerges: your configuration drifts away from how you actually work. Skills accumulate assumptions that no longer hold. CLAUDE.md describes a codebase that has evolved. Rules cover edge cases that became the norm. The agent keeps making the same correctable mistakes because nothing captures what you learned last week.
This section covers how to detect that drift early and close the loop — turning session observations into concrete config improvements.
Skills accumulate. Without a lifecycle policy, you end up with 20+ skills where half are unused, two contradict each other, and none have version history.
When to create a skill:
A task is worth encoding as a skill when you’ve done it manually 3+ times and the steps are stable enough to write down. If you’re still figuring out the right approach, don’t encode it yet — premature skills crystallize bad patterns.
When to update a skill (patch):
A command in the skill fails because an API or path changed
The output needs a small clarification you keep adding manually
You added a convention and the skill doesn’t reflect it yet
When to version a skill (minor/major):
Add a version field and updated date to your skill frontmatter:
---
version: 1.2.0
updated: 2026-03-02
breaking_since: null
---
Use a simple policy:
patch (x.x.Z): rewording, clarification, examples added — no behavior change
minor (x.Y.z): new instructions, extended scope, new behavior opt-in
major (X.y.z): default behavior changes — annotate what broke and when in your CHANGELOG
When to deprecate a skill:
Add a deprecated: true flag and a note explaining what replaced it. Don’t delete immediately — other skills or commands may reference it.
CI staleness check — CLAUDE.md vs source modules:
If your CLAUDE.md is assembled from source modules (e.g., via a pnpm ai:configure pipeline), add a CI job to catch divergence before it causes silent failures:
The update loop formalizes what you already do informally: something doesn’t work well → you notice → you fix it. The difference is making the “notice” step systematic rather than accidental.
┌──────────────────────────────────────────────┐
│ THE UPDATE LOOP │
│ │
│ Session → Observe friction │
│ (repeated fixes, tool fails) │
│ ↓ │
│ Analyze root cause │
│ (which skill/rule is missing?) │
│ ↓ │
│ Delta update │
│ (targeted edit, not rewrite) │
│ ↓ │
│ Canary test │
│ (verify the fix holds) │
│ ↓ │
│ Next session → repeat │
└──────────────────────────────────────────────┘
The delta update principle: when updating a skill or rule, make the smallest targeted edit that fixes the observed problem. Don’t rewrite the whole skill — you’ll lose what was working. One problem, one edit, one test.
Integrating into /tech:handoff:
If you use a handoff command to persist session context, add a mandatory retrospective step before saving:
# Append to your handoff command prompt
Before saving context, answer:
- Which rules or skills were missing for today's work?
- Which corrections did you make more than once?
- What's the smallest edit that would prevent the most repeated friction?
Save conclusions via: write_memory("retro_[date]", your answers)
Canary testing a skill after update:
Before committing a skill change, verify it still produces the expected output on a known input:
Terminal window
# Example: test that typescript-aristote skill generates Zod validation
claude-p"Using the typescript-aristote skill: create a basic user tRPC router"\
If you want to automate prompt optimization beyond the manual update loop, two frameworks are worth knowing:
DSPy (Stanford, open-source) — optimizes prompts programmatically given a metric and a set of examples. Requires 20+ labeled examples per skill for reliable results. Useful when you have a well-defined task and enough session history to build a dataset. dspy.ai
TextGrad — treats prompts as differentiable parameters and iterates using LLM-generated feedback as “gradients”. Better for creative or domain-specific tasks where the evaluation is qualitative. github.com/zou-group/textgrad
Both require more setup than the manual loop above, and neither eliminates the need for human judgment on what to optimize. Start with the update loop and canary tests — they’ll surface most of the value with a fraction of the overhead.
Relationship to §9.23: The Update Loop handles deliberate config maintenance — you notice drift, you fix it. Instinct-based learning handles incidental capture — useful observations you’d otherwise forget by end of session.
Standard session-end prompts (“what did you learn this session?”) produce verbose summaries that rarely get acted on. The friction between “observation” and “encoded rule” is high enough that most corrections never make it back into your config.
What actually gets encoded: corrections you make twice, then a third time, until the repetition forces you to write a rule. That’s too slow, and it only captures the painful patterns — not the useful ones.
Instincts are lightweight, low-commitment observations — candidate rules that haven’t been validated yet. They sit below skills (stable, tested, promoted) and below memory (project context, decisions):
Session observation
↓
Instinct (low confidence, 0.1–0.4)
↓ confirmed across multiple sessions
Candidate rule (medium confidence, 0.5–0.7)
↓ tested explicitly
Skill or CLAUDE.md rule (high confidence, 0.8+)
Each instinct tracks: content (the observation), confidence (0.0–1.0, starts low and grows with confirmation), source (which session/context), and decay (confidence drops if not confirmed over time).
The key design choice: capture at the Stop hook, not at UserPromptSubmit.
Why Stop, not UserPromptSubmit: UserPromptSubmit runs before every message — adding extraction logic there adds latency to every interaction. Stop runs once when the session ends — zero impact on session speed, and the full session context is available for pattern extraction.
.claude/hooks/capture-instincts.sh
#!/bin/bash
# Stop hook: extract candidate observations from the completed session
The promotion step stays manual by design — you decide what gets encoded. The pipeline reduces the friction of capturing observations, not the friction of validating them.
Add capture-instincts.sh as a Stop hook in settings.json
Review weekly — 5 minutes maximum
Promote 0–2 high-confidence instincts per week; delete the rest
What not to capture: project-specific context (use memory), patterns you’re already confident in (write the skill directly), one-off workarounds (let them go).
Credit: Instinct-based learning pipeline and the Stop hook capture pattern from Everything Claude Code v2 (Affaan Mustafa). The confidence scoring, decay model, and instinct → skill evolution pipeline are their original contribution.
The core insight: model capability and execution reliability are orthogonal. The same model produces fundamentally different outcomes depending on the infrastructure around it, not the model’s quality. That infrastructure is the harness.
The harness is everything in the engineering environment around the agent: the instruction files, initialization scripts, state tracking, verification commands, and feedback loops. It is not a prompt file and not a list of guidelines. The harness is the workbench the agent operates inside.
Five subsystems make up a complete harness:
Subsystem
Purpose
Core artifacts
Instructions
Defines what the agent should do and how to behave
AGENTS.md, CLAUDE.md
Tools
Shell access, file editing, command execution
Native Claude Code tools
Environment
Dependencies, versions, reproducible baseline
init.sh, lockfiles, devcontainers
State
Tracks scope and progress across sessions
feature_list.json, progress.md
Feedback
Signals whether work is correct before declaring done
Tests, lint, typecheck, E2E
The most common failure modes map directly to missing subsystems. Agents that forget context between sessions are missing State. Agents that redo completed work are missing State. Agents that declare done before tests pass are missing Feedback.
The most dangerous failure mode in agentic workflows: the agent announces “done” while tests are still failing, types are broken, or the build doesn’t compile. This is not a model quality issue; it is a harness design issue. Without an enforced verification step, the agent relies on code inspection rather than actual execution, and its confidence is uncalibrated.
The fix is to make verification non-optional. Add a three-layer check before the agent can declare completion:
Terminal window
# Layer 1: Static analysis
npmrunlint && npmruntypecheck
# Layer 2: Unit and integration tests
npmtest
# Layer 3: End-to-end smoke test
npmrune2e
Encode this as a hard rule in CLAUDE.md:
## Definition of Done
A feature is NOT done until all three layers pass:
1.`npm run lint && npm run typecheck` — clean
2.`npm test` — all tests pass
3.`npm run e2e` — smoke test passes
Do NOT commit or report completion before running all three.
The third layer matters more than most teams expect. Unit tests pass when components work in isolation. End-to-end tests catch interface mismatches, state propagation errors, and lifecycle issues that unit tests structurally cannot detect. Agents that know E2E verification is enforced also tend to write better integration code, because they know it will be tested.
When multiple features are in progress simultaneously, verification becomes ambiguous (which feature broke the tests?), progress tracking becomes noisy, and context fills faster with no clear completion signal. The agent distributes attention across the full task list instead of closing one thing.
Enforce WIP=1 in your feature list: only one feature can be in active state at any time. The agent picks one, finishes it through all three verification layers, then picks the next. This constraint feels restrictive and produces measurably better completion rates.
A reliable session follows this sequence every time, not just at startup:
Step
Action
Subsystem
1. READ
Read AGENTS.md and CLAUDE.md
Instructions
2. INIT
Run ./init.sh — verify environment is healthy
Environment
3. RESUME
Read progress.md — what happened last session
State
4. SELECT
Pick one feature with not_started status from feature_list.json
State
5. EXECUTE
Implement only that feature
—
6. VERIFY
Run all three verification layers
Feedback
7. UPDATE
Set feature status to passing, record evidence
State
8. LOG
Update progress.md with what changed and what’s next
State
9. CLEANUP
Remove temp files, leave repo in restartable state
Environment
10. COMMIT
Commit only when verification passes and state is clean
—
Steps 2 (INIT) and 6 (VERIFY) are where most harness failures occur. INIT that silently continues past broken dependencies produces confusing errors for the rest of the session. VERIFY that runs but doesn’t block completion produces false positives that erode trust in the agent’s output.
A plain text task list is insufficient for reliable agent operation: no machine-readable state, no evidence field, no dependency ordering. feature_list.json adds structure that both the agent and your tooling can read.
Each feature needs three things: a description of the expected behavior, the verification command that proves it works, and a status field the agent updates throughout the session.
{
"features": [
{
"id": "feat-001",
"name": "Document Import",
"description": "User can import PDF and TXT files from the local filesystem",
"description": "Imported documents split into ~500-char chunks with position metadata",
"dependencies": ["feat-001"],
"status": "active",
"evidence": ""
},
{
"id": "feat-003",
"name": "Search Index",
"description": "Full-text search across all imported documents",
"dependencies": ["feat-002"],
"status": "not_started",
"evidence": ""
}
]
}
Status values follow a one-way flow: not_started → active → passing (or blocked if a dependency is unresolvable). The evidence field is the highest-signal part of the schema: it records what verification actually ran, not just that the code was written. An empty evidence field on a passing feature is a red flag.
Every session starts from an unknown environment state. Dependencies may have changed, build artifacts may be stale, or types may be broken from a previous incomplete session. init.sh establishes a known-good baseline before any work begins.
#!/bin/bash
set-e# Fail fast on any error
echo"=== Initialization ==="
npminstall
npmrunbuild
npmruntypecheck
npmtest
echo"=== Environment ready ==="
echo"Next: read feature_list.json and pick one not_started feature"
set -e is non-negotiable. If install fails, the script stops. An agent that proceeds past a broken environment produces confusing errors for the rest of the session, and the root cause becomes difficult to isolate. Run it idempotently — calling it five times should produce the same result as calling it once.
Context windows are finite. Every session that ends without a handoff note forces the next session to reconstruct context from scratch: reading git log, grepping for recent changes, inferring what was in progress. This reconstruction is expensive and imprecise, and it’s where subtle errors get introduced.
progress.md eliminates the reconstruction cost. It’s a short, structured note written at the end of every session, read at the start of the next.
# Session Progress
## Last Updated
2026-05-04 — Session 7
## Active Feature
feat-002: Document Chunking
## Done This Session
- [x] Implemented chunk() function in src/services/chunker.ts
- [x] Added position metadata (start_char, end_char, chunk_index)
- [x] Unit tests pass (8/8)
## In Progress
- [ ] Chunker integration with DocumentService
- Status: function exists, wiring not complete
- Blocker: none
## Next Steps
1. Wire chunker into DocumentService.import()
2. Add integration test covering full import-to-chunk flow
3. Update feat-002 status to passing once integration test passes
## Evidence
- lint: clean
- typecheck: clean
- unit tests: 8/8 pass
- integration tests: not yet (feat-002 not complete)
## Notes for Next Session
chunk() is in src/services/chunker.ts:42. DocumentService expects a
ChunkResult[] type (defined in src/types/documents.ts:18). The wiring
point is DocumentService.import() at line 67.
The “Notes for Next Session” section is the highest-ROI part: concrete file paths, line numbers, and specific wiring points that save 5-10 minutes of orientation at session start. Treat it as a message to a colleague who knows the codebase but has no memory of what happened today.
The most common failure pattern with instruction files: they start small and accumulate. Every team adds rules, guidelines, conventions, and exceptions. After three months the file is 800 lines. The agent reads all 800 lines every session, consuming context budget before any work starts. Rules that appear 600 lines in are effectively invisible. The file cannot be linted. Contradictions accumulate silently.
The failure mode is structural, not a content quality problem. A long AGENTS.md will degrade regardless of how carefully each rule is written.
The OpenAI Codex team’s approach: keep AGENTS.md to approximately 100 lines and make it a map, not a manual. The file tells the agent where to look, not everything it needs to know.
AGENTS.md
## Architecture
See docs/DESIGN.md for system architecture.
See docs/design-docs/core-beliefs.md for foundational decisions.
Taste invariants (enforced by linters): docs/RELIABILITY.md, docs/SECURITY.md
Frontend conventions: docs/FRONTEND.md
## External Libraries
LLM-ready docs for external dependencies: docs/references/
Example: docs/references/nixpacks-llms.txt
## Verification
Before marking done: run `make verify` (lint + typecheck + tests + e2e).
Definition of Done: all layers pass, no skips.
The docs/ hierarchy does the heavy lifting. The agent reads only what it needs for the current task: the product spec for the feature it is implementing, the exec plan for the task it is executing, the reliability doc when touching infrastructure. Progressive disclosure through the file system.
CI enforcement: the knowledge base must be maintained like code. Linters check that docs/ references in AGENTS.md resolve, that exec plans in active/ are not stale, and that QUALITY_SCORE.md reflects the last cleanup run. A broken link in AGENTS.md is a build failure, not a documentation oversight.
Agents have one knowledge boundary: the repository. Everything that exists outside the repository (Slack threads, video calls, Google Docs, tacit understanding between teammates) does not exist for the agent. This is not a limitation to work around. It is a design constraint that shapes how a team must operate.
A decision made in a Slack thread and not encoded as a markdown file in the repo will be violated by the agent on the next task. Not because the agent is careless, but because it genuinely does not know. The same is true of conventions discussed in a code review but not written into a linter rule or doc. The same is true of architecture decisions made six months ago that “everyone on the team knows.”
The practical test: “If a new engineer joined the team today with no onboarding, would they know this from reading the repo?” If not, the agent doesn’t know it either.
Three categories require particular attention:
Decisions: architectural choices, rejected alternatives, tradeoffs accepted. These belong in docs/design-docs/ as design records, not in someone’s memory. A design record does not need to be long. A short document that states the decision, the alternatives considered, and the reason for the choice is sufficient and survives every team change.
Conventions: naming rules, structural patterns, file organization. These belong in linter rules (so they are enforced, not just documented) or in targeted docs that AGENTS.md links to. A convention that lives only in a README section will drift.
Plans: what is being built, why, and in what sequence. These belong in exec plans (see §9.25.3). A plan that exists only in a project management tool the agent cannot read is not a plan for the agent.
The corollary: when a human makes a decision during code review or changes direction mid-task, that decision must be written into the repo before the next agent session. Review comment responses that change architecture are not repo content. Writing them into a design doc or updating an exec plan is the required step, not optional cleanup.
A structured docs/ hierarchy turns the knowledge boundary from a liability into an asset. When all relevant context is in the repo and consistently organized, the agent can navigate to exactly what it needs for any task.
│ └── db-schema.md # Auto-generated from actual schema (never edited by hand)
├── product-specs/
│ └── index.md # One spec per feature
├── references/
│ └── nixpacks-llms.txt # LLM-ready docs for each external library
├── DESIGN.md # System architecture overview
├── FRONTEND.md # Frontend conventions
├── PLANS.md # Current planning status
├── PRODUCT_SENSE.md # Product judgment and principles
├── QUALITY_SCORE.md # Quality scores per domain/layer
├── RELIABILITY.md # Reliability requirements and taste invariants
└── SECURITY.md # Security requirements and patterns
Exec plans as first-class artifacts: for any non-trivial task, the agent creates a plan document before writing code. Simple changes get ephemeral plans: a short markdown file with the approach and expected outcome, created at the start of the task and moved to completed/ when done. Complex tasks get full exec plans with progress logs, decision records, and explicit notes on alternatives rejected. The separation of active/ and completed/ keeps the agent’s attention on current work while preserving a searchable history of past decisions. The tech-debt-tracker.md is the backlog for known quality issues, populated by the background cleanup agents described in §9.25.5, addressed incrementally rather than in a disruptive periodic cleanup.
generated/ directory: certain documentation must track code exactly. Database schemas, API surface areas, generated type definitions. These go in generated/ and are produced by automated scripts, not written by hand. The doc-gardening agent (described below) enforces the invariant that generated/ files match the actual runtime state.
The doc-gardening agent: a recurring background agent that reads docs/ and compares documentation claims against actual code behavior. When it detects drift (a documented API that has changed signature, or a design record that contradicts current implementation), it opens a PR to fix the documentation. This treats the knowledge base as code: it has correctness requirements, and those requirements are enforced automatically. Without this agent, the knowledge base degrades as the codebase evolves. With it, the degradation is caught and corrected continuously rather than discovered when an agent acts on stale information.
references/ for external libraries: each significant external dependency gets a dedicated file in references/ (the library’s official llms.txt if available, or a curated summary of the relevant API surface). The agent reads the relevant reference file when implementing against that library rather than relying on its training data, which may be outdated or incomplete.
The verification stack in §9.25 (lint, typecheck, tests, e2e) covers correctness. A separate layer covers performance and runtime behavior: observability. Without it, the agent cannot answer whether a change meets performance requirements and can only inspect code and guess.
The OpenAI Codex team gave each git worktree its own ephemeral, isolated observability stack. The stack is created at task start and torn down after completion; it is never committed to the repository.
Data pipeline: app logs/metrics/traces → Vector (collector/router)
↓
Storage layer: VictoriaLogs (logs) VictoriaMetrics (metrics) trace store (traces)
↓
Query APIs: LogQL PromQL TraceQL
↓
Agent access: curl / CLI tools → structured data in agent context
The stack enables metric-based prompts that were previously impossible. Instead of “implement service startup,” the prompt becomes “ensure service startup completes in under 800ms.” Instead of “optimize the checkout flow,” it becomes “no UI journey through checkout should exceed 2 seconds.” The agent implements a change, restarts the application, runs the workload, queries the observability stack, reads the result, and iterates. The feedback loop is closed without human measurement.
This approach requires infrastructure that not every team has available. The pattern is worth knowing because it illustrates the direction: as harness investment increases, the agent can take on work that was previously impossible to delegate because verification required human judgment on runtime behavior. Teams without this stack can approximate it by making performance requirements explicit (run this benchmark before and after, compare output) and scripting the measurement, even if the infrastructure is not as complete.
At agent throughput levels, the natural tendency toward entropy accelerates. Agents replicate patterns they observe in the codebase. If an imperfect pattern exists anywhere, it will be reproduced everywhere within a few sessions. The compounding is faster than with human developers because the agent works faster and is more likely to generalize from examples. Architecture must be enforced, not documented.
Layered domain architecture
The OpenAI Codex team enforced a fixed layer order within each business domain:
Types → Config → Repo → Service → Runtime → UI
Each layer may depend only on layers below it. Cross-cutting concerns (auth, connectors, telemetry, feature flags) are available only through explicit Provider interfaces, not by importing directly. Violations are build failures enforced by custom linters and structural tests.
This is the kind of architecture typically deferred in early-stage products with the reasoning “we’ll add this structure when we have more engineers.” At agent throughput levels, the reasoning inverts: without this structure, agents will introduce cross-layer dependencies within days, and the resulting tangle is difficult to reverse. Layered architecture becomes a prerequisite rather than a future optimization.
Taste invariants and custom linters
Taste invariants are opinionated rules that go beyond style. Examples: “prefer shared utility packages over ad-hoc helpers,” “validate at boundaries or use typed SDKs,” “use structured logging in all service-layer code,” “schemas and types follow the naming convention X.” These rules are not written as guidelines; they are encoded as custom linters.
The linter error messages are written specifically for agent consumption, not for human developers. A conventional linter message says what is wrong. A taste-invariant linter message says what is wrong and what to do instead, written in a form the agent can act on:
TASTE-003: Untyped API response found in services/payment.ts:47
Prefer typed SDK responses. Use PaymentClient from @internal/payment-sdk
instead of direct fetch(). See docs/RELIABILITY.md#api-boundaries for the pattern.
The error message injects the fix instruction directly into the agent’s context window. Once encoded, the rule applies instantly to every file in the codebase, including files the agent has never seen. This is the amplifier effect: one linter rule enforces consistent behavior across the entire project with zero additional per-file effort.
The custom linters were themselves generated by the Codex agents, not written by hand. A human describes the rule in plain language; the agent generates the linter implementation. This compounds the amplifier: taste invariants are cheap to create, so more of them get created, so more of the codebase behavior is enforced rather than documented.
Anti-entropy via background cleanup agents
The problem with architectural drift: it is incremental and invisible until it compounds. An agent replicates a slightly imperfect pattern. Another agent extends it. A third adds a dependency that should not exist. Three months in, the codebase has structural problems that are expensive to reverse, and no single change introduced them.
The OpenAI Codex team’s approach was to treat anti-entropy like garbage collection: continuous incremental cleanup rather than periodic disruptive rewrites. A team of background agents runs on a recurring schedule:
Scan for deviations from taste principles and architectural layer rules
Update QUALITY_SCORE.md with current scores per domain and layer
Open targeted refactor PRs for detected violations
The PRs are scoped to be reviewable in under a minute and auto-merged when they pass verification. Each addresses one deviation, not a broad refactor. The cumulative effect is that tech debt is paid down continuously rather than in a disruptive periodic cleanup. The tech-debt-tracker.md in docs/exec-plans/ records known issues, and the background agents work through them incrementally.
QUALITY_SCORE.md tracks health over time per architectural layer and business domain. A quality score that is declining is a signal before the decline becomes a problem.
High-throughput merge philosophy
At 3.5 PRs per engineer per day, conventional merge gates become the bottleneck. A PR that waits two hours for a flaky CI run is a two-hour delay in a workflow that produces multiple PRs per hour. The OpenAI team’s approach: minimal merge blocks, fixes applied via follow-up runs rather than blocking merges.
The reasoning: at genuine agent throughput levels, a broken test is fixed faster by a follow-up agent run than by blocking the current PR. “Fixes are cheap; waiting is expensive” inverts the usual risk calculus that is correct at human development throughput.
This philosophy only applies when throughput is genuinely high. At normal development throughput, blocking merges on failing tests is correct: the cost of a merge block is low, and the cost of merging broken code is high. The inversion happens only when the agent can produce a fix faster than a human can review and unblock the PR. Applying this philosophy prematurely, without the throughput to support it, produces a codebase with accumulated failures rather than one with efficient flow.
Sources: Session lifecycle, Verification Gap, WIP=1, feature_list.json, init.sh, and progress.md patterns from Learn Harness Engineering (HumanLayer, 2026). AGENTS.md-as-TOC, knowledge boundary principle, exec plans, docs/ structure, ephemeral observability stack, taste invariants, doc-gardening agent, anti-entropy model, layered domain architecture, and high-throughput merge philosophy from “Harness engineering: exploiting Codex in the agent era,” Ryan Lopopolo, OpenAI Engineering blog, Feb 11, 2026 (https://openai.com/index/harness-engineering/).