When /compact goes wrong: Compaction fires when the model has the most accumulated context, meaning it is also at its most distracted point. If the model cannot predict where the work is heading (e.g., auto-compact fires mid-debugging and your next message is “now fix that warning in bar.ts”), it may drop future-relevant info from the summary. Mitigate by compacting proactively and with context: /compact focus on the auth refactor, drop the test debugging guides the summary toward what matters next. (Source: Anthropic internal guidance)
Option 2: Clear (/clear)
Starts fresh
Loses all context
Use when changing topics
“One Task, One Chat” — mixing unrelated topics across turns degrades model accuracy by ~39%. Context accumulates noise (“context rot”) that distorts judgment even when total token usage stays low. Use /clear aggressively between distinct tasks, not just when the context bar turns red.
Option 3: Summarize from here (v2.1.32+)
Use /rewind (or Esc + Esc) to open the checkpoint list
Select a checkpoint and choose “Summarize from here”
Claude summarizes everything from that point forward, keeping earlier context intact
Frees space while keeping critical context
More precise than full /compact
Option 4: Targeted Approach
Be specific in queries
Avoid “read the entire file”
Use symbol references: “read the calculateTotal function”
When approaching the red zone (75%+), /compact alone may not be enough. You need to actively decide what information to preserve before compacting.
Priority: Keep
Keep
Why
CLAUDE.md content
Core instructions must persist
Files being actively edited
Current work context
Tests for the current component
Validation context
Critical decisions made
Architectural choices
Error messages being debugged
Problem context
Priority: Evacuate
Evacuate
Why
Files read but no longer relevant
One-time lookups
Debug output from resolved issues
Historical clutter
Long conversation history
Summarized by /compact
Files from completed tasks
No longer needed
Large config files
Can be re-read if needed
Pre-Compact Checklist:
Document critical decisions in CLAUDE.md or a session note
Commit pending changes to git (creates restore point)
Note the current task explicitly (“We’re implementing X”)
Run /compact to summarize and free space
Pro tip: If you know you’ll need specific information post-compact, tell Claude explicitly: “Before we compact, remember that we decided to use Strategy A for authentication because of X.” Claude will include this in the summary.
Explicitly saved with write_memory("key", "value")
Survives across sessions
Ideal for: architectural decisions, API patterns, coding conventions
Pattern: End-of-Session Save
# Before ending a productive session:
"Save our authentication decision to memory:
- Chose JWT over sessions for scalability
- Token expiry: 15min access, 7d refresh
- Store refresh tokens in httpOnly cookies"
# Claude calls: write_memory("auth_decisions", "...")
# Next session:
"What did we decide about authentication?"
# Claude calls: read_memory("auth_decisions")
When to use which:
Session memory: Active problem-solving, debugging, exploration
Auto-memory: Decisions and context you want Claude to rediscover next session without manual effort (v2.1.59+)
Persistent memory (Serena): Structured key-value store for architectural decisions across many projects
CLAUDE.md: Team conventions, project structure (versioned with git)
Auto-compact and PostToolUse memory capture — a conflict to know about:
Claude Code auto-compacts the conversation when the remaining context drops below a fixed buffer threshold (roughly the last 6-7% of the context window, or about 13K tokens from the effective limit). In practice, this triggers somewhere in the 90-95% usage range depending on the model’s context window and reserved output tokens. Before full compaction runs, Claude Code also applies micro-compaction — a lighter pass that selectively compresses older tool results (file reads, bash outputs, search results) to free space incrementally without summarizing the whole conversation. If auto-compact fails (e.g., due to a rate limit), it retries up to 3 consecutive times before giving up for that session.
If you use a hook-based memory capture tool (like claude-mem) that saves session history via PostToolUse, auto-compact can fire and discard conversation history before the save pipeline has a chance to capture it.
Two ways to handle this:
// Option 1: disable auto-compact in your project settings.json
// (you manage compaction manually via /compact)
{
"autoCompactEnabled": false
}
Terminal window
# Option 2: keep auto-compact on, but set your tool's save threshold
# to trigger well below 80% (e.g., at 60% context usage)
# — check your memory plugin's cooldowns/threshold config
Option 1 gives full control but requires discipline. Option 2 is safer if you forget to compact manually. The general guide advice (use /compact proactively at 75%) still applies — auto-compact disabled just means you own the timing.
Research shows LLM performance degrades significantly with accumulated context:
20-30% performance gap between focused and polluted prompts (Chroma, 2025)
Degradation starts at ~16K tokens for older Claude models (Chroma, 2025); Anthropic reports noticeable degradation around 300-400K tokens on the 1M context window (task-dependent, not a fixed threshold)
Failed attempts, error traces, and iteration history dilute attention
Instead of managing context within a session, you can restart with a fresh session per task while persisting state externally.
Naming note: “Ralph Loop” is used in two distinct ways in the community. Geoffrey Huntley’s original pattern (above) is about context rotation — spawning fresh sessions to avoid context rot. A separate usage, popularized by Addy Osmani and others in 2026, applies the same term to atomic task iteration in multi-agent teams: pick task → implement → validate → commit → reset context → repeat. Both share the same core mechanic (stateless loop with external state), but the scope differs. When the term appears without attribution, clarify which variant is meant.
State persists via:
TASK.md — Current task definition with acceptance criteria
A lightweight alternative for interactive sessions (no loop required): after each user correction, Claude updates tasks/lessons.md with the rule to avoid the same mistake. Reviewed at the start of each new session.
tasks/
├── todo.md # Current plan (checkable items)
└── lessons.md # Rules accumulated from corrections
The difference from PROGRESS.md: lessons.md captures behavioral rules (“always diff before marking done”, “never mock without asking”) rather than task state. It compounds over time — the mistake rate drops as the ruleset grows.
Tasks with clear success criteria (tests pass, build succeeds)
Poor fit:
Interactive exploration
Design without clear spec
Tasks with slow/ambiguous feedback loops
Variant: Session-per-Concern Pipeline
Instead of looping the same task, dedicate a fresh session to each quality dimension:
Plan session — Architecture, scope, acceptance criteria
Test session — Write unit, integration, and E2E tests first (TDD)
Implement session — Code until all linters and tests pass
Review sessions — Separate sessions for security audit, performance, code review
Repeat — Iterate with scope adjustments as needed
This combines Fresh Context (clean 200K per phase) with OpusPlan (Opus for review/strategy sessions, Sonnet for implementation). Each session generates progress artifacts that feed the next.
The default model depends on your subscription: Max/Team Premium subscribers get Opus 4.7 by default, while Pro/Team Standard subscribers get Sonnet 4.6. If Opus usage hits the plan threshold, it auto-falls back to Sonnet.
Model lineup (April 2026): Claude Opus 4.7 is the standard production Opus model (claude-opus-4-7). Claude Mythos Preview is more capable but remains in limited release. Opus 4.7 is the recommended upgrade path from Opus 4.6.
Model
Input (per 1M tokens)
Output (per 1M tokens)
Context Window
Notes
Sonnet 4.6
$3.00
$15.00
200K tokens
Default (Pro/Team Standard)
Sonnet 4.5
$3.00
$15.00
200K tokens
Legacy
Opus 4.7
$5.00
$25.00
200K tokens
Released April 2026; default for Max/Team Premium
Opus 4.7 (1M context)
$5.00
$25.00
1M tokens
GA for Max/Team/Enterprise; API requires tier 4
Opus 4.6 (standard)
$5.00
$25.00
200K tokens
Previous generation
Opus 4.6 (1M context)
$5.00
$25.00
1M tokens
Previous generation
Opus 4.6 (fast mode)
$30.00
$150.00
200K tokens
Fast mode; 2.5x faster, 6x price
Haiku 4.5
$0.80
$4.00
200K tokens
Budget option
Opus 4.7 tokenizer: A new tokenizer means the same input can map to roughly 1.0–1.35× more tokens depending on content type. At higher effort levels, Opus 4.7 also produces more output tokens (more reasoning). Measure your real traffic when migrating from Opus 4.6; use the effort parameter to control spend.
Reality check: A typical 1-hour session costs $0.10 - $0.50 depending on usage patterns.
Model retirement (April 2026): claude-3-haiku-20240307 (Claude 3 Haiku) was retired on April 20, 2026. If your CLAUDE.md, agent definitions, or scripts still hardcode this model ID, migrate to claude-haiku-4-5-20251001 (Haiku 4.5) immediately. Source: platform.claude.com/docs/model-deprecations
The 1M context window (GA for Max/Team/Enterprise plans; API tier 4 still required for direct API use) is a significant capability jump — but community feedback consistently frames it as a niche premium tool, not a default.
Retrieval accuracy at scale (MRCR v2 8-needle 1M variant)
The benchmark is the “8-needle 1M variant” — finding 8 specific facts in a 1M-token document. Opus 4.6 drops from 93% to 76% when scaling from 256K to 1M; Sonnet 4.5 collapses to 18.5%. Community validation: a developer loaded ~733K tokens (4 Harry Potter books) and Opus 4.6 retrieved 49/50 documented spells in a single prompt (HN, Feb 2026). Sonnet 4.6 MRCR not yet published, but community reports suggest it “struggles with following specific instructions and retrieving precise information” at full 1M context.
Cost per session (approximate)
Above 200K input tokens on direct API, all tokens in the request are charged at premium rates — not just the excess. Note: on Max/Team/Enterprise Claude Code plans, Opus 4.6 1M is the default at standard rates (no premium) as of v2.1.75 (March 2026).
Session type
~Tokens in
~Tokens out
Sonnet 4.6
Opus 4.6
Bug fix / PR review (≤200K)
50K
5K
~$0.23
~$0.38
Module refactoring (≤200K)
150K
20K
~$0.75
~$1.25
Full service analysis (>200K, 1M context)
500K
50K
~$4.13
~$6.88
For comparison: Gemini 1.5 Pro offers a 2M context window at $3.50/$10.50/MTok — significantly cheaper for pure long-context RAG. Community advice: use Gemini for large-document RAG, Claude for reasoning quality and agentic workflows.
When to use which
Scenario
Recommendation
Bug fix, PR review, daily coding
Sonnet 4.6 @ 200K — fast and cheap
Full-repo audit, entire codebase load
Opus 4.7 @ 1M — worth the cost for precision
Cross-module refactoring
Sonnet 4.6 @ 1M — but weigh cost vs. chunking + RAG
Architecture analysis, Agent Teams
Opus 4.7 @ 1M — strongest retrieval at scale
Large-document RAG (PDFs, legal, books)
Consider Gemini 1.5 Pro — cheaper at this scale
Key facts
Opus 4.7 max output: 128K tokens (same as Opus 4.6); Sonnet 4.6 max output: 64K tokens
1M context ≈ 30,000 lines of code / 750,000 words
1M context is GA for Max/Team/Enterprise Claude Code plans (v2.1.75, March 2026) — API direct use still requires tier 4 or custom rate limits
API direct use above 200K input tokens: Sonnet 4.6 doubles to $6/$22.50/MTok; Opus 4.6 doubles to $10/$37.50/MTok (standard rate applies for Claude Code Max/Team/Enterprise plans)
If input stays ≤200K, standard pricing applies even with the beta flag enabled
Practical workaround: check context at ~70% and open a new session rather than hitting compaction (HN pattern)
Community consensus: 200K + RAG is the default; 1M Opus is reserved for cases where loading everything at once is genuinely necessary
Claude Code manages prompt caching without any configuration on your part. Understanding the mechanics helps you make decisions that keep cache hit rates high and costs low.
Cache prefix hierarchy
Every API call Claude Code makes structures content in this fixed order: tools → system → messages. Cache matching always starts from the beginning of this prefix. A stable tool list + stable CLAUDE.md + growing conversation history means the first two layers are almost always cache hits, while only new message turns require fresh computation.
The 20-block lookback — the long-session trap
Cache matching uses a bounded lookback of approximately 20 blocks. In a long session with many tool calls and exchanges, blocks from early in the conversation fall outside this window and become cache misses. Practical consequence: very long sessions gradually lose cache efficiency at the message layer. The fix is /compact — it compresses the conversation history into a single summary block, resetting the lookback window and restoring high hit rates.
Minimum token thresholds by model
A block must meet a minimum size to be eligible for caching. Blocks smaller than the threshold are never cached, regardless of how stable they are:
Model family
Minimum tokens
Claude Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.5
4,096
Claude Sonnet 4.6
2,048
Claude Sonnet 4.5, Sonnet 4, Sonnet 3.7, Opus 4.1, Opus 4
1,024
Claude Haiku 3.5, Haiku 3
2,048
Short CLAUDE.md files (under ~1,000 tokens) may not be cached at all on Sonnet models. If cost optimization matters, make sure your system prompt crosses the threshold for your target model.
Tool result size and cache economics
Tool results land in the message history and stay there for the rest of the session. Every subsequent API call re-reads that history — at cache read price (0.1x), but still proportional to size. A git status output of 500 tokens costs 500 × 0.1x to read on every turn that follows. The same output at 50 tokens (filtered by a tool like RTK) costs 50 × 0.1x — 90% less, compounding across every turn in the session. Compact tool outputs are not just faster to process; they make the entire cached prefix cheaper to maintain.
The same logic applies to cache writes: a smaller history prefix means cheaper initial writes (1.25x × fewer tokens).
Monitoring cache performance in your own pipelines
When building agents or pipelines on top of the Anthropic API, the response usage object exposes cache metrics directly:
response = client.messages.create(...)
print(response.usage.cache_creation_input_tokens) # Tokens written to cache this request
print(response.usage.cache_read_input_tokens) # Tokens read from cache (hits)
Calculate your hit rate as cache_read / (cache_read + cache_creation) across requests. A ratio above 0.8 means your prompt structure is working well. Low ratios usually mean content in the stable prefix is changing between requests — check for timestamps, random IDs, or dynamic content embedded in your system prompt.
No dedicated monitoring tool exists specifically for Claude Code session cache metrics. Cost tracking via ccusage covers overall spend but does not break out cache hit rates. For cache-specific visibility in custom pipelines, parse the response fields above.
Practical rules
Keep CLAUDE.md stable between sessions — edits invalidate the system cache one-shot, then it re-warms on the next request
Run /compact before the conversation gets very long, not after performance degrades
Avoid dynamic content in stable sections (dates, random values, per-request context)
Larger CLAUDE.md = more expensive cache write, but also more tokens saved per read — profitable after ~2 hits
Known cache bugs (v2.1.69+)
Two active bugs silently break caching on v2.1.69+. Apply these workarounds immediately:
—resume/—continue causes a full cache rebuild (0% hit ratio) on every resume because session JSONL strips deferred tool records before write. Workaround: avoid --resume until fixed.
Per-session billing header injects a unique hash as the first system prompt block, causing a cold miss on every session start and subagent call. Workaround: "CLAUDE_CODE_ATTRIBUTION_HEADER": "false" in ~/.claude/settings.json.
Perspective on costs: If Claude Code saves you meaningful time on a task, the API cost is usually negligible compared to your hourly rate. Don’t over-optimize for token costs at the expense of productivity.
When to optimize:
✅ You’re on a tight budget (student, hobbyist)
✅ High-volume usage (>4 hours/day)
✅ Team usage (5+ developers)
When NOT to optimize:
❌ Your time is more expensive than API costs
❌ You’re spending more time optimizing than the savings
❌ Optimization hurts productivity (being too restrictive)
Note: Anthropic’s plans evolve frequently. Always verify current pricing and limits at claude.com/pricing.
How Subscription Limits Work
Unlike API usage (pay-per-token), subscriptions use a hybrid model that’s deliberately opaque:
Concept
Description
5-hour rolling window
Primary limit; resets when you send next message after 5 hours lapse
Weekly aggregate cap
Secondary limit; resets every 7 days. Both apply simultaneously
Hybrid counting
Advertised as “messages” but actual capacity is token-based, varying by code complexity, file size, and context
Model weighting
Opus consumes 8-10× more quota than Sonnet for equivalent work
Approximate Token Budgets by Plan (Jan 2026, community-verified)
Plan
5-Hour Token Budget
Claude Code prompts/5h
Weekly Sonnet Hours
Weekly Opus Hours
Claude Code Access
Free
0
0
0
0
❌ None
Pro ($20/mo)
~44,000 tokens
~10-40 prompts
40-80 hours
N/A (Sonnet only)
✅ Limited
Max 5x ($100/mo)
~88,000-220,000 tokens
~50-200 prompts
140-280 hours
15-35 hours
✅ Full
Max 20x ($200/mo)
~220,000+ tokens
~200-800 prompts
240-480 hours
24-40 hours
✅ Full
Warning: These are community-measured estimates. Anthropic does not publish exact token limits, and limits have been reduced without announcement (notably Oct 2025). The 8-10× Opus/Sonnet ratio means Max 20x users get only ~24-40 Opus hours weekly despite paying $200/month. “Prompts/5h” is a rough practical translation of the token budget — actual capacity varies significantly with task complexity, context size, and sub-agent usage. Monthly cap: ~50 active 5-hour windows across all plans.
Why “Hours” Are Misleading
The term “hours of Sonnet 4” refers to elapsed wall-clock time during active processing, not calendar hours. This is not directly convertible to tokens without knowing:
Tool usage (Bash execution adds ~245 input tokens per call; text editor adds ~700)
Context re-reads and caching misses
Tier-Specific Strategies
If you have…
Recommended approach
Pro plan
Sonnet only; batch sessions, avoid context bloat
Limited Opus quota
OpusPlan essential: Opus for planning, Sonnet for execution
Max 5x
Sonnet default, Opus only for architecture/complex debugging
Max 20x
More Opus freedom, but still monitor weekly usage (24-40h goes fast)
The Pro User Pattern (validated by community):
1. Opus → Create detailed plan (high-quality thinking)
2. Sonnet/Haiku → Execute the plan (cost-effective implementation)
3. Result: Best reasoning where it matters, lower cost overall
This is exactly what OpusPlan mode does automatically (see Section 2.3).
Monitoring Your Usage
Terminal window
/status# Shows current session: cost, context %, model
Anthropic provides no in-app real-time usage metrics. Community tools like ccusage help track token consumption across sessions.
For subscription usage history: Check your Anthropic Console or Claude.ai settings.
Historical Note: In October 2025, users reported significant undocumented limit reductions coinciding with Sonnet 4.5’s release. Pro users who previously sustained 40-80 Sonnet hours weekly reported hitting limits after only 6-8 hours. Anthropic acknowledged the limits but did not explain the discrepancy.
You: "Create a session handoff document for what we accomplished today"
Claude will analyze git status, conversation history, and generate a structured handoff.
Handoff Triad Pattern: For teams or multi-session workflows, a three-command protocol adds explicit merge semantics on top of the basic handoff. Three commands work together:
Command
Job
/handoff:create
Generates the structured document from current session context
/handoff:resume
Loads a handoff document, confirms understanding, and waits for approval before starting
/handoff:update
Updates an existing handoff with section-specific merge rules (see below)
The critical addition is per-section merge rules in update:
Section
Merge Rule
Task, Scope
Keep or refine
Files
Merge — combine original with new files touched
Discoveries
Append — add new findings, never remove prior ones
Work Done
Append only — add new entries, never delete history, include commit hashes
Status
Replace — write current state
Next Steps
Replace — write updated checklist
The append-only Work Done section creates an audit trail across sessions. Even if earlier work was revised, the revision appears as a new entry rather than an overwrite.
Fork-ready templates at examples/commands/handoff/ in this repo.
Recommended frequency: Boris Cherny (Head of Claude Code at Anthropic) starts approximately 80% of tasks in Plan Mode — letting Claude plan before writing a single line of code. Once the plan is approved, execution is almost always correct on the first try.
— Lenny’s Newsletter, February 19, 2026
Press Shift+Tab to toggle back to Normal Mode (Act Mode). You can also type a message and Claude will ask: “Ready to implement this plan?”
Note: Shift+Tab toggles between Plan Mode and Normal Mode during a session. Use Shift+Tab twice from Normal Mode to enter Plan Mode, once from Plan Mode to return.
Claude Code supports six model aliases via /model (each always resolves to the latest version):
Alias
Resolves To
Use Case
default
Latest model for your plan tier
Standard usage
sonnet
Claude Sonnet 4.6
Fast, cost-efficient
opus
Claude Opus 4.7
Deep reasoning
haiku
Claude Haiku 4.5
Budget, high-volume
sonnet[1m]
Sonnet with 1M context
Large codebases
opusplan
Opus (plan) + Sonnet (act)
Hybrid intelligence
Model can also be set via claude --model <alias>, ANTHROPIC_MODEL env var, or "model" in settings.json. Priority: /model > --model flag > ANTHROPIC_MODEL > settings.json.
Knowledge cutoffs (what each model knows about):
Model
Knowledge Cutoff
Claude Opus 4.7
Not yet published
Claude Sonnet 4.6
August 2025
Claude Opus 4.6
May 2025
Claude Haiku 4.5
February 2025
Claude Code injects the cutoff date for the active model into the system prompt at the start of each session. You can ask Claude directly — “what’s your knowledge cutoff?” — to confirm which date applies to your current session.
Concept: Use Opus for planning (superior reasoning) and Sonnet for implementation (cost-efficient).
Why OpusPlan?
Cost optimization: Opus tokens cost more than Sonnet
Best of both worlds: Opus-quality planning + Sonnet-speed execution
Token savings: Planning is typically shorter than implementation
Activation:
/model opusplan
Or in ~/.claude/settings.json:
{
"model": "opusplan"
}
How It Works:
In Plan Mode (/plan or Shift+Tab twice) → Uses Opus
In Act Mode (normal execution) → Uses Sonnet
Automatic switching based on mode
Recommended Workflow:
1. /model opusplan → Enable OpusPlan
2. Shift+Tab × 2 → Enter Plan Mode (Opus)
3. Describe your task → Get Opus-quality planning
4. Shift+Tab → Exit to Act Mode (Sonnet)
5. Execute the plan → Sonnet implements efficiently
Alternative Approach with Subagents:
You can also control model usage per agent:
.claude/agents/planner.md
---
name: planner
model: opus
tools: Read, Grep, Glob
---
# Strategic Planning Agent
.claude/agents/implementer.md
---
name: implementer
model: haiku
tools: Write, Edit, Bash
---
# Fast Implementation Agent
Pro Users Note: OpusPlan is particularly valuable for Pro subscribers with limited Opus tokens. It lets you leverage Opus reasoning for critical planning while preserving tokens for more sessions.
Budget Variant: SonnetPlan (Community Hack)
opusplan is hardcoded to Opus+Sonnet — there’s no native sonnetplan alias. But you can remap what the opus and sonnet aliases resolve to via environment variables, effectively creating a Sonnet→Haiku hybrid:
Caveat: The model’s self-report (what model are you?) is unreliable — models don’t always know their own identity. Trust the status bar (Model: Sonnet 4.6 in plan mode) or verify via billing dashboard. GitHub issue #9749 tracks native support.
Unfamiliar domain where first instincts are often wrong
Pattern:
## Round 1: Initial analysis
User: /plan
User: Analyze the current auth system. What are the key components,
dependencies, and potential risks of migrating to OAuth2?
Claude: [Initial analysis]
## Round 2: Deep challenge
User: Now use extended thinking. Challenge your own analysis:
- What assumptions did you make?
- What failure modes did you miss?
- What would a senior security engineer flag?
Claude: [Deeper analysis with self-correction]
## Round 3: Final plan
User: Based on both rounds, write the definitive migration plan.
Include rollback strategy and risk mitigation for each step.
Claude: [Refined plan incorporating both rounds]
## Execute
User: /execute
User: Implement the plan from round 3.
Why it works: Each round forces Claude to reconsider assumptions. Round 2 typically catches 30-40% of issues that round 1 missed. Round 3 synthesizes into a more robust plan.
📊 Empirical backing — Anthropic AI Fluency Index (Feb 2026)
An Anthropic study analyzing 9,830 Claude conversations quantifies exactly why plan review works: users who iterate and question the AI’s reasoning are 5.6× more likely to catch missing context and errors compared to users who accept the first output. A second round of review makes you 4× more likely to identify what was left out.
The Rev the Engine pattern operationalizes this finding: each round of deep challenge triggers the questioning behavior that produces measurably better plans.
Status: Research preview — requires Claude Code v2.1.91+ and a Claude Code on the web account.
Concept: Offload planning to Anthropic’s cloud while your terminal stays free. Claude drafts the plan remotely using multiple Opus 4.6 agents in parallel; you review it in your browser with inline comments, then choose whether to execute in the cloud or teleport the plan back to your terminal.
This solves the core friction of local Plan Mode: on complex tasks, the terminal blocks for minutes while planning runs. Ultraplan runs asynchronously — you keep working, check back when ready.
How It Works
CLI launches a cloud session → terminal shows a live status indicator
Multiple Opus 4.6 agents explore the codebase in parallel (planning windows up to 30 minutes)
Browser opens the plan with outline sidebar, inline commenting, and emoji reactions
You iterate on the plan — comment on specific sections, request revisions
Choose where to execute: cloud (opens a PR) or terminal (teleports the plan back)
Activation (3 methods)
Terminal window
# 1. Dedicated command
/ultraplanmigratetheauthservicefromsessionstoJWTs
# 2. Keyword anywhere in a prompt
Planwithultraplanafullrefactorofthepaymentsmodule
# 3. From a local plan approval dialog
# → choose "No, refine with Ultraplan on Claude Code on the web"
The command and keyword paths show a confirmation dialog first. The local plan path skips it.
Terminal Status Indicators
Status
Meaning
◇ ultraplan
Claude is researching and drafting
◇ ultraplan needs your input
Clarification needed — open the browser link
◆ ultraplan ready
Plan is ready to review
Run /tasks to see the session link, agent activity, and a Stop ultraplan action.
Browser Review Interface
Outline sidebar: navigate between sections without scrolling
Inline comments: highlight any passage, leave targeted feedback
Emoji reactions: signal approval or concern on a section without writing a comment
Revision cycles: ask Claude to address your comments; it presents an updated draft — iterate as many times as needed
Execution: Two Choices
Once the plan looks right, choose in the browser:
Option
What happens
Approve and start coding
Cloud session implements the plan, creates a PR; terminal clears
Approve and teleport back
Plan sent to your terminal with 3 sub-options
Teleport sub-options:
Implement here — inject plan into current conversation, proceed immediately
Start new session — fresh session with plan as context (prints claude --resume to return to current session)
Cancel — saves plan to a file, prints the path
Requirements and Constraints
Requirement
Detail
Claude Code version
v2.1.91+
Account
Pro, Max, Team, or Enterprise (not free tier)
Repository
GitHub only (no GitLab, Bitbucket)
Providers
Anthropic API only — not available on Bedrock, Vertex, Foundry
Conflict
Incompatible with Remote Control (both use claude.ai/code)
Ultraplan vs. OpusPlan vs. Plan Mode
Feature
Plan Mode
OpusPlan
Ultraplan
Execution
Local
Local
Cloud
Terminal blocked?
Yes
Yes
No
Models
Active model
Opus (plan) + Sonnet (act)
Opus 4.7 (multi-agent)
Review surface
Terminal scrollback
Terminal scrollback
Browser with inline comments
Requires GitHub
No
No
Yes
Token accounting
Counts locally
Counts locally
Cloud planning free from local quota
When to Use Ultraplan
Best fit:
Complex architectural changes touching many files (service migrations, large refactors)
Tasks where you want to keep working while planning runs
Situations where stakeholders need to review the plan before implementation
Skip it for:
Simple, focused changes where local Plan Mode takes under a minute
Environments without internet or not on GitHub
Sessions using Remote Control
Token Note: Early tests show cloud planning consuming ~37% fewer tokens than equivalent local plans (82K vs 131K for a ~55 min migration task). Cloud planning tokens don’t count against your local quota; only implementation tokens do.
See also: §9.16 Session Teleportation for the broader web ↔ terminal workflow. Ultraplan uses the same cloud infrastructure with planning-specific review capabilities.
Cloud-based parallel multi-agent code review. Where Ultraplan handles planning, Ultrareview handles review: multiple Opus 4.7 agents read through your changes simultaneously and surface bugs and design issues that careful reviewers would catch.
Activation:
Terminal window
/ultrareview# Review current branch (diff from base)
/ultrareview<PR#># Review a specific GitHub PR
Ultrareview operates on diffs, not the full codebase — it reviews what changed on the current branch, or the changes in a given PR. The cloud session dispatches parallel agents to analyse the diff; results arrive in the browser and can optionally be teleported back to the terminal.
Launch offer: Pro and Max subscribers receive three free ultrareviews to try the feature.
Note on costs: Estimates based on API pricing (Haiku $0.80/$4.00 per MTok, Sonnet $3/$15, Opus $5/$25). Pro/Max subscribers pay a flat rate, so prioritize quality over cost. See Section 2.2 for full pricing breakdown.
Budget modifier (Teams Standard/Pro): downgrade one tier per phase — use Sonnet where the table says Opus, Haiku where it says Sonnet for mechanical implementation tasks. Community pattern: Sonnet for Plan → Haiku for Implementation on a $25/mo Teams Standard plan.
The effort parameter (Opus 4.6 API) controls the model’s overall computational budget — not just thinking tokens, but tool calls, verbosity, and analysis depth. Low effort = fewer tool calls, no preamble. High effort = more explanations, detailed analysis.
Calibrated gradient — one real prompt per level:
low — Mechanical, no design decisions needed
"Rename getUserById to findUserById across src/" — Find-replace scope, zero reasoning required.
medium — Clear pattern, defined scope, one concern
"Convert fetchUser() in api/users.ts from callbacks to async/await" — Pattern is known, scope bounded.
high — Design decisions, edge cases, multiple concerns
"Redesign error handling in the payment module: add retry logic, partial failure recovery, and idempotency guarantees" — Architectural choices, not just pattern application.
xhigh(Opus 4.7+, v2.1.114+) — Extra-high effort between high and max; default for Claude Code (all plans) with Opus 4.7
"Debug this race condition in the distributed job queue with concurrent writes and partial reads" — More reasoning depth than high, faster than max.
max(Opus 4.7+ only — returns error on other models) — Cross-system reasoning, irreversible decisions
"Analyze the microservices event pipeline for race conditions across order-service, inventory-service, and notification-service" — Multi-service hypothesis testing, adversarial thinking.
Skills can declare their own effort level in frontmatter. The skill’s value overrides the session setting for the duration of that skill’s execution, then reverts. This eliminates the need to manually toggle effort between mechanical and analytical tasks.
# Mechanical skill — always fast, never wastes reasoning budget
# Analytical skill — always deep, regardless of session setting
---
name: architecture-review
description: Full architectural analysis with trade-off evaluation
effort: high
---
Decision table for common skill types:
Skill type
Recommended effort
Reasoning
Commit, push, sync
low
Sequential steps, no design decisions
Changelog, release notes
low
Reads git + formats, mécanique
Scaffolding, boilerplate
low
Template instantiation
Code review (single PR)
medium
Pattern recognition, bounded scope
Issue triage, backlog
medium
Categorization + some analysis
Security audit
high
Threat modeling, adversarial thinking
Architecture review
high
Design decisions, cross-component reasoning
Multi-agent orchestration
high
Coordination + planning
Cost model: low effort means fewer tool calls, no preamble, direct output. high effort means more tool calls with explanations, detailed summaries, deeper exploration. Match effort to where analysis adds value — not to “effort = quality” uniformly.
description: Mechanical execution agent. Scope must be defined explicitly in the task.
model: haiku
tools: Write, Edit, Bash, Read, Grep, Glob
---
Note: Haiku is for mechanical tasks only. If the implementation requires design decisions or complex business logic, use Sonnet — state this in the task prompt.
External Services: Claude can’t access your databases directly
Your Intent: Claude needs clear instructions
Hidden Files: Claude respects .gitignore by default
⚠️ Pattern Amplification: Claude mirrors the patterns it finds. In well-structured codebases, it produces consistent, idiomatic code. In messy codebases without clear abstractions, it perpetuates the mess. If your code lacks good patterns, provide them explicitly in CLAUDE.md or use semantic anchors (Section 2.9).
The most common mistake is treating Claude Code like a chatbot — typing ad-hoc requests and hoping for good output. What separates casual usage from production workflows is a shift in thinking:
Chatbot mode: You write good prompts. Context system: You build structured context that makes every prompt better.
“Stop treating it like a chatbot. Give it structured context. CLAUDE.md, hooks, skills, project memory. Changes everything.”
— Robin Lorenz, AI Engineer (comment)
Claude Code has four layers of persistent context that compound over time:
XML-structured prompts provide semantic organization for complex requests, helping Claude distinguish between different aspects of your task for clearer understanding and better results.
The Claude Code team internally treats prompts as challenges to a peer, not instructions to an assistant. This subtle shift produces higher-quality outputs because it forces Claude to prove its reasoning rather than simply comply.
Three challenge patterns from the team:
1. The Gatekeeper — Force Claude to defend its work before shipping:
"Grill me on these changes and don't make a PR until I pass your test"
Claude reviews your diff, asks pointed questions about edge cases, and only proceeds when satisfied. This catches issues that passive review misses.
2. The Proof Demand — Require evidence, not assertions:
"Prove to me this works — show me the diff in behavior between main and this branch"
Claude runs both branches, compares outputs, and presents concrete evidence. Eliminates the “trust me, it works” failure mode.
3. The Reset — After a mediocre first attempt, invoke full-context rewrite:
"Knowing everything you know now, scrap this and implement the elegant solution"
This forces a substantive second attempt with accumulated context rather than incremental patches on a weak foundation. The key insight: Claude’s second attempt with full context consistently outperforms iterative fixes.
Why this works: Provocation triggers deeper reasoning paths than polite requests. When Claude must convince rather than comply, it activates more thorough analysis and catches its own shortcuts.
LLMs are statistical pattern matchers trained on massive text corpora. Using precise technical vocabulary helps Claude activate the right patterns in its training data, leading to higher-quality outputs.
When you say “clean code”, Claude might generate any of dozens of interpretations. But when you say “SOLID principles with dependency injection following Clean Architecture layers”, you anchor Claude to a specific, well-documented pattern from its training.
Key insight: Technical terms act as GPS coordinates into Claude’s knowledge. The more precise, the better the navigation.
💡 Pro tip: When Claude produces generic code, try adding more specific anchors. “Use clean code” → “Apply Martin Fowler’s Refactoring catalog, specifically Extract Method and Replace Conditional with Polymorphism.”
Important: Everything you share with Claude Code is sent to Anthropic servers. Understanding this data flow is critical for protecting sensitive information.
2. Never connect production databases to MCP servers. Use dev/staging with anonymized data.
3. Use security hooks to block reading of sensitive files (see Section 7.4).
Full guide: For complete privacy documentation including known risks, community incidents, and enterprise considerations, see Data Privacy & Retention Guide.
Reading time: 5 minutes
Goal: Understand the core architecture that powers Claude Code
This section provides a summary of Claude Code’s internal mechanisms. For the complete technical deep-dive with diagrams and source citations, see the Architecture & Internals Guide.
How tool execution works: Claude Code can start executing tools marked as concurrency-safe (read-only operations like Read, Grep, Glob) while the model is still generating its response, reducing total turn time. Non-concurrent tools (writes, bash commands) wait for the response to complete and run serially. When multiple read-only tools appear in a single response, they run in parallel — up to 10 concurrent by default.
⚠️ Important: Use repository-specific task list IDs to avoid cross-project contamination. Tasks with the same ID are shared across all sessions using that ID.
Task schema example:
{
"id": "task-auth-login",
"title": "Implement login endpoint",
"description": "POST /auth/login with JWT token generation",
The Diagnostic Principle: When Claude’s task list doesn’t match your intent, the problem isn’t Claude—it’s your instructions.
Task lists act as a mirror for instruction clarity. If you ask Claude to plan a feature and the resulting tasks surprise you, that divergence is diagnostic information:
Your instruction: "Refactor the auth system"
Claude's task list:
- [ ] Read all auth-related files
- [ ] Identify code duplication
- [ ] Extract shared utilities
- [ ] Update imports
- [ ] Run tests
Your reaction: "That's not what I meant—I wanted to switch from session to JWT"
Diagnosis: Your instruction was ambiguous. "Refactor" ≠ "replace".
Divergence patterns and what they reveal:
Divergence Type
What It Means
Fix
Tasks too broad
Instructions lack specificity
Add WHAT, WHERE, HOW, VERIFY
Tasks too narrow
Instructions too detailed, missing big picture
State the goal, not just the steps
Wrong priorities
Context missing about what matters
Add constraints and priorities
Missing tasks
Implicit knowledge not shared
Make assumptions explicit in prompt
Extra tasks
Claude inferred requirements you didn’t intend
Add explicit scope boundaries
Using task divergence as a workflow:
## Step 1: Seed with loose instruction
User: "Improve the checkout flow"
## Step 2: Review Claude's task list (don't execute yet)
Claude generates: [task list]
## Step 3: Compare against your mental model
- Missing: payment retry logic? → Add to instructions
User: "Actually, here's what I need: [refined instruction with specifics]"
Pro tip: Run TaskList after initial planning as a sanity check before execution. If more than 30% of tasks surprise you, your prompt needs work. Iterate on the prompt, not the tasks.
Claude Code operates within a 200K token context window (1M beta available via API — see [200K vs 1M comparison](line 1751)):
Component
Approximate Size
System prompt
5-15K tokens
CLAUDE.md files
1-10K tokens
Conversation history
Variable
Tool results
Variable
Reserved for response
40-45K tokens
When context fills up (~75% in VS Code, ~95% in CLI), older content is automatically summarized. However, research shows this degrades quality (50-70% performance drop on complex tasks). Use /compact proactively at logical breakpoints, or trigger session handoffs at 85% to preserve intent over compressed history. See [Session Handoffs](line 2140) and Auto-Compaction Research.
Status: Partially feature-flagged, progressive rollout in progress.
TeammateTool enables multi-agent orchestration with persistent communication between agents. Unlike standard sub-agents that work in isolation, teammates can coordinate through structured messaging.
Core Capabilities:
Operation
Purpose
spawnTeam
Create a named team of agents
discoverTeams
List available teams
requestJoin
Agent requests to join a team
approveJoin
Team leader approves join requests
Messaging
JSON-based inter-agent communication
Execution Backends (auto-detected):
In-process: Async tasks in same Node.js process (fastest)
⚠️ Note: This is an experimental feature. Capabilities may change or be removed in future releases. Always verify current behavior with official documentation.
“Do more with less. Smart architecture choices, better training efficiency, and focused problem-solving can compete with raw scale.”
— Daniela Amodei, Anthropic CEO
Claude Code trusts the model’s reasoning instead of building complex orchestration systems. This means: