Production Safety Rules
Production Safety Rules
Section titled “Production Safety Rules”Audience: Teams deploying Claude Code in production environments. For solo learners: See Getting Started instead.
TL;DR (30 seconds)
Section titled “TL;DR (30 seconds)”6 non-negotiable rules for production teams:
- ✅ Port Stability: Never change backend/frontend ports
- ✅ Database Safety: Always backup before destructive ops
- ✅ Feature Completeness: Never ship half-implemented features
- ✅ Infrastructure Lock: Docker/env changes require permission
- ✅ Dependency Safety: No new dependencies without approval
- ✅ Pattern Following: Conform to existing codebase conventions
When to Use These Rules
Section titled “When to Use These Rules”| Project Type | Use These Rules? | Why |
|---|---|---|
| Learning / Tutorials | ❌ No | Too restrictive for exploration |
| Solo prototypes | ❌ No | Overhead not worth it |
| Small teams (2-3), staging env | ⚠️ Partial | Rules 1, 3, 6 only |
| Production apps, multi-dev teams | ✅ Yes | All 6 rules |
| Regulated industries (HIPAA, SOC2) | ✅ Yes + add compliance rules | Critical safety |
Rule 1: Port Stability
Section titled “Rule 1: Port Stability”The Problem
Section titled “The Problem”Changing ports breaks:
- Local development environments
- Docker Compose configurations
- Deployed service configs
- Team member setups
Real incident: Backend port changed from 3000 → 8080 during refactor. All developers lost a day re-configuring local envs. Staging deployment failed silently because nginx proxy still pointed to 3000.
The Rule
Section titled “The Rule”Never modify backend/frontend ports without explicit team permission.
Implementation
Section titled “Implementation”Option A: Permission deny in settings.json
{ "permissions": { "deny": [ "Edit(docker-compose.yml:*ports*)", "Edit(package.json:*PORT*)", "Edit(.env.example:*PORT*)", "Edit(vite.config.ts:*port*)" ] }}Option B: Pre-commit hook
if [[ "$TOOL" == "Edit" ]]; then FILE=$(echo "$INPUT" | jq -r '.tool.input.file_path') CONTENT=$(echo "$INPUT" | jq -r '.tool.input.new_string')
if [[ "$FILE" =~ (docker-compose|vite.config|package.json) ]] && \ [[ "$CONTENT" =~ (port|PORT):[[:space:]]*[0-9] ]]; then echo "⚠️ BLOCKED: Port modification detected in $FILE" echo "Ports must remain stable across team. Request permission first." exit 2 fifiOption C: CLAUDE.md constraint
## Port Configuration
**CRITICAL**: Ports are locked for team coordination.
Current ports:- Frontend (Vite): 5173- Backend (Express): 3000- Database: 5432
To change ports:1. Create RFC document in `/docs/rfcs/`2. Get team approval (3+ reviewers)3. Update all environments simultaneously4. Notify team 48h in advanceEdge Cases
Section titled “Edge Cases”| Scenario | Behavior |
|---|---|
| Adding NEW service | OK (doesn’t break existing) |
| Changing test env port | OK (isolated from dev/prod) |
| Port conflict on machine | Ask user to resolve locally (.env.local) |
Rule 2: Database Safety
Section titled “Rule 2: Database Safety”The Problem
Section titled “The Problem”Accidental deletions in production = data loss.
Real incidents:
DELETE FROM users WHERE id = 123→ ForgotWHERE→ All users deletedDROP TABLE sessionsduring cleanup → Production table dropped- Migration rollback → Data loss because no backup
The Rule
Section titled “The Rule”Always backup before destructive operations.
Destructive operations:
DELETE FROM(withoutLIMIT 1)DROP TABLETRUNCATEALTER TABLE ... DROP COLUMN- Database migrations that can’t rollback
Implementation
Section titled “Implementation”Option A: Pre-tool hook with backup enforcement
#!/bin/bashINPUT=$(cat)TOOL=$(echo "$INPUT" | jq -r '.tool.name')
if [[ "$TOOL" == "Bash" ]]; then COMMAND=$(echo "$INPUT" | jq -r '.tool.input.command')
# Detect destructive database operations if [[ "$COMMAND" =~ (DROP TABLE|DELETE FROM|TRUNCATE|ALTER.*DROP) ]]; then echo "🚨 BLOCKED: Destructive database operation detected" echo "" echo "Required steps:" echo "1. Create backup: pg_dump -U user dbname > backup_\$(date +%Y%m%d_%H%M%S).sql" echo "2. Verify backup size is reasonable" echo "3. Re-run after backup confirmation" exit 2 fifi
exit 0Option B: Migration safety wrapper
#!/bin/bashset -e
echo "🔍 Pre-migration checks..."
# 1. Check environmentif [[ "$NODE_ENV" == "production" ]]; then echo "❌ BLOCKED: Use migration service for production" exit 1fi
# 2. Create backupBACKUP_FILE="backups/pre-migration-$(date +%Y%m%d_%H%M%S).sql"mkdir -p backupspg_dump $DATABASE_URL > "$BACKUP_FILE"echo "✅ Backup created: $BACKUP_FILE"
# 3. Run migrationecho "🚀 Running migration..."npm run prisma:migrate:dev
# 4. Verifyecho "🔍 Verifying database state..."npm run prisma:validate
echo "✅ Migration complete. Backup: $BACKUP_FILE"Option C: CLAUDE.md protocol
## Database Operations
### Destructive Operations Protocol
**NEVER run these without backup**:- DELETE, DROP, TRUNCATE, ALTER...DROP
**Required steps**:1. Announce in #dev-ops Slack channel2. Create backup: `./scripts/backup-db.sh`3. Verify backup: `ls -lh backups/` (should be >0 bytes)4. Execute in staging FIRST5. Wait 24h for issues6. Execute in production with on-call engineer present
**Emergency rollback**:```bashpsql $DATABASE_URL < backups/[latest].sql### MCP Database Safety
If using MCP database servers (Postgres, MySQL, etc.):
```json{ "mcpServers": { "database": { "command": "npx", "args": ["@modelcontextprotocol/server-postgres"], "env": { "POSTGRES_URL": "postgres://readonly:***@dev-db.example.com:5432/appdb" }, "comment": "READ-ONLY user for safety" } }}Critical: Use read-only database users for MCP. See Data Privacy Guide.
Rule 3: Feature Completeness
Section titled “Rule 3: Feature Completeness”The Problem
Section titled “The Problem”Claude Code sometimes “half-asses” features when context runs low:
- Deletes existing functionality instead of fixing bugs
- Adds
TODOcomments for core features - Leaves error states unhandled
- Creates mock implementations
Real incidents:
- Payment validation “fixed” by removing validation entirely
- Error handling “added” with
throw new Error("Not implemented") - Feature “completed” with
// TODO: Add actual logic here
The Rule
Section titled “The Rule”Never ship half-implemented features. If you start, you finish to working state.
Implementation
Section titled “Implementation”Option A: CLAUDE.md constraint
## Feature Implementation Standards
### NON-NEGOTIABLE
1. **No TODOs for core functionality** - TODOs allowed ONLY for future enhancements - Core features must be complete and working
2. **No mock implementations** - No `throw new Error("Not implemented")` - No fake data generators in production code paths
3. **Complete error handling** - Every async call has try/catch - Every user input is validated - Every API call has timeout and retry logic
4. **Downgrade = Delete the feature entirely** - If you can't fix properly, remove the feature - Document why in commit message - Create issue for proper implementation
### Validation
Before accepting changes, verify:- [ ] No `TODO` in modified files (except future enhancements)- [ ] No `throw new Error("Not implemented")`- [ ] No commented-out code without explanation- [ ] All new functions have error handlingOption B: Pre-commit git hook
#!/bin/bash
# Check staged files for "half-assing" patternsSTAGED=$(git diff --cached --name-only --diff-filter=ACM)
for FILE in $STAGED; do if [[ "$FILE" =~ \.(ts|tsx|js|jsx|py)$ ]]; then # Check for TODOs in core logic (not tests) if ! [[ "$FILE" =~ test|spec ]]; then if git diff --cached "$FILE" | grep -E "^\+.*TODO.*implement|^\+.*Not implemented"; then echo "❌ COMMIT BLOCKED: TODO/Not implemented in $FILE" echo " Complete the feature or remove it entirely." exit 1 fi fi
# Check for mock placeholders if git diff --cached "$FILE" | grep -E "^\+.*(MOCK_DATA|fakeData|placeholder)"; then echo "⚠️ WARNING: Mock data detected in $FILE" echo " Ensure this is intentional for staging/dev only." fi fidone
exit 0Option C: Output evaluator command
# Before committing/validate-changes
# This runs the output-evaluator agent (see examples/agents/output-evaluator.md)# which scores changes on:# - Correctness (10/10)# - Completeness (10/10) ← Detects half-assing# - Safety (10/10)Rule 4: Infrastructure Lock
Section titled “Rule 4: Infrastructure Lock”The Problem
Section titled “The Problem”Claude might modify infrastructure configs without understanding production implications:
- Changes Docker Compose volumes → data loss
- Modifies
.env.example→ breaks onboarding - Updates Terraform → unintended resource changes
- Tweaks Kubernetes manifests → downtime
The Rule
Section titled “The Rule”Infrastructure modifications require explicit team permission.
Files to protect:
docker-compose.yml,Dockerfile.env.example(templates, NOT personal .env.local)kubernetes/,k8s/,terraform/,helm/- CI/CD configs (
.github/workflows/,.gitlab-ci.yml) - Database schemas (requires migration review)
Implementation
Section titled “Implementation”Option A: Permission deny
{ "permissions": { "deny": [ "Edit(docker-compose.yml)", "Edit(Dockerfile)", "Edit(.env.example)", "Edit(terraform/**)", "Edit(kubernetes/**)", "Edit(.github/workflows/**)", "Edit(prisma/schema.prisma)" ] }}Option B: CLAUDE.md rule
## Infrastructure Changes
You are **FORBIDDEN** from modifying these without explicit permission:
- `docker-compose.yml`, `Dockerfile`- `.env.example` (template for new developers)- `terraform/`, `kubernetes/` (infrastructure as code)- `.github/workflows/` (CI/CD pipelines)- `prisma/schema.prisma` (database schema)
**If infrastructure change is needed**:1. Ask user: "This requires infrastructure change. Should I create an RFC?"2. Create RFC document in `docs/rfcs/YYYYMMDD-<title>.md`3. Do NOT modify files until RFC approvedNote: Personal .env.local files are OK to modify (they’re gitignored).
Rule 5: Dependency Safety
Section titled “Rule 5: Dependency Safety”The Problem
Section titled “The Problem”Adding dependencies without team approval:
- Increases bundle size (performance)
- Introduces security vulnerabilities
- Creates license compliance issues
- Adds maintenance burden
Real incidents:
- Added
moment.js(200KB) whendate-fns(tiny) already in project - Installed
lodashwhen project usesramda - Added GPL library → license violation for proprietary codebase
The Rule
Section titled “The Rule”No new dependencies without explicit approval.
Implementation
Section titled “Implementation”Option A: Permission deny on package managers
{ "permissions": { "deny": [ "Bash(npm install *)", "Bash(npm i *)", "Bash(pnpm add *)", "Bash(yarn add *)", "Bash(pip install *)", "Bash(poetry add *)" ], "allow": [ "Bash(npm install)", "Bash(pnpm install)", "Bash(pip install -r requirements.txt)" ] }}Option B: CLAUDE.md protocol
## Dependency Management
### Immutable Stack Rule
**You are FORBIDDEN from adding new dependencies** (`npm install <package>`).
**If new dependency is needed**:1. Check if existing dependency solves it: - Date manipulation? Use existing `date-fns` - HTTP requests? Use existing `axios` - State management? Use existing `zustand`2. If genuinely needed, ASK: - "I need [package] for [reason]. Existing alternatives: [X, Y]. Should I add it?"3. Wait for explicit approval4. User will run: `npm install <package>` manually
**Allowed without asking**:- `npm install` (installs existing package.json deps)- Dev dependencies for testing (`-D` flag after approval)Option C: Pre-tool hook
if [[ "$TOOL" == "Bash" ]]; then COMMAND=$(echo "$INPUT" | jq -r '.tool.input.command')
# Block dependency installation if [[ "$COMMAND" =~ (npm|pnpm|yarn)[[:space:]]+(install|add|i)[[:space:]]+[a-zA-Z] ]]; then echo "🚨 BLOCKED: New dependency installation" echo "" echo "Dependencies must be approved by team lead." echo "Create PR with RFC explaining:" echo "1. Why this dependency is needed" echo "2. Alternatives considered" echo "3. Bundle size impact" echo "4. License compatibility" exit 2 fi
# Allow: npm install (no args), npm install -g, pnpm install if [[ "$COMMAND" =~ ^(npm|pnpm|yarn)[[:space:]]+install$ ]]; then exit 0 fifiRule 6: Pattern Following
Section titled “Rule 6: Pattern Following”The Problem
Section titled “The Problem”Claude introduces new patterns inconsistent with codebase:
- Uses
classcomponents when project is functional React - Imports
lodashwhen project usesramda - Writes REST endpoints when project is GraphQL
- Uses
fetchwhen project standardized onaxios
The Rule
Section titled “The Rule”Conform to existing codebase conventions. Check before implementing.
Implementation
Section titled “Implementation”Option A: CLAUDE.md conventions
## Code Conventions
### Tech Stack (DO NOT DEVIATE)
**Frontend**:- React 18 with **function components + hooks** (NO class components)- State: Zustand (NOT Redux, Context)- HTTP: axios (NOT fetch)- Styling: Tailwind CSS (NOT styled-components, emotion)- Forms: React Hook Form + Zod
**Backend**:- Node.js + Express- Database: Prisma ORM (NOT raw SQL, TypeORM)- Auth: JWT via jose library- Validation: Zod schemas
**Testing**:- Unit: Vitest (NOT Jest)- E2E: Playwright (NOT Cypress)
### Import Patterns
**Always use**:```typescriptimport { useState } from 'react' // ✅ Named importsimport axios from 'axios' // ✅ Default importNever use:
import React from 'react' // ❌ Deprecated patternimport * as axios from 'axios' // ❌ Namespace importFile Structure
Section titled “File Structure”src/ features/ ← Group by feature (NOT by type) auth/ components/ hooks/ api/ shared/ ← Shared utilities components/ hooks/Design System
Section titled “Design System”UI changes MUST use existing design system:
- Check
src/shared/components/for existing components - Use Tailwind utility classes from
tailwind.config.js - Colors from
colors.tspalette ONLY - Typography from
typography.config.js
Before creating new component:
- Search:
rg "Button" src/shared/components/ - If exists, use it
- If doesn’t exist, ask: “Should I create new Button component or use existing primitive?”
**Option B: Pre-implementation analysis**
```markdown## Before Implementing
**ALWAYS** run these checks:
1. **Pattern check**:```bash# How does codebase handle X?rg "import.*useState" src/ # Check React patternsrg "axios\." src/ # Check HTTP patternsrg "prisma\." src/ # Check DB patterns- Existing components:
# Does component already exist?find src/shared/components -name "*Button*"find src/shared/components -name "*Modal*"- Ask user if unclear:
- “I see project uses [X]. Should I follow this pattern or use [Y]?”
**Option C: Automated validation**
```bash#!/bin/bashif [[ "$TOOL" == "Write" ]] || [[ "$TOOL" == "Edit" ]]; then FILE=$(echo "$INPUT" | jq -r '.tool.input.file_path')
# Check for pattern violations if [[ "$FILE" =~ \.(tsx?)$ ]]; then CONTENT=$(cat "$FILE")
# Violation: class components in React if echo "$CONTENT" | grep -q "class.*extends.*Component"; then echo "⚠️ WARNING: Class component detected in $FILE" echo " Project uses function components. Consider refactoring." fi
# Violation: wrong HTTP library if echo "$CONTENT" | grep -q "import.*fetch\|window.fetch"; then echo "⚠️ WARNING: fetch() detected in $FILE" echo " Project uses axios. Use: import axios from 'axios'" fi fifiRule 7: The Verification Paradox
Section titled “Rule 7: The Verification Paradox”The Problem
Section titled “The Problem”When AI succeeds 99% of the time, traditional human verification becomes fragile:
The paradox: As AI reliability increases, human review quality decreases.
- Vigilance fatigue: Rare errors (1%) slip through when humans unconsciously trust patterns that usually work
- Pattern-trusting behavior: Manual review degrades as reviewers stop expecting errors
- False confidence: “It worked last 50 times” creates blind spots for the 51st failure
- Cognitive load: Humans aren’t optimized to catch 1-in-100 errors consistently
Real incidents:
- Payment validation bypassed after 200 successful transactions → fraud on transaction #201
- Security check skipped because “AI always gets auth right” → credentials leaked
- Test suite passing 99% → production bug from the 1% case that wasn’t tested
Source: Alan Engineering Team (Charles Gorintin, Maxime Le Bras), Feb 2026
The Rule
Section titled “The Rule”Build automated safety systems instead of relying on human vigilance.
When AI reliability crosses ~95%, shift from manual review to automated guardrails.
Anti-Patterns vs Better Approaches
Section titled “Anti-Patterns vs Better Approaches”| Anti-Pattern | Better Approach |
|---|---|
| Manual review for every AI output | Automated test suites + selective review |
| Trust because “it worked last time” | Verification contracts (tests, types, lints) |
| Human as sole error detector | Guardrails that fail fast (CI/CD gates) |
| “Spot-check” strategy for high-frequency AI ops | Comprehensive automated validation |
| Reviewer fatigue = lower standards over time | Consistent automated quality bars |
Implementation
Section titled “Implementation”Option A: Automated Guardrail Stack
name: AI Output Validation
on: [pull_request]
jobs: validate: runs-on: ubuntu-latest steps: - name: Type safety run: npm run typecheck # Catch type errors AI missed
- name: Lint rules run: npm run lint # Enforce code standards
- name: Unit tests run: npm run test # Verify behavior contracts
- name: E2E tests run: npm run test:e2e # Catch integration failures
- name: Security audit run: npm audit # Detect vulnerable dependencies
- name: Bundle analysis run: npm run analyze # Catch bloat/regressions
# Human review ONLY after all automation passesOption B: Verification Contracts in CLAUDE.md
## Verification Protocol
### NEVER rely on human review alone
**Automated verification required**:1. **Type safety**: `npm run typecheck` must pass (zero errors)2. **Tests**: `npm run test` coverage ≥ 80% for new code3. **Lint**: `npm run lint` must pass (zero warnings)4. **Security**: `npm audit` must show zero high/critical vulnerabilities5. **Performance**: Lighthouse score ≥ 90 for affected pages
**Human review is for**:- Architecture decisions- UX/design choices- Business logic validation- Edge cases automation can't catch
**Human review is NOT for**:- Syntax errors (use linters)- Type errors (use TypeScript)- Performance regressions (use benchmarks)- Security issues (use automated scanners)Option C: Pre-merge checklist (automated)
#!/bin/bash
echo "🔍 Running automated verification (Verification Paradox defense)..."
# 1. Type safetynpm run typecheck || { echo "❌ Type errors detected"; exit 1; }
# 2. Lintnpm run lint || { echo "❌ Lint errors detected"; exit 1; }
# 3. Testsnpm run test || { echo "❌ Tests failing"; exit 1; }
# 4. Securitynpm audit --audit-level=high || { echo "❌ Security vulnerabilities detected"; exit 1; }
echo "✅ All automated checks passed"echo "💡 Human review can now focus on architecture/UX/business logic"Edge Cases
Section titled “Edge Cases”| Scenario | Behavior |
|---|---|
| AI writes perfect code 99.9% | STILL run automation (paradox applies even at 99.9%) |
| Time pressure, “just ship it” | Automation is non-negotiable (fast ≠ skip safety) |
| Trivial changes (typo fix) | Run automation (typos can break prod) |
| Emergency hotfix | Automation REQUIRED (stress = higher error rate) |
Why This Matters
Section titled “Why This Matters”Old model (pre-AI):
- Code quality = human expertise + careful review
- Errors caught by experienced developers
- Review quality stays consistent
New model (AI-assisted):
- AI produces high-quality code 95%+ of the time
- Humans become complacent (“AI usually gets it right”)
- 5% error rate slips through fatigued review
Solution: Automate the boring verification (syntax, types, tests), reserve human attention for creative/strategic review.
Integration with Other Rules
Section titled “Integration with Other Rules”- Rule 3 (Feature Completeness): Automated tests verify features are actually complete
- Rule 2 (Database Safety): Migration tests catch destructive operations
- Rule 6 (Pattern Following): Linters enforce project conventions automatically
The agent-side counterpart to this paradox is the Verification Gap pattern documented in the TDD workflow.
Integration with Existing Workflows
Section titled “Integration with Existing Workflows”With Plan Mode
Section titled “With Plan Mode”# Before multi-file changes/plan
# Claude enters read-only mode, explores codebase# Identifies patterns, conventions, existing implementations# Proposes plan following project conventions# You review before executionWith Git Hooks
Section titled “With Git Hooks”These rules integrate with existing git workflows:
#!/bin/bash
# Run safety checks./.claude/hooks/production-safety-check.sh
# If blocked, commit failsexit $?With CI/CD
Section titled “With CI/CD”Add validation step:
name: PR Validationon: [pull_request]
jobs: validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3
- name: Check for half-assing run: | if git diff origin/main...HEAD | grep -E "TODO.*implement|Not implemented"; then echo "❌ PR contains incomplete features" exit 1 fi
- name: Check for unauthorized deps run: | git diff origin/main...HEAD -- package.json | grep -E '^\+.*"[^"]+": "[^"]+"' || exit 0 echo "⚠️ New dependencies detected. Review required."Troubleshooting
Section titled “Troubleshooting””These rules are too restrictive”
Section titled “”These rules are too restrictive””Solution: Adapt based on your team size and stage.
| Team Size | Recommended Rules |
|---|---|
| 1-2 devs | Rules 1, 3, 6 only |
| 3-10 devs | Rules 1, 3, 5, 6 |
| 10+ devs or production | All 6 rules |
”Claude keeps getting blocked”
Section titled “”Claude keeps getting blocked””Solution: Rules are working! Options:
-
Grant temporary permission:
Terminal window # In CLAUDE.md## Temporary Override (expires 2026-01-25)For this feature only: infrastructure changes allowed.Reason: Setting up new microservice. -
Create exception:
{"permissions": {"allow": ["Edit(docker-compose.dev.yml)"],"deny": ["Edit(docker-compose.prod.yml)"]}} -
Review if rule is appropriate:
- Solo dev blocking themselves? → Remove rule
- Team needs flexibility? → Use “ask” instead of “deny"
"How do I enforce rules across team?”
Section titled “"How do I enforce rules across team?””Solution: Commit to repo, not personal config.
# Shared team rules/project/.claude/settings.json # Committed/project/CLAUDE.md # Committed
# Personal overrides/project/.claude/settings.local.json # Gitignored/project/.claude/CLAUDE.md # GitignoredTeam settings take precedence, but individuals can opt-in to stricter rules.
See Also
Section titled “See Also”- Ultimate Guide §9.12 Git Best Practices — Commit workflow, Plan → Act pattern
- Security Hardening Guide — MCP security, secret protection, hook stack
- Data Privacy Guide — MCP database risks, retention policies
- Enterprise AI Governance — Org-level governance: usage charters, MCP approval workflow, guardrail tiers, compliance
- Adoption Approaches — Team setup, shared conventions, enterprise rollout
- Plan Mode — Safe exploration before execution
- Permissions System — Allow/deny rules, hooks
Quick Reference
Section titled “Quick Reference”Rule Severity
Section titled “Rule Severity”| Rule | Severity | Breaking this causes |
|---|---|---|
| 1. Port Stability | 🔴 Critical | Team downtime, deployment failures |
| 2. Database Safety | 🔴 Critical | Data loss, customer impact |
| 3. Feature Completeness | 🟡 High | Production bugs, tech debt |
| 4. Infrastructure Lock | 🟠 High | Downtime, security issues |
| 5. Dependency Safety | 🟡 Medium | Bundle bloat, license issues |
| 6. Pattern Following | 🟢 Low | Code inconsistency, maintenance burden |
Enforcement Methods
Section titled “Enforcement Methods”| Method | Strictness | Setup Time | Best For |
|---|---|---|---|
| Permission deny | 100% (blocks) | 2 min | Critical rules (1, 2, 4) |
| Pre-tool hooks | 100% (blocks) | 10 min | Custom logic, team-specific |
| CLAUDE.md rules | ~70% (Claude respects) | 5 min | Conventions, guidelines |
| Post-tool warnings | ~30% (warns only) | 5 min | Best practices, suggestions |
| Git hooks | 100% (blocks commits) | 15 min | Final safety net before push |
Common Patterns
Section titled “Common Patterns”Allow staging changes, block production:
{ "permissions": { "allow": ["Edit(docker-compose.dev.yml)"], "deny": ["Edit(docker-compose.prod.yml)"] }}Require confirmation for sensitive ops:
{ "permissions": { "ask": ["Bash(rm -rf *)", "Bash(DROP TABLE *)"] }}Temporary override with expiry:
## Temporary Override (expires 2026-02-01)Infrastructure changes allowed for migration project.After expiry: revert to standard rules.Rule 6: Autonomous Loop Safety
Section titled “Rule 6: Autonomous Loop Safety”The Problem
Section titled “The Problem”Autonomous agent loops — a Claude session running unattended for hours, processing a queue, monitoring a system — have a failure mode that’s hard to debug: the process appears to be running but has silently stalled. No error. No exit code. Just nothing happening, consuming your API budget.
This happens when:
- Claude enters a reasoning loop with no exit condition
- A tool call hangs waiting for a resource that’s gone
- The session hits an edge case that produces no output but also no failure
The Rule
Section titled “The Rule”For any autonomous session expected to run longer than a few minutes, implement a heartbeat mechanism. If the heartbeat stops, kill the entire process group — not just the parent process.
Why process group, not just parent: Claude Code spawns child processes (shell commands, tool calls). Killing only the parent leaves orphaned children consuming resources and potentially taking actions without oversight.
Implementation
Section titled “Implementation”Heartbeat writer — runs inside the autonomous loop as a PostToolUse hook:
#!/bin/bash# PostToolUse hook: write timestamp after every tool use
HEARTBEAT_FILE="${CLAUDE_HEARTBEAT_FILE:-/tmp/claude-heartbeat-$$}"date +%s > "$HEARTBEAT_FILE"Dead-man watchdog — runs as a separate process alongside Claude:
#!/bin/bash# Usage: ./scripts/watchdog.sh <timeout_seconds> <pid>
TIMEOUT="${1:-30}"TARGET_PID="${2:-}"HEARTBEAT_FILE="${CLAUDE_HEARTBEAT_FILE:-/tmp/claude-heartbeat-$TARGET_PID}"
if [[ -z "$TARGET_PID" ]]; then echo "Usage: watchdog.sh <timeout_seconds> <pid>" >&2 exit 1fi
while true; do sleep 5
if ! kill -0 "$TARGET_PID" 2>/dev/null; then echo "Watchdog: process $TARGET_PID has exited cleanly" exit 0 fi
if [[ -f "$HEARTBEAT_FILE" ]]; then LAST_BEAT=$(cat "$HEARTBEAT_FILE") NOW=$(date +%s) AGE=$(( NOW - LAST_BEAT ))
if [[ "$AGE" -gt "$TIMEOUT" ]]; then echo "Watchdog: no heartbeat for ${AGE}s (limit: ${TIMEOUT}s) — killing process group" kill -TERM -"$TARGET_PID" 2>/dev/null || kill -TERM "$TARGET_PID" sleep 2 kill -KILL -"$TARGET_PID" 2>/dev/null || true rm -f "$HEARTBEAT_FILE" exit 1 fi fidoneLaunch script that wires them together:
#!/bin/bashexport CLAUDE_HEARTBEAT_FILE="/tmp/claude-heartbeat-$$"
claude --print "Process the task queue in tasks.json" &CLAUDE_PID=$!
./scripts/watchdog.sh 30 "$CLAUDE_PID" &WATCHDOG_PID=$!
wait "$CLAUDE_PID"EXIT_CODE=$?
kill "$WATCHDOG_PID" 2>/dev/nullrm -f "$CLAUDE_HEARTBEAT_FILE"exit "$EXIT_CODE"Tuning the Timeout
Section titled “Tuning the Timeout”| Task type | Recommended timeout |
|---|---|
| Simple file operations | 15–30s |
| Web requests, API calls | 60s |
| Code compilation, test runs | 120s |
| Long-running research tasks | 300s |
Start at 30s and increase only when you observe legitimate pauses longer than your timeout.
When NOT to Use This
Section titled “When NOT to Use This”- Interactive sessions (you’re watching) — the watchdog adds no value
- Tasks under 5 minutes — setup overhead isn’t justified
- Pipelines where failure is expected and handled by retry logic
Credit: Heartbeat dead-man switch pattern for autonomous agents from Everything Claude Code Security Guide (Affaan Mustafa). The process-group kill and watchdog separation are their contributions.
Version: 1.0.0 Last Updated: 2026-01-21 Changelog: Initial version based on community-validated production patterns