Skip to content
Code Guide

Agent Tools: Beyond Claude Code

Claude Code is one tool in a field that has expanded dramatically since 2024. Dozens of agent frameworks, autonomous coders, and multi-agent systems have shipped, each with different trade-offs. This page maps that field so you can decide when Claude Code is the right call, and when something else fits better.

What this page covers: terminal coding agents, autonomous coders, multi-agent orchestration frameworks, and agent orchestration tooling. Claude Code’s own multi-agent capabilities (agent teams, event-driven workflows, programmatic usage) are documented separately, linked throughout.

What it does not cover: GUI-based AI coding IDEs (Cursor, Windsurf, Cline), which are covered in AI Ecosystem §6. Multi-Claude orchestration tools (Gas Town, multiclaude, Conductor desktop app) are in Third-Party Tools: Multi-Agent Orchestration.


Agent tools fall on a spectrum from interactive to autonomous:

Interactive pair programmer
Claude Code, Codex CLI, Aider, Goose
|
Hermes Agent (interactive + scheduled + messaging gateways)
|
Autonomous issue fixer
SWE-agent, Devin, claude -p in CI
|
Multi-agent framework (build your own)
CrewAI, LangGraph, AutoGen/MAF

Interactive agents: you stay in the loop, approve actions, redirect the agent. Best for daily coding, debugging, and exploratory work where requirements shift.

Autonomous agents: you assign a task and come back to a result. Best for well-specified, bounded tasks: fix this bug, implement this spec, review this PR. The quality of the task description determines the quality of the output more than the agent choice.

Multi-agent frameworks: libraries for building custom agent systems. Not coding tools themselves. You use LangGraph to build an agent, not to write code.


These tools do what Claude Code does: sit in your terminal, read your codebase, write code, run commands. The differences are in model support, cost model, and specific capabilities.


OpenAI’s direct answer to Claude Code. Launched April 2025, built in Rust, open-sourced under Apache 2.0.

AttributeDetails
GitHubopenai/codex
Stars86,200+ (May 2026)
Installnpm install -g @openai/codex
LanguageRust (96%)
LicenseApache 2.0
Versionv0.134.0 (May 26, 2026)
Releases800+ since April 2025
Contributors400+

A terminal AI agent for writing, editing, and running code, built on OpenAI’s model family. The architecture mirrors Claude Code closely: you describe a task, the agent reads files, makes edits, runs tests, and iterates. The main difference is the model provider: Codex CLI talks to GPT-4o, o3, o4-mini, and other OpenAI models, not Claude.

ChatGPT Pro and Team subscribers get Codex CLI usage included in their plan, making it a zero-marginal-cost tool for teams already paying for OpenAI.

AspectClaude CodeCodex CLI
ModelsClaude 3.5/4 family onlyGPT-4o, o3, o3-mini, o4-mini, plus future OpenAI models
LanguageTypeScriptRust
LicenseOpen sourceApache 2.0
SubscriptionAnthropic Claude Max ($20-$200/mo)OpenAI ChatGPT Pro/Team ($20-$30/mo)
MCP SupportNative, growing ecosystemMCP compatible
Release cadenceWeeklyVery high (800+ releases in 13 months)
MemoryCLAUDE.md + Auto MemoryAGENTS.md convention
Skills/HooksFull systemCompatible with agentskills.io standard

Good fit if you are already on a ChatGPT Pro or Team plan and want to avoid a second subscription. Also the right call if you prefer GPT-4o or o3 for specific tasks (reasoning, long-context analysis) and want a terminal agent that uses those models natively.

Poor fit if your team has invested in Claude Code workflows, CLAUDE.md files, and Anthropic-specific patterns. The cognitive cost of context-switching between two agent environments is real.

Terminal window
npm install -g @openai/codex
export OPENAI_API_KEY=sk-...
codex

OpenAI’s Codex docs cover setup in detail.


The most starred open-source agent framework as of May 2026. Created by Nous Research, the AI lab known for its Hermes series of fine-tuned models. Was called OpenClaw until late 2025, when it rebranded on Anthropic reinstating subscription support.

AttributeDetails
GitHubNousResearch/hermes-agent
Stars170,000+ (May 2026)
Installpip install hermes-agent or curl -sSL install.hermes-agent.dev | sh
LanguagePython (89%), TypeScript (8%)
LicenseMIT
Versionv0.14.0 (May 16, 2026)
Release cadenceWeekly (v0.10 Apr 16 → v0.14 May 16)
Contributors215+
CreatorNous Research (Teknium, @teknium1)

A self-improving terminal agent that works with 200+ LLM providers, runs on any platform, and connects to 22 messaging platforms (Telegram, Discord, Slack, WhatsApp, Signal, Teams, LINE, SimpleX, and more). The distinguishing feature is its learning loop: after completing tasks, Hermes analyzes what worked, extracts reusable patterns, and generates skills automatically. Each session makes the agent marginally better at your specific workflows.

The OpenClaw history matters for two reasons. First, the migration path is clean: hermes-agent imports OpenClaw memories, skills, and settings during setup, so switching costs are low. Second, the Anthropic billing controversy from early 2026 was specifically about OpenClaw/Hermes being used on Claude Max subscriptions without proper programmatic billing attribution. Anthropic now explicitly includes Hermes in the programmatic usage bucket (see Billing: Programmatic vs Interactive).

AspectClaude CodeHermes Agent
ModelsClaude only200+ via OpenRouter, OpenAI, Anthropic, HuggingFace, local
Self-improvementEach session starts freshSkills auto-generated from recurring patterns
MessagingTerminal + IDETerminal + 22 chat platforms
Cron schedulingRoutines (Anthropic cloud)Built-in cron, runs locally
BillingSubscription or APIPay your LLM provider directly
Agent SDKAnthropic-specificctx.llm plugin for any provider
SkillsSKILL.md systemSkills Hub (agentskills.io) + auto-generated
MemoryCLAUDE.md + Auto MemoryCross-session persistent memory, agent-curated

The model-agnostic case is the strongest argument. If you want to run Claude for code generation, GPT-4o for specific reasoning tasks, and a local model (via Ollama) for offline work, Hermes handles all three in a single agent. Claude Code cannot.

The self-improving loop is genuinely differentiated. Over 30-40 sessions on the same codebase, Hermes builds a library of skills specific to your project’s patterns. This compounds in a way that static CLAUDE.md files do not, though the comparison is complex because CLAUDE.md is human-authored and intentional while Hermes skills are machine-generated.

The 22 messaging platform integrations are useful for teams that want to interact with their agent via Telegram or Slack rather than a terminal. Not a priority for most developers, but critical for some workflows.

Poor fit if you are invested in Anthropic’s ecosystem (Claude Max subscription, Routines, the Agent SDK). Running Hermes with Claude models hits the programmatic billing bucket, meaning your $200/mo Max subscription’s $200 credit gets consumed by both interactive terminal use and Hermes API calls. Factor that in.

Terminal window
pip install hermes-agent
# Or one-line installer
curl -sSL install.hermes-agent.dev | sh
# Import from OpenClaw if migrating
hermes import --from openclaw
# Start
hermes

The original terminal AI pair programmer. Launched in 2023 by Paul Gauthier before Claude Code existed, Aider established many of the conventions that later tools adopted: direct file editing, automatic git commits, multi-file context windows.

AttributeDetails
GitHubAider-AI/aider
Stars45,400+ (May 2026)
Installpip install aider-install && aider-install
LanguagePython (80%)
LicenseApache 2.0
CreatorPaul Gauthier (paul-gauthier)
PyPI downloads5.3M+

A Python-based coding assistant that edits files in your local git repo and auto-commits with descriptive messages. Key characteristic: near-universal model support via LiteLLM, covering GPT-4o, Claude 3.5/4, Gemini, Ollama, and dozens of other providers. Aider popularized the “whole file” and “diff” editing formats that informed how later agents handle file modifications.

The SWE-Bench benchmark trajectory tells the story well: Aider held the top score on SWE-Bench Verified for several months in 2024-2025 before larger-context models and more capable agents surpassed it. That benchmark record established its reputation as a serious tool, not just a convenience wrapper.

AspectClaude CodeAider
Model supportClaude onlyGPT-4o, Claude, Gemini, Ollama, 50+ providers
Git integrationNative (reads .git, runs git)Deep (auto-commits, commit messages, blame context)
ArchitectureAnthropic proprietaryOpen source, LiteLLM under the hood
File editingTool-based (Edit, Write)Whole-file or diff format sent to model
Web searchVia MCPNot native (requires plugin)
Agentic loopFull (multi-turn, tool use)Full (auto-accepts changes in architect mode)
Release cadenceWeeklyMonthly (last: v0.86.0, Aug 2025)

The last release date (August 2025) is worth noting. Aider remains maintained and functional, but the release cadence has slowed relative to Claude Code and Hermes. This is not a warning sign by itself, but worth checking if you need cutting-edge features.

Best case: you need multi-model support in a mature, battle-tested tool and do not want the operational overhead of Hermes. Aider is simpler to configure than Hermes, has a smaller footprint, and has years of community documentation.

Also a good fit for teams that have strong git discipline and want every AI change explicitly committed with a clear message. Aider’s auto-commit behavior is more aggressive than Claude Code’s (which asks before committing by default).

Terminal window
pip install aider-install && aider-install
# With Claude
export ANTHROPIC_API_KEY=sk-ant-...
aider --model claude-sonnet-4-6
# With GPT-4o
export OPENAI_API_KEY=sk-...
aider

See aider.chat for the full model list and configuration options.


A general-purpose agent, not just a coding tool. Originally built by Block (formerly Square), transferred to the Linux Foundation’s AAIF (Agentic AI Infrastructure Foundation) for long-term governance neutrality.

Full coverage in AI Ecosystem §11.1: Goose.

Quick stats: 45,900+ stars (May 2026), Rust (63%) + TypeScript (30%), Apache 2.0, daily active development, 368+ contributors. The headline difference from Claude Code: provider-agnostic (Claude, GPT, Gemini, Ollama, 15+ providers), with recipe-based reusable workflows and heterogeneous subagent teams where each subagent can run a different model.


These tools run without you watching. You give them a task description (a GitHub issue, a spec, a bug report), and they produce a pull request. The interaction model is fundamentally different from terminal agents: less iterative, more like assigning work to a colleague.


The first commercial fully autonomous software engineer. Closed-source, cloud-hosted, enterprise-priced.

AttributeDetails
Websitedevin.ai
TypeCloud SaaS, proprietary
PricingCore: $20/mo (pay-as-you-go ACUs), Team: $500/mo (250 ACUs), Enterprise: custom
Launched2024
Valuation$25B (April 2026 fundraise)
Notable acquisitionWindsurf AI-native IDE (July 2025)
Enterprise customersGoldman Sachs, Microsoft, Palantir, Citi, Dell

An autonomous software engineer that runs in a cloud-based Linux VM with its own shell, code editor, and browser. Devin plans its approach, writes code, runs tests, reads error messages, and iterates until the task is complete or it gets stuck. The primary interface is Slack: you send a message like “fix issue #342” and Devin opens a PR when done.

Billing is in ACUs (Agent Compute Units), where 1 ACU maps to roughly 15 minutes of agent work. A complex feature might consume 10-20 ACUs; a simple bug fix might use 1-3.

AspectClaude CodeDevin
Execution environmentYour local machineCloud Linux VM (sandboxed)
Interaction modelInteractive (you watch)Async (assign and check back)
StateSession-scopedPersistent across the task
PricingSubscription ($20-$200/mo)Per-task ACU billing ($0.07-$0.15/ACU approx)
Who drivesYou (pair programming)Agent (autonomous, you review)
Task specificationConversational, iterativeUpfront (better spec = better output)
Browser accessVia MCP (Playwright)Built-in, native
Code review integrationYou review in your IDEDevin posts a PR, you review on GitHub

Devin works best when the task is well-specified, bounded, and does not require continuous judgment calls. Refactoring a specific module, implementing a documented API endpoint, fixing a regression with a known root cause: these are Devin tasks. Designing a new system architecture, debugging an obscure production issue, or writing code that depends on implicit context in your codebase: these require a more interactive loop.

The $500/month Team plan (250 ACUs) is substantial. At that price point, you are paying for the async value: developers not blocked waiting for agent output, agents running in parallel on multiple tasks, no context switching. If your bottleneck is developer attention rather than raw throughput, Devin is worth the calculation. If you want to stay in the loop and iterate interactively, Claude Code at $200/month delivers more value per dollar.

The Windsurf acquisition (July 2025) signals Cognition moving toward a full developer environment, not just a background agent. Watch for integrated workflows combining interactive coding (Windsurf IDE) and autonomous task execution (Devin) in the same product.


An academic agent designed specifically for resolving GitHub issues from an issue description alone. NeurIPS 2024 paper, Princeton NLP Group and Stanford.

AttributeDetails
GitHubSWE-agent/SWE-agent
Stars19,300+ (May 2026)
PaperNeurIPS 2024
LicenseMIT
LanguagePython (95%)
Versionv1.1.0 (May 2025)
MaintainersPrinceton NLP Group + Stanford

An agent pipeline that takes a GitHub issue URL and a model, then attempts to reproduce the bug, write a fix, and produce a patch. Its architecture uses an Agent-Computer Interface (ACI) layer that abstracts terminal, file editing, and test running into a consistent set of commands regardless of the underlying environment. This ACI design is the main academic contribution: it shows that agent performance correlates strongly with how well the environment exposes information, not just with the model’s raw capability.

SWE-agent + Claude 3.7 holds state-of-the-art on SWE-Bench Full (open-weights). The benchmark is the key context: SWE-Bench measures the percentage of real GitHub issues an agent can resolve end-to-end, and SWE-agent was designed with that benchmark as its optimization target.

Primarily academic and research use. If you want to run systematic evaluations of how different models perform on real GitHub issues, SWE-agent is the right tool because it has the reproducibility infrastructure (trajectory logging, evaluation harness, config YAML) that production tools skip.

For production batch issue resolution, Devin’s cloud sandbox and better error recovery make it more practical. SWE-agent requires you to set up the environment and handle failures manually.

The research value is real: teams building agent systems can use SWE-agent’s trajectory data (generated from issue resolution runs) to fine-tune models. Nous Research’s SWE-agent-LM-32b (open weights, SoTA on SWE-Bench for open models) was trained on trajectories generated by SWE-agent.

Terminal window
pip install swe-agent
# Run on a GitHub issue
sweagent run \
--agent.model.name=claude-sonnet-4-6 \
--env.repo.github_url=https://github.com/org/repo \
--problem_statement.github_url=https://github.com/org/repo/issues/123

Claude Code’s own autonomous mode: claude -p "task" runs a single instruction non-interactively and exits. Combined with CI/CD, it becomes an autonomous agent that triggers on GitHub events, runs on schedule via Routines, or processes tasks programmatically via the Agent SDK.

This falls in the programmatic billing bucket since June 15, 2026. See Billing: Programmatic vs Interactive for the credit limits and overage rates.

Patterns:

Terminal window
# Single task, exits when done
claude -p "Write tests for src/auth.ts, aim for 80% coverage"
# GitHub Actions: triggered by issue label
# See workflows/event-driven-agents.md for the full pattern
# Agent SDK: programmatic with tools
# See ai-ecosystem.md §14 (Claude Managed Agents)

Cross-references:


These are not coding tools. They are libraries for building custom multi-agent applications from scratch: marketing pipelines, research automation, document processing, customer support bots. You would use them if you are building a product that has AI agents inside it, not if you are a developer wanting an agent to write code for you.

The relationship to Claude Code: Claude (the model) can be one of the LLMs powering agents built with these frameworks. The frameworks themselves do not compete with Claude Code any more than Express.js competes with a browser.


Role-based multi-agent orchestration. The dominant choice for teams that want to define agents by job function (Researcher, Writer, Editor) and let them collaborate on structured tasks.

AttributeDetails
GitHubcrewAIInc/crewAI
Stars52,300+ (May 2026)
LanguagePython (99%)
LicenseMIT
Versionv1.14.5 (May 18, 2026)
Executions2B+ agent task executions reported
Downloads27M+
Enterprise customers150+

You define agents with a role, a goal, and a backstory (the “crew”). You define tasks and assign them to agents. CrewAI handles routing: sequential (A finishes, then B starts), parallel (A and B run simultaneously), or hierarchical (a manager agent delegates to specialists). Each agent can use tools, including MCP servers and web search. Multiple LLM providers supported (Claude, GPT, Gemini, Ollama).

It stands apart from LangChain (the older framework it frequently gets compared to) because it does not depend on LangChain at all. Standalone Python library.

The right level of abstraction for teams that can describe their workflow in human roles. If you can say “I want a researcher who gathers information, a writer who drafts, and an editor who refines,” CrewAI handles the orchestration and inter-agent communication. You write agent definitions, not orchestration code.

Avoid it when your workflow has complex conditional branching, requires durable execution across failures, or needs fine-grained control over how state passes between agents. LangGraph handles those cases better.

from crewai import Agent, Task, Crew, Process
researcher = Agent(
role="Technical Researcher",
goal="Find accurate technical information",
backstory="Expert at synthesizing documentation and research papers",
llm="claude-sonnet-4-6"
)
writer = Agent(
role="Technical Writer",
goal="Write clear, accurate documentation",
backstory="Experienced at translating technical concepts",
llm="claude-sonnet-4-6"
)
task = Task(
description="Research and document the new auth API endpoints",
expected_output="Markdown documentation with examples",
agent=writer,
context=[research_task] # researcher's output feeds writer
)
crew = Crew(agents=[researcher, writer], tasks=[task], process=Process.sequential)
result = crew.kickoff()

Graph-based agent orchestration from LangChain. Lower-level than CrewAI, more flexible, better for complex stateful workflows.

AttributeDetails
GitHublangchain-ai/langgraph
Stars33,100+ (May 2026)
LanguagePython (99%) + JS version available
LicenseMIT
Versionv1.2.2 (May 26, 2026)
Production usersKlarna, Replit, Elastic

An agent construction framework that models workflows as directed graphs with nodes (agent steps) and edges (transitions). The key primitives are state (a typed dict that persists across all nodes), conditional edges (branching based on state), and persistence (checkpointing so an interrupted workflow resumes from the last checkpoint, not from scratch). Human-in-the-loop is a first-class pattern: you can pause execution at any node and wait for a human decision before continuing.

LangGraph does not bundle agents. You define the workflow logic and plug in whatever LLM you want. The framework ensures that state transitions are predictable, failures are recoverable, and the workflow can be debugged step by step.

The right tool when your agent needs to survive failures, branch on runtime conditions, or require human approval at specific decision points. Examples: a code review pipeline that escalates to a human when the agent detects a security-relevant change; a data processing workflow that checkpoints after each expensive step so restarts do not re-process completed stages; a multi-step research agent that pauses for human guidance when it hits ambiguous source material.

Steeper learning curve than CrewAI. Worth it when the workflow complexity justifies the investment.

from langgraph.graph import StateGraph, END
from typing import TypedDict
class AgentState(TypedDict):
messages: list
task_complete: bool
graph = StateGraph(AgentState)
graph.add_node("agent", call_agent)
graph.add_node("tools", call_tools)
graph.add_conditional_edges("agent", should_continue, {"continue": "tools", "end": END})
graph.add_edge("tools", "agent")
graph.set_entry_point("agent")
app = graph.compile(checkpointer=MemorySaver()) # Durable execution

LangSmith (LangChain’s observability product) integrates natively for debugging and tracing agent runs.


Microsoft’s multi-agent framework, mid-transition from the original AutoGen library (maintenance mode since September 2025) to the Microsoft Agent Framework (MAF), which merges AutoGen and Semantic Kernel into one SDK.

AttributeDetails (MAF)
GitHubmicrosoft/agent-framework
Stars10,800+ (MAF, active)
Legacy GitHubmicrosoft/autogen (58,400 stars, maintenance mode since Sep 2025)
LanguagePython + C# + TypeScript
LicenseMIT
Versionpython-1.6.0 (May 22, 2026)
Production releasev1.0 (April 2026)

MAF is the merge of AutoGen (Python, conversational multi-agent) and Semantic Kernel (C# + Python, function-calling abstractions). The result is a cross-runtime framework: Python agents can coordinate with .NET agents, all backed by the same messaging layer. It implements the A2A (Agent-to-Agent) protocol, Microsoft’s contribution to agent interoperability, and supports MCP.

The AutoGen star count (58,400) reflects its historical reputation. AutoGen pioneered the “conversable agent” pattern where agents talk to each other in a structured conversation loop. That pattern is still the dominant mental model in the framework even as the implementation evolved.

Strong fit for Microsoft ecosystem teams: .NET + Python shops, Azure deployments, enterprise environments where Semantic Kernel is already established. The cross-runtime story is real: a Python agent can call tools implemented as .NET Semantic Kernel functions.

Less compelling for teams without existing .NET investment. If you are Python-only, CrewAI or LangGraph have larger communities and more tutorials.


Anthropic’s own framework for building multi-agent systems programmatically, distinct from Claude Code. Covered in detail in AI Ecosystem §14: Claude Managed Agents.

The operative distinction: Claude Code is a finished product you use as a developer; the Agent SDK is a library you use to build products that have Claude inside them. The Agent SDK handles tool use, context management, and multi-agent coordination via the Messages API. It is also in the programmatic billing bucket (see billing cross-reference above).


Tools that sit above agent frameworks and manage how agents are deployed, routed, and operated at scale. Not to be confused with multi-Claude orchestration tools (Gas Town, multiclaude) which are covered in Third-Party Tools.


A development methodology, not a product. “Conductor” started as an extension for Gemini CLI that enforces a Context, Spec, Plan, Implement workflow: before writing any code, the agent creates and commits a spec document, then a plan document, then implements against both.

AttributeDetails
GitHubgemini-cli-extensions/conductor
Stars3,600+ (May 2026)
LicenseApache 2.0

The methodology has been ported to Claude Code via community repos: lackeyjb/claude-conductor, ryanmac/code-conductor, and the wshobson/agents plugin marketplace. None of these have significant traction on their own, but the pattern itself (spec-before-code, committed documentation) maps directly to Claude Code’s Spec-First Development workflow.


An entirely separate project from the Gemini one. A YAML-first CLI for deterministic multi-agent workflows where the routing logic is static configuration, not LLM decisions.

AttributeDetails
GitHubmicrosoft/conductor
Stars158 (May 2026, brand new)
LicenseMIT
LaunchedMay 14, 2026 (Microsoft Open Source Blog)

Core idea: define your agent workflow in YAML (which agents run in sequence, which in parallel, which model each uses, what gets passed between stages) and execute it deterministically. No LLM in the orchestration loop, only in the agent steps. Supports both GitHub Copilot SDK and Anthropic Agent SDK as providers. Very early stage (158 stars, days old at time of writing), but backed by Microsoft’s open-source team.


A community template by Shann (@shannhk, Lisbon) for managing a fleet of Hermes agents on a VPS. Not a Nous Research project.

AttributeDetails
GitHubshannhk/hermes-agent-control-room
Stars474 (May 2026)
Age12 days (as of May 27, 2026)
TypeTemplate/documentation, not executable software

The concept: a folder structure with governance docs, a registry of deployed agents, runbooks for common operations, and 8 bundled Hermes skills for VPS provisioning, task routing, backup, security auditing, and cron planning. Agents share a filesystem-based task bus (inbox/working/outbox/archive per specialty). The orchestrator reads the control room docs to know agent capabilities, routes tasks via the bus, and synthesizes results.

The pattern is sound for anyone running 3+ Hermes agents. The specific repo is too new (7 commits) to recommend as a production dependency. Watch for a v1.0 with more operational hardening.


ToolOpen SourceStarsModel SupportModeLanguageCost
Claude CodeYes (TS)112KClaude onlyInteractive + headlessTypeScript$20-$200/mo
Codex CLIYes86KGPT-4o, o3, o4-miniInteractive + headlessRustIncluded in ChatGPT Pro/Team
Hermes AgentYes (MIT)170K200+ providersInteractive + cron + messagingPythonPay-per-LLM-call
AiderYes45K50+ providersInteractivePythonPay-per-LLM-call
GooseYes46K15+ providersInteractive + subagentsRustPay-per-LLM-call
DevinNoN/AProprietaryFully autonomousProprietary$20-$500/mo
SWE-agentYes (MIT)19KAny (Claude, GPT…)Autonomous (issue → PR)PythonPay-per-LLM-call
CrewAIYes (MIT)52K50+ providersFramework (build your own)PythonFramework is free
LangGraphYes (MIT)33KAnyFrameworkPython/JSFramework is free
AutoGen/MAFYes (MIT)58K/11KAnyFrameworkPython/C#/TSFramework is free
SituationRecommended
Daily coding, already on Claude MaxClaude Code
Daily coding, already on ChatGPT ProCodex CLI
Daily coding, want any modelHermes Agent or Aider
Daily coding, general-purpose agentGoose
Assign a task, come back to a PRDevin ($500/mo) or claude -p in CI
Fix GitHub issues autonomously, research/benchmarkSWE-agent
Orchestrate multiple Claude Code instancesGas Town, multiclaude, Ruflo (see Third-Party Tools)
Build a multi-agent product with rolesCrewAI
Build a stateful, recoverable workflowLangGraph
Build in .NET + Python with Microsoft stackAutoGen/MAF
Anthropic ecosystem, cloud-hosted agentsAnthropic Agent SDK (see ai-ecosystem.md §14)
Manage a fleet of Hermes agents on VPSHermes Control Room pattern

The single most clarifying question for choosing between Claude Code, Codex CLI, Hermes, Aider, and Goose: does the tool need to work with exactly one model provider, or multiple?

If you are committed to Claude and the Anthropic ecosystem (subscription, Routines, Agent SDK, CLAUDE.md tooling), Claude Code is unambiguously the right choice. The integration is native and the feature velocity from Anthropic is high.

If you need model flexibility (local models for sensitive code, cheaper models for routine tasks, specific models for benchmarking), Hermes Agent handles the broadest range with the most automation. Aider and Goose are simpler alternatives with smaller footprints.

If your team is OpenAI-first and already paying for ChatGPT Pro, Codex CLI costs nothing incremental.

Higher autonomy means the agent can complete more work without you watching, but also means more ways to go off track on ambiguous tasks. The right autonomy level depends on how well-specified your tasks are, not on which tool is “more powerful.”

Claude Code headless (claude -p) and SWE-agent give you controlled autonomy: you set the task, the agent runs, you review the output. Devin gives you maximal autonomy with a cloud sandbox: the agent has a full Linux environment and can take actions you did not anticipate. More power, more review required before merging.

Interactive agents (Claude Code terminal, Hermes, Aider, Goose) give you real-time control. You watch the agent think, redirect it when it goes wrong, and approve destructive actions. For exploratory work where requirements shift mid-session, interactive is faster than autonomous despite appearing more manual.