· 5 min read
Agent systems fail in ways that traditional software does not. A web server either returns a response or throws an exception. An agent might return a confidently wrong answer, get stuck in a retry loop, silently drop context, or cascade a single tool failure across an entire multi-agent pipeline. Handling these failure modes requires patterns beyond standard try/catch blocks.
The Claude Agent SDK exposes hooks and configuration options at every layer — tool execution, agent turns, handoffs, and run-level orchestration. Building reliable systems means using these hooks to detect failures early, recover gracefully, and preserve enough diagnostic information to fix the root cause.
This guide covers the error handling patterns that prevent the most damaging production failures: tool errors that corrupt agent state, cascading failures across agent teams, and silent degradation that goes undetected until users complain.
Invest in structured error handling when:
For simple, interactive prototypes where a human reviews every output, basic try/catch may suffice. But any system that acts autonomously or serves multiple users concurrently needs the patterns described here.
Never let tool exceptions propagate unhandled. Return structured error objects that the agent can reason about.
import { Tool } from "@anthropic-ai/agent-sdk";
import { z } from "zod";
const apiCallTool = new Tool({
name: "fetch_pricing_data",
description: "Retrieve current pricing from the external pricing service",
inputSchema: z.object({
productId: z.string(),
region: z.enum(["us", "eu", "apac"]),
}),
async execute({ productId, region }) {
try {
const response = await fetch(
`https://pricing.internal/api/v2/products/${productId}?region=${region}`,
{ signal: AbortSignal.timeout(5000) }
);
if (!response.ok) {
return {
success: false,
error: `Pricing service returned ${response.status}`,
retryable: response.status >= 500,
};
}
const data = await response.json();
return { success: true, data };
} catch (err) {
const isTimeout = err instanceof DOMException && err.name === "TimeoutError";
return {
success: false,
error: isTimeout ? "Pricing service timed out after 5s" : `Unexpected error: ${err.message}`,
retryable: isTimeout,
};
}
},
});
For retryable failures, wrap agent runs with a retry mechanism that increases delay between attempts.
interface RetryConfig {
maxRetries: number;
baseDelayMs: number;
maxDelayMs: number;
}
async function runWithRetry<T>(
fn: () => Promise<T>,
config: RetryConfig = { maxRetries: 3, baseDelayMs: 1000, maxDelayMs: 10000 }
): Promise<T> {
let lastError: Error | undefined;
for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
lastError = err instanceof Error ? err : new Error(String(err));
if (attempt === config.maxRetries) break;
const isRetryable =
lastError.message.includes("rate_limit") ||
lastError.message.includes("overloaded") ||
lastError.message.includes("timeout");
if (!isRetryable) throw lastError;
const delay = Math.min(
config.baseDelayMs * Math.pow(2, attempt),
config.maxDelayMs
);
console.warn(`Attempt ${attempt + 1} failed, retrying in ${delay}ms`);
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
throw lastError;
}
// Usage with an agent run
const result = await runWithRetry(() =>
researchAgent.run(userQuery, { maxTurns: 10 })
);
When an external service is consistently failing, stop calling it temporarily rather than wasting tokens on repeated failures.
class CircuitBreaker {
private failures = 0;
private lastFailureTime = 0;
private state: "closed" | "open" | "half-open" = "closed";
constructor(
private readonly threshold: number = 5,
private readonly resetTimeMs: number = 30000
) {}
async execute<T>(fn: () => Promise<T>, fallback: () => T): Promise<T> {
if (this.state === "open") {
if (Date.now() - this.lastFailureTime > this.resetTimeMs) {
this.state = "half-open";
} else {
console.warn("Circuit breaker open, using fallback");
return fallback();
}
}
try {
const result = await fn();
if (this.state === "half-open") {
this.state = "closed";
this.failures = 0;
}
return result;
} catch (err) {
this.failures++;
this.lastFailureTime = Date.now();
if (this.failures >= this.threshold) {
this.state = "open";
console.error(`Circuit breaker tripped after ${this.failures} failures`);
}
return fallback();
}
}
}
const pricingBreaker = new CircuitBreaker(3, 60000);
const resilientPricingTool = new Tool({
name: "fetch_pricing_resilient",
description: "Retrieve pricing with circuit breaker protection",
inputSchema: z.object({ productId: z.string() }),
async execute({ productId }) {
return pricingBreaker.execute(
() => fetchPricingFromAPI(productId),
() => ({
success: false,
error: "Pricing service temporarily unavailable. Use cached or estimated pricing.",
cached: getCachedPrice(productId),
})
);
},
});
Error handling in multi-agent teams requires coordination at the orchestration layer. In a Sequential Pipeline, a failure in stage two should not silently pass corrupted data to stage three.
import { Agent } from "@anthropic-ai/agent-sdk";
const resilientPipeline = new Agent({
name: "resilient-pipeline",
model: "claude-sonnet-4-20250514",
instructions: `You coordinate a multi-stage pipeline. After each agent
completes, validate its output before passing to the next stage.
Validation rules:
- Research agent must return at least 3 sources
- Analysis agent must include confidence scores between 0 and 1
- If validation fails, retry the failing agent once with clarified instructions
- If retry fails, return a partial result with a clear explanation of what failed`,
handoffs: [
{ agent: researchAgent, condition: "Start with research gathering" },
{ agent: analysisAgent, condition: "When research passes validation" },
{ agent: reportAgent, condition: "When analysis passes validation" },
],
});
In Parallel Worker patterns, use Promise.allSettled rather than Promise.all so that one agent's failure does not prevent collecting results from agents that succeeded. The orchestrator can then decide whether partial results are sufficient or if the entire operation should be retried.
Return errors as data, not exceptions. When a tool fails, return a structured object describing the failure. The agent can then reason about the error and decide on next steps, rather than having the entire run abort.
Set timeouts at every boundary. Every HTTP call, every database query, and every agent run should have an explicit timeout. A missing timeout is a latency bomb waiting to go off under load.
Distinguish retryable from terminal errors. Rate limits and timeouts are retryable. Authentication failures and invalid input are not. Retrying a terminal error wastes time and tokens.
Log the full error chain. When an agent run fails, capture which tool failed, what input it received, and what error it returned. In multi-agent systems, also log which agent was active and what turn number the failure occurred on.
Test failure paths explicitly. Write tests that simulate tool timeouts, malformed API responses, and rate limit errors. The failure paths are the ones that break in production, and they are the ones most often untested.