Claude Code is a long-running, networked process that talks to the Anthropic API, executes shell commands, reads and writes files, and manages sub-agents — all of which can fail. Rather than letting errors propagate arbitrarily, the codebase has a layered error architecture that classifies failures, decides whether to retry or surface them, and ensures the user always sees a meaningful message.
utils/errors.ts → utils/toolErrors.ts →
utils/errorLogSink.ts → services/api/withRetry.ts →
services/api/errors.ts → ink/components/ErrorOverview.tsx →
utils/conversationRecovery.ts → components/SentryErrorBoundary.ts
There are four distinct layers in the error stack:
Typed Error Classes
utils/errors.ts — a vocabulary of named exceptions for each failure domain
API Retry Engine
services/api/withRetry.ts — exponential backoff, 529 fallback, auth refresh, context overflow auto-adjust
Terminal Error Overlay
ink/components/ErrorOverview.tsx — inline source excerpt with highlighted crash line
Conversation Recovery
utils/conversationRecovery.ts — resume from interrupted/mid-turn sessions by cleaning and replaying transcript
utils/errors.ts)
Every major failure mode gets its own named class. This is not just style — it lets callers use instanceof checks that survive minification and class-name mangling in production builds.
| Class | Purpose | Key fields |
|---|---|---|
ClaudeError |
Base class; sets this.name to the subclass constructor name |
— |
AbortError |
User-initiated cancellation (Escape / Ctrl-C) | name = 'AbortError' |
MalformedCommandError |
Slash-command parse failures | — |
ConfigParseError |
Corrupt or unreadable config file; carries the default to fall back to | filePath, defaultConfig |
ShellError |
Shell command exit with non-zero code | stdout, stderr, code, interrupted |
TeleportOperationError |
Teleport SSH operations that need a formatted user-facing message | formattedMessage |
TelemetrySafeError_I_VERIFIED_... |
Errors that are safe to send to telemetry (no file paths, no code) | telemetryMessage |
The isAbortError three-way check
The abort signal comes from three different sources depending on context, and their class names mangle in production builds. The helper guards all three:
export function isAbortError(e: unknown): boolean {
return (
e instanceof AbortError || // our own class
e instanceof APIUserAbortError || // SDK's class — checked by instanceof
(e instanceof Error && e.name === 'AbortError') // DOMException from AbortController
)
}
e.name for all three?APIUserAbortError never sets this.name, and minified builds mangle constructor names to short strings like 'nJT'. String matching silently fails in production. The comment in the source makes this explicit.
Utility helpers — catch-site boundaries
Rather than casting unknown to Error everywhere, a small set of functions normalize the catch value at the boundary:
// Normalize unknown → Error (use at catch-site when you need an Error instance)
export function toError(e: unknown): Error {
return e instanceof Error ? e : new Error(String(e))
}
// Only need the message string (for logging, display)
export function errorMessage(e: unknown): string {
return e instanceof Error ? e.message : String(e)
}
// Trim stack to top-N frames so tool_result payloads don't waste tokens
export function shortErrorStack(e: unknown, maxFrames = 5): string {
if (!(e instanceof Error)) return String(e)
if (!e.stack) return e.message
const lines = e.stack.split('\n')
const header = lines[0] ?? e.message
const frames = lines.slice(1).filter(l => l.trim().startsWith('at '))
if (frames.length <= maxFrames) return e.stack
return [header, ...frames.slice(0, maxFrames)].join('\n')
}
shortErrorStack is specifically designed for tool results sent to the model. Full stacks are 500–2000 characters of mostly internal frames. Truncating to 5 frames keeps the model's context window free for what matters.
Filesystem error helpers
Node.js filesystem errors carry an errno code on the error object but TypeScript types it as any. The codebase replaces the unsafe cast pattern with typed helpers:
// Safe alternative to (e as NodeJS.ErrnoException).code
export function getErrnoCode(e: unknown): string | undefined
// Covers: ENOENT | EACCES | EPERM | ENOTDIR | ELOOP
export function isFsInaccessible(e: unknown): e is NodeJS.ErrnoException
isFsInaccessible covers five errno codes because any of them can appear when trying to access a path — a file named .claude where a directory is expected triggers ENOTDIR; circular symlinks give ELOOP. Checking only ENOENT would miss these.
The telemetry safety discipline
TelemetrySafeError_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS is intentionally long. The name forces the developer to consciously acknowledge that the error message contains no sensitive data before using it. The two-argument form lets you log a full message to the user (with file paths) while sending a sanitized version to telemetry:
// Two-arg form: full message for user, scrubbed for telemetry
throw new TelemetrySafeError_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS(
`MCP tool timed out after ${ms}ms`, // full message
'MCP tool timed out' // telemetry message (no timing data)
)
utils/toolErrors.ts)
When a tool fails, its error must be formatted for two consumers: the terminal (for the user) and the model (in a tool_result block). toolErrors.ts handles both, including a hard 10,000-character cap with center-truncation to protect context budgets.
export function formatError(error: unknown): string {
if (error instanceof AbortError) {
return error.message || INTERRUPT_MESSAGE_FOR_TOOL_USE
}
if (!(error instanceof Error)) return String(error)
const parts = getErrorParts(error)
const fullMessage = parts.filter(Boolean).join('\n').trim()
|| 'Command failed with no output'
if (fullMessage.length <= 10000) return fullMessage
// Center-truncate: keep head + tail of large outputs
const halfLength = 5000
return `${fullMessage.slice(0, halfLength)}\n\n...${fullMessage.length - 10000} characters truncated...\n\n${fullMessage.slice(-halfLength)}`
}
For ShellError, the parts are assembled in priority order: exit code first, then stderr, then stdout. This mirrors what a developer would want to see — the most diagnostic information first.
Zod validation errors → LLM-friendly messages
When the model calls a tool with the wrong schema, a ZodError is converted to a structured English message that the model can understand and correct:
// Input: ZodError with two issues — missing param + wrong type
// Output:
"FileEditTool failed due to the following issues:
The required parameter `old_string` is missing
The parameter `new_string` type is expected as `string` but provided as `number`"
todos[0].activeForm and specifying expected vs. received types, the model can self-correct on the next attempt without a human in the loop.
utils/errorLogSink.ts)
Error logging is decoupled from the actual write implementation through a sink pattern. log.ts is dependency-free — it queues events until a sink is attached. errorLogSink.ts contains the heavy implementation (file I/O, axios enrichment) and is initialized once during startup.
The sink writes errors as JSONL (one JSON object per line) to a date-stamped file. Each entry includes timestamp, session ID, cwd, and version. For axios errors it extracts the request URL, HTTP status, and server error body — the three fields most useful for diagnosing API issues.
// Buffered JSONL writer — flushes every 1 second or after 50 entries
// On first write: mkdirSync creates parent dirs, then appendFileSync
// Registered with cleanupRegistry so it flushes on process exit
function createJsonlWriter(options: {
writeFn: (content: string) => void
flushIntervalMs?: number // default: 1000
maxBufferSize?: number // default: 50
}): JsonlWriter
appendToLog function guards on process.env.USER_TYPE !== 'ant' — error logs are only written for internal Anthropic employees, not external users. This prevents accumulating potentially sensitive user data in log files.
services/api/withRetry.ts)
withRetry is an async generator that wraps every Anthropic API call. It implements a sophisticated retry policy that handles transient network failures, rate limits, auth token expiry, context overflow, and the Claude-specific 529 overload status.
Retry delay formula
The backoff uses exponential delay with ±25% jitter to avoid thundering-herd when many clients hit rate limits simultaneously:
export function getRetryDelay(
attempt: number,
retryAfterHeader?: string | null,
maxDelayMs = 32000,
): number {
if (retryAfterHeader) {
const seconds = parseInt(retryAfterHeader, 10)
if (!isNaN(seconds)) return seconds * 1000 // honor server directive
}
const baseDelay = Math.min(
500 * Math.pow(2, attempt - 1), // 500ms, 1s, 2s, 4s... cap 32s
maxDelayMs,
)
const jitter = Math.random() * 0.25 * baseDelay
return baseDelay + jitter
}
The 529 overloaded error — special handling
HTTP 529 is a Claude-specific status code meaning the API is overloaded. It gets its own logic because:
- Background query sources (title generators, summarizers) bail immediately — retrying from dozens of clients during a capacity event would amplify the cascade
- After 3 consecutive 529s on an Opus model, the engine falls back to a configured fallback model via
FallbackTriggeredError - The SDK sometimes fails to set
status=529during streaming — the fallback checkserror.message.includes('"type":"overloaded_error"')
export function is529Error(error: unknown): boolean {
if (!(error instanceof APIError)) return false
return (
error.status === 529 ||
// SDK streaming bug: status not set, check message content
(error.message?.includes('"type":"overloaded_error"') ?? false)
)
}
Context overflow auto-adjustment
When a request is rejected with a 400 "input length and max_tokens exceed context limit" error, withRetry parses the token counts from the message and automatically reduces maxTokens for the next attempt — without requiring any user action:
// Error message format:
// "input length and `max_tokens` exceed context limit: 188059 + 20000 > 200000"
// Auto-adjustment:
const availableContext = Math.max(0, contextLimit - inputTokens - 1000) // safety buffer
retryContext.maxTokensOverride = Math.max(FLOOR_OUTPUT_TOKENS, availableContext, minRequired)
Persistent retry mode (CLAUDE_CODE_UNATTENDED_RETRY)
For unattended / CI sessions, setting CLAUDE_CODE_UNATTENDED_RETRY=1 enables indefinite retry on 429/529 with a maximum backoff of 5 minutes and a 6-hour total cap. Long waits are chunked into 30-second heartbeat yields so the host environment (CI runner, tmux session) does not mark the process idle.
services/api/errors.ts)All user-visible error strings are defined as named constants in one file. This makes them searchable, testable, and prevents message drift between the place that throws and the place that detects:
export const INVALID_API_KEY_ERROR_MESSAGE = 'Not logged in · Please run /login'
export const TOKEN_REVOKED_ERROR_MESSAGE = 'OAuth token revoked · Please run /login'
export const REPEATED_529_ERROR_MESSAGE = 'Repeated 529 Overloaded errors'
export const API_TIMEOUT_ERROR_MESSAGE = 'Request timed out'
export const CREDIT_BALANCE_TOO_LOW_ERROR_MESSAGE = 'Credit balance is too low'
Interactive vs. non-interactive sessions get different guidance for media errors. The same getImageTooLargeErrorMessage() function returns "Double press esc to go back" for REPL users and "Try resizing the image" for SDK/headless callers:
export function getImageTooLargeErrorMessage(): string {
return getIsNonInteractiveSession()
? 'Image was too large. Try resizing the image or using a different approach.'
: 'Image was too large. Double press esc to go back and try again with a smaller image.'
}
ink/components/ErrorOverview.tsx)
When an unhandled exception reaches the Ink render tree, ErrorOverview displays it directly in the terminal — not a stack dump, but a formatted overlay with the crash location and inline source context. It uses two libraries: StackUtils to parse V8 stack frames, and code-excerpt to read and display the relevant source lines.
// 1. Parse the first stack frame to get file + line + column
const stack = error.stack ? error.stack.split('\n').slice(1) : undefined
const origin = stack ? getStackUtils().parseLine(stack[0]!) : undefined
// 2. Read source file synchronously (sync OK: error overlay, can't go async)
const sourceCode = readFileSync(filePath, 'utf8')
excerpt = codeExcerpt(sourceCode, origin.line)
// 3. Render: crash line is highlighted in red; surrounding lines are dim
const isCrashLine = line_0 === origin.line
<Text
backgroundColor={isCrashLine ? 'ansi:red' : undefined}
color={isCrashLine ? 'ansi:white' : undefined}
dim={!isCrashLine}
>
{' ' + value}
</Text>
If the source file is unreadable (e.g., the process moved working directories or the file was deleted), the readFileSync is wrapped in a silent try/catch. The overlay degrades gracefully — it still shows the error message and full parsed stack, just without the source excerpt.
The SentryErrorBoundary
Alongside ErrorOverview, there is a React error boundary at components/SentryErrorBoundary.ts. It catches render errors from any child component and renders null (silent failure) instead of crashing the whole Ink tree:
export class SentryErrorBoundary extends React.Component<Props, State> {
static getDerivedStateFromError(): State {
return { hasError: true }
}
render(): React.ReactNode {
if (this.state.hasError) return null // silent: don't crash the whole UI
return this.props.children
}
}
ErrorOverview is used for critical unhandled errors that terminate the current render — it surfaces them to the user. SentryErrorBoundary wraps non-critical UI components that can be silently dropped without breaking the whole session.
utils/conversationRecovery.ts)
When Claude Code crashes mid-turn or is forcefully killed, the conversation transcript on disk may be in an inconsistent state. conversationRecovery.ts is responsible for loading, cleaning, and restoring that transcript into a resumable state.
The four-stage deserialization pipeline
deserializeMessagesWithInterruptDetection runs the raw persisted messages through four filters before handing them back to the REPL:
Legacy migration
Transform old attachment types (new_file → file, new_directory → directory) and backfill missing displayPath fields
Strip bad permission modes
Remove permissionMode values not in the current build's PERMISSION_MODES set — prevents crashes from stale config
Filter invalid messages
Remove unresolved tool_use pairs, orphaned thinking-only assistant messages, and whitespace-only assistant messages
Interrupt detection
Classify the transcript end as: none (completed) / interrupted_prompt (user sent a message, AI never responded) / interrupted_turn (AI was mid-tool-use)
Interruption classification
After filtering, the last "turn-relevant" message (skipping system, progress, and API error assistants) determines what happened:
// Last message is an assistant → turn completed normally
if (lastMessage.type === 'assistant') return { kind: 'none' }
// Last message is a plain user prompt → CC hadn't started responding
if (lastMessage.type === 'user' && !isToolUseResultMessage(lastMessage))
return { kind: 'interrupted_prompt', message: lastMessage }
// Last message is a tool_result → AI was mid-tool-use
if (isToolUseResultMessage(lastMessage)) {
// Special case: brief mode ends on SendUserMessage tool_result legitimately
if (isTerminalToolResult(lastMessage, messages, lastMessageIdx))
return { kind: 'none' }
return { kind: 'interrupted_turn' }
}
Synthetic continuation messages
An interrupted_turn (AI was mid-tool when killed) is converted to interrupted_prompt by injecting a synthetic "Continue from where you left off." user message. This unifies both interruption kinds so the consumer only needs to handle one case:
if (internalState.kind === 'interrupted_turn') {
const [continuationMessage] = normalizeMessages([
createUserMessage({ content: 'Continue from where you left off.', isMeta: true })
])
filteredMessages.push(continuationMessage!)
turnInterruptionState = { kind: 'interrupted_prompt', message: continuationMessage! }
}
The API-valid sentinel
If the last relevant message after all filtering is a user message, the Anthropic API would reject the conversation (which must end on an assistant turn for streaming). A synthetic NO_RESPONSE_REQUESTED assistant sentinel is spliced in after the user message so the conversation is always API-valid even if no resume action is taken.
lastRelevantIdx + 1, not at the array end. This is intentional: removeInterruptedMessage calls splice(idx, 2) to remove the user message and the sentinel as a pair. Inserting at the end would break this if there are trailing system or progress messages.
Skill state restoration
Before deserialization, restoreSkillStateFromMessages walks the transcript for invoked_skills attachments and re-registers those skills in process state. Without this, a second compaction after resume would lose track of which skills were active:
for (const message of messages) {
if (message.attachment?.type === 'invoked_skills') {
for (const skill of message.attachment.skills) {
addInvokedSkill(skill.name, skill.path, skill.content, null)
}
}
// Suppress re-sending skill listing if already in transcript
if (message.attachment?.type === 'skill_listing') suppressNextSkillListing()
}
Key Takeaways
- Every failure domain has a dedicated typed error class —
ShellError,ConfigParseError,AbortErroretc. — so callers useinstanceofinstead of string-matching, which survives minification. isAbortErrorchecks three different abort shapes (own class, SDK class, DOMException) because minification manglesconstructor.nameand the SDK never setsthis.name.withRetryhandles 10+ distinct failure modes in one loop: 529 model fallback, context overflow auto-adjustment, OAuth token refresh, Bedrock/Vertex auth cache clearing, and stale connection keep-alive disable.- Background query sources (title generators, summarizers) bail immediately on 529 without retry — amplifying capacity events from dozens of concurrent clients would make the outage worse.
- Tool errors are truncated to 10,000 chars (5k head + 5k tail) before being sent to the model — large compiler outputs would otherwise waste the entire context window.
- The terminal
ErrorOverviewreads the crash source file synchronously and highlights the exact line — acceptable because this is the error path and the REPL is already broken. - Conversation recovery runs a 4-stage filter pipeline to clean broken transcripts, then classifies the interruption type and injects synthetic messages to make the conversation API-valid before resuming.
TelemetrySafeError_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHSuses a deliberately long name as a code-review forcing function — the developer must consciously confirm no sensitive data is in the message.
Knowledge Check
isAbortError use instanceof APIUserAbortError instead of checking e.name === 'APIUserAbortError'?withRetry do when it receives a context overflow 400 error?deserializeMessages finds an interrupted_turn (process killed mid-tool-use), what does it inject?ErrorOverview.tsx use synchronous readFileSync instead of an async read?