mdENG — Lesson 44 — Error Handling and Recovery

01 Overview

Claude Code is a long-running, networked process that talks to the Anthropic API, executes shell commands, reads and writes files, and manages sub-agents — all of which can fail. Rather than letting errors propagate arbitrarily, the codebase has a layered error architecture that classifies failures, decides whether to retry or surface them, and ensures the user always sees a meaningful message.

Source files covered

utils/errors.ts → utils/toolErrors.ts → utils/errorLogSink.ts → services/api/withRetry.ts → services/api/errors.ts → ink/components/ErrorOverview.tsx → utils/conversationRecovery.ts → components/SentryErrorBoundary.ts

There are four distinct layers in the error stack:

Layer 1

Typed Error Classes

utils/errors.ts — a vocabulary of named exceptions for each failure domain

Layer 2

API Retry Engine

services/api/withRetry.ts — exponential backoff, 529 fallback, auth refresh, context overflow auto-adjust

Layer 3

Terminal Error Overlay

ink/components/ErrorOverview.tsx — inline source excerpt with highlighted crash line

Layer 4

Conversation Recovery

utils/conversationRecovery.ts — resume from interrupted/mid-turn sessions by cleaning and replaying transcript

02 The Error Taxonomy (utils/errors.ts)

Every major failure mode gets its own named class. This is not just style — it lets callers use instanceof checks that survive minification and class-name mangling in production builds.

Class	Purpose	Key fields
`ClaudeError`	Base class; sets `this.name` to the subclass constructor name	—
`AbortError`	User-initiated cancellation (Escape / Ctrl-C)	`name = 'AbortError'`
`MalformedCommandError`	Slash-command parse failures	—
`ConfigParseError`	Corrupt or unreadable config file; carries the default to fall back to	`filePath`, `defaultConfig`
`ShellError`	Shell command exit with non-zero code	`stdout`, `stderr`, `code`, `interrupted`
`TeleportOperationError`	Teleport SSH operations that need a formatted user-facing message	`formattedMessage`
`TelemetrySafeError_I_VERIFIED_...`	Errors that are safe to send to telemetry (no file paths, no code)	`telemetryMessage`

The `isAbortError` three-way check

The abort signal comes from three different sources depending on context, and their class names mangle in production builds. The helper guards all three:

export function isAbortError(e: unknown): boolean {
  return (
    e instanceof AbortError ||           // our own class
    e instanceof APIUserAbortError ||    // SDK's class — checked by instanceof
    (e instanceof Error && e.name === 'AbortError')  // DOMException from AbortController
  )
}

Why not just check e.name for all three?

The SDK's APIUserAbortError never sets this.name, and minified builds mangle constructor names to short strings like 'nJT'. String matching silently fails in production. The comment in the source makes this explicit.

Utility helpers — catch-site boundaries

Rather than casting unknown to Error everywhere, a small set of functions normalize the catch value at the boundary:

// Normalize unknown → Error (use at catch-site when you need an Error instance)
export function toError(e: unknown): Error {
  return e instanceof Error ? e : new Error(String(e))
}

// Only need the message string (for logging, display)
export function errorMessage(e: unknown): string {
  return e instanceof Error ? e.message : String(e)
}

// Trim stack to top-N frames so tool_result payloads don't waste tokens
export function shortErrorStack(e: unknown, maxFrames = 5): string {
  if (!(e instanceof Error)) return String(e)
  if (!e.stack) return e.message
  const lines = e.stack.split('\n')
  const header = lines[0] ?? e.message
  const frames = lines.slice(1).filter(l => l.trim().startsWith('at '))
  if (frames.length <= maxFrames) return e.stack
  return [header, ...frames.slice(0, maxFrames)].join('\n')
}

Context budget

shortErrorStack is specifically designed for tool results sent to the model. Full stacks are 500–2000 characters of mostly internal frames. Truncating to 5 frames keeps the model's context window free for what matters.

Filesystem error helpers

Node.js filesystem errors carry an errno code on the error object but TypeScript types it as any. The codebase replaces the unsafe cast pattern with typed helpers:

// Safe alternative to (e as NodeJS.ErrnoException).code
export function getErrnoCode(e: unknown): string | undefined

// Covers: ENOENT | EACCES | EPERM | ENOTDIR | ELOOP
export function isFsInaccessible(e: unknown): e is NodeJS.ErrnoException

isFsInaccessible covers five errno codes because any of them can appear when trying to access a path — a file named .claude where a directory is expected triggers ENOTDIR; circular symlinks give ELOOP. Checking only ENOENT would miss these.

The telemetry safety discipline

TelemetrySafeError_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS is intentionally long. The name forces the developer to consciously acknowledge that the error message contains no sensitive data before using it. The two-argument form lets you log a full message to the user (with file paths) while sending a sanitized version to telemetry:

// Two-arg form: full message for user, scrubbed for telemetry
throw new TelemetrySafeError_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS(
  `MCP tool timed out after ${ms}ms`,  // full message
  'MCP tool timed out'              // telemetry message (no timing data)
)

03 Tool Error Formatting (utils/toolErrors.ts)

When a tool fails, its error must be formatted for two consumers: the terminal (for the user) and the model (in a tool_result block). toolErrors.ts handles both, including a hard 10,000-character cap with center-truncation to protect context budgets.

export function formatError(error: unknown): string {
  if (error instanceof AbortError) {
    return error.message || INTERRUPT_MESSAGE_FOR_TOOL_USE
  }
  if (!(error instanceof Error)) return String(error)
  const parts = getErrorParts(error)
  const fullMessage = parts.filter(Boolean).join('\n').trim()
    || 'Command failed with no output'
  if (fullMessage.length <= 10000) return fullMessage

  // Center-truncate: keep head + tail of large outputs
  const halfLength = 5000
  return `${fullMessage.slice(0, halfLength)}\n\n...${fullMessage.length - 10000} characters truncated...\n\n${fullMessage.slice(-halfLength)}`
}

For ShellError, the parts are assembled in priority order: exit code first, then stderr, then stdout. This mirrors what a developer would want to see — the most diagnostic information first.

Zod validation errors → LLM-friendly messages

When the model calls a tool with the wrong schema, a ZodError is converted to a structured English message that the model can understand and correct:

// Input: ZodError with two issues — missing param + wrong type
// Output:
"FileEditTool failed due to the following issues:
The required parameter `old_string` is missing
The parameter `new_string` type is expected as `string` but provided as `number`"

Design intent

Generic Zod messages like "Required" are confusing to LLMs. By formatting paths as todos[0].activeForm and specifying expected vs. received types, the model can self-correct on the next attempt without a human in the loop.

04 The Error Log Sink (utils/errorLogSink.ts)

Error logging is decoupled from the actual write implementation through a sink pattern. log.ts is dependency-free — it queues events until a sink is attached. errorLogSink.ts contains the heavy implementation (file I/O, axios enrichment) and is initialized once during startup.

flowchart LR A["logError(err)\nlog.ts"] -->|"queues if no sink"| Q["In-memory queue"] A -->|"drains to sink once attached"| S["logErrorImpl()\nerrorLogSink.ts"] S --> D["logForDebugging()\ndebug log"] S --> F["JSONL append\n~/.cache/claude/errors/DATE.jsonl"] S --> AX{"axios\nerror?"} AX -->|"yes"| EN["Enrich:\nurl, status, body"] AX -->|"no"| F EN --> F style A fill:#22201d,stroke:#7d9ab8,color:#b8b0a4 style F fill:#1e251b,stroke:#6e9468,color:#b8b0a4

The sink writes errors as JSONL (one JSON object per line) to a date-stamped file. Each entry includes timestamp, session ID, cwd, and version. For axios errors it extracts the request URL, HTTP status, and server error body — the three fields most useful for diagnosing API issues.

// Buffered JSONL writer — flushes every 1 second or after 50 entries
// On first write: mkdirSync creates parent dirs, then appendFileSync
// Registered with cleanupRegistry so it flushes on process exit
function createJsonlWriter(options: {
  writeFn: (content: string) => void
  flushIntervalMs?: number     // default: 1000
  maxBufferSize?: number        // default: 50
}): JsonlWriter

Ant-only logging

The appendToLog function guards on process.env.USER_TYPE !== 'ant' — error logs are only written for internal Anthropic employees, not external users. This prevents accumulating potentially sensitive user data in log files.

05 The API Retry Engine (services/api/withRetry.ts)

withRetry is an async generator that wraps every Anthropic API call. It implements a sophisticated retry policy that handles transient network failures, rate limits, auth token expiry, context overflow, and the Claude-specific 529 overload status.

flowchart TD START["API call attempt"] --> TRY["try: operation()"] TRY -->|"success"| RETURN["return result"] TRY -->|"error"| CLASSIFY["Classify error"] CLASSIFY --> ABORT{"AbortSignal\nset?"} ABORT -->|"yes"| THROW_ABORT["throw APIUserAbortError"] ABORT -->|"no"| FM{"Fast mode\nactive + 429/529?"} FM -->|"short retry-after"| CONTINUE["continue (fast mode preserved)"] FM -->|"long/unknown"| COOLDOWN["triggerFastModeCooldown()\nretryContext.fastMode = false"] COOLDOWN --> CONTINUE FM -->|"no"| BG529{"Background\nquery source?"} BG529 -->|"yes"| DROP["throw CannotRetryError\n(no amplification)"] BG529 -->|"no"| CTX{"Context\noverflow?"} CTX -->|"yes"| ADJUST["Compute adjustedMaxTokens\nretryContext.maxTokensOverride = N\ncontinue"] CTX -->|"no"| AUTH{"Auth\nerror?"} AUTH -->|"401/403 OAuth"| REFRESH["handleOAuth401Error()\nforce token refresh\nnew client on next attempt"] AUTH -->|"Bedrock 403"| CLEAR_AWS["clearAwsCredentialsCache()\nnew client on next attempt"] AUTH -->|"Vertex 401"| CLEAR_GCP["clearGcpCredentialsCache()"] REFRESH --> RETRY_CHECK CLEAR_AWS --> RETRY_CHECK CLEAR_GCP --> RETRY_CHECK CTX -->|"no auth"| RETRY_CHECK{"attempt ≤\nmaxRetries?"} RETRY_CHECK -->|"yes + retryable"| BACKOFF["getRetryDelay(attempt)\nexponential + jitter\nyield SystemAPIErrorMessage"] RETRY_CHECK -->|"no"| THROW_RETRY["throw CannotRetryError"] BACKOFF --> START style RETURN fill:#1e251b,stroke:#6e9468,color:#b8b0a4 style THROW_ABORT fill:#2c1d18,stroke:#c47a50,color:#b8b0a4 style THROW_RETRY fill:#2c1d18,stroke:#c47a50,color:#b8b0a4 style DROP fill:#31271d,stroke:#b8965e,color:#b8b0a4

Retry delay formula

The backoff uses exponential delay with ±25% jitter to avoid thundering-herd when many clients hit rate limits simultaneously:

export function getRetryDelay(
  attempt: number,
  retryAfterHeader?: string | null,
  maxDelayMs = 32000,
): number {
  if (retryAfterHeader) {
    const seconds = parseInt(retryAfterHeader, 10)
    if (!isNaN(seconds)) return seconds * 1000  // honor server directive
  }
  const baseDelay = Math.min(
    500 * Math.pow(2, attempt - 1),   // 500ms, 1s, 2s, 4s... cap 32s
    maxDelayMs,
  )
  const jitter = Math.random() * 0.25 * baseDelay
  return baseDelay + jitter
}

The 529 overloaded error — special handling

HTTP 529 is a Claude-specific status code meaning the API is overloaded. It gets its own logic because:

Background query sources (title generators, summarizers) bail immediately — retrying from dozens of clients during a capacity event would amplify the cascade
After 3 consecutive 529s on an Opus model, the engine falls back to a configured fallback model via FallbackTriggeredError
The SDK sometimes fails to set status=529 during streaming — the fallback checks error.message.includes('"type":"overloaded_error"')

export function is529Error(error: unknown): boolean {
  if (!(error instanceof APIError)) return false
  return (
    error.status === 529 ||
    // SDK streaming bug: status not set, check message content
    (error.message?.includes('"type":"overloaded_error"') ?? false)
  )
}

Context overflow auto-adjustment

When a request is rejected with a 400 "input length and max_tokens exceed context limit" error, withRetry parses the token counts from the message and automatically reduces maxTokens for the next attempt — without requiring any user action:

// Error message format:
// "input length and `max_tokens` exceed context limit: 188059 + 20000 > 200000"

// Auto-adjustment:
const availableContext = Math.max(0, contextLimit - inputTokens - 1000) // safety buffer
retryContext.maxTokensOverride = Math.max(FLOOR_OUTPUT_TOKENS, availableContext, minRequired)

Floor output tokens

The floor is set to 3,000 tokens. If even that doesn't fit (the context is absolutely full), the error is re-thrown rather than attempting a call that would produce truncated or empty output.

Persistent retry mode (`CLAUDE_CODE_UNATTENDED_RETRY`)

For unattended / CI sessions, setting CLAUDE_CODE_UNATTENDED_RETRY=1 enables indefinite retry on 429/529 with a maximum backoff of 5 minutes and a 6-hour total cap. Long waits are chunked into 30-second heartbeat yields so the host environment (CI runner, tmux session) does not mark the process idle.

06 User-Facing Error Message Constants (services/api/errors.ts)

All user-visible error strings are defined as named constants in one file. This makes them searchable, testable, and prevents message drift between the place that throws and the place that detects:

export const INVALID_API_KEY_ERROR_MESSAGE = 'Not logged in · Please run /login'
export const TOKEN_REVOKED_ERROR_MESSAGE    = 'OAuth token revoked · Please run /login'
export const REPEATED_529_ERROR_MESSAGE     = 'Repeated 529 Overloaded errors'
export const API_TIMEOUT_ERROR_MESSAGE      = 'Request timed out'
export const CREDIT_BALANCE_TOO_LOW_ERROR_MESSAGE = 'Credit balance is too low'

Interactive vs. non-interactive sessions get different guidance for media errors. The same getImageTooLargeErrorMessage() function returns "Double press esc to go back" for REPL users and "Try resizing the image" for SDK/headless callers:

export function getImageTooLargeErrorMessage(): string {
  return getIsNonInteractiveSession()
    ? 'Image was too large. Try resizing the image or using a different approach.'
    : 'Image was too large. Double press esc to go back and try again with a smaller image.'
}

07 The Terminal Error Overlay (ink/components/ErrorOverview.tsx)

When an unhandled exception reaches the Ink render tree, ErrorOverview displays it directly in the terminal — not a stack dump, but a formatted overlay with the crash location and inline source context. It uses two libraries: StackUtils to parse V8 stack frames, and code-excerpt to read and display the relevant source lines.

// 1. Parse the first stack frame to get file + line + column
const stack = error.stack ? error.stack.split('\n').slice(1) : undefined
const origin = stack ? getStackUtils().parseLine(stack[0]!) : undefined

// 2. Read source file synchronously (sync OK: error overlay, can't go async)
const sourceCode = readFileSync(filePath, 'utf8')
excerpt = codeExcerpt(sourceCode, origin.line)

// 3. Render: crash line is highlighted in red; surrounding lines are dim
const isCrashLine = line_0 === origin.line
<Text
  backgroundColor={isCrashLine ? 'ansi:red' : undefined}
  color={isCrashLine ? 'ansi:white' : undefined}
  dim={!isCrashLine}
>
  {' ' + value}
</Text>

If the source file is unreadable (e.g., the process moved working directories or the file was deleted), the readFileSync is wrapped in a silent try/catch. The overlay degrades gracefully — it still shows the error message and full parsed stack, just without the source excerpt.

Why sync file I/O here?

The component renders synchronously inside the Ink reconciler. Going async would require suspense restructuring. Since this is the error path (process is already broken), sync I/O is acceptable — there is no REPL loop to block.

The `SentryErrorBoundary`

Alongside ErrorOverview, there is a React error boundary at components/SentryErrorBoundary.ts. It catches render errors from any child component and renders null (silent failure) instead of crashing the whole Ink tree:

export class SentryErrorBoundary extends React.Component<Props, State> {
  static getDerivedStateFromError(): State {
    return { hasError: true }
  }

  render(): React.ReactNode {
    if (this.state.hasError) return null  // silent: don't crash the whole UI
    return this.props.children
  }
}

Two different recovery strategies

ErrorOverview is used for critical unhandled errors that terminate the current render — it surfaces them to the user. SentryErrorBoundary wraps non-critical UI components that can be silently dropped without breaking the whole session.

08 Conversation Recovery (utils/conversationRecovery.ts)

When Claude Code crashes mid-turn or is forcefully killed, the conversation transcript on disk may be in an inconsistent state. conversationRecovery.ts is responsible for loading, cleaning, and restoring that transcript into a resumable state.

The four-stage deserialization pipeline

deserializeMessagesWithInterruptDetection runs the raw persisted messages through four filters before handing them back to the REPL:

Stage 1

Legacy migration

Transform old attachment types (new_file → file, new_directory → directory) and backfill missing displayPath fields

Stage 2

Strip bad permission modes

Remove permissionMode values not in the current build's PERMISSION_MODES set — prevents crashes from stale config

Stage 3

Filter invalid messages

Remove unresolved tool_use pairs, orphaned thinking-only assistant messages, and whitespace-only assistant messages

Stage 4

Interrupt detection

Classify the transcript end as: none (completed) / interrupted_prompt (user sent a message, AI never responded) / interrupted_turn (AI was mid-tool-use)

Interruption classification

After filtering, the last "turn-relevant" message (skipping system, progress, and API error assistants) determines what happened:

// Last message is an assistant → turn completed normally
if (lastMessage.type === 'assistant') return { kind: 'none' }

// Last message is a plain user prompt → CC hadn't started responding
if (lastMessage.type === 'user' && !isToolUseResultMessage(lastMessage))
  return { kind: 'interrupted_prompt', message: lastMessage }

// Last message is a tool_result → AI was mid-tool-use
if (isToolUseResultMessage(lastMessage)) {
  // Special case: brief mode ends on SendUserMessage tool_result legitimately
  if (isTerminalToolResult(lastMessage, messages, lastMessageIdx))
    return { kind: 'none' }
  return { kind: 'interrupted_turn' }
}

Synthetic continuation messages

An interrupted_turn (AI was mid-tool when killed) is converted to interrupted_prompt by injecting a synthetic "Continue from where you left off." user message. This unifies both interruption kinds so the consumer only needs to handle one case:

if (internalState.kind === 'interrupted_turn') {
  const [continuationMessage] = normalizeMessages([
    createUserMessage({ content: 'Continue from where you left off.', isMeta: true })
  ])
  filteredMessages.push(continuationMessage!)
  turnInterruptionState = { kind: 'interrupted_prompt', message: continuationMessage! }
}

The API-valid sentinel

If the last relevant message after all filtering is a user message, the Anthropic API would reject the conversation (which must end on an assistant turn for streaming). A synthetic NO_RESPONSE_REQUESTED assistant sentinel is spliced in after the user message so the conversation is always API-valid even if no resume action is taken.

removeInterruptedMessage splice contract

The sentinel is inserted at lastRelevantIdx + 1, not at the array end. This is intentional: removeInterruptedMessage calls splice(idx, 2) to remove the user message and the sentinel as a pair. Inserting at the end would break this if there are trailing system or progress messages.

Skill state restoration

Before deserialization, restoreSkillStateFromMessages walks the transcript for invoked_skills attachments and re-registers those skills in process state. Without this, a second compaction after resume would lose track of which skills were active:

for (const message of messages) {
  if (message.attachment?.type === 'invoked_skills') {
    for (const skill of message.attachment.skills) {
      addInvokedSkill(skill.name, skill.path, skill.content, null)
    }
  }
  // Suppress re-sending skill listing if already in transcript
  if (message.attachment?.type === 'skill_listing') suppressNextSkillListing()
}

09 End-to-End Error Flow

flowchart TD subgraph "Runtime Errors" TE["Tool execution\nfails"] --> FE["formatError()\ntoolErrors.ts"] FE --> TR["tool_result block\nto model"] FE --> TER["Terminal display"] end subgraph "API Errors" AE["Anthropic API\nreturns error"] --> WR["withRetry()\nwithRetry.ts"] WR -->|"retryable"| BACK["Backoff + yield\nSystemAPIErrorMessage"] BACK --> WR WR -->|"exhausted"| CNR["CannotRetryError\nwraps original"] WR -->|"529 × 3 Opus"| FBT["FallbackTriggeredError"] CNR --> EL["logError()\nerrorLogSink.ts"] EL --> JSONL["JSONL log file\n~/.cache/claude/errors/"] CNR --> UI["ErrorOverview.tsx\nterminal overlay"] end subgraph "Session Recovery" CRASH["Process killed\nmid-turn"] --> RESUME["--continue / --resume"] RESUME --> DSR["deserializeMessages()\nconversationRecovery.ts"] DSR --> FILTER["4-stage filter pipeline"] FILTER --> DETECT["detectTurnInterruption()"] DETECT -->|"interrupted_prompt"| AUTO["Auto-resumes\nwith user message"] DETECT -->|"interrupted_turn"| SYNTH["Inject synthetic\n'Continue...' message"] SYNTH --> AUTO DETECT -->|"none"| CLEAN["Clean resume\nno action needed"] end style UI fill:#2c1d18,stroke:#c47a50,color:#b8b0a4 style JSONL fill:#1e251b,stroke:#6e9468,color:#b8b0a4 style AUTO fill:#1c2228,stroke:#7d9ab8,color:#b8b0a4

Key Takeaways

Every failure domain has a dedicated typed error class — ShellError, ConfigParseError, AbortError etc. — so callers use instanceof instead of string-matching, which survives minification.
isAbortError checks three different abort shapes (own class, SDK class, DOMException) because minification mangles constructor.name and the SDK never sets this.name.
withRetry handles 10+ distinct failure modes in one loop: 529 model fallback, context overflow auto-adjustment, OAuth token refresh, Bedrock/Vertex auth cache clearing, and stale connection keep-alive disable.
Background query sources (title generators, summarizers) bail immediately on 529 without retry — amplifying capacity events from dozens of concurrent clients would make the outage worse.
Tool errors are truncated to 10,000 chars (5k head + 5k tail) before being sent to the model — large compiler outputs would otherwise waste the entire context window.
The terminal ErrorOverview reads the crash source file synchronously and highlights the exact line — acceptable because this is the error path and the REPL is already broken.
Conversation recovery runs a 4-stage filter pipeline to clean broken transcripts, then classifies the interruption type and injects synthetic messages to make the conversation API-valid before resuming.
TelemetrySafeError_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS uses a deliberately long name as a code-review forcing function — the developer must consciously confirm no sensitive data is in the message.