Building a Customer Support Resolution Agent with Claude

Why This Scenario Matters

Customer support is where AI agents meet financial consequences in real time. A wrong refund decision costs money. A missed escalation triggers chargebacks or legal exposure. An agent that asks customers questions it could answer itself with a tool call destroys the experience you were trying to improve.

Scenario 1 of the Claude Certified Architect exam is deliberately set in this high-stakes context. The scenario asks you to design and reason about a customer support resolution agent using the Claude Agent SDK, wired to four MCP tools — get_customer, lookup_order, process_refund, and escalate_to_human — with an 80%+ first-contact resolution (FCR) target.

This post walks through the architecture decisions, the exam-relevant tradeoffs, and the implementation patterns the certification expects you to understand. It draws on both the exam guide and a reference teaching aid that walks through the same scenario interactively.

The Architecture in Plain Terms

The system is a single-agent loop — not a multi-agent pipeline. One Claude instance handles the full conversation, calling tools as needed, managing context, and deciding when a case exceeds its authority. This simplicity is intentional: support conversations require continuity of context that multi-agent splits can undermine.

Every conversation follows a sequence the agent must internalize through its system prompt: verify identity first, fetch order context second, classify the issue, then act. The ordering is not optional — and as we will discuss, enforcing it requires more than a well-written system prompt.

The Four MCP Tools

The agent’s backend access is scoped to exactly four tools, each serving a specific role in the resolution workflow:

Tool	Type	Purpose
`get_customer`	Read-only	Retrieve verified customer profile by ID or email. Must be called before any other tool.
`lookup_order`	Read-only	Fetch order details including status, items, shipping, and refund eligibility.
`process_refund`	Write	Execute a full or partial refund. Requires verified identity and order context. Irreversible.
`escalate_to_human`	Write	Transfer conversation to a human agent with a structured summary. Terminates agentic loop.

The Decision Flow

The agent’s resolution logic follows a strict five-step sequence:

Intake — Parse the customer’s intent from their raw message
Verify Identity — Call get_customer() before any other tool
Context Lookup — Call lookup_order() before discussing any order details
Classify — Determine whether to resolve, clarify, or escalate
Act — Call process_refund() or escalate_to_human() as appropriate

The Concepts the Exam Actually Tests

The certification does not ask you to write code from memory. It asks you to demonstrate judgment — to choose between plausible approaches and explain why one is better than another in a production context. Here are the core tradeoffs Scenario 1 is built around.

1. Programmatic Enforcement vs. Prompt-Based Guidance

This is the highest-leverage concept in the scenario, and the exam probes it directly.

📋 Exam Question 1 (verbatim from the guide)

Production data shows that in 12% of cases, your agent skips get_customer entirely and calls lookup_order using only the customer’s stated name, occasionally leading to misidentified accounts and incorrect refunds. What change would most effectively address this reliability issue?

The four options include: (A) add a programmatic prerequisite blocking lookup_order until get_customer has returned a verified ID, (B) strengthen the system prompt to say verification is mandatory, (C) add few-shot examples showing get_customer always called first, and (D) implement a routing classifier.

✅ Why A is correct

When a specific tool sequence is required for critical business logic — like verifying customer identity before processing refunds — programmatic enforcement provides deterministic guarantees that prompt-based approaches cannot. Options B and C rely on probabilistic LLM compliance, which is insufficient when errors have financial consequences.

2. Tool Description Quality Drives Selection Reliability

📋 Exam Question 2 (verbatim from the guide)

Production logs show the agent frequently calls get_customer when users ask about orders, instead of calling lookup_order. Both tools have minimal descriptions and accept similar identifier formats. What’s the most effective first step?

A well-designed tool description includes:

What the tool does and what it returns
Input formats it accepts and which identifiers trigger it
Example queries that should route to this tool
Explicit boundaries — when to use this tool versus a similar one

This matters especially when tools have overlapping semantics. get_customer and lookup_order both accept identifier-like inputs. Without rich descriptions, the model has to guess based on naming and context alone.

3. Escalation Logic Requires Explicit Criteria

The third sample question captures a common failure mode: the agent achieves 55% FCR because it escalates easy cases (routine damage replacements with photo evidence) while attempting to autonomously handle hard ones (cases requiring policy exceptions). The correct fix — adding explicit escalation criteria with few-shot examples to the system prompt — is chosen over sentiment analysis, a separate classifier model, and confidence score thresholds.

The exam guide is clear on why the alternatives fail: LLM self-reported confidence is poorly calibrated, sentiment does not correlate with case complexity, and deploying a classifier is over-engineered when prompt optimization has not been tried. This is a valuable principle for practitioners: instrument before you architect.

Effective escalation criteria in a system prompt look like this:

## MUST Escalate When:
- Customer mentions legal action or chargebacks (explicit trigger)
- Account is flagged for fraud investigation
- Issue requires policy exception beyond your authority
- You cannot resolve after 2 clarifying attempts
- Customer explicitly asks to speak with a human

The key word is “explicit.” Vague instructions like “escalate when appropriate” or “use judgment on complex cases” give the model no calibration signal. Named conditions with few-shot examples of each are the mechanism that produces consistent behavior.

4. The Structured Handoff Is Part of Escalation

When escalation does happen, the exam expects you to know what makes it a good handoff versus a poor one. The escalate_to_human tool’s summary field is not optional padding — it is the mechanism that prevents a human agent from having to restart the conversation from scratch.

A complete escalation summary should include:

Customer ID and verified identity
What the customer wants (their stated request)
What the agent found (order status, refund eligibility, fraud flags)
What was attempted (tools called, offers made)
Why escalation is happening (specific trigger condition met)
Recommended priority level

This is tested under Task Statement 1.4, which covers structured handoff protocols. The exam frames it clearly: human agents who lack access to the conversation transcript need enough context in the summary to act immediately.

5. Context Accumulation and Token Efficiency

Domain 5 (Context Management & Reliability) contributes 15% of the exam score, and this scenario is its primary testing ground. Two patterns from the teaching aid are particularly relevant.

First, verbose tool outputs accumulate in context and consume tokens disproportionately to their relevance. A customer record might return 40 fields when only 5 are relevant to the current issue. Trimming tool outputs to relevant fields before they accumulate across iterations is both a context management technique and a reliability one — the model cannot get distracted by data it does not see.

Second, transactional facts (amounts, dates, order numbers, return windows) should be extracted into a persistent “case facts” block at the top of the prompt rather than buried in summarized conversation history. Progressive summarization compresses these values into vague representations that can introduce errors in resolution decisions.

The System Prompt as Architecture

The reference implementation in the teaching aid treats the system prompt as a first-class architectural artifact. It is worth examining its structure, because the exam tests your ability to reason about what belongs in a system prompt and what belongs in code.

A well-designed system prompt for this scenario covers:

Identity and tone — who the agent is, what register it uses
Tool inventory — explicit list of available tools and what each does
Resolution rules — ordered, specific conditions under which each action is permitted
Mandatory escalation conditions — explicit triggers, not judgment calls
Guard rails — hard limits that cannot be overridden by customer requests

⚠️ What system prompts cannot reliably enforce

As the exam repeatedly demonstrates, system prompts are probabilistic. They guide behavior but cannot guarantee it. For any rule where a failure has financial, legal, or safety consequences, the system prompt should be paired with a programmatic hook that enforces the rule regardless of what the model decides.

The teaching aid’s reference system prompt also captures something the exam guide emphasizes: empathy before action. The agent should acknowledge customer emotion before jumping to solutions. This is not just UX polish — it reduces escalation rates because customers who feel heard are more tolerant of outcomes they do not prefer.

Preparing for the Exam on This Scenario

The exam will present four randomly selected scenarios from the six in the guide. Scenario 1 could appear alongside any combination of the other five. The following patterns from this scenario recur across domains and are worth internalizing before exam day.

Understand the stop_reason loop

The agentic loop terminates on stop_reason == "end_turn" and continues on stop_reason == "tool_use". Tool results are appended to the message history and the loop iterates. Knowing how to implement this cleanly — and what constitutes an anti-pattern (parsing natural language signals, arbitrary iteration caps) — is Task Statement 1.1 territory.

Know when to choose hooks over prompts

Every time you face a question about enforcing a business rule, the default answer is: use a hook when the rule must always hold, use a prompt when guidance is sufficient. PostToolUse hooks for data normalization, tool-call interception hooks for policy enforcement — these are deterministic mechanisms that replace probabilistic instruction.

Distinguish error types in MCP tool responses

Task Statement 2.2 covers structured error responses. The exam expects you to know the difference between transient errors (retryable), validation errors (not retryable, fix the input), business errors (not retryable, explain to customer), and permission errors (escalate). Returning a generic “operation failed” prevents the agent from making the right recovery decision.

Know the escalation triggers by category

Customer requests for a human, policy exceptions and gaps, inability to make progress after two attempts, and legal or financial triggers are the four canonical escalation categories. Memorize them, and understand that sentiment and self-reported confidence are not reliable proxies for any of them.

What This Scenario Teaches About Production AI

Scenario 1 is instructive beyond exam preparation. The patterns it tests — programmatic enforcement over probabilistic instruction, explicit criteria over judgment calls, structured handoffs over blind transfers, tool description quality as a reliability lever — are patterns that apply to every production agentic system.

The 80% first-contact resolution target is not achieved by making the model smarter. It is achieved by giving the model the right tools with the right descriptions, enforcing the right sequence of operations, and being precise about when the model’s authority ends and a human’s begins.

The next post in this series covers Scenario 2: Code Generation with Claude Code — a different domain, but many of the same underlying principles about configuration, enforcement, and reliability applied to a developer tooling context.

Primary Domains	Tools Covered	Target Metric	Exam Weight
Agentic Architecture	4 MCP Tools	80%+ FCR	Domains 1, 2, 5