a
Teaching Aid · Scenario 06

Structured
Data
Extraction

Build systems that transform unstructured documents — invoices, contracts, medical records, emails — into validated, schema-conformant JSON. Learn to handle edge cases gracefully and integrate with downstream systems.

JSON Schema Validation High Accuracy Edge Case Handling Downstream Integration
Document → JSON Transformation✓ Valid
Source Document

Invoice from Acme Corp dated March 4, 2025.

Bill to: TechStart Ltd
Services rendered: $4,250.00
Payment due: March 18, 2025
Terms: Net 14

"vendor": "Acme Corp",
"date": "2025-03-04",
"client": "TechStart Ltd",
"amount": 4250.00,
"due_date": "2025-03-18",
"terms": "Net 14",
"currency": "USD"

The Extraction Stack

CONCEPT_01
📄

Unstructured → Structured

Real-world documents are messy: dates in 12 different formats, amounts as "four thousand dollars" or "$4,000" or "4000 USD", missing fields, tables, footnotes, scanned PDFs. Claude must normalize these into a clean, consistent schema every time.

CONCEPT_02
🧬

JSON Schema as Contract

The JSON Schema is not optional decoration — it's the contract between extraction and downstream systems. Every field type, format, and required constraint must be defined upfront. Claude's output is validated against it on every run.

CONCEPT_03

Validation Over Trust

Never send Claude's raw output directly to a downstream system. Always validate against the schema first. Validation catches type errors, missing required fields, and format violations before they corrupt your database or crash an API call.

CONCEPT_04
🎯

Confidence Scores

Ask Claude to return a confidence score per field (0.0–1.0). Low-confidence fields trigger human review rather than silent errors. This turns extraction from a black box into a measurable, improvable process.

CONCEPT_05
⚠️

Graceful Null Handling

Missing fields must return null, not an invented value. Hallucinated data is worse than missing data — a wrong invoice amount causes real financial damage. The system prompt must explicitly instruct: "If a field is absent, return null."

CONCEPT_06
🔗

Downstream Integration

Extracted data feeds ERPs, CRMs, databases, and APIs. Each has specific format requirements. Design the schema around the downstream consumer, not the source document. Normalize dates to ISO 8601, amounts to numeric, enums to fixed values.

Live Extraction — Select a Document

Document:
📄 invoice_acme_march_2025.pdfUnstructured
ACME CORP
123 Business Ave, Chicago IL
[email protected]

INVOICE No. INV-2025-0341
Date: March 4, 2025
Payment Due: March 18, 2025

Bill To:
TechStart Ltd
attn: Accounts Payable

DescriptionQtyUnitTotal
Consulting services17$200$3,400.00
Travel expenses1$850$850.00

Total Due: $4,250.00 USD
Terms: Net 14
Extracted JSON
Click "Extract" to run Claude extraction.

Common Extraction Schemas

🧾InvoiceFinance
📑Contract ClauseLegal
🏥Medical NoteClinical
📧Support EmailCRM
🏠Property ListingReal Estate
invoice_v2
Extracts all billing data from vendor invoices for ERP integration and accounts payable automation.
vendor_namestrLegal name of the issuing vendorrequired
invoice_numberstrUnique identifier on the invoiceoptional
invoice_datestrIssue date, ISO 8601 (YYYY-MM-DD)required
due_datestrPayment due date, ISO 8601optional
total_amountnumTotal amount due, numeric (not string)required
currencystrISO 4217 currency code: USD, EUR, GBP…optional
tax_amountnumTax component if separately statedoptional
line_itemsarrArray of individual invoice line itemsoptional
_confidenceobjPer-field confidence scores 0.0–1.0required

End-to-End Architecture

1
Ingest & Preprocess
Receive document (PDF, DOCX, TXT, image). Convert to text via OCR if needed. Truncate at context limit. Tag document type.
2
Select Schema
Route to the correct schema based on document type detection or explicit caller metadata. Load the matching system prompt.
3
Extract with Claude
Pass system prompt + document text. Claude returns JSON with per-field confidence scores in _confidence object.
4
Parse & Validate
Parse JSON (handle markdown fences). Run jsonschema.validate(). If validation fails → retry with repair (max 2 attempts).
5
Confidence Routing
Compute overall confidence. Route: ≥0.90 → auto-approve, 0.70-0.89 → soft review, <0.70 → manual queue.
6
Downstream Delivery
Auto-approved records post to the target API/database. Review queue items surface in the review UI. Manual items go to human operators.
The retry-with-repair pattern: When validation fails, don't discard the extraction. Send Claude the validation errors and ask it to fix only the failing fields. This is more efficient than a full re-extraction and corrects ~80% of validation failures in one round.

Confidence Routing

≥ 0.90 — Auto-approve · Send directly to downstream system
0.70 – 0.89 — Soft review · Flag for human spot-check before sending
< 0.70 — Manual review · Hold and assign to human queue

What Goes Wrong — and How to Handle It

💱
Ambiguous Currency
$4,250 — USD or CAD?
Medium

When a document shows "$4,250" without a currency indicator, Claude cannot know if it's USD, CAD, AUD, or another dollar. Returning USD by assumption is a hallucination.

The system prompt must say: "If currency is ambiguous (e.g. $ without country context), return null for currency and set confidence to 0.5." This triggers human review for that field while allowing the rest to auto-approve.

💡 Strategy: Return null + low confidence for ambiguous fields. Never assume a default currency. Add a "currency_context" field where Claude can note the ambiguity.
📅
Date Format Chaos
"March 4th", "04/03/25", "03-04-2025"
Easy

Dates appear in dozens of formats. ISO 8601 normalization is straightforward for Claude but requires explicit instruction — otherwise it echoes the source format.

System prompt: "Normalize ALL dates to ISO 8601 (YYYY-MM-DD). If year is ambiguous (e.g. '25' could be 2025 or 1925), use the most recent plausible year and set confidence to 0.7." Also watch for European vs. US format ambiguity: 04/03/25 is April 3 or March 4 depending on locale.

💡 Strategy: Explicitly state date format in prompt. For locale-ambiguous dates, include source locale in document metadata and pass it in the extraction prompt.
🕳
Missing Required Fields
Invoice with no invoice number
Easy

Some documents simply don't contain certain fields. A handwritten invoice may have no formal invoice number. The wrong response is to invent one; the right response is null.

Without explicit instruction, Claude sometimes fabricates plausible-looking values ("INV-001") to seem helpful. This is worse than null — it silently corrupts records. The system prompt must say: "If a field is absent, return null — never generate a plausible value."

💡 Strategy: JSON Schema should mark required fields, but Claude should still return null for them when absent. The validation error then triggers human review rather than a silent hallucination.
📉
Partial / Truncated Documents
PDF extraction cut off mid-page
Hard

OCR failures, corrupted PDFs, and context window limits can result in Claude receiving an incomplete document. It may extract from the visible portion without flagging that data is missing.

Always tell Claude the expected document structure: "This is an invoice. If you do not see line items, totals, or other expected sections, set the relevant fields to null and note in a _parsing_notes field that the document appears truncated."

💡 Strategy: Add a "_parsing_notes" string field to your schema. Prompt Claude to use it for any concerns about document completeness, OCR quality, or ambiguity. Review all records with non-empty _parsing_notes.
🔁
Duplicate / Multi-Document Packets
3 invoices scanned together
Hard

A scanned PDF may contain multiple documents — 3 invoices, a cover letter and a contract, or a chain of email replies. The extractor must decide whether to extract one record or multiple.

The system prompt should specify: "If the document contains multiple distinct records (e.g. multiple invoices), return an array of extraction objects. If it is a single document with attachments, extract only the primary document and note attachments in _parsing_notes."

💡 Strategy: Wrap your schema in an array type at the top level and instruct Claude to use a single-element array for single documents and multi-element for multi-document packets.
🔢
Amount Calculation Errors
Line items don't sum to total
Medium

Claude extracts line items and a total. But sometimes the document has a math error — the vendor's total doesn't match the sum of line items. Should Claude correct the total or extract it verbatim?

The answer is verbatim extraction + a validation flag. Add a post-processing step: calculate sum(line_items.total) and compare to total_amount. If they differ by more than 0.01, set a "_validation_warnings" flag: "line_item_sum_mismatch: 4250.00 vs 4280.00". Do not silently correct the total.

💡 Strategy: Never have Claude perform arithmetic corrections. Extract what's in the document, then add a post-extraction calculation check in your Python validation layer.

Reference Implementation

extractor.py
# extractor.py — Core extraction with validation + retry import json, jsonschema from anthropic import Anthropic from typing import Any client = Anthropic() def extract(document: str, schema: dict, system_prompt: str, max_retries: int = 2) -> dict[str, Any]: messages = [{"role": "user", "content": f"Extract from this document:\n\n{document}"}] for attempt in range(max_retries + 1): response = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, system=system_prompt, messages=messages) raw = response.content[0].text clean = raw.strip() if clean.startswith("```"): clean = clean.split("\n", 1)[1].rsplit("```", 1)[0] try: extracted = json.loads(clean) except json.JSONDecodeError as e: if attempt == max_retries: raise ExtractionError(f"Invalid JSON: {e}") messages += [{"role":"assistant","content":raw},{"role":"user","content":f"JSON parse error: {e}. Return valid JSON only."}] continue errors = validate_schema(extracted, schema) if not errors: return extracted if attempt == max_retries: raise ValidationError("Schema validation failed", errors, extracted) error_summary = "\n".join(f"- {e}" for e in errors) messages += [{"role":"assistant","content":raw},{"role":"user","content":f"Validation failed:\n{error_summary}\n\nFix only the failing fields."}] def validate_schema(data, schema): validator = jsonschema.Draft7Validator(schema) return [e.message for e in sorted(validator.iter_errors(data), key=lambda e: e.path)]
validator.py
from dataclasses import dataclass from enum import Enum class RouteDecision(Enum): AUTO_APPROVE = "auto" # confidence ≥ 0.90 SOFT_REVIEW = "review" # confidence 0.70–0.89 MANUAL = "manual" # confidence < 0.70 @dataclass class ExtractionResult: data: dict; confidence: dict; overall_conf: float route: RouteDecision; validation_pass: bool; low_conf_fields: list def validate_and_route(extracted, schema): errors = validate_schema(extracted, schema) if errors: raise ValidationError("Schema validation failed", errors, extracted) conf = extracted.pop("_confidence", {}) overall = sum(conf.values()) / len(conf) if conf else 0.5 low_conf = [k for k, v in conf.items() if v < 0.70] if overall >= 0.90 and not low_conf: route = RouteDecision.AUTO_APPROVE elif overall >= 0.70: route = RouteDecision.SOFT_REVIEW else: route = RouteDecision.MANUAL return ExtractionResult(data=extracted, confidence=conf, overall_conf=overall, route=route, validation_pass=True, low_conf_fields=low_conf)
system_prompts.txt
═══ INVOICE EXTRACTION ═════════════════════════════════ You are a data extraction specialist for financial documents. ## Output Rules - Return ONLY valid JSON. No preamble, no explanation, no markdown. - If a field is absent from the document, return null — never invent values. - Normalize dates to ISO 8601 (YYYY-MM-DD). - Normalize amounts to numeric: 1234.56 not "$1,234.56". - For each field, include a _confidence score (0.0–1.0). ## Confidence Guide 1.0 — Explicitly stated | 0.8 — Reliably inferred 0.6 — Ambiguous | 0.3 — Low-evidence guess ═══ MEDICAL NOTE EXTRACTION ════════════════════════════ HIPAA applies. Do not add, infer, or embellish any clinical detail. - Use exact phrases from the document for diagnoses. - Medications must include exact dosage as stated. - Confidence < 0.8 for any clinical field triggers mandatory human review. ═══ SHARED RULES (all document types) ═════════════════ NEVER: Hallucinate values | Use today's date as fallback Round amounts | Return partial JSON
schemas/invoice.json
{ "$schema": "http://json-schema.org/draft-07/schema#", "title": "Invoice", "type": "object", "required": ["vendor_name", "total_amount", "invoice_date"], "additionalProperties": false, "properties": { "vendor_name": { "type": ["string", "null"] }, "invoice_number": { "type": ["string", "null"] }, "invoice_date": { "type": ["string", "null"], "format": "date" }, "due_date": { "type": ["string", "null"], "format": "date" }, "total_amount": { "type": ["number", "null"], "minimum": 0 }, "currency": { "type": ["string", "null"], "enum": ["USD","EUR","GBP","JPY","CAD",null] }, "line_items": { "type": "array", "items": { "type": "object", "required": ["description","total"], "properties": { "description": {"type":"string"}, "quantity": {"type":["number","null"]}, "unit_price": {"type":["number","null"]}, "total": {"type":"number"} }}}, "_confidence": { "type": "object", "additionalProperties": { "type": "number", "minimum": 0, "maximum": 1 }} } }

What Good Extraction Looks Like

99%
Schema Validation Pass Rate
95%
Field Accuracy (vs. ground truth)
<2%
Hallucination Rate
90%
Auto-Approval Rate (conf ≥ 0.90)

Measuring Accuracy Against Ground Truth

Field Accuracy — For each field, compare extracted value to manually verified ground truth. Target: 95%+. Measure per document type separately.

Null Precision — Of the fields Claude returned as null, what % were genuinely absent vs. missed extractions?

Hallucination Rate — The % of non-null fields where Claude returned a value not present in the document. Target: <2%. Any value with no textual support is a hallucination.

Auto-Approval Rate — What % of documents pass without human review? Higher is better for efficiency, but only if accuracy stays high.

Assessment