Structured
Data
Extraction
Build systems that transform unstructured documents — invoices, contracts, medical records, emails — into validated, schema-conformant JSON. Learn to handle edge cases gracefully and integrate with downstream systems.
The Extraction Stack
Unstructured → Structured
Real-world documents are messy: dates in 12 different formats, amounts as "four thousand dollars" or "$4,000" or "4000 USD", missing fields, tables, footnotes, scanned PDFs. Claude must normalize these into a clean, consistent schema every time.
JSON Schema as Contract
The JSON Schema is not optional decoration — it's the contract between extraction and downstream systems. Every field type, format, and required constraint must be defined upfront. Claude's output is validated against it on every run.
Validation Over Trust
Never send Claude's raw output directly to a downstream system. Always validate against the schema first. Validation catches type errors, missing required fields, and format violations before they corrupt your database or crash an API call.
Confidence Scores
Ask Claude to return a confidence score per field (0.0–1.0). Low-confidence fields trigger human review rather than silent errors. This turns extraction from a black box into a measurable, improvable process.
Graceful Null Handling
Missing fields must return null, not an invented value. Hallucinated data is worse than missing data — a wrong invoice amount causes real financial damage. The system prompt must explicitly instruct: "If a field is absent, return null."
Downstream Integration
Extracted data feeds ERPs, CRMs, databases, and APIs. Each has specific format requirements. Design the schema around the downstream consumer, not the source document. Normalize dates to ISO 8601, amounts to numeric, enums to fixed values.
Live Extraction — Select a Document
123 Business Ave, Chicago IL
[email protected]
INVOICE No. INV-2025-0341
Date: March 4, 2025
Payment Due: March 18, 2025
Bill To:
TechStart Ltd
attn: Accounts Payable
| Description | Qty | Unit | Total |
|---|---|---|---|
| Consulting services | 17 | $200 | $3,400.00 |
| Travel expenses | 1 | $850 | $850.00 |
Total Due: $4,250.00 USD
Terms: Net 14
Common Extraction Schemas
End-to-End Architecture
Confidence Routing
What Goes Wrong — and How to Handle It
When a document shows "$4,250" without a currency indicator, Claude cannot know if it's USD, CAD, AUD, or another dollar. Returning USD by assumption is a hallucination.
The system prompt must say: "If currency is ambiguous (e.g. $ without country context), return null for currency and set confidence to 0.5." This triggers human review for that field while allowing the rest to auto-approve.
Dates appear in dozens of formats. ISO 8601 normalization is straightforward for Claude but requires explicit instruction — otherwise it echoes the source format.
System prompt: "Normalize ALL dates to ISO 8601 (YYYY-MM-DD). If year is ambiguous (e.g. '25' could be 2025 or 1925), use the most recent plausible year and set confidence to 0.7." Also watch for European vs. US format ambiguity: 04/03/25 is April 3 or March 4 depending on locale.
Some documents simply don't contain certain fields. A handwritten invoice may have no formal invoice number. The wrong response is to invent one; the right response is null.
Without explicit instruction, Claude sometimes fabricates plausible-looking values ("INV-001") to seem helpful. This is worse than null — it silently corrupts records. The system prompt must say: "If a field is absent, return null — never generate a plausible value."
OCR failures, corrupted PDFs, and context window limits can result in Claude receiving an incomplete document. It may extract from the visible portion without flagging that data is missing.
Always tell Claude the expected document structure: "This is an invoice. If you do not see line items, totals, or other expected sections, set the relevant fields to null and note in a _parsing_notes field that the document appears truncated."
A scanned PDF may contain multiple documents — 3 invoices, a cover letter and a contract, or a chain of email replies. The extractor must decide whether to extract one record or multiple.
The system prompt should specify: "If the document contains multiple distinct records (e.g. multiple invoices), return an array of extraction objects. If it is a single document with attachments, extract only the primary document and note attachments in _parsing_notes."
Claude extracts line items and a total. But sometimes the document has a math error — the vendor's total doesn't match the sum of line items. Should Claude correct the total or extract it verbatim?
The answer is verbatim extraction + a validation flag. Add a post-processing step: calculate sum(line_items.total) and compare to total_amount. If they differ by more than 0.01, set a "_validation_warnings" flag: "line_item_sum_mismatch: 4250.00 vs 4280.00". Do not silently correct the total.
Reference Implementation
What Good Extraction Looks Like
Measuring Accuracy Against Ground Truth
Field Accuracy — For each field, compare extracted value to manually verified ground truth. Target: 95%+. Measure per document type separately.
Null Precision — Of the fields Claude returned as null, what % were genuinely absent vs. missed extractions?
Hallucination Rate — The % of non-null fields where Claude returned a value not present in the document. Target: <2%. Any value with no textual support is a hallucination.
Auto-Approval Rate — What % of documents pass without human review? Higher is better for efficiency, but only if accuracy stays high.