a
pipeline● PASSING
|
scenario05 / CI-CD
|
coverage94%
|
false-positive-rate< 4%
|
pr-review-time~45s
|
build#2,841
Scenario 05 · Teaching Aid

Claude Code
CI/CD

Design automated code review, test generation, and PR feedback systems using Claude Code in your CI pipeline. Learn to write prompts that produce actionable, low-noise output.

Automated Review Test Generation PR Feedback False Positive Control GitHub Actions
PR #847 — add payment retry logic run #2841
checkout & install8spassed
lint & typecheck12spassed
unit tests34s94% cov
claude code review~45srunning
claude test gen~60squeued
deploy to stagingwaiting

Why Claude in CI?

01🔍

Automated Code Review

Claude reads the PR diff and project context to surface security issues, logic errors, performance problems, and style violations — before a human reviewer even opens the PR.

02🧪

Test Case Generation

Given a new function or changed code path, Claude generates test cases covering the happy path, edge cases, and error conditions. It reads existing test files first to match the project's testing patterns.

03💬

PR Feedback Comments

Claude posts structured, line-level comments on the PR via GitHub API. Comments are categorized by severity (CRITICAL, WARNING, SUGGESTION) so developers know what must be fixed.

04🎯

Actionable Over Verbose

The #1 failure mode: CI feedback nobody reads. Every comment must include what the problem is, why it matters, and a specific fix. Vague comments create noise.

05🚫

Minimizing False Positives

A false positive erodes developer trust rapidly. Once trust is lost, developers ignore all Claude feedback — including real bugs. False positive rate must stay below 5%.

06

Speed as a Requirement

CI feedback that takes 10 minutes is ignored. Claude review must complete in under 90 seconds. This constrains prompt design: only the diff plus targeted context.

Live Review — Select a Diff

PR Diff:
📄src/payments/retry.ts+18
@@ -12,7 +12,25 @@ export class PaymentService {
12 async processPayment(order: Order): Promise {
13 const charge = await this.stripe.charge(order.amount);
14- return { success: true, chargeId: charge.id };
+ if (charge.status === "succeeded") {
+ return { success: true, chargeId: charge.id };
+ }
+ // Retry logic
+ for (let i = 0; i < 5; i++) {
+ await sleep(1000 * i);
+ const retry = await this.stripe.charge(order.amount);
+ if (retry.status === "succeeded") {
+ return { success: true, chargeId: retry.id };
+ }
+ }
+ return { success: false, error: "Payment failed after retries" };
28 }
Claude Review Output

Build Your Review Prompt

Toggle components to compose a production-grade CI review prompt. Each component controls what Claude focuses on and how it formats output.

Review Scope
Security vulnerabilities
Logic & correctness errors
Performance issues
Style & formatting
Missing documentation
Output Format
Severity classification
Line number references
Suggested fix per issue
JSON output (for automation)
False Positive Controls
Skip style nits
Only flag if confident
Include positive feedback
system_prompt.txt — live preview
You are a senior code reviewer integrated into CI. Your role: surface real problems in PR diffs, not style preferences. ## Review Scope Focus ONLY on the following categories: ### SECURITY Flag: SQL injection, XSS, insecure deserialization, exposed secrets, authentication bypasses, insecure direct object references. ### LOGIC & CORRECTNESS Flag: off-by-one errors, incorrect conditionals, unhandled error paths, race conditions, incorrect return values. ### PERFORMANCE Flag: N+1 queries, missing indexes on frequent queries, synchronous I/O in hot paths, unbounded loops. ## Output Format Prefix each finding with its severity: [CRITICAL], [WARNING], or [SUGGESTION] Include the file path and line number for each finding. Include a specific, actionable fix for every finding. ## Rules - Do NOT flag style nits (spacing, naming preference, formatting) - Only report a finding if you are 85% confident it is a genuine problem Omit uncertain findings entirely, silence is better than noise - Do not include positive feedback or praise - Ignore commented-out code unless it contains secrets - Do not flag TODO/FIXME comments, those are tracked separately - If the diff has no issues in scope, return an empty issues list

CI Integration Code

.github/workflows/claude-review.yml
# .github/workflows/claude-review.yml name: Claude Code Review on: pull_request: types: [opened, synchronize, reopened] jobs: claude-review: runs-on: ubuntu-latest permissions: pull-requests: write contents: read steps: - uses: actions/checkout@v4 with: fetch-depth: 0 - name: Get PR diff run: | git diff origin/${{ github.base_ref }}...HEAD \ -- '*.ts' '*.py' '*.go' '*.js' \ > /tmp/pr_diff.txt - name: Install Claude Code run: npm install -g @anthropic-ai/claude-code - name: Run Claude Review env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} PR_NUMBER: ${{ github.event.pull_request.number }} run: python scripts/claude_review.py timeout-minutes: 3 claude-test-gen: runs-on: ubuntu-latest needs: claude-review if: github.event.pull_request.draft == false steps: - uses: actions/checkout@v4 - name: Generate missing tests env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} run: python scripts/claude_testgen.py
# .gitlab-ci.yml — Claude review stage stages: - test - review - deploy claude-review: stage: review image: python:3.12-slim rules: - if: $CI_PIPELINE_SOURCE == "merge_request_event" variables: DIFF_BASE: $CI_MERGE_REQUEST_DIFF_BASE_SHA before_script: - pip install anthropic requests - git fetch origin $CI_MERGE_REQUEST_TARGET_BRANCH_NAME script: - git diff $DIFF_BASE...HEAD > /tmp/pr_diff.txt - python scripts/claude_review.py timeout: 3 minutes allow_failure: true
# scripts/claude_review.py import os, json, requests from anthropic import Anthropic client = Anthropic() GITHUB_TOKEN = os.environ["GITHUB_TOKEN"] PR_NUMBER = os.environ["PR_NUMBER"] REPO = os.environ["GITHUB_REPOSITORY"] def run_review(): diff = open("/tmp/pr_diff.txt").read() if len(diff) > 80_000: diff = diff[:80_000] + "\n\n[DIFF TRUNCATED]" response = client.messages.create( model="claude-sonnet-4-6", max_tokens=4096, system=REVIEW_SYSTEM_PROMPT, messages=[{ "role": "user", "content": f"Review this PR diff:\n\n{diff}" }] ) raw = response.content[0].text clean = raw.replace("```json", "").replace("```", "").strip() findings = json.loads(clean) for f in findings["issues"]: if f["severity"] in ["CRITICAL", "WARNING"]: post_pr_comment(f) post_summary(findings["summary"], findings["issues"]) critical_count = sum(1 for f in findings["issues"] if f["severity"] == "CRITICAL") if critical_count > 0: raise SystemExit(1) def post_pr_comment(finding): url = f"https://api.github.com/repos/{REPO}/pulls/{PR_NUMBER}/comments" requests.post(url, json={ "body": format_comment(finding), "path": finding["file"], "line": finding["line"], "side": "RIGHT" }, headers={"Authorization": f"Bearer {GITHUB_TOKEN}"}) if __name__ == "__main__": run_review()
# scripts/claude_testgen.py import os, subprocess from anthropic import Anthropic client = Anthropic() def generate_tests(): changed = subprocess.check_output([ "git", "diff", "--name-only", "origin/main...HEAD" ]).decode().strip().split("\n") source_files = [ f for f in changed if f.endswith((".ts", ".py")) and "test" not in f and "spec" not in f ] for filepath in source_files: source = open(filepath).read() testpath = derive_test_path(filepath) existing = open(testpath).read() if os.path.exists(testpath) else "" pattern = find_test_pattern() response = client.messages.create( model="claude-sonnet-4-6", max_tokens=4096, system=TEST_GEN_PROMPT, messages=[{"role": "user", "content": f""" Source file: {filepath} {source} Existing tests (if any): {existing} Test pattern from project: {pattern} """}] ) generated = response.content[0].text write_tests(testpath, generated, existing) print(f"Generated tests for {filepath} → {testpath}") def find_test_pattern() -> str: test_files = subprocess.check_output( ["find", ".", "-name", "*.test.ts", "-not", "-path", "*/node_modules/*"] ).decode().strip().split("\n")[:2] return "\n\n---\n\n".join(open(f).read() for f in test_files if f)

Prompt Design That Reduces Noise

False positives are the #1 trust killer for AI-powered CI. Each example shows a problematic prompt pattern alongside the improved version.

Vague Scope - NoiseAnti-Pattern - Fix
Problematic Prompt
Review this diff for any issues you notice. Return all problems you find.
Returns 12 findings: 8 are style opinions, 3 are legitimate, 1 is a false positive. Developers stop reading after day 2.
Improved Prompt
Review for: security vulnerabilities, logic errors, unhandled exceptions. Skip all style, naming, and formatting concerns entirely. Only flag issues you are 85% confident are genuine problems.
Returns 3 findings, all actionable. Developer acceptance rate: 92%.
Missing Context - Wrong FlagsAnti-Pattern - Fix
Problematic Prompt
Review this diff. Flag any hardcoded values.
Flags "const MAX_RETRIES = 3" as a hardcoded magic number. It's actually a deliberate, well-named constant. False positive.
Improved Prompt
Review this diff. Flag hardcoded values only when: 1. They are numeric literals inline in logic (not named constants) 2. They represent config that varies by environment Do NOT flag named constants with clear intent.
No false positives on legitimate constants. Only flags true inline magic numbers.
No Severity - Everything Feels CriticalAnti-Pattern - Fix
Problematic Prompt
List all issues found in the diff.
Returns a flat list where "missing semicolon" appears next to "SQL injection". Developer can't distinguish importance.
Improved Prompt
Classify each finding as: [CRITICAL] Must fix before merge (security, data loss, crashes) [WARNING] Should fix (correctness, performance) [SUGGESTION] Optional improvement Only post CRITICAL and WARNING as blocking comments.
Developers immediately know what to fix. CRITICAL findings have 98% fix rate.
No Fix Suggested - Ignored FeedbackAnti-Pattern - Fix
- Problematic Prompt
Flag potential security issues in this diff.
Comment: "This may have a security vulnerability." Developer doesn't know what to change. Comment is dismissed.
Improved Prompt
For every finding, include: 1. What the problem is (specific, not vague) 2. Why it matters (impact: data loss? auth bypass?) 3. The exact fix - a corrected code snippet A finding without a fix is noise, not feedback.
"SQL injection on line 14: parameterize query - db.query(sql, [email])". Developer fixes in 30 seconds.
Reviewing Style in CI - Team ConflictAnti-Pattern - Fix
Problematic Prompt
Review code quality, including style and best practices.
Claude flags 6 style opinions that contradict the team's ESLint config. Creates friction and reduces trust in all Claude feedback.
Improved Prompt
DO NOT review: - Formatting, indentation, or whitespace - Naming conventions (the linter handles these) - Code style preferences - Performance micro-optimizations without profiling data Focus only on logic errors and security issues.
Zero style conflicts. Claude is seen as complementary to existing tools.
Reviewing the Whole Repo - Irrelevant FindingsAnti-Pattern - Fix
Problematic Prompt
Review our codebase for security issues.
Claude flags 20 issues in files that weren't changed in this PR. Developers don't know which are new vs. pre-existing.
Improved Prompt
Review ONLY the changed lines in this diff. Do not comment on unchanged context lines. Do not flag pre-existing issues in unchanged code. Each finding must reference a + (added) or modified line.
Every finding is directly attributable to code in this PR. Immediately actionable.
The confidence gate rule: Add this to every CI review prompt — "Only include a finding if you are at least 85% confident it is a genuine problem. If you're unsure, omit it. Silence is better than noise." This single instruction reduces false positive rates by 30–50%.

What Good Looks Like

<5%
False Positive Rate
90s
Max Review Time
85%
Dev Acceptance Rate
3+
Bugs Caught / 100 PRs

Measuring and Improving Over Time

False Positive Rate — Track by asking developers to thumbs-down unhelpful comments. If rate exceeds 5%, audit the last 20 flagged items and tighten the prompt's confidence gate.

Review Time — Diff size is the main variable. Set a hard limit: truncate diffs at 80k characters. Use timeout-minutes: 3 in CI.

Acceptance Rate — What % of findings does the developer act on? Target 85%. Below 70% means feedback is too noisy or vague.

Bugs Caught — Track production issues that were flagged (but ignored) in Claude's review. Even 1 per month is strong ROI.

Assessment