Skip to main content

Lesson 13: Thinking in Systems

Lesson 12 established that specs are scaffolding—temporary thinking tools deleted after implementation. But what makes a spec good enough to produce quality code?

Think of a spec as a zoom lens. Zoomed out, you see architecture—modules, boundaries, invariants. Zoomed in, you see implementation—edge cases, error handling, concurrency. You oscillate between views, and the spec sharpens through contact with implementation1.

Precision Through Iteration

Vague specs produce vague code. Precision narrows the solution space:

VaguePrecise
"Handle webhook authentication"C-001: NEVER process unsigned webhook — Signature validation on line 1 of handler
"Store payment data"I-001: SUM(transactions) = account.balance — Verified by: generate 1K transactions, check sum after each batch

But precision isn't achieved through contemplation alone—it's discovered through iteration2. Each pass through implementation reveals constraints the spec missed: a state transition you didn't anticipate, a concurrency edge case, an unrealistic performance budget. The bottleneck has shifted from "production" to "orchestration + verification"3—you orchestrate what gets built and verify it matches intent.

This has a practical consequence for debugging. When implementation diverges from intent, ask: is the architecture sound? If yes, fix the code—the agent made a mechanical error. If the model or boundaries are wrong, fix the spec and regenerate.

The Iterative Workflow

Start with three sections: Architecture, Interfaces, and State—enough to generate a first pass. The spec is a hypothesis. The code is an experiment. Implementation reveals what the spec missed: a state transition you didn't anticipate, a concurrency constraint, an unrealistic performance budget. Zoom out—extract the updated understanding from code via ChunkHound code research. Fix the architecture. Zoom back in—regenerate. Repeat until convergence, then delete the spec.

This is Lesson 3's four-phase cycle applied fractally. At the spec level: research the domain, plan architecture, write spec, validate completeness. At the code level: research codebase, plan changes, execute, validate tests. Each zoom transition—spec→code or code→spec—is itself a Research→Plan→Execute→Validate loop. The depth of iteration scales with complexity: a simple feature converges in one pass; a complex architectural change might take five.

The sections below are the questions this process surfaces. You won't answer them all upfront—you'll discover which ones matter because the code reveals gaps there.

Architecture: Modules, Boundaries, Contracts

Every system has internal structure. The architecture section forces you to make that structure explicit.

Modules

A module is a unit with a single responsibility. Not "handles payments"—that's a category. "Processes Stripe webhook events and updates payment state"—that's a responsibility.

ModuleResponsibilityBoundary
webhook-handlerProcess Stripe webhooks, update payment statesrc/payment/webhooks/
notificationSend emails on payment eventssrc/notification/

When you can't articulate what a module does in one sentence, it's doing too much.

Boundaries

Boundaries define what a module cannot import—the coupling constraint.

  • webhook-handler — NEVER imports from notification or order
  • webhook-handler — Publishes events to queue, consumers decide action

Boundaries prevent changes in one module from rippling through the system.

Contracts

Contracts define how modules communicate—what the caller provides (preconditions) and what the callee guarantees (postconditions).

ProviderConsumerContract
webhook-handlerpaymentprocessEvent(stripeEventId): PaymentIntent — precondition: event not yet processed
paymentnotificationPaymentEvent { type, paymentId, amount, timestamp } — postcondition: immutable once published
paymentcheckoutcreateIntent(orderId, amount): PaymentIntent — precondition: order exists and is unpaid

Integration Points

Integration points are the doors in the boundary wall—where traffic crosses from external to internal or vice versa.

PointTypeDirectionOwner
/webhooks/stripeHTTP endpointinboundwebhook-handler
/api/v1/paymentsREST APIinboundpayment
payment-eventsMessage queueinternal pub/subpayment

Direction matters: inbound points need validation and rate limiting; internal pub/sub needs delivery guarantees. But direction alone doesn't explain why a particular validation exists—that requires stating what you believe about the external service.

Third-Party Assumptions

Integration points tell you where external services connect. Third-party assumptions capture what you believe about those services—behavioral guarantees your design silently depends on. When you don't make them explicit, design decisions appear arbitrary: an agent sees C-001 (idempotency check) but not the delivery semantic that demands it.

For the Stripe webhook system, the assumptions driving key design decisions are:

AssumptionSourceDrives
Webhooks deliver at-least-once, not exactly-onceStripe docsC-001 (idempotency), Redis lock, event-driven state model
Webhooks may arrive out of orderStripe docsState machine with explicit transitions
Payloads signed with HMAC-SHA256Stripe docsC-002 (signature validation)
API availability ~99.99%Stripe SLACircuit breaker, retry queue, manual fallback

The Drives column is the point. It creates traceability from assumption to spec element—so when an assumption changes (you migrate from Stripe to Adyen, or Stripe changes delivery semantics), you know exactly which constraints, state models, and security decisions to revisit. Without it, a provider migration becomes an audit of the entire spec. With it, the audit is scoped to the rows whose assumptions no longer hold.

Extension Points

Not every integration point exists yet. When a specific variation is committed—funded, scheduled, required by a known deadline—declare the stable interface now so the current implementation doesn't cement itself.

VariationStable InterfaceCurrentPlanned By
PayPal checkoutPaymentGateway interfaceStripe-only implementationQ3 — committed
Multi-currencyAmount { value, currency }USD-hardcodedNot committed — omit

The principle is Protected Variation4 (Cockburn/Larman): identify points of predicted variation and create a stable interface around them. The second row stays out—YAGNI gates which variations make it into the spec. Only committed business needs earn an abstraction.

Without this, agents build the simplest correct implementation—a hardcoded Stripe client. When PayPal arrives in Q3, that's a rewrite, not an extension. Declaring the interface now costs one abstraction; omitting it costs a migration.

State: What Persists, What Changes, What Recovers

State is where bugs hide. The state section forces you to account for what the system remembers.

Entities

What persists beyond a single request? Where does it live? Who owns it?

EntityPersistenceStorageOwner
PaymentIntentpersistentpayments tablepayment service
WebhookEventpersistentwebhooks tablepayment service
ProcessingLockephemeralRedispayment service

The distinction matters for crash recovery. If the process dies mid-operation, ephemeral state disappears. Your system must handle that.

State Models

How you model state determines how you think about transitions.

ModelUse WhenTradeoffKey Question
DeclarativeUI rendering, infrastructure, schema convergenceSimple to reason about; need a reconciler to diff and converge"What should the end state be?"
Event-DrivenWebhooks, messaging, event sourcing, CQRSFull audit trail and replay; eventual consistency, ordering complexity"What happened, and in what order?"
State MachinePayment lifecycles, order flows, approval chainsIllegal transitions are impossible; every edge must be enumerated upfront"What transitions are legal from this state?"

Declarative is increasingly the default across domains — React reconciles UI, Terraform reconciles infrastructure, SQL declares query results, GitOps reconciles deployments. The core pattern is always the same: desired_state + reconciliation_loop. You declare what, something else figures out how. When no reconciler exists for your domain, you're building one — that's the cost.

Choose one model per entity. Payment lifecycle = state machine (pending → processing → succeeded/failed). Webhook ingestion = event-driven (append-only log, at-least-once delivery). Account balance = declarative (SUM(transactions) must converge to account.balance). The model shapes the code agents generate: state machines produce switch/case with explicit transitions, event-driven produces handlers and projections, declarative produces diff-and-patch reconcilers.

Error States

Errors aren't exceptions to your data model—they're part of it.

CodeMeaningRecovery
PAYMENT_PENDINGAwaiting Stripe confirmationRetry webhook check
PAYMENT_FAILEDStripe declinedNotify user, allow retry
WEBHOOK_DUPLICATEAlready processedReturn 200, skip processing

When you model error states explicitly, recovery paths become obvious.

Initialization and Crash Recovery

Systems don't start in steady state. Startup ordering and crash recovery determine whether a restart corrupts data or resumes cleanly.

OrderComponentDepends OnReady WhenOn Fail
1DatabaseAccepts connectionsabort
2CacheDatabasePing succeedsdegrade
3HTTP serverDB, CacheHealthcheck 200retry 3×, abort

If any startup step is not idempotent, a crash-and-restart can corrupt state. Specify what "ready" means for each component, and what happens when readiness fails.


Architecture defines the internal skeleton—modules, boundaries, contracts. The next section flips the perspective: what does the system look like from the outside?

The dashed line is the key. Everything inside it is architecture: modules connected by contracts. Everything crossing it is an interface: data entering (inputs) or leaving (outputs) the system. Integration points are the doors in the wall.

Interfaces: Inputs and Outputs

Every system has a surface area—where data enters and exits. While architecture describes internal structure, interfaces describe the system's external surface: what crosses the boundary, in what format, and under what constraints.

Inputs

NameSourceFormatValidationRate Limit
Stripe webhookStripe (HTTPS POST)StripeEvent JSONHMAC-SHA256 signature, timestamp < 5min10K/min
Payment requestClient app (REST API){ orderId: UUID, amount: number }JWT auth, orderId exists, amount > 0100/min per client

Every input crosses the boundary from an external source. The Format column is what you parse; the Validation column is what you reject; the Rate Limit column is what you throttle. Inputs without all three are bugs waiting to happen.

Outputs

NameDestinationFormatSLA
Webhook ackStripe (HTTP response)200 empty / 400 error code< 100ms p95
Payment notificationRabbitMQ (AMQP){ event_type, payment_id, amount, timestamp }at-least-once, < 500ms
Payment responseClient app (HTTP response){ paymentId, status, created_at }< 200ms p95

Every output row is a promise to an external consumer. The Format column is the contract they depend on. The SLA column is the promise they'll hold you to.

Constraints and Invariants: Defining Correctness

Constraints limit actions (NEVER do X). Invariants describe state (X is always true). Together they define what "correct" means for your system.

Constraints

IDRuleVerified ByDataStress
C-001NEVER process duplicate webhookUnique constraint on stripe_event_id10K synthetic events, 5% duplicates100 concurrent deliveries
C-002NEVER trust unsigned webhookSignature validation before processingValid + tampered payloads
C-003NEVER log card numbersPCI compliance scanner in CIPayloads containing PAN data

The Data and Stress columns transform a constraint from a wish into a testable requirement. "NEVER process duplicates" is a policy. "NEVER process duplicates, verified with 10K events at 100 concurrent deliveries" is an engineering requirement with a verification plan. (Note that C-001 and C-002 trace back to third-party assumptions—they exist because of Stripe's delivery semantics and signing behavior, not as arbitrary security choices.)

During implementation, these IDs migrate into code as structured comments (Lesson 11):

// C-001: NEVER process duplicate webhook — idempotency via unique constraint on stripe_event_id
// C-002: NEVER trust unsigned webhook — HMAC-SHA256 validation before any processing
export async function handleWebhook(req: Request): Promise<Response> {
verifySignature(req) // C-002
if (await isDuplicate(req.body.id)) return new Response(null, { status: 200 }) // C-001
// ...
}

The spec table is the authoritative source during design. The code comments become the authoritative source after implementation. This is what makes deleting the spec safe—the constraints have migrated.

Invariants

IDConditionScopeManifested By
I-001payment.status IN (pending, processing, succeeded, failed)PaymentIntentInsert invalid status, assert rejection
I-002webhook.processed_at IS NULL OR webhook.event_id IS UNIQUEWebhookEventProcess same event twice, verify single record
I-003SUM(transactions) = account.balanceAccount ledgerGenerate 1K transactions, verify sum after each batch

Manifested By answers how a test exercises the invariant. Without it, invariants are assertions nobody checks. An invariant violation means your data model is corrupted—make sure you can detect it.

Verify Behavior: Concrete Scenarios at Boundaries

Constraints say NEVER. Invariants say ALWAYS. Neither answers: what should happen when amount=0?

Behavioral scenarios fill this gap—concrete Given-When-Then examples at system boundaries, specific enough to become tests without dictating test framework, mocks, or assertion syntax.

IDGivenWhenThenEdge Category
B-001PaymentIntent in pending stateWebhook delivers succeeded with amount=0Transition to succeeded, balance unchangedboundary value
B-002No matching PaymentIntentWebhook delivers valid event for unknown intentReturn 200, log warning, no state changenull / missing
B-003Stripe API returns 503Client submits payment requestReturn 502, queue for retry, no charge createderror propagation
B-004Two identical webhooks within 10msBoth pass signature validationFirst processes, second returns 200, no state changeconcurrency

Each scenario traces back to a constraint or invariant—B-001 exercises I-003 (balance integrity), B-004 exercises C-001 (no duplicate processing). The edge category column is a systematic checklist: boundary values, null/empty, error propagation, concurrency, temporal. Walk each category per interface; errors cluster at boundaries5 because agents don't reliably infer them.

The spec captures what should happen, not how to test it. Framework choices, mock configurations, and assertion syntax belong in implementation—they change with the codebase. Behavioral examples survive refactoring.

Quality Attributes: How Good Is Good Enough?

Quality attributes define measurable thresholds across three tiers: target (normal operations), degraded (alerting), and failure (paging).

AttributeTargetDegradedFailureMeasurement
Latency p95100ms200ms1sAPM traces
Availability99.9%99.5%99%uptime/month
Recovery15min30min1hincident drill

Target = SLO. Degraded = alerts fire. Failure = on-call gets paged. Three tiers give you an error budget before the first outage and make "good enough" concrete rather than aspirational.

Performance Budget: Decomposing SLOs

Quality Attributes says "Latency p95: 100ms." But the webhook flow has five steps. Which step gets how many milliseconds?

Flow StepBudgetHot/Cold
Signature validation2mshot
Idempotency check (Redis)5mshot
Parse + validate payload3mshot
Update payment state (DB)15mshot
Publish event (queue)5mscold
Total30ms
Headroom70ms

The budget forces two decisions agents can't make alone. First, hot vs. cold path: signature validation is synchronous and blocking—it gets a tight budget. Event publishing is async—it tolerates more. Second, headroom: the total is 30ms against a 100ms SLO, leaving 70ms for future operations on this path. Without decomposition, an agent might spend the entire budget on a single unoptimized query.

Per-operation budgets also surface algorithmic constraints. If "idempotency check" must complete in 5ms, that rules out a full-table scan—the agent knows to use an indexed lookup or bloom filter without being told.

Flows: Tracing Execution

Flows trace execution from trigger to completion, revealing integration points and error handling gaps.

Each step has three parts: what happens, what happens on success, what happens on failure. Flows force you to think through the actual execution path, not an idealized happy-path abstraction.

Security and Observability: System Properties

These aren't features you bolt on—they're system properties that emerge from correct boundaries and instrumentation.

Security

Where does trust end? What can an attacker control?

ThreatMitigation
Forged webhookSignature verification with STRIPE_WEBHOOK_SECRET
Replay attackIdempotency check on event_id
Secret exposureSecrets from env vars, never logged
Deep Security Checklist

For systems with significant attack surface, also specify: Authentication (how are identities verified?), Authorization (who can do what? default deny), Data Protection (what's PII? encrypted at rest? retention policy?). See the full template for the complete format.

Observability

How do you know it's working?

MetricTypeAlert Threshold
webhook_processing_durationhistogramp99 > 5s
payment_success_rategauge< 95% over 5min
duplicate_webhook_ratecounter> 10/min
Deep Observability Checklist

For production-critical systems, also specify: Logging (structured format, correlation IDs, PII redaction), SLOs (availability/latency targets, burn-rate alerts), Tracing (propagation standard, sampling strategy, key spans). See the full template for the complete format.

Deployment and Integration: The Operational Boundary

How a system gets to production and how it behaves when dependencies fail are as much a part of the spec as business logic.

Deployment Strategy

Specify the deployment method (blue-green, canary, rolling), rollback triggers (what metrics cause auto-rollback?), and migration approach (backward-compatible schema changes for how long?). These decisions affect code structure—canary deployments require feature flags; rolling updates require backward-compatible APIs.

Integration Dependencies

ServiceContractOn FailureTimeout
Stripe APIREST, idempotency keyQueue for retry, degrade to manual5s, circuit breaker at 50% failure

Circuit breakers, timeouts, and fallback modes define how your system degrades. Without them, one slow dependency cascades into a full outage. These operational failure modes operationalize the architectural assumptions declared earlier—the circuit breaker exists because you assumed ~99.99% availability, not 100%.

Converge, Don't Count Passes

The spec is a hypothesis. The code is an experiment. Verification is observation. This is the scientific method applied to engineering—and it terminates on convergence, not on a prescribed number of passes.

Always start with three sections: Architecture, Interfaces, and State. Generate a first pass. Then ask one question: is the architecture sound?

  • Yes → fix the code. The agent made a mechanical error—patch the implementation.
  • No → fix the spec and regenerate. Don't patch around flawed boundaries.

Each loop through this cycle reveals what the spec missed. The first pass might expose concurrency constraints—add Constraints. The second might surface a performance bottleneck—add a Performance Budget. The code pulls depth from you; you don't push depth onto it by categorizing complexity upfront. You can't know which sections matter before the code shows you where gaps are2.

You're done when the loop produces no new gaps: the code passes all behavioral scenarios, the spec accounts for all constraints the code revealed, and the last pass surfaces nothing new. That's a testable termination condition. A simple feature converges in one loop. A complex architectural change might take five. But you discover which you're dealing with by running the loop, not by predicting it.

Iteration speed is the multiplier. Code generation is approaching post-scarcity3—the scarce resource is your judgment about what to build. The engineer who runs ten hypothesis→experiment→verify loops per day outperforms the one who runs two with a more thorough upfront spec21. This is the same insight that made Agile outperform Waterfall, compressed from weeks-per-iteration to minutes. Use exploration planning (Lesson 3) and ArguSeek (Lesson 5) to research before each loop. For system-level work, start from the full template. Validate through the SDD workflow—gap-analyze, implement, then delete the spec. What survives deletion: constraint IDs inlined in code (Lesson 11), and the small WHY residual (rejected alternatives, business rationale) committed as decision records.

Template Sections Not Covered

The full spec template includes sections not taught in this lesson: Background (problem statement + baseline metrics), Caching (strategy/TTL/invalidation), Endpoints (REST contract details), Cleanup Flows (teardown/rollback sequences), Code Traceability (file:line evidence columns). Use these when the code pulls them from you—not before.

Key Takeaways

  • Specs are a zoom lens, not a blueprint — oscillate between bird's-eye architecture and detail-level implementation.

  • Spec = hypothesis, code = experiment — each loop through the cycle tests whether your architectural assumptions hold. Converge when the loop produces no new gaps.

  • Precision is discovered, not specified — each spec↔code pass reveals gaps the previous spec missed. The code pulls depth from you.

  • Iteration speed is the multiplier — code is cheap, judgment is scarce. Maximize hypothesis→experiment→verify loops per day, not spec thoroughness per loop.

  • Architecture makes structure explicit — modules have single responsibilities, boundaries prevent coupling, contracts define communication.

  • Third-party assumptions are architectural drivers — make them explicit so agents know which decisions to revisit when providers change.

  • State modeling shapes transition code — choose declarative, event-driven, or state machine per entity.

  • Fix specs for architecture, fix code for bugs — flawed boundaries = regenerate from updated spec; mechanical errors = patch the implementation.

  • Delete the spec when done — code is the source of truth.


Footnotes

  1. Lloyd, Zach (2025) — First Round Capital interview — Compares upfront outcome-based specs to "writing a huge design doc for something up front"; advocates iterative, incremental agent guidance instead. Beck, Kent (2025) — "Augmented Coding: Beyond the Vibes" — Demonstrates plans failing on contact with implementation complexity; advocates incremental TDD cycles over upfront specification. 2

  2. Eberhardt, Colin (2025) — "Putting Spec Kit Through Its Paces: Radical Idea or Reinvented Waterfall?" — Found iterative prompting ~10x faster than specification-driven development. Li et al. (2025) — "Specine: An AI Agent That Writes Your Spec" (arXiv:2509.01313) — confirms LLMs misperceive specification quality, requiring iterative alignment. 2 3

  3. Xu et al. (2025) - "When Code Becomes Abundant: Implications for Software Engineering in a Post-Scarcity AI Era" - Argues software engineering shifts from production to orchestration + verification as AI makes code generation cheap. Source: arXiv:2602.04830 2

  4. Cockburn, Alistair / Larman, Craig — "Protected Variation: The Importance of Being Closed" (IEEE Software). Reformulates the Open-Closed Principle as: "Identify points of predicted variation and create a stable interface around them." See also Fowler, Martin — YAGNI for the distinction between presumptive and known features.

  5. Boundary Value Analysis research consistently shows errors cluster at input extremes (min, max, off-by-one). See Ranorex — "What Is Boundary Value Analysis in Software Testing?" and NVIDIA — "Building AI Agents to Automate Software Test Case Creation" (HEPH framework for AI-driven positive/negative test specification).