Others

Developing for Failure

Meghna Bharadwaj

August 13, 2025

Because nothing breaks quite like production.

Let’s be honest. No matter how many green ticks your CI gives you, something will inevitably fail in production. At scale, when no one is looking, and probably on a Friday afternoon.

Instead of pretending we can avoid it, it's time to build like we expect failure and actively plan to survive it. This isn't pessimism; it's engineering maturity. It's about designing systems that don't just work, but gracefully recover.

Step 1: Think Like it's going to Fail

Before the code, before the backlog, before the tech stack — start with “How Might This Fail?”

In your discovery or exploration workshops, don’t just ask what users want. Ask what might go wrong. This is where agile risk planning begins. During sprint planning or backlog refinement, explicitly dedicate time to discuss potential failure modes for user stories and epics.

Examples:

What happens if the API is down?
What if the user refreshes during checkout?
What if 10,000 people all click the same button at once?

This mindset — “How Might We Fail?” (HMW-F) — gives you something far better than perfect plans: realistic expectations. This is maturity, not paranoia.

Incorporate failure paths into your story mapping process from day one — treat them as first-class citizens through negative user stories or error-state criteria. Design for failure, not just success. That includes your error pages, too.

Step 2: Architect With Chaos in Mind

Distributed systems don’t usually explode dramatically. They degrade slowly, silently, and weirdly. So instead of hoping for stability, architect for instability.

Some patterns that help, and which can be explicitly discussed during agile design sessions or spike tasks:

Retries with Backoff: For flaky APIs. Instead of hammering a failing service, implement exponential backoff (e.g., initial retry after 100ms, then 200ms, 400ms, with a jitter component to avoid thundering herds).
Circuit Breakers: To prevent cascading failures when a service is unresponsive. If a service consistently fails, the circuit breaker "opens," quickly failing subsequent calls without waiting for timeouts, protecting the downstream service from overload.
Bulkheads: Isolate failures across services or threads. Imagine a ship's compartments: a leak in one doesn't sink the whole ship. In software, this means isolating resource pools (e.g., separate thread pools for different external service calls) so one slow dependency doesn't block your entire application.
Idempotent Operations: Crucial for things like payments, messages, or retries. An operation is idempotent if executing it multiple times has the same effect as executing it once. This is achieved by using unique transaction IDs or conditional updates to prevent duplicate processing.

You don’t need a massive resilience framework to start; even basic timeouts and deduplication can prevent the weirdest bugs.

And remember: if you don’t design for broken APIs, flaky connections, and race conditions — your users will experience them anyway. These design discussions should be part of your definition of done for relevant features.

Step 3: Test What Shouldn’t Work

Testing the happy path is great. But most users find the chaotic path on day one. So build tests that:

Simulate retries and timeouts: For example, using mock servers that introduce artificial delays or random failures.
Handle bad inputs, dropped packets, and corrupted state: Injecting malformed data, simulating network partitions or packet loss, and testing how the system recovers from inconsistent data states.
Validate what happens when external services go sideways: This includes testing their error codes and unexpected responses.
Double-submit forms or trigger race conditions: Using concurrent test frameworks or specifically designed scripts to hit endpoints simultaneously.

Throw weirdness at your system on purpose. Your users will — so you better do it first. And yes, make space in your pipeline for tests that simulate failure, not just correctness. This requires planning for test automation as part of your sprint work, rather than just manual QA.

Step 4: CI/CD with Guardrails

Every deploy is a risk. So CI/CD isn’t just about automation, it’s about containment. Here’s how to reduce impact:

Run failure-specific tests in CI: Beyond unit/integration tests, include contract tests against mock services for external dependencies, and even light-weight chaos tests that can run quickly.
Make rollback paths explicit and testable: Ensure your deployment script includes a clear, automated rollback command, and periodically test it in a staging environment. This should be part of your release planning and definition of done for deployments.
Use progressive deployment: blue-green, canary, staged rollouts.
Wrap every new feature in a flag so you can flip it off without reverting code.

Don’t ship and hope. Ship with a plan for what to do if it tanks.

Step 5: Monitor Like You’re Blind Without It

Our logs shouldn’t just say “it’s up.” They should say:

What’s slow?
What’s acting weird?
What’s broken but not throwing errors?
What should’ve happened that didn’t?

Observability isn’t just dashboards. It’s clarity.

At Renben, we use:

Metrics
Tracing
Logs
Error reporting
Synthetic monitoring

And don’t forget to alert on business logic too: abandoned checkouts, sudden drop in usage, weird spikes in retries. Technical health without product context is just noise. These business-critical metrics should be defined and monitored in collaboration with product owners in your agile teams.

Step 6: SRE Isn’t Just for Big Companies

Site Reliability Engineering (SRE) is a mindset shift. Treat failure as inevitable and design for graceful degradation. This can be integrated into agile practices at any scale.

Key practices:

Define SLIs (Service Level Indicators): What actual user experience are you measuring? (e.g., successful login rate, request latency below 200ms, data freshness within 5 minutes).
Commit to SLOs (Service Level Objectives): Realistic, measurable targets based on your SLIs (e.g., 99.9% of logins must be successful over 7 days, 95% of requests must complete within 200ms).
Track your Error Budget: How much failure are you willing to tolerate before pausing new feature development to focus on reliability? (e.g., if our SLO is 99.9%, our error budget is 0.1% downtime or failures).
Do Incident Reviews (Retrospective): Not just to blame, but to learn. Conduct blameless postmortems to identify root causes, systemic weaknesses, and action items for improvement, feeding back into your agile backlog as reliability-focused tasks.

Don’t just fix bugs. Fix the design that let the bug through.

You can start simple: a shared Google Doc of SLIs, a weekly discussion in your sprint retrospective on latency trends, a Slack channel for near-misses. You honestly don’t need a full SRE team, but it sure helps.

Why All This?

Because real-world software doesn't fail neatly, it times out randomly, loses messages in transit, goes stale behind load balancers, gets called twice with different parameters, and fails in only one region.

Building for failure means:

You think about people, not just systems
You code for recovery, not just correctness
You test for entropy, not just coverage
You monitor for patterns, not just errors

It’s not a waterfall model. It’s not paranoia. It’s just good engineering.

And yes, you can accomplish all this in an agile manner. Just add failure conversations to planning, testing, and review.

At Renben, we believe engineering isn't just about uptime — it’s about designing for the real world.

Explore what we’re building.