Because nothing breaks quite like production.
Let’s be honest. No matter how many green ticks your CI gives you, something will inevitably fail in production. At scale, when no one is looking, and probably on a Friday afternoon.
Instead of pretending we can avoid it, it's time to build like we expect failure and actively plan to survive it. This isn't pessimism; it's engineering maturity. It's about designing systems that don't just work, but gracefully recover.
Before the code, before the backlog, before the tech stack — start with “How Might This Fail?”
In your discovery or exploration workshops, don’t just ask what users want. Ask what might go wrong. This is where agile risk planning begins. During sprint planning or backlog refinement, explicitly dedicate time to discuss potential failure modes for user stories and epics.
Examples:
This mindset — “How Might We Fail?” (HMW-F) — gives you something far better than perfect plans: realistic expectations. This is maturity, not paranoia.
Incorporate failure paths into your story mapping process from day one — treat them as first-class citizens through negative user stories or error-state criteria. Design for failure, not just success. That includes your error pages, too.
Distributed systems don’t usually explode dramatically. They degrade slowly, silently, and weirdly. So instead of hoping for stability, architect for instability.
Some patterns that help, and which can be explicitly discussed during agile design sessions or spike tasks:
You don’t need a massive resilience framework to start; even basic timeouts and deduplication can prevent the weirdest bugs.
And remember: if you don’t design for broken APIs, flaky connections, and race conditions — your users will experience them anyway. These design discussions should be part of your definition of done for relevant features.
Testing the happy path is great. But most users find the chaotic path on day one. So build tests that:
Throw weirdness at your system on purpose. Your users will — so you better do it first. And yes, make space in your pipeline for tests that simulate failure, not just correctness. This requires planning for test automation as part of your sprint work, rather than just manual QA.
Every deploy is a risk. So CI/CD isn’t just about automation, it’s about containment. Here’s how to reduce impact:
Don’t ship and hope. Ship with a plan for what to do if it tanks.
Our logs shouldn’t just say “it’s up.” They should say:
Observability isn’t just dashboards. It’s clarity.
At Renben, we use:
And don’t forget to alert on business logic too: abandoned checkouts, sudden drop in usage, weird spikes in retries. Technical health without product context is just noise. These business-critical metrics should be defined and monitored in collaboration with product owners in your agile teams.
Site Reliability Engineering (SRE) is a mindset shift. Treat failure as inevitable and design for graceful degradation. This can be integrated into agile practices at any scale.
Key practices:
Don’t just fix bugs. Fix the design that let the bug through.
You can start simple: a shared Google Doc of SLIs, a weekly discussion in your sprint retrospective on latency trends, a Slack channel for near-misses. You honestly don’t need a full SRE team, but it sure helps.
Because real-world software doesn't fail neatly, it times out randomly, loses messages in transit, goes stale behind load balancers, gets called twice with different parameters, and fails in only one region.
Building for failure means:
It’s not a waterfall model. It’s not paranoia. It’s just good engineering.
And yes, you can accomplish all this in an agile manner. Just add failure conversations to planning, testing, and review.
At Renben, we believe engineering isn't just about uptime — it’s about designing for the real world.
Explore what we’re building.