Designing for Failure
In a real system, things break: networks drop packets, servers crash, dependencies time out. Designing for failure means assuming all of this and recovering gracefully so users barely notice.
Retries with Exponential Backoff
Many failures are temporary. A retry often succeeds on the second or third try. The trick is how you retry.
Retrying immediately can hammer a struggling server until it falls over. Exponential backoff doubles the wait each attempt: 50ms, 100ms, 200ms, 400ms. This gives the dependency time to recover and prevents your retries from amplifying the outage.
Add jitter (random wiggle) so many clients do not retry in lockstep, which would create thundering-herd waves of traffic.
Circuit Breakers
A circuit breaker stops calling a failing dependency for a while. If the last N calls have failed, the breaker opens and short-circuits future calls with an immediate error. After a cooldown, it goes to half-open and lets one probe through. If that succeeds, it closes; if it fails, it stays open longer.
This protects your system from spending resources on hopeless calls and gives the dependency room to recover.
Graceful Degradation
When a non-critical feature fails, the rest of the app should keep working. If the recommendations service is down, show a default product list. If the avatar service times out, show initials. Fallbacks turn outages into mild annoyances.
Decide ahead of time which features are critical and what to substitute when the rest break.
Timeouts
Set a timeout on every network call. The default of "wait forever" is the most common cause of cascading outages. A short timeout with a retry is almost always better than a long timeout that ties up a thread or connection.
Try It Yourself
- Add jitter to the backoff so the wait is random within a range.
- Implement a basic circuit breaker that opens after 3 failures and half-opens after 5 seconds.
- Wrap the retry helper around a function that returns a fallback value when all attempts fail.