Module 11 · Lesson 5

Designing for Failure

Build systems that tolerate things going wrong. Retries, exponential backoff, circuit breakers, and graceful degradation.

Audio: Designing for Failure
0:000:00

Designing for Failure

In a real system, things break: networks drop packets, servers crash, dependencies time out. Designing for failure means assuming all of this and recovering gracefully so users barely notice.

Retries with Exponential Backoff

Many failures are temporary. A retry often succeeds on the second or third try. The trick is how you retry.

Retrying immediately can hammer a struggling server until it falls over. Exponential backoff doubles the wait each attempt: 50ms, 100ms, 200ms, 400ms. This gives the dependency time to recover and prevents your retries from amplifying the outage.

Add jitter (random wiggle) so many clients do not retry in lockstep, which would create thundering-herd waves of traffic.

Circuit Breakers

A circuit breaker stops calling a failing dependency for a while. If the last N calls have failed, the breaker opens and short-circuits future calls with an immediate error. After a cooldown, it goes to half-open and lets one probe through. If that succeeds, it closes; if it fails, it stays open longer.

This protects your system from spending resources on hopeless calls and gives the dependency room to recover.

Graceful Degradation

When a non-critical feature fails, the rest of the app should keep working. If the recommendations service is down, show a default product list. If the avatar service times out, show initials. Fallbacks turn outages into mild annoyances.

Decide ahead of time which features are critical and what to substitute when the rest break.

Timeouts

Set a timeout on every network call. The default of "wait forever" is the most common cause of cascading outages. A short timeout with a retry is almost always better than a long timeout that ties up a thread or connection.

Try It Yourself

  • Add jitter to the backoff so the wait is random within a range.
  • Implement a basic circuit breaker that opens after 3 failures and half-opens after 5 seconds.
  • Wrap the retry helper around a function that returns a fallback value when all attempts fail.

Code Playground

Edit the code below and click Run to see the output. Switch between languages using the tabs.

Loading editor...

Enjoying the lesson? Unlock the full System Design & Architecture from $4.99/mo.

See plans →