Monitoring and Iteration

A system you cannot observe is a system you cannot fix. Once code is in production, monitoring is what tells you whether users are happy, what is breaking, and where to spend the next sprint.

Metrics

A metric is a number you record over time: request count, latency, error rate, queue depth, CPU usage. The four most universally useful metrics are sometimes called the golden signals: latency, traffic, errors, and saturation.

Aggregate metrics let you spot trends. Percentiles (p50, p95, p99) tell you about the user experience much better than averages. The example tracks average latency and error rate and flags an alert when either crosses a threshold.

Logs

A log entry records what happened at a single moment: a request, an error, a state change. Good logs include enough context (user id, request id, timestamp) to follow a single transaction across services.

Logs are noisy. Search and structured fields are essential. Avoid logging secrets or personally identifying information.

Alerts

An alert wakes someone up when something is wrong. A good alert is actionable, urgent, and rare. Pages that fire constantly create alert fatigue, and real incidents start getting ignored.

Tier alerts by severity. Page only on user-facing problems. Send lower-severity issues to a dashboard or chat channel for next-business-day review.

Post-Mortems

After an incident, write a post-mortem: what happened, what users saw, the timeline, the root cause, and what you will change. The goal is blameless learning. Almost all incidents come from systems that allowed mistakes, not from bad people.

Track the action items from each post-mortem. If the same kind of incident keeps happening, the system, not the on-call rotation, needs to change.

Iterate

Use what you learn. Each cycle of build, ship, observe, and adjust makes the next one better. Real software is never finished; it is just better than it was last week.

Try It Yourself

Extend the metric collector to track p95 latency.
Add a sliding window so the report only considers the last N events.
Write a short post-mortem template and fill it in for an imagined outage.