I wrote some stuff while I was at Google about writing clean alerts and keeping an oncall rotation sane; after some cleanup they’ve allowed me to make it public. Of course, this represents my opinions and not Google’s. They do reflect what we think are best practices at Tumblr, though. We’re hiring Site Reliability Engineers.
Check out My Philosophy On Alerting.
When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier:
- Pages should be urgent, important, actionable, and real.
- They should represent either ongoing or imminent problems with your service.
- Err on the side of removing noisy alerts.
- You should almost always be able to classify the problem into one of
- availability & basic functionality
- correctness (completeness, freshness and durability of data)
- and feature-specific problems.
- Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.
- The further up your serving stack you go, the more problems you catch in a single rule. Balance this with being able to distinguish what’s going on.