Alert Rule Standards

Intermediate

Enforce standards for Grafana alert rules — naming conventions, severity labels, for-duration requirements, runbook annotations, and notification routing rules.

File Patterns

**/grafana/alerting/**/*.yaml**/grafana/provisioning/**/*.yaml

This rule applies to files matching the patterns above.

Rule Content

rule-content.md

# Alert Rule Standards

## Rule
All Grafana alert rules MUST have severity labels, for-duration, runbook links, and follow consistent naming patterns. Alerts MUST route based on severity level.

## Naming Convention
```
[Service] Symptom Description
```
Examples:
- "[API] High Error Rate"
- "[Database] Connection Pool Exhausted"
- "[Worker] Job Queue Backlog Growing"

## Required Labels
| Label | Values | Purpose |
|-------|--------|---------|
| severity | critical, warning, info | Routing priority |
| team | backend, frontend, infra | Ownership |
| service | api, worker, database | Affected service |
| environment | production, staging | Environment scope |

## Required Annotations
| Annotation | Content |
|-----------|---------|
| summary | "{{ $value }} — brief description of what is wrong" |
| description | Detailed explanation with affected service and impact |
| runbook_url | Link to remediation steps |

## For-Duration Requirements
| Severity | Minimum for-duration |
|----------|---------------------|
| critical | 5 minutes |
| warning | 10 minutes |
| info | 15 minutes |

## Routing Rules
| Severity | Channel | Repeat Interval |
|----------|---------|----------------|
| critical | PagerDuty + Slack | 30 minutes |
| warning | Slack | 4 hours |
| info | Email digest | 24 hours |

## Examples

### Good
```yaml
- title: "[API] High Error Rate"
  for: 5m
  labels:
    severity: critical
    team: backend
    service: api
  annotations:
    summary: "Error rate is {{ $value }}% (threshold: 5%)"
    runbook_url: https://wiki.example.com/runbooks/api-error-rate
```

### Bad
```yaml
- title: "Alert 1"           # Non-descriptive name
  for: 0s                     # No for-duration
  # No severity label
  # No runbook link
```

## Enforcement
Review alert rules in provisioning PRs.
Audit active alerts monthly for noise/relevance.
Test alert routing in staging environment.

FAQ

Discussion

Loading comments...

# Alert Rule Standards ## Rule All Grafana alert rules MUST have severity labels, for-duration, runbook links, and follow consistent naming patterns. Alerts MUST route based on severity level. ## Naming Convention ``` [Service] Symptom Description ``` Examples: - "[API] High Error Rate" - "[Database] Connection Pool Exhausted" - "[Worker] Job Queue Backlog Growing" ## Required Labels | Label | Values | Purpose | |-------|--------|---------| | severity | critical, warning, info | Routing priority | | team | backend, frontend, infra | Ownership | | service | api, worker, database | Affected service | | environment | production, staging | Environment scope | ## Required Annotations | Annotation | Content | |-----------|---------| | summary | "{{ $value }} — brief description of what is wrong" | | description | Detailed explanation with affected service and impact | | runbook_url | Link to remediation steps | ## For-Duration Requirements | Severity | Minimum for-duration | |----------|---------------------| | critical | 5 minutes | | warning | 10 minutes | | info | 15 minutes | ## Routing Rules | Severity | Channel | Repeat Interval | |----------|---------|----------------| | critical | PagerDuty + Slack | 30 minutes | | warning | Slack | 4 hours | | info | Email digest | 24 hours | ## Examples ### Good ```yaml - title: "[API] High Error Rate" for: 5m labels: severity: critical team: backend service: api annotations: summary: "Error rate is {{ $value }}% (threshold: 5%)" runbook_url: https://wiki.example.com/runbooks/api-error-rate ``` ### Bad ```yaml - title: "Alert 1" # Non-descriptive name for: 0s # No for-duration # No severity label # No runbook link ``` ## Enforcement Review alert rules in provisioning PRs. Audit active alerts monthly for noise/relevance. Test alert routing in staging environment.