监控与告警体系建设生产环境最佳实践
情境(Situation)
监控与告警是SRE工程师的生命线。一个完善的监控体系能够实时感知系统状态,及时发现潜在问题,保障服务的高可用性和稳定性。
冲突(Conflict)
许多团队在监控体系建设中面临以下挑战:
- 告警风暴:大量重复告警导致告警疲劳
- 监控盲区:关键指标未被监控
- SLA难以衡量:缺乏明确的SLA指标和监控
- 告警响应不及时:缺乏有效的升级机制
- 数据孤岛:监控数据分散在多个系统中
问题(Question)
如何构建一个高效、智能、SLA驱动的监控与告警体系?
答案(Answer)
本文将基于真实生产案例,提供一套完整的监控与告警体系建设最佳实践指南。
一、监控体系架构设计
1.1 监控架构概览
flowchart TD
subgraph 采集层["数据采集层"]
A["Node Exporter"]
B["Application Metrics"]
C["Logs (Filebeat/Fluentd)"]
D["Tracing (Jaeger/Zipkin)"]
end
subgraph 存储层["数据存储层"]
E["Prometheus (时序数据)"]
F["Elasticsearch (日志)"]
G["Jaeger Storage (链路)"]
end
subgraph 分析层["数据分析层"]
H["Alertmanager"]
I["Grafana"]
J["Kibana"]
end
subgraph 通知层["通知层"]
K["PagerDuty"]
L["Slack/DingTalk"]
M["Email"]
N["SMS"]
end
A --> E
B --> E
C --> F
D --> G
E --> H
E --> I
F --> J
H --> K
H --> L
H --> M
H --> N
style 采集层 fill:#e3f2fd
style 存储层 fill:#fff3e0
style 分析层 fill:#c8e6c9
style 通知层 fill:#f8bbd9
1.2 监控指标分类
| 指标类型 | 监控内容 | 关键指标 |
|---|---|---|
| 基础设施指标 | CPU、内存、磁盘、网络 | CPU使用率、内存使用率、磁盘I/O、网络延迟 |
| 应用指标 | 请求量、响应时间、错误率 | QPS、P50/P95/P99延迟、错误率 |
| 业务指标 | 业务成功率、转化率 | 订单成功率、支付成功率、页面转化率 |
| SLA指标 | 服务可用性、性能达标率 | SLA达成率、MTTR、MTBF |
二、Prometheus监控配置实践
2.1 Exporter配置
# Node Exporter配置
apiVersion: v1
kind: Service
metadata:
name: node-exporter
labels:
app: node-exporter
spec:
selector:
app: node-exporter
ports:
- name: metrics
port: 9100
targetPort: 9100
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
containers:
- name: node-exporter
image: quay.io/prometheus/node-exporter:latest
args:
- --path.rootfs=/host
volumeMounts:
- name: rootfs
mountPath: /host
readOnly: true
volumes:
- name: rootfs
hostPath:
path: /
2.2 Prometheus配置
# prometheus.yml配置
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: ${1}:${2}
target_label: __address__
2.3 告警规则配置
# rules/alerts.yml
groups:
- name: infrastructure_alerts
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: critical
team: sre
annotations:
summary: "High CPU Usage"
description: " CPU usage is above 90% (current: %)"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
team: sre
annotations:
summary: "High Memory Usage"
description: " Memory usage is above 90% (current: %)"
- alert: DiskSpaceLow
expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 85
for: 10m
labels:
severity: warning
team: sre
annotations:
summary: "Low Disk Space"
description: " disk usage is above 85% (current: %)"
三、SLA监控与保障
3.1 SLA指标定义
# SLA指标告警规则
groups:
- name: sla_alerts
rules:
- alert: ApiLatencySlaBreach
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 0.15
for: 5m
labels:
severity: critical
team: sre
sla: "api-latency"
annotations:
summary: "API Latency SLA Breach"
description: " P95 latency is s, exceeding 150ms target"
- alert: ApiSuccessRateSlaBreach
expr: sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 0.5
for: 5m
labels:
severity: critical
team: sre
sla: "api-success-rate"
annotations:
summary: "API Success Rate SLA Breach"
description: "API error rate is %, exceeding 0.5% target"
3.2 SLA仪表盘配置
// Grafana仪表盘JSON配置片段
{
"title": "SLA Dashboard",
"panels": [
{
"type": "stat",
"title": "API Availability",
"targets": [
{
"expr": "100 - (sum(rate(http_requests_total{status_code=~\"5..\"}[24h])) / sum(rate(http_requests_total[24h])) * 100)",
"legendFormat": "Availability"
}
],
"thresholds": "99.5,99.9",
"colorMode": "value"
},
{
"type": "graph",
"title": "API Latency (P95)",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "P95 Latency"
}
],
"yAxis": {
"label": "Seconds",
"min": 0,
"max": 0.5
}
}
]
}
四、告警管理最佳实践
4.1 Alertmanager配置
# alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
route:
group_by: ['alertname', 'instance', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-sre'
routes:
- match:
severity: critical
receiver: 'team-sre-critical'
group_wait: 10s
- match:
severity: warning
receiver: 'team-sre-warning'
group_wait: 20s
receivers:
- name: 'team-sre'
email_configs:
- to: 'sre@example.com'
send_resolved: true
webhook_configs:
- url: 'https://hooks.slack.com/services/XXX'
send_resolved: true
- name: 'team-sre-critical'
email_configs:
- to: 'sre@example.com'
send_resolved: true
webhook_configs:
- url: 'https://hooks.slack.com/services/XXX'
send_resolved: true
pagerduty_configs:
- service_key: 'your-pagerduty-key'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
4.2 告警分级体系
| 级别 | 定义 | 响应时间 | 通知方式 |
|---|---|---|---|
| P0 | 系统完全不可用 | 立即 | 电话 + SMS + 钉钉 + 邮件 |
| P1 | 关键功能异常 | 10分钟内 | 钉钉 + 邮件 |
| P2 | 非核心功能异常 | 30分钟内 | 钉钉 + 邮件 |
| P3 | 信息性提醒 | 24小时内 | 邮件 |
五、日志管理最佳实践
5.1 EFK Stack配置
# Elasticsearch StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
env:
- name: discovery.type
value: single-node
- name: ES_JAVA_OPTS
value: "-Xms512m -Xmx512m"
ports:
- containerPort: 9200
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
5.2 Fluentd配置
# fluentd.conf
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
<match kubernetes.**>
@type elasticsearch
host elasticsearch
port 9200
logstash_format true
logstash_prefix kubernetes
</match>
六、链路追踪配置
6.1 Jaeger配置
# Jaeger Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 6831
protocol: UDP
- containerPort: 16686
- containerPort: 14268
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
七、监控可视化最佳实践
7.1 Grafana仪表盘设计
// 系统概览仪表盘
{
"title": "System Overview",
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"query": "label_values(node_cpu_seconds_total, instance)"
}
]
},
"panels": [
{
"gridPos": { "x": 0, "y": 0, "w": 8, "h": 4 },
"type": "graph",
"title": "CPU Usage",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\", instance=~\"$instance\"}[5m])) * 100)",
"legendFormat": ""
}
]
},
{
"gridPos": { "x": 8, "y": 0, "w": 8, "h": 4 },
"type": "graph",
"title": "Memory Usage",
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
"legendFormat": ""
}
]
}
]
}
八、最佳实践总结
8.1 监控体系设计原则
| 原则 | 说明 | 实践建议 |
|---|---|---|
| 全面覆盖 | 监控所有关键指标 | 基础设施+应用+业务指标 |
| SLA驱动 | 以SLA为核心 | 定义明确的SLA指标 |
| 告警收敛 | 减少告警噪声 | 使用抑制规则和聚合 |
| 可观测性 | 统一数据平台 | Prometheus + Elasticsearch + Jaeger |
| 自动化 | 自动发现和配置 | Kubernetes自动发现 |
8.2 常见问题与解决方案
| 问题 | 症状 | 解决方案 |
|---|---|---|
| 告警风暴 | 大量重复告警 | 配置group_by和inhibit_rules |
| 监控盲区 | 关键问题未发现 | 定期审计监控覆盖率 |
| SLA不达标 | 服务可用性低 | 建立SLA监控和告警 |
| 告警响应慢 | 问题处理不及时 | 配置升级策略 |
| 数据分散 | 难以定位问题 | 统一监控平台 |
总结
监控与告警体系是保障服务可靠性的关键。通过构建SLA驱动的监控体系、配置合理的告警规则、实现告警收敛和自动化响应,可以显著提高系统的可观测性和故障处理效率。
延伸阅读:更多监控相关面试题,请参考 SRE面试题解析:基于JD与简历匹配分析。
参考资料
文档信息
- 本文作者:soveran zhong
- 本文链接:https://blog.clockwingsoar.cn/2026/04/28/monitoring-alerting-sla-production-best-practices/
- 版权声明:自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)