Prometheus数据采集与指标设计：RED/USE方法论与生产实践指南

情境与背景

Prometheus是云原生监控的核心，但很多团队在使用时缺乏系统的指标设计方法论。本指南详细讲解Prometheus的指标模型、RED和USE方法论、四种指标类型的适用场景，以及生产环境中指标设计的最佳实践。

一、Prometheus指标模型

1.1 数据模型

Prometheus指标结构：

## Prometheus指标模型

### 数据模型

**指标格式**：

```yaml
metric_name{label1="value1", label2="value2"} value timestamp

# 示例
http_requests_total{method="GET", endpoint="/api/users", status="200"} 12345 1704067200

命名规范：

naming_conventions:
  format: "{namespace}_{name}_{type}"
  
  components:
    namespace: "产品/服务名"
    name: "指标功能描述"
    type: "后缀如total/count/histogram"
    
  examples:
    - "http_requests_total"
    - "kubernetes_pod_status_phase"
    - "process_cpu_seconds_total"

### 1.2 四种指标类型

**指标类型详解**：

```yaml
metric_types:
  counter:
    description: "只增不减的累计值"
    use_case: "请求总数、错误总数"
    example: "http_requests_total"
    code: |
      # 累加
      http_requests_total{path="/api"} 100
      http_requests_total{path="/api"} 101
      
  gauge:
    description: "可增可减的当前值"
    use_case: "CPU使用率、内存占用"
    example: "cpu_usage_percent"
    code: |
      # 可增可减
      cpu_usage_percent{host="node-1"} 45.2
      cpu_usage_percent{host="node-1"} 50.1
      
  histogram:
    description: "对采样数据分桶统计"
    use_case: "请求延迟、响应大小"
    example: "http_request_duration_seconds"
    buckets: "[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]"
    
  summary:
    description: "直接计算分位数"
    use_case: "需要精确分位数"
    example: "http_request_duration_seconds"
    quantiles: "[0.5, 0.9, 0.99]"

Histogram vs Summary对比：

histogram_vs_summary:
  histogram:
    advantages:
      - "服务端计算分位数"
      - "可跨服务聚合"
      - "bucket可自定义"
    disadvantages:
      - "客户端开销小"
      - "分位数精度取决于bucket"
      
  summary:
    advantages:
      - "精确分位数"
      - "客户端直接输出"
    disadvantages:
      - "不可跨服务聚合"
      - "客户端开销大"

二、RED方法论

2.1 RED定义

RED方法论概述：

## RED方法论

### RED定义

**适用场景**：

```yaml
red适用场景:
  description: "用于监控微服务/API等面向用户的服务"
  
  three_metrics:
    Rate:
      definition: "请求速率"
      question: "服务收到多少请求？"
      example: "每秒多少请求"
      
    Errors:
      definition: "错误率"
      question: "有多少请求失败？"
      example: "5xx错误比例"
      
    Duration:
      definition: "响应时间"
      question: "处理请求需要多久？"
      example: "P99延迟"

RED指标示例：

red_metrics_example:
  service: "用户服务 user-service"
  
  Rate:
    - "user_service_requests_total"
    - "label: method, endpoint, status"
    
  Errors:
    - "user_service_errors_total"
    - "label: method, endpoint, error_type"
    
  Duration:
    - "user_service_request_duration_seconds"
    - "type: histogram"
    - "buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5]"

2.2 RED应用示例

HTTP服务RED指标：

// Go语言实现RED指标
package main

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "net/http"
)

var (
    // Rate: 请求总数
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    // Duration: 请求延迟
    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request latency distribution",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal, httpRequestDuration)
}

func metricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        // Rate计数
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
        
        // Duration记录
        // ... 记录延迟
    })
}

Python实现：

# Python实现RED指标
from prometheus_client import Counter, Histogram, generate_latest

# Rate: 请求总数
http_requests = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Duration: 延迟分布
http_request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)

# Errors直接用Counter的status label

三、USE方法论

3.1 USE定义

USE方法论概述：

## USE方法论

### USE定义

**适用场景**：

```yaml
use适用场景:
  description: "用于监控系统资源如CPU、内存、磁盘、网络"
  
  three_metrics:
    Utilization:
      definition: "资源利用率"
      question: "资源被使用了多少？"
      example: "CPU使用率百分比"
      
    Saturation:
      definition: "资源饱和度"
      question: "资源有多满？"
      example: "CPU队列长度"
      
    Errors:
      definition: "错误数"
      question: "资源出错了吗？"
      example: "网络丢包数"

USE指标示例：

use_metrics_example:
  resource: "CPU"
  
  Utilization:
    - "node_cpu_usage_percent"
    - "或: 1 - idle"
    
  Saturation:
    - "node_load1"  # 1分钟负载
    - "node_load5"  # 5分钟负载
    
  Errors:
    - "node_cpu_errors_total"  # CPU错误（如果有）

use_metrics_example_disk:
  resource: "Disk"
  
  Utilization:
    - "disk_usage_percent"
    
  Saturation:
    - "io_queue_length"
    
  Errors:
    - "disk_read_errors_total"
    - "disk_write_errors_total"

3.2 常用资源监控指标

系统资源指标：

system_metrics:
  cpu:
    utilization: "node_cpu_seconds_total{mode=\"idle\"}"
    saturation: "node_load1"
    
  memory:
    utilization: "node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes"
    saturation: "node_vmstat_pgpgin"
    
  disk:
    utilization: "1 - node_filesystem_avail_bytes{fstype!~\"tmpfs|fuse.lxcfs\"} / node_filesystem_size_bytes"
    saturation: "node_disk_io_time_seconds_total"
    
  network:
    utilization: "rate(node_network_receive_bytes_total[5m])"
    saturation: "rate(node_network_receive_drop_total[5m])"

Kubernetes资源指标：

k8s_metrics:
  node:
    cpu_utilization: "kubectl top node"
    memory_utilization: "kubectl top node"
    
  pod:
    cpu_usage: "kubectl top pod"
    memory_usage: "kubectl top pod"
    
  container:
    restart_count: "kube_pod_container_status_restarts_total"
    last_termination_reason: "kube_container_last_termination_reason"

四、标签设计原则

4.1 标签命名规范

标签设计原则：

## 标签设计原则

### 标签命名规范

**命名规范**：

```yaml
label_naming:
  style: "lowercase with underscores"
  
  examples:
    good: "user_id, request_count, http_status"
    bad: "userId, requestCount, HTTPStatus"
    
  cardinality:
    low_cardinality:
      - "status: 200, 404, 500"
      - "method: GET, POST, PUT, DELETE"
      - "endpoint: /api/users, /api/orders"
      
    high_cardinality:
      - "user_id: 1, 2, 3, ... (10万+)"
      - "request_id: uuid格式"
      - "trace_id: 分布式追踪ID"

4.2 高基数问题

高基数标签危害：

high_cardinality_problems:
  storage:
    - "指标数量爆炸"
    - "存储成本剧增"
    example: "userID有100万用户 → 100万时间序列"
    
  query:
    - "查询延迟增加"
    - "内存占用过高"
    
  cardinality_limit:
    prometheus: "每张指标卡片的标签组合数有限制"
    practical: "单指标标签组合应 < 10万"

解决方案：

high_cardinality_solutions:
  avoid_labels:
    - "user_id"
    - "session_id"
    - "request_id"
    - "trace_id"
    
  alternative:
    - "使用trace_id关联外部系统"
    - "用k/v存储原始数据"
    - "用histogram/summary聚合"
    
  good_labels:
    - "service"
    - "endpoint"
    - "method"
    - "status"
    - "job"
    - "instance"

五、采集配置最佳实践

5.1 抓取配置

scrape_configs配置：

# Prometheus抓取配置
global:
  scrape_interval: 15s      # 抓取间隔
  evaluation_interval: 15s  # 规则评估间隔

scrape_configs:
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
    - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
    - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
      action: keep
      regex: default;kubernetes;https

relabel_configs使用：

relabel_configs_examples:
  # 1. 过滤target
  - source_labels: [__meta_kubernetes_pod_label_app]
    regex: 'my-app'
    action: keep
    
  # 2. 重命名标签
  - source_labels: [__meta_kubernetes_pod_name]
    regex: '(.*)'
    target_label: pod
    replacement: '${1}'
    
  # 3. 添加标签
  - target_label: environment
    replacement: 'production'
    
  # 4. 删除标签
  - regex: '__meta_kubernetes_pod_label_(.*)'
    action: labeldrop

5.2 采集频率选择

抓取间隔选择：

scrape_interval_selection:
  10s:
    use_case: "高实时性要求场景"
    example: "核心交易系统"
    pros: "数据精确"
    cons: "资源消耗大"
    
  15s:
    use_case: "一般生产环境"
    example: "普通微服务"
    pros: "平衡之选"
    cons: "中等资源消耗"
    
  30s:
    use_case: "低频变化指标"
    example: "配置指标、日志统计"
    pros: "资源节省"
    cons: "数据粒度粗"
    
  60s+:
    use_case: "业务统计指标"
    example: "每日订单量"
    pros: "极低消耗"
    cons: "无法做实时告警"

5.3 资源消耗估算

Prometheus资源需求：

resource_requirements:
  per_target:
    memory: "~1MB"
    cpu: "~0.5m"
    
  estimation:
    formula: |
      Memory ≈ Targets × ScrapeInterval × SamplesPerScrape × 3
      CPU ≈ Targets × ScrapeInterval × 0.1m
      
  example:
    targets: 1000
    scrape_interval: 15s
    memory: "1000 × 15 × 100 × 3 ≈ 450MB"

六、生产环境最佳实践

6.1 指标设计检查清单

设计原则：

design_checklist:
  naming:
    - "遵循{namespace}_{name}_{type}规范"
    - "使用小写字母和下划线"
    - "包含单位后缀（如_seconds, _bytes）"
    
  labels:
    - "避免高基数标签"
    - "标签命名一致"
    - "控制在5-10个标签以内"
    
  types:
    - "累计值用Counter"
    - "瞬时值用Gauge"
    - "延迟分布用Histogram"
    - "精确分位用Summary"

6.2 常用指标命名约定

社区约定：

community_conventions:
  # 请求类指标
  http_requests_total:
    description: "HTTP请求总数"
    labels: "method, handler, status"
    
  http_request_duration_seconds:
    description: "HTTP请求延迟"
    type: "histogram"
    buckets: "[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]"
    
  # 业务类指标
  orders_total:
    description: "订单总数"
    labels: "status, type"
    
  user_login_total:
    description: "用户登录次数"
    labels: "method, status"

6.3 Grafana仪表盘设计

仪表盘最佳实践：

grafana_dashboard:
  panels:
    - "使用变量实现动态筛选"
    - "统一时间范围"
    - "适当使用模板"
    
  variables:
    - name: "env"
      query: "label_values(http_requests_total, env)"
      
    - name: "service"
      query: "label_values(http_requests_total{env=\"$env\"}, service)"

6.4 告警规则设计

告警设计原则：

alert_design:
  severity:
    critical: "服务不可用"
    warning: "性能降级"
    info: "需要关注"
    
  for_duration:
    critical: "5m (5分钟持续)"
    warning: "10m (10分钟持续)"
    
  thresholds:
    error_rate:
      critical: "> 1%"
      warning: "> 0.1%"
      
    latency_p99:
      critical: "> 2s"
      warning: "> 1s"

告警规则示例：

groups:
- name: service-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m])) > 0.01
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "服务错误率过高"
      description: "错误率: "

七、面试1分钟精简版（直接背）

完整版：

Prometheus指标设计：1. 方法论：RED用于服务监控（Rate请求速率、Errors错误率、Duration响应时间），USE用于资源监控（Utilization利用率、Saturation饱和度、Errors错误）；2. 指标类型：Counter用于累计值、Gauge用于瞬时值、Histogram用于延迟分布、Summary用于精确分位数；3. 标签设计：避免高基数标签（userID/requestID），控制在5-10个标签；4. 采集频率：一般15秒，高实时要求10秒；5. 命名规范：{namespace}{name}{type}，如http_requests_total_seconds。生产实践：Histogram优于Summary（可聚合）。

30秒超短版：

指标设计用RED（服务）和USE（资源），Counter累计Gauge瞬时，Histogram分布Summary分位，避免高基数标签，采集间隔15秒。

八、总结

8.1 方法论对比

methodology_comparison:
  RED:
    适用: "微服务/API"
    指标: "Rate/Errors/Duration"
    
  USE:
    适用: "系统资源"
    指标: "Utilization/Saturation/Errors"

8.2 指标类型选择

type_selection:
  counter:
    选: "需要累计的场景"
    
  gauge:
    选: "需要显示当前值的场景"
    
  histogram:
    选: "需要延迟分布且可聚合"
    
  summary:
    选: "需要精确分位数且不需聚合"

8.3 记忆口诀

指标设计有方法，RED看服务USE看资源，
Counter累计Gauge瞬，Histogram分布Summary分位，
标签设计要精简，高基数标签要避免，
采集间隔十五秒，命名规范记心间。

参考链接：SRE运维面试题全解析：从理论到实践（第二部分）

«Prometheus Service Discovery Best Practices

Prometheus Ha Best Practices»

文档信息

本文作者：soveran zhong
本文链接：https://blog.clockwingsoar.cn/2026/05/09/prometheus-metrics-design-best-practices/
版权声明：自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

codezhong

Prometheus Metrics Design Best Practices