Prometheus Pull vs Push模式深度对比：监控架构设计指南

一、引言

在监控系统设计中，数据采集模式是核心架构决策之一。Prometheus作为云原生时代监控的事实标准，选择了以Pull（拉取）模式为核心的数据采集方式，这与传统监控系统如Zabbix采用的Push（推送）模式形成鲜明对比。本文将深入剖析这两种模式的设计理念、技术实现、优缺点及适用场景，并提供生产环境最佳实践。

二、SCQA分析框架

情境（Situation）

监控系统是现代运维的核心基础设施
云原生环境下，服务动态扩缩容成为常态
微服务架构要求监控系统具备高度灵活性

冲突（Complication）

传统Push模式在动态环境下配置复杂
Pull模式需要解决网络穿透和短任务采集问题
不同场景对监控模式有不同需求

问题（Question）

Prometheus为什么选择Pull模式？
Pull模式和Push模式各有什么优缺点？
短生命周期任务如何采集？
生产环境中如何选择和配置采集模式？
Prometheus和Zabbix该如何选择？

答案（Answer）

Prometheus选择Pull模式是为了更好地适应云原生环境
Pull模式灵活、易于扩展，Push模式适合跨网络场景
使用Pushgateway处理短任务
根据场景选择合适的采集模式
根据环境特点选择监控系统

三、Prometheus Pull模式详解

3.1 核心架构

Prometheus采用典型的Pull模式架构：

┌─────────────────────────────────────────────────────────────────┐
│                     Prometheus Server                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │  Retrieval   │→│   Storage    │→│    PromQL        │   │
│  │  (拉取模块)   │  │  (TSDB存储)  │  │   (查询引擎)    │   │
│  └──────┬───────┘  └──────────────┘  └──────────────────┘   │
│         │                                                     │
│         ▼                                                     │
│  ┌──────────────────────────────────────────────────────┐     │
│  │              Scrape Configurations                   │     │
│  │  - 静态目标配置                                      │     │
│  │  - 服务发现 (K8s/Consul/DNS)                        │     │
│  │  - 采集间隔、超时配置                                │     │
│  └──────────────────────────────────────────────────────┘     │
└───────────────────────────┬───────────────────────────────────┘
                            │ HTTP GET /metrics
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
        ▼                   ▼                   ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Exporter 1  │    │ Exporter 2  │    │  应用服务   │
│ node_exporter│   │mysql_exporter│   │(内置metrics)│
└─────────────┘    └─────────────┘    └─────────────┘

3.2 工作流程

Pull模式的完整工作流程分为四个阶段：

阶段一：服务发现

# 静态配置
scrape_configs:
  - job_name: 'static-targets'
    static_configs:
      - targets: ['node1:9100', 'node2:9100']

# Kubernetes服务发现
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: 'my-app'
        action: keep

阶段二：指标采集

Prometheus Server按照配置的采集间隔主动发起HTTP请求：

# 采集请求示例
GET /metrics HTTP/1.1
Host: node1:9100
Accept: application/openmetrics-text;version=1.0.0,text/plain;version=0.0.4;q=0.5,*/*;q=0.1

阶段三：数据存储

采集到的指标以时间序列格式存储在TSDB中：

指标名 + 标签 → 时间戳 + 值
http_requests_total{method="GET",status="200"} 12345 @ 1620000000

阶段四：查询与告警

# 查询5分钟内的请求速率
rate(http_requests_total[5m])

# 告警规则
groups:
- name: example
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status="5xx"}[5m]) > 0.1
    for: 5m

3.3 Pull模式的优势

优势一：灵活性与自管理

无需Agent部署：

目标服务只需暴露标准HTTP端点
无需在每个节点部署监控代理
降低运维复杂度

标准接口：

# 任何服务都可以暴露/metrics端点
curl http://localhost:8080/metrics
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
# http_requests_total{method="GET",path="/"} 1234

优势二：原生服务发现

动态环境适配：

# 自动发现Kubernetes Pod
kubernetes_sd_configs:
  - role: pod

# 自动发现Consul服务
consul_sd_configs:
  - server: 'consul:8500'

自动适应扩缩容：

新Pod创建后自动加入监控
Pod删除后自动停止采集
无需人工干预

优势三：集中控制

统一配置管理：

采集频率集中配置
目标列表统一管理
便于审计和变更追踪

灵活的采集策略：

scrape_configs:
  - job_name: 'critical-services'
    scrape_interval: 10s  # 高频采集
    scrape_timeout: 5s
  
  - job_name: 'background-services'
    scrape_interval: 60s  # 低频采集
    scrape_timeout: 30s

优势四：故障检测

快速发现不可达：

采集超时立即标记目标异常
无需等待推送超时
支持健康检查集成

状态追踪：

# 查看目标状态
curl http://localhost:9090/api/v1/targets

# 输出示例
{
  "status": "success",
  "data": {
    "activeTargets": [...],
    "droppedTargets": [...]
  }
}

四、Pull模式的局限性与解决方案

4.1 网络穿透问题

问题描述：

Prometheus需要访问目标的/metrics端点
防火墙或网络隔离环境下可能受限

解决方案：

方案一：反向代理

# 使用Ingress暴露Exporter
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: node-exporter
spec:
  rules:
  - host: exporter.example.com
    http:
      paths:
      - path: /node1
        pathType: Prefix
        backend:
          service:
            name: node-exporter-1
            port:
              number: 9100

方案二：Sidecar模式

# Sidecar代理采集
containers:
- name: app
  image: myapp:latest
- name: metrics-proxy
  image: nginx:latest
  ports:
  - containerPort: 9100

4.2 短生命周期任务

问题描述：

批处理任务、临时脚本存在时间短
Prometheus可能来不及采集

解决方案：Pushgateway

┌─────────────────────────────────────────────────────────────────┐
│                     Prometheus Server                          │
│                           │                                   │
│                           ▼                                   │
│                     Pushgateway                               │
│  ┌─────────────────────────────────────────────────────┐       │
│  │  临时存储区：接收短任务推送的指标                    │       │
│  └─────────────────────────────────────────────────────┘       │
│                           ▲                                   │
└───────────────────────────┼───────────────────────────────────┘
                            │ HTTP POST /metrics/job/{job}
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
        ▼                   ▼                   ▼
    ┌───────┐          ┌───────┐          ┌───────┐
    │任务1  │          │任务2  │          │任务3  │
    │(10s)  │          │(5s)   │          │(15s)  │
    └───────┘          └───────┘          └───────┘

Pushgateway配置

scrape_configs:
  - job_name: 'pushgateway'
    honor_labels: true
    scrape_interval: 10s
    static_configs:
      - targets: ['pushgateway:9091']

推送命令

# 方式一：curl推送
echo "batch_job_duration_seconds 12.5" | \
  curl --data-binary @- http://pushgateway:9091/metrics/job/batch_job

# 方式二：promtool推送
echo "# TYPE batch_job_duration_seconds gauge
batch_job_duration_seconds 12.5" > /tmp/metrics.txt
promtool tsdb create-blocks-from openmetrics /tmp/metrics.txt --output-dir=/tmp/blocks

# 方式三：客户端库推送（Go示例）
import "github.com/prometheus/client_golang/prometheus"
import "github.com/prometheus/client_golang/prometheus/push"

func main() {
    duration := prometheus.NewGauge(prometheus.GaugeOpts{
        Name: "batch_job_duration_seconds",
        Help: "Duration of batch job",
    })
    duration.Set(12.5)
    
    err := push.New("http://pushgateway:9091", "batch_job").
        Collector(duration).
        Push()
}

4.3 网络依赖问题

问题描述：

网络延迟直接影响数据采集
需要稳定的网络连接

解决方案：

方案一：调整超时配置

scrape_configs:
  - job_name: 'remote-services'
    scrape_interval: 30s
    scrape_timeout: 20s  # 增加超时时间
    scheme: https
    tls_config:
      insecure_skip_verify: false

方案二：本地Agent聚合

# 使用Prometheus Agent模式
remote_write:
  - url: 'http://central-prometheus:9090/api/v1/write'

五、Push模式详解（对比视角）

5.1 Zabbix Push模式架构

┌─────────────────────────────────────────────────────────────────┐
│                     Zabbix Server                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │   Receiver   │←│   Storage    │  │    Alerting      │   │
│  │  (接收模块)   │  │  (数据库)    │  │   (告警引擎)    │   │
│  └──────────────┘  └──────────────┘  └──────────────────┘   │
└───────────────────────────┬───────────────────────────────────┘
                            │ 等待Agent推送
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
        ▼                   ▼                   ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Zabbix     │    │ Zabbix     │    │ Zabbix     │
│ Agent      │    │ Agent      │    │ Agent      │
│ (主动推送) │    │ (主动推送) │    │ (主动推送) │
└─────────────┘    └─────────────┘    └─────────────┘

5.2 Push模式工作流程

Agent在目标主机上采集指标
Agent按照配置的间隔主动推送数据
Server接收并存储数据
Server执行告警规则

5.3 Push模式的优势

优势一：跨网络穿透

无需反向连接：

Agent主动连接Server
无需在防火墙上开放入站端口
适合跨地域、跨网络监控

优势二：离线缓存

网络不稳定时缓存数据：

Zabbix Agent配置示例
BufferSize=100
BufferSend=5

优势三：全面协议支持

支持多种监控协议：

SNMP
IPMI
JMX
SSH/Telnet

六、Pull vs Push模式对比

6.1 核心特性对比

特性	Pull模式（Prometheus）	Push模式（Zabbix）
数据流向	Server → Target	Agent → Server
网络要求	Server能访问Target	Agent能访问Server
Agent要求	无需Agent，Exporter可选	必须部署Agent
服务发现	原生支持多种发现机制	有限支持
数据格式	开放标准（OpenMetrics）	私有协议
故障检测	主动探测，快速发现	被动等待，延迟发现
离线缓存	无（需Pushgateway）	Agent内置缓存
配置复杂度	集中配置，简单	分布式配置，复杂
扩展性	水平扩展容易	扩展需要Proxy

6.2 适用场景对比

场景类型	Pull模式	Push模式
云原生环境	适合	不太适合
微服务架构	适合	不太适合
动态扩缩容	适合	不太适合
跨网络监控	不太适合	适合
IoT设备	不太适合	适合
传统IT环境	不太适合	适合
批处理任务	需要Pushgateway	适合

6.3 性能对比

维度	Pull模式	Push模式
Server负载	主动发起请求	被动接收请求
网络流量	可控（配置采集间隔）	不可控（Agent决定）
单点故障	Server故障不影响Target	Server故障影响数据接收
数据一致性	由Server控制	由Agent控制

七、生产环境最佳实践

7.1 标准Pull模式配置

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: production

scrape_configs:
  # 监控Prometheus自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # 监控节点
  - job_name: 'node-exporter'
    scrape_interval: 10s
    scrape_timeout: 5s
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__meta_kubernetes_node_name]
        target_label: node
  
  # 监控Pod
  - job_name: 'kubernetes-pods'
    scrape_interval: 15s
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        regex: 'true'
        action: keep
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        replacement: '${1}'

7.2 Pushgateway最佳实践

配置建议

scrape_configs:
  - job_name: 'pushgateway'
    honor_labels: true
    scrape_interval: 10s
    static_configs:
      - targets: ['pushgateway:9091']

数据管理

# 清理过期数据
curl -X DELETE http://pushgateway:9091/api/v1/admin/wipe

# 删除特定job的数据
curl -X DELETE http://pushgateway:9091/metrics/job/batch_job

使用原则

仅用于短生命周期任务
设置合理的清理策略
使用honor_labels保留原始标签
避免存储长期指标

7.3 Remote Write扩展

remote_write:
  - url: 'http://thanos-receive:19291/api/v1/write'
    remote_timeout: 30s
    queue_config:
      capacity: 2000
      max_samples_per_send: 500
      batch_send_deadline: 5s
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'job_duration_seconds.*'
        action: keep

remote_read:
  - url: 'http://thanos-query:10902/api/v1/read'
    read_recent: true

7.4 高可用配置

# 多副本Prometheus
apiVersion: v1
kind: Service
metadata:
  name: prometheus
spec:
  clusterIP: None
  selector:
    app: prometheus
  ports:
  - port: 9090

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
spec:
  replicas: 2
  serviceName: prometheus
  template:
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: data
          mountPath: /prometheus
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

7.5 安全配置

# 启用TLS
scrape_configs:
  - job_name: 'secure-services'
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/tls/ca.crt
      cert_file: /etc/prometheus/tls/client.crt
      key_file: /etc/prometheus/tls/client.key
      insecure_skip_verify: false

# 配置基本认证
  - job_name: 'auth-services'
    basic_auth:
      username: admin
      password: secret

八、监控系统选型指南

8.1 选择Prometheus的场景

场景	说明
Kubernetes环境	原生支持K8s服务发现
微服务架构	灵活的服务发现和动态配置
云原生应用	与CNCF生态深度集成
需要自定义指标	丰富的客户端库支持
大规模监控	支持水平扩展和联邦部署

8.2 选择Zabbix的场景

场景	说明
跨网络监控	推送模式更容易穿透防火墙
传统IT环境	支持SNMP、IPMI等传统协议
IoT设备监控	Agent模式更适合边缘设备
需要全面功能	内置告警、报表等完整功能
运维团队熟悉	成熟的企业级解决方案

8.3 混合架构建议

┌─────────────────────────────────────────────────────────────┐
│                     统一监控平台                            │
│                    (Grafana + Alertmanager)                │
└───────────────────────────────┬─────────────────────────────┘
                                │
        ┌───────────────────────┼───────────────────────┐
        │                       │                       │
        ▼                       ▼                       ▼
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Prometheus    │       │ Zabbix        │       │ Pushgateway   │
│ (云原生服务)  │       │ (传统设备)    │       │ (批处理任务)  │
└───────────────┘       └───────────────┘       └───────────────┘

九、常见问题与解决方案

问题一：目标不可达

现象：

kubectl logs prometheus-0
level=warn ts=... caller=scrape.go:1327 component="scrape manager" scrape_pool=node-exporter target=http://node1:9100/metrics msg="Error on ingesting samples" err="context deadline exceeded"

原因：

网络不通
端口未开放
Exporter未运行

解决方案：

# 检查网络连通性
telnet node1 9100

# 检查Exporter状态
curl http://node1:9100/metrics

# 检查防火墙规则
iptables -L | grep 9100

问题二：指标缺失

现象：

promql> http_requests_total
# No data returned

原因：

Exporter配置错误
标签不匹配
采集间隔过长

解决方案：

# 检查Exporter输出
curl http://localhost:8080/metrics | grep http_requests

# 检查Prometheus配置
kubectl get configmap prometheus-config -o yaml

# 验证采集配置
curl http://prometheus:9090/api/v1/targets

问题三：Pushgateway数据过期

现象：

# 任务已完成但指标仍存在
promql> batch_job_duration_seconds
# 返回已完成任务的指标

原因：

任务完成后未清理数据
Pushgateway无自动清理机制

解决方案：

# 任务完成后主动清理
curl -X DELETE http://pushgateway:9091/metrics/job/batch_job

# 使用grouping标签管理
echo "batch_job_duration_seconds{instance=\"job1\"} 12.5" | \
  curl --data-binary @- http://pushgateway:9091/metrics/job/batch_job/instance/job1

问题四：数据延迟

现象：

# 查询最新数据存在延迟
promql> http_requests_total
# 返回的数据落后于实际时间

原因：

采集间隔过长
网络延迟
TSDB写入延迟

解决方案：

# 调整采集间隔
scrape_configs:
  - job_name: 'critical'
    scrape_interval: 5s
    scrape_timeout: 2s

十、总结

核心要点

Prometheus默认使用Pull模式，通过主动拉取目标的/metrics端点获取数据
Pull模式的优势：灵活性、服务发现集成、集中控制、快速故障检测
Pull模式的局限性：网络穿透困难、短任务采集需要Pushgateway
Push模式的适用场景：跨网络监控、传统IT环境、IoT设备
生产环境建议：标准场景使用Pull模式，短任务使用Pushgateway，跨网络场景可考虑Zabbix或混合架构

选型建议

环境类型	推荐方案
云原生/Kubernetes	Prometheus + Pull模式
传统IT/跨网络	Zabbix + Push模式
混合环境	Prometheus + Zabbix集成
批处理任务	Prometheus + Pushgateway

本文对应的面试题：Prometheus获取数据用的是pull还是push？

«Prometheus Pull与Push模式应用场景深度解析

StatefulSet四大特性深度解析：有状态应用管理最佳实践»

文档信息

本文作者：soveran zhong
本文链接：https://blog.clockwingsoar.cn/2026/06/08/prometheus-pull-push/
版权声明：自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

Prometheus Pull vs Push模式深度对比：监控架构设计指南

一、引言

二、SCQA分析框架

情境（Situation）

冲突（Complication）

问题（Question）

答案（Answer）

三、Prometheus Pull模式详解

3.1 核心架构

3.2 工作流程

阶段一：服务发现

阶段二：指标采集

阶段三：数据存储

阶段四：查询与告警

3.3 Pull模式的优势

优势一：灵活性与自管理

优势二：原生服务发现

优势三：集中控制

优势四：故障检测

四、Pull模式的局限性与解决方案

4.1 网络穿透问题

方案一：反向代理

方案二：Sidecar模式

4.2 短生命周期任务

Pushgateway配置

推送命令

4.3 网络依赖问题

方案一：调整超时配置

方案二：本地Agent聚合

五、Push模式详解（对比视角）

5.1 Zabbix Push模式架构

5.2 Push模式工作流程

5.3 Push模式的优势

优势一：跨网络穿透

优势二：离线缓存

优势三：全面协议支持

六、Pull vs Push模式对比

6.1 核心特性对比

6.2 适用场景对比

6.3 性能对比

七、生产环境最佳实践

7.1 标准Pull模式配置

7.2 Pushgateway最佳实践

配置建议

数据管理

使用原则

7.3 Remote Write扩展

7.4 高可用配置

7.5 安全配置

八、监控系统选型指南

8.1 选择Prometheus的场景

8.2 选择Zabbix的场景

8.3 混合架构建议

九、常见问题与解决方案

问题一：目标不可达

问题二：指标缺失

问题三：Pushgateway数据过期

问题四：数据延迟

十、总结

核心要点

选型建议

文档信息

Search

Table of Contents