K8S监控存储方案：从本地到Thanos全解析

情境与背景

监控数据是企业运维的重要资产，数据量大、增长快、查询频繁是监控存储的主要特点。作为高级DevOps/SRE工程师，需要深入理解各种监控存储方案的优缺点，选择适合业务场景的方案。本文从实战角度详细讲解K8S监控存储的完整方案。

一、监控存储需求分析

1.1 数据特点

监控数据特征：

# 监控数据特征
data_characteristics:
  volume:
    per_pod: "约100MB/天"
    cluster_100_pods: "约10GB/天"
    retention_90d: "约900GB"
    
  velocity:
    scrape_interval: "15秒"
    samples_per_minute: "4"
    per_metric_per_hour: "240"
    
  variety:
    metrics: "基础设施+应用+业务"
    cardinality: "高基数问题"

1.2 存储需求

存储需求分类：

需求类型	说明	时间范围
热数据	高频查询	0-2天
温数据	中频查询	2-30天
冷数据	低频查询	30-90天

二、存储方案对比

2.1 本地存储

HostPath/EmptyDir：

# HostPath存储配置
apiVersion: v1
kind: Pod
metadata:
  name: prometheus
spec:
  containers:
    - name: prometheus
      volumeMounts:
        - name: prometheus-storage
          mountPath: /data
  volumes:
    - name: prometheus-storage
      hostPath:
        path: /data/prometheus
        type: DirectoryOrCreate

适用场景：

开发/测试环境
单节点集群
临时监控需求

2.2 PV/PVC存储

持久化存储配置：

# PVC配置
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-data
  namespace: monitoring
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast
---
# Prometheus配置使用PVC
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  template:
    spec:
      containers:
        - name: prometheus
          volumeMounts:
            - name: prometheus-data
              mountPath: /data
      volumes:
        - name: prometheus-data
          persistentVolumeClaim:
            claimName: prometheus-data

2.3 Thanos存储架构

Thanos架构：

flowchart TB
    A["Prometheus"] --> B["Thanos Sidecar"]
    B --> C["对象存储"]
    C --> D["Thanos Store"]
    D --> E["Thanos Query"]
    F["Thanos Query"] --> G["Grafana"]
    
    C --> H["S3/MinIO"]
    
    style A fill:#e3f2fd
    style C fill:#c8e6c9
    style H fill:#fff3e0

Thanos组件：

组件	功能	说明
Sidecar	数据上传	将Prometheus数据上传到对象存储
Store	数据查询	从对象存储读取历史数据
Query	统一查询	聚合多个数据源的查询
Receive	接收数据	接收远程写入的数据
Rule	告警规则	分布式告警

2.4 存储方案对比表

方案对比：

方案	存储容量	查询性能	成本	适用场景
本地存储	受限单节点	最高	低	开发测试
PV/PVC	可扩展	高	中	中小规模
Thanos+S3	无限扩展	中	低	大规模生产
InfluxDB	可扩展	高	高	时序数据专用

三、Thanos部署配置

3.1 Prometheus配置

Prometheus配置：

# prometheus.yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'prod'
    replica: '$(HOSTNAME)'

storage:
  tsdb:
    path: /data
    retention.time: 15d

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager.monitoring:9093

3.2 Thanos Sidecar配置

Sidecar配置：

# Thanos Sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:latest
          args:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/data'
            - '--storage.tsdb.retention.time=15d'
            - '--web.enable-lifecycle'
          volumeMounts:
            - name: config
              mountPath: /etc/prometheus
            - name: data
              mountPath: /data
        - name: thanos-sidecar
          image: quay.io/thanos/thanos:v0.34.0
          args:
            - sidecar
            - '--prometheus.url=http://localhost:9090'
            - '--objstore.config-file=/etc/thanos/object-storage.yaml'
            - '--tsdb.path=/data'
          volumeMounts:
            - name: data
              mountPath: /data
            - name: object-storage
              mountPath: /etc/thanos
      volumes:
        - name: config
          configMap:
            name: prometheus-config
        - name: data
          persistentVolumeClaim:
            claimName: prometheus-data
        - name: object-storage
          secret:
            secretName: thanos-object-storage

3.3 对象存储配置

MinIO配置：

# object-storage.yaml
type: S3
config:
  bucket: "thanos"
  endpoint: "minio.monitoring:9000"
  access_key: "minioadmin"
  secret_key: "minioadmin"
  insecure: false
  signature_version2: false

MinIO部署：

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: minio
  namespace: monitoring
spec:
  serviceName: minio
  replicas: 4
  selector:
    matchLabels:
      app: minio
  template:
    metadata:
      labels:
        app: minio
    spec:
      containers:
        - name: minio
          image: minio/minio:latest
          args:
            - server
            - http://minio-{0...3}.minio.monitoring:9000/data
            - --console-address
            - ":9001"
          env:
            - name: MINIO_ROOT_USER
              value: "minioadmin"
            - name: MINIO_ROOT_PASSWORD
              value: "minioadmin"
          ports:
            - containerPort: 9000
            - containerPort: 9001
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 100Gi

3.4 Thanos Query配置

Query配置：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app: thanos-query
  template:
    metadata:
      labels:
        app: thanos-query
    spec:
      containers:
        - name: thanos
          image: quay.io/thanos/thanos:v0.34.0
          args:
            - query
            - '--store=prometheus:10901'
            - '--store=thanos-store:10901'
            - '--grpc-grpc-server.tls-enabled=false'
          ports:
            - containerPort: 10901
            - containerPort: 10902
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 1000m
              memory: 1Gi

四、数据生命周期管理

4.1 降采样配置

降采样策略：

# 降采样配置
downsampling:
  enabled: true
  
  rules:
    - name: "5m"
      duration: "5m"  # 保留5分钟原始数据
      resolution: "raw"
      
    - name: "1h"
      duration: "90d"  # 90天后降为1小时
      resolution: "5m"
      
    - name: "1d"
      duration: "365d"  # 1年后降为1天
      resolution: "1h"

4.2 数据压缩

压缩配置：

# Thanos Compactor
apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-compactor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: thanos-compactor
  template:
    metadata:
      labels:
        app: thanos-compactor
    spec:
      containers:
        - name: thanos
          image: quay.io/thanos/thanos:v0.34.0
          args:
            - compact
            - '--data-dir=/data'
            - '--objstore.config-file=/etc/thanos/object-storage.yaml'
            - '--wait'
            - '--downsampling.disable'
          volumeMounts:
            - name: data
              mountPath: /data
            - name: object-storage
              mountPath: /etc/thanos
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: 2000m
              memory: 2Gi

五、最佳实践

5.1 存储容量规划

容量计算公式：

# 容量计算
capacity_calculation:
  # 单个指标每天大小
  per_metric_daily: "约100bytes * 4 samples/min * 60 min * 24h = 5.76MB"
  
  # 指标数量估算
  metrics_per_pod: 500
  pods: 100
  
  # 存储需求
  daily_total: "5.76MB * 500 * 100 = 288GB/天"
  
  # 90天存储
  retention_90d: "288GB * 90 = 25.9TB"
  
  # 预留余量
  buffer: 1.3
  total_storage: "25.9TB * 1.3 = 33.7TB"

5.2 性能优化

性能优化配置：

# Prometheus性能优化
prometheus:
  resources:
    requests:
      cpu: "2"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "8Gi"
      
  storage:
    tsdb:
      path: "/data"
      retention:
        time: "15d"
        
  query:
    max_samples: 10000000
    timeout: "2m"

5.3 监控告警

存储监控配置：

# Prometheus告警规则
groups:
  - name: monitoring-storage-alerts
    rules:
      - alert: PrometheusStorageRunningOut
        expr: |
          (prometheus_tsdb_storage_blocks_bytes / 1024 / 1024 / 1024) < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus存储空间不足"
          
      - alert: ThanosObjectStorageLatency
        expr: |
          thanos_objstore_bucket_operation_duration_seconds > 2
        for: 5m
        labels:
          severity: warning

六、故障排查

6.1 常见问题

问题与解决方案：

问题	原因	解决
存储满	数据保留过长	调整保留策略
查询慢	数据量大	启用降采样
上传失败	网络问题	检查网络
数据丢失	PV问题	启用备份

6.2 排查命令

排查方法：

# 检查存储使用
kubectl exec -it prometheus-0 -n monitoring -- df -h /data

# 检查Prometheus存储
curl -s http://prometheus:9090/api/v1/status/tsdb | jq

# 检查Thanos状态
thanos tools bucket verify --objstore.config-file=/etc/thanos/object-storage.yaml

# 检查上传状态
curl -s http://thanos-sidecar:10902/metrics | grep thanos

七、面试1分钟精简版（直接背）

完整版：

监控存储我们采用Thanos加对象存储的方案。Prometheus本地存储监控数据，通过Sidecar将数据定期上传到S3兼容存储（MinIO），实现长期存储和高可用。热数据保留在本地SSD上保障查询性能，温数据存储在对象存储降低成本。这套方案支持90天数据存储，同时通过压缩和降采样优化存储成本。查询时通过Thanos Query进行统一查询，对应用透明。

30秒超短版：

Thanos加对象存储，本地存热数据，S3存温数据，压缩降采样优化成本。

八、总结

8.1 方案选择指南

场景	推荐方案
开发测试	本地存储
中小生产	PV/PVC
大规模生产	Thanos+S3
超大规模	Thanos+对象存储+CDN

8.2 配置原则

原则	说明
分层存储	热数据本地，温数据对象存储
容量规划	根据指标数量和保留期计算
性能优先	SSD优先，本地存储优先
成本优化	降采样+压缩

8.3 记忆口诀

开发用本地，生产用PV，
大规模用Thanos，冷热数据分层存，
压缩降采样，成本最优控。

参考链接：SRE运维面试题全解析：从理论到实践（第二部分）

«Pod Per Node Planning Best Practices

Microservice Image Management Best Practices»

文档信息

本文作者：soveran zhong
本文链接：https://blog.clockwingsoar.cn/2026/05/08/monitoring-storage-best-practices/
版权声明：自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

codezhong

Monitoring Storage Best Practices