K8s自定义HPA详解：从指标采集到扩缩容生产最佳实践

情境与背景

在Kubernetes生产环境中，默认的CPU/内存HPA扩缩容策略往往无法满足复杂业务场景的需求。例如：API网关需要根据QPS扩缩容、消息队列消费者需要根据队列长度调整Worker数量、批处理任务需要根据任务队列深度动态扩缩容。自定义指标HPA能够根据业务指标进行精准扩缩容，是构建弹性云原生架构的核心能力。

一、自定义HPA架构概览

1.1 核心组件

flowchart TB
    A["业务指标产生"] --> B["指标采集"]
    B --> C["指标暴露"]
    C --> D["Prometheus Adapter"]
    D --> E["Custom Metrics API"]
    E --> F["HPA控制器"]
    F --> G["计算副本数"]
    G --> H["更新Deployment"]
    H --> I["Pod扩缩容"]

1.2 工作流程

阶段	组件	职责
指标采集	Prometheus/自定义Exporter	收集业务指标
指标转换	Prometheus Adapter	将PromQL转换为K8s标准API
指标暴露	Custom Metrics API	提供标准化指标查询接口
扩缩容决策	HPA控制器	根据指标计算目标副本数
执行扩缩容	Deployment/StatefulSet	调整Pod副本数

二、自定义指标类型

2.1 指标分类

类型	作用范围	示例
Resource Metrics	Pod级资源指标	CPU、内存
Pod Metrics	Pod级自定义指标	QPS、请求延迟
Object Metrics	特定对象指标	队列长度、消息数
External Metrics	外部系统指标	数据库连接数、第三方服务指标

2.2 指标来源

# 常见指标来源
Prometheus + 自定义Exporter
Application-level metrics (如Spring Actuator)
Message Queue (Kafka/RabbitMQ)
Database metrics
External API metrics

三、Prometheus Adapter部署与配置

3.1 安装Prometheus Adapter

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-adapter
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app: prometheus-adapter
  template:
    metadata:
      labels:
        app: prometheus-adapter
    spec:
      containers:
      - name: adapter
        image: quay.io/coreos/k8s-prometheus-adapter-amd64:v0.9.0
        args:
        - --metrics-relist-interval=1m
        - --prometheus-url=http://prometheus:9090/
        - --prometheus-port=9090
        - --config=/etc/adapter/config.yaml
        volumeMounts:
        - name: config
          mountPath: /etc/adapter
      volumes:
      - name: config
        configMap:
          name: adapter-config

3.2 配置指标规则

# adapter-config ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'http_requests_total{kubernetes_namespace!="",kubernetes_pod_name!=""}'
      resources:
        overrides:
          kubernetes_namespace: {resource: "namespace"}
          kubernetes_pod_name: {resource: "pod"}
      name:
        matches: "^(.*)_total$"
        as: "${1}_per_second"
      metricsQuery: 'sum(rate(<<.Series>>[2m])) by (<<.GroupBy>>)'

    - seriesQuery: 'queue_length{queue!=""}'
      resources:
        namespaced: true
      name:
        matches: "^(.*)$"
        as: "${1}"
      metricsQuery: 'sum(<<.Series>>) by (<<.GroupBy>>)'

3.3 验证指标

# 查看可用的自定义指标
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

# 查看特定指标
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/requests_per_second" | jq .

四、自定义HPA配置实践

4.1 Pod级自定义指标

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

4.2 外部指标（队列长度）

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: consumer-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: kafka-consumer
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: External
    external:
      metric:
        name: kafka_queue_length
        selector:
          matchLabels:
            queue: orders
      target:
        type: AverageValue
        averageValue: "500"

4.3 混合指标策略

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mixed-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 2
  maxReplicas: 15
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "200"
  - type: External
    external:
      metric:
        name: active_users
      target:
        type: AverageValue
        averageValue: "1000"

五、高级配置技巧

5.1 扩缩容行为控制

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: controlled-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 10
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Min
  metrics:
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

5.2 指标优先级

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: priority-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: critical-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: error_rate
      target:
        type: AverageValue
        averageValue: "0.01"

六、生产环境最佳实践

6.1 指标选择策略

场景	推荐指标	目标值
API网关	QPS	100-500 req/s per Pod
消息队列消费者	队列长度	< 500消息
批处理Worker	任务队列深度	< 100任务
数据库连接池	连接使用率	< 80%
缓存服务	命中率	> 95%

6.2 扩缩容参数调优

参数	建议值	说明
minReplicas	2-3	保证最小可用性
maxReplicas	10-50	根据集群容量调整
scaleUp.stabilizationWindowSeconds	30-60	避免频繁扩容
scaleDown.stabilizationWindowSeconds	300-600	避免过早缩容
scaleUp.policies.Percent	50-100	每次扩容比例
scaleUp.policies.Pods	2-5	每次扩容Pod数

6.3 监控与告警

# Prometheus监控HPA
groups:
- name: k8s_hpa_alerts
  rules:
  - alert: HpaMaxedOut
    expr: hpa_current_replicas == hpa_desired_replicas and hpa_desired_replicas == hpa_max_replicas
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "HPA / 已达到最大副本数"

  - alert: HpaNotReady
    expr: hpa_status_condition == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "HPA / 状态异常"

6.4 部署检查清单

# 1. 验证Prometheus Adapter运行正常
kubectl get pods -n monitoring -l app=prometheus-adapter

# 2. 验证自定义指标可用
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

# 3. 验证HPA状态
kubectl get hpa

# 4. 查看HPA详细状态
kubectl describe hpa web-api-hpa

七、常见问题排查

7.1 HPA无法获取指标

# 检查指标是否可用
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/requests_per_second"

# 检查Prometheus Adapter日志
kubectl logs -n monitoring prometheus-adapter-xxx

# 检查Prometheus查询
curl http://prometheus:9090/api/v1/query?query=http_requests_total

7.2 HPA不扩缩容

# 检查HPA状态
kubectl describe hpa my-hpa

# 常见原因：
# 1. 指标值未达到阈值
# 2. 已达到minReplicas/maxReplicas限制
# 3. 指标采集失败
# 4. 扩缩容冷却时间未过

八、面试精简版

8.1 一分钟版本

自定义HPA流程包括四个核心环节：1) 业务指标产生（如QPS、队列长度）；2) 指标采集（通过Prometheus等监控系统）；3) 指标转换（通过Prometheus Adapter转换为K8s标准API）；4) HPA控制器根据指标计算目标副本数并执行扩缩容。关键组件包括Prometheus用于采集、Prometheus Adapter用于API转换、HPA控制器负责决策。

8.2 记忆口诀

自定义HPA，指标来驱动，
Prometheus采集，Adapter转换API，
HPA做决策，副本数动态调整，
业务指标精准控，弹性伸缩更智能。

8.3 关键词速查

关键词	说明
Custom Metrics API	自定义指标API
Prometheus Adapter	指标转换适配器
Pod Metrics	Pod级自定义指标
External Metrics	外部系统指标
stabilizationWindowSeconds	扩缩容冷却时间

参考链接：SRE运维面试题全解析：从理论到实践（第三部分）

«K8s Eviction Threshold Dynamic Sizing

K8s Affinity Deep Dive Best Practices»

文档信息

本文作者：soveran zhong
本文链接：https://blog.clockwingsoar.cn/2026/05/11/k8s-custom-hpa-best-practices/
版权声明：自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

codezhong

K8s Custom Hpa Best Practices