K8s Node Scheduling Strategies Best Practices

2026/05/11 共 5918 字,约 17 分钟

K8s节点调度策略详解:从原理到生产环境最佳实践

情境与背景

在Kubernetes生产环境中,Pod调度是集群高效运行的关键。从简单的NodeSelector到复杂的亲和性、抢占机制,K8s提供了丰富的调度能力。调度策略的合理配置,直接影响资源利用率、应用性能和业务高可用性。作为高级DevOps/SRE工程师,深入理解调度器工作原理和各种策略的适用场景,是构建生产级K8s集群的必备技能。

一、调度器工作原理

1.1 调度器三阶段流程

flowchart TB
    A["Pod创建请求"] --> B["API Server接收"]
    B --> C["调度器监听"]
    C --> D["预选阶段(Filter)"]
    D --> E["优选阶段(Prioritize)"]
    E --> F["绑定阶段(Bind)"]
    F --> G["Pod调度到目标节点"]

    style D fill:#e3f2fd
    style E fill:#c8e6c9
    style F fill:#fff3e0
阶段职责说明
预选阶段筛选可用节点排除资源不足、不匹配约束等的节点
优选阶段对候选节点打分基于多个优先级策略计算得分
绑定阶段选择得分最高节点将Pod与目标节点绑定

1.2 调度器启动方式

# kube-scheduler配置示例
apiVersion: kube-scheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
leaderElection:
  leaderElect: true

二、基础调度策略

2.1 NodeName - 直接指定节点

NodeName是最简单的调度策略,直接指定Pod运行的节点名,跳过调度器的预选和优选阶段。

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  nodeName: node-001   # 直接指定节点
  containers:
  - name: nginx
    image: nginx:1.21
优点缺点适用场景
简单直接无法自动调度到其他节点静态Pod、特定应用绑定
性能好灵活性差系统组件部署

2.2 NodeSelector - 标签筛选

NodeSelector根据节点标签筛选调度目标,是常用的基础调度策略。

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  nodeSelector:
    disktype: ssd           # 筛选有ssd标签的节点
    zone: cn-beijing-1      # 筛选特定可用区
  containers:
  - name: nginx
    image: nginx:1.21
# 给节点打标签
kubectl label nodes node-001 disktype=ssd
kubectl label nodes node-001 zone=cn-beijing-1
优点缺点适用场景
简单易用仅支持硬约束基础资源筛选
标签灵活不支持复杂逻辑区分节点类型

2.3 Taints & Tolerations - 污点与容忍

污点(Taint)让节点排斥特定Pod,容忍(Toleration)让Pod能运行在有污点的节点上。

flowchart TB
    A["污点 Taint"] --> B["Key"]
    A --> C["Value"]
    A --> D["Effect"]

    D --> E["NoSchedule"]
    D --> F["PreferNoSchedule"]
    D --> G["NoExecute"]

    H["容忍 Toleration"] --> I["匹配污点"]
    I --> J["Pod可调度"]
# 给节点打污点
kubectl taint nodes node-001 dedicated=database:NoSchedule
apiVersion: v1
kind: Pod
metadata:
  name: db-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "database"
    effect: "NoSchedule"
  containers:
  - name: db
    image: mysql:8.0
Effect说明适用场景
NoSchedule新Pod无法调度节点隔离
PreferNoSchedule尽量不调度软隔离
NoExecute新Pod不调度+现有Pod驱逐故障节点

三、高级调度策略

3.1 NodeAffinity - 节点亲和性

NodeAffinity支持更丰富的节点亲和表达式,包括硬约束(required)和软约束(preferred)。

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
  containers:
  - name: nginx
    image: nginx:1.21
操作符说明
Inlabel value包含列表中的值
NotInlabel value不在列表中
Existslabel存在
DoesNotExistlabel不存在
Gtlabel value大于
Ltlabel value小于

3.2 PodAffinity/PodAntiAffinity - Pod间亲和性

PodAffinity让Pod尽量调度到与特定Pod相近的拓扑域,PodAntiAffinity则相反。

apiVersion: v1
kind: Pod
metadata:
  name: web-pod
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - redis
        topologyKey: topology.kubernetes.io/zone
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - web
          topologyKey: kubernetes.io/hostname
  containers:
  - name: web
    image: nginx:1.21
场景推荐策略说明
Web+CachePodAffinity让Web与Redis同AZ
Web多副本PodAntiAffinity分散到不同节点
DB主从PodAntiAffinity必须分离

3.3 Priority & Preemption - 优先级与抢占

PriorityClass定义Pod优先级,高优先级Pod可在资源不足时抢占低优先级Pod。

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "高优先级Pod类"
apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    image: myapp:v1
优先级value范围说明
系统最高优先级2000000000+kube-system组件
高优先级1000000关键业务
默认优先级0普通应用
低优先级<0测试应用

四、优选阶段优先级策略

4.1 常用优先级策略

策略名称说明权重
LeastRequestedPriority节点使用率最低优先默认启用
MostRequestedPriority节点使用率最高优先默认关闭
BalancedResourceAllocationCPU/Memory均衡分配默认启用
SelectorSpreadPriorityPod均匀分布默认启用
ImageLocalityPriority节点已有镜像优先默认启用
TaintTolerationPriority污点容忍优先默认启用

4.2 自定义调度器配置

apiVersion: kube-scheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: my-scheduler
  plugins:
    score:
      disabled:
      - name: TaintTolerationPriority
      enabled:
      - name: LeastRequestedPriority
        weight: 5

五、生产环境最佳实践

5.1 节点分层与标签规划

# 节点标签示例
# kubectl label nodes node-001 node-type=compute
# kubectl label nodes node-001 env=prod
# kubectl label nodes node-001 zone=cn-beijing-1
标签维度说明示例
节点类型compute/storage/ingressnode-type=compute
环境prod/staging/devenv=prod
AZcn-beijing-1/2/3zone=cn-beijing-1
资源规格large/medium/smallsize=large

5.2 应用Pod调度配置示例

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web-frontend
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      priorityClassName: medium-priority
      nodeSelector:
        env: prod
        node-type: compute
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: web-frontend
            topologyKey: kubernetes.io/hostname
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: redis-cache
              topologyKey: topology.kubernetes.io/zone
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "web"
        effect: "NoSchedule"
      containers:
      - name: web
        image: nginx:1.21
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "500m"

5.3 调度失败排查与监控

# 1. 查看Pod调度状态
kubectl get pods -o wide
kubectl describe pod <pod-name>

# 2. 查看节点标签和污点
kubectl get nodes --show-labels
kubectl describe nodes node-001 | grep -i taint

# 3. 查看调度器日志
kubectl logs -n kube-system kube-scheduler-<node-name>
# Prometheus监控告警示例
groups:
- name: k8s_scheduler_alerts
  rules:
  - alert: PodUnscheduled
    expr: kube_pod_status_phase{phase="Pending"} > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod  未调度"

六、面试精简版

6.1 一分钟版本

K8s调度分三个阶段:预选阶段(Filter)根据约束条件(NodeName/NodeSelector/Affinity/Taints)筛选节点;优选阶段(Prioritize)对候选节点打分;绑定阶段(Bind)选择得分最高的节点。主要调度策略包括NodeName直接指定、NodeSelector标签筛选、NodeAffinity/PodAffinity亲和性、Taints/Tolerations污点与容忍、PriorityClass优先级与抢占机制。生产环境需根据业务场景组合使用,例如有状态服务用NodeAffinity+Taints隔离,Web前端用PodAntiAffinity实现高可用。

6.2 记忆口诀

预选过滤不能错,优选打分来排序,
NodeName直接指定,NodeSelector选标签,
亲和反亲和灵活用,污点容忍需配合,
优先级抢占要谨慎,调度监控不能忘。

6.3 关键词速查

关键词说明
Filter预选阶段,筛选节点
Prioritize优选阶段,节点打分
Bind绑定阶段,最终调度
requiredDuringScheduling硬约束,必须满足
preferredDuringScheduling软约束,尽量满足
Taint Effect污点效果(NoSchedule等)
PriorityClassPod优先级

参考链接SRE运维面试题全解析:从理论到实践(第三部分)

文档信息

Search

    Table of Contents