Kubernetes调度机制:Scheduler工作流程与污点容忍度详解
情境与背景
Kubernetes Scheduler是集群的核心组件,负责将Pod调度到最优节点。深入理解调度机制、调度策略以及污点容忍度,是高级DevOps工程师和SRE的必备技能。
一、Scheduler工作流程
1.1 调度流程概述
调度流程图:
## Scheduler工作流程
**整体流程**:
```mermaid
flowchart TD
A["新Pod创建"] --> B["加入调度队列"]
B --> C["预选阶段 Predicates"]
C --> D{"节点满足条件?"}
D -->|否| E["淘汰节点"]
D -->|是| F["优选阶段 Priorities"]
F --> G["计算节点得分"]
G --> H["选择最优节点"]
H --> I["绑定Pod到节点"]
I --> J["kubelet创建容器"]
style A fill:#ffcdd2
style J fill:#c8e6c9
调度队列:
scheduling_queue:
active_q:
description: "活跃队列"
priority: "高"
content: "等待调度的Pod"
backoff_q:
description: "退避队列"
priority: "中"
content: "调度失败的Pod"
unschedulable_q:
description: "不可调度队列"
priority: "低"
content: "长时间未调度的Pod"
### 1.2 预选阶段(Predicates)
**预选策略详解**:
```markdown
## 预选阶段
**预选策略**:
```yaml
predicates_strategies:
PodFitsResources:
description: "资源是否满足"
check: "CPU/内存/临时存储"
PodFitsHostPorts:
description: "端口是否冲突"
check: "主机端口占用"
HostName:
description: "节点名称匹配"
check: "spec.nodeName"
MatchNodeSelector:
description: "节点选择器匹配"
check: "spec.nodeSelector"
NoDiskConflict:
description: "磁盘无冲突"
check: "PVC卷挂载"
PodToleratesNodeTaints:
description: "容忍度匹配"
check: "污点容忍"
CheckNodeMemoryPressure:
description: "内存压力检查"
check: "节点内存状态"
CheckNodeDiskPressure:
description: "磁盘压力检查"
check: "节点磁盘状态"
CheckNodePIDPressure:
description: "进程ID压力检查"
check: "节点PID状态"
预选流程图:
flowchart TD
A["开始预选"] --> B["PodFitsResources"]
B -->|通过| C["PodFitsHostPorts"]
B -->|失败| Z["节点淘汰"]
C -->|通过| D["HostName"]
C -->|失败| Z
D -->|通过| E["MatchNodeSelector"]
D -->|失败| Z
E -->|通过| F["NoDiskConflict"]
E -->|失败| Z
F -->|通过| G["PodToleratesNodeTaints"]
F -->|失败| Z
G -->|通过| H["所有策略通过"]
G -->|失败| Z
style H fill:#c8e6c9
style Z fill:#ffcdd2
### 1.3 优选阶段(Priorities)
**优选策略详解**:
```markdown
## 优选阶段
**优选策略**:
```yaml
priorities_strategies:
LeastRequestedPriority:
description: "最小请求资源优先"
score: "得分 = (capacity - requested) / capacity"
favor: "资源使用少的节点"
BalancedResourceAllocation:
description: "资源平衡分配"
score: "平衡CPU和内存"
favor: "资源使用均衡的节点"
ImageLocalityPriority:
description: "镜像本地性优先"
score: "基于镜像已下载大小"
favor: "已有镜像的节点"
NodeAffinityPriority:
description: "节点亲和性优先"
score: "基于节点亲和性匹配度"
favor: "满足亲和性的节点"
TaintTolerationPriority:
description: "污点容忍优先"
score: "基于容忍度匹配"
favor: "容忍度匹配的节点"
SelectorSpreadPriority:
description: "选择器分散优先"
score: "基于同一拓扑域的Pod数"
favor: "Pod分布更分散的节点"
得分计算:
score_calculation:
step_1: "每个策略计算0-10分"
step_2: "权重乘以策略得分"
step_3: "所有策略得分求和"
formula: "NodeScore = Σ(weight_i × score_i)"
example:
LeastRequested: "weight=1, score=8"
BalancedResource: "weight=1, score=6"
ImageLocality: "weight=1, score=9"
NodeAffinity: "weight=2, score=7"
result: "1×8 + 1×6 + 1×9 + 2×7 = 37"
优选流程图:
flowchart TD
A["开始优选"] --> B["LeastRequested"]
B --> C["BalancedResource"]
C --> D["ImageLocality"]
D --> E["NodeAffinity"]
E --> F["TaintToleration"]
F --> G["SelectorSpread"]
G --> H["计算总分"]
H --> I["选择最高分节点"]
style I fill:#c8e6c9
## 二、调度策略配置
### 2.1 内置调度策略
**调度器配置**:
```markdown
## 调度策略配置
### 内置调度策略
**调度器配置文件**:
```yaml
# kube-scheduler配置文件
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
strategy: LeastAllocated
Predicates配置:
# 启用的Predicates
enabledPredicates:
- CheckNodeCondition
- CheckNodeMemoryPressure
- CheckNodeDiskPressure
- CheckNodePIDPressure
- PodToleratesNodeTaints
- NoDiskConflict
- PodFitsResources
- HostName
- MatchNodeSelector
- PodFitsHostPorts
Priorities配置:
# 启用的Priorities及权重
enabledPriorities:
- name: NodeResourcesLeastAllocated
weight: 1
- name: NodeResourcesBalancedAllocation
weight: 1
- name: ImageLocality
weight: 1
- name: InterPodAffinity
weight: 1
- name: NodeAffinity
weight: 1
- name: TaintToleration
weight: 1
- name: SelectorSpread
weight: 1
### 2.2 自定义调度策略
**多个调度器**:
```markdown
### 自定义调度器
**创建自定义调度器**:
```yaml
# 部署自定义调度器
apiVersion: v1
kind: Pod
metadata:
name: my-custom-scheduler
namespace: kube-system
spec:
containers:
- name: my-custom-scheduler
image: k8s.gcr.io/kube-scheduler:v1.28.0
command:
- kube-scheduler
- --scheduler-name=my-custom-scheduler
- --leader-elect=false
Pod指定调度器:
# Pod使用自定义调度器
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
schedulerName: my-custom-scheduler
containers:
- name: nginx
image: nginx:latest
### 2.3 亲和性与反亲和性
**节点亲和性**:
```markdown
### 亲和性配置
**节点亲和性**:
```yaml
# 节点亲和性示例
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: nginx
image: nginx:latest
Pod亲和性:
# Pod亲和性示例
apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: topology.kubernetes.io/zone
containers:
- name: nginx
image: nginx:latest
Pod反亲和性:
# Pod反亲和性示例(高可用部署)
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: nginx:latest
## 三、污点与容忍度
### 3.1 污点(Taint)详解
**污点概念**:
```markdown
## 污点与容忍度
### 污点详解
**污点定义**:
```yaml
taint_definition:
purpose: "排斥Pod调度到特定节点"
effect_types:
NoSchedule: "不调度新Pod到该节点"
PreferNoSchedule: "尽量不调度新Pod到该节点"
NoExecute: "不调度且驱逐已有Pod"
污点设置:
# 添加污点
kubectl taint nodes node1 key=value:NoSchedule
# 添加 PreferNoSchedule 污点
kubectl taint nodes node1 dedicated=gpu:PreferNoSchedule
# 添加 NoExecute 污点
kubectl taint nodes node1 app=monitoring:NoExecute
# 查看污点
kubectl describe node node1 | grep Taints
# 移除污点
kubectl taint nodes node1 key=value:NoSchedule-
常见污点示例:
common_taints:
master_node:
key: "node-role.kubernetes.io/master"
effect: "NoSchedule"
reason: "master节点不调度普通Pod"
gpu_node:
key: "nvidia.com/gpu"
value: "present"
effect: "NoSchedule"
reason: "GPU节点专用于AI任务"
dedicated_node:
key: "dedicated"
value: "database"
effect: "NoSchedule"
reason: "数据库专用节点"
memory_pressure:
key: "node.kubernetes.io/memory-pressure"
effect: "NoSchedule"
reason: "节点内存压力大"
### 3.2 容忍度(Toleration)详解
**容忍度概念**:
```markdown
### 容忍度详解
**容忍度定义**:
```yaml
toleration_definition:
purpose: "允许Pod调度到有相应污点的节点"
match_mechanism: "污点key、value、effect都要匹配"
容忍度配置:
# 基本容忍度
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
# 使用Exists操作符
tolerations:
- key: "key"
operator: "Exists"
effect: "NoSchedule"
# 容忍所有污点
tolerations:
- operator: "Exists"
完整容忍度示例:
# Pod容忍度配置
apiVersion: v1
kind: Pod
metadata:
name: with-tolerations
spec:
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
- key: "app"
operator: "Equal"
value: "monitoring"
effect: "NoExecute"
tolerationSeconds: 300
containers:
- name: nginx
image: nginx:latest
特殊容忍度:
# 容忍所有NoSchedule污点
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
# 容忍所有污点(不推荐)
tolerations:
- operator: "Exists"
# 容忍特定key的所有污点
tolerations:
- key: "dedicated"
operator: "Exists"
effect: ""
3.3 污点与容忍度配合使用
配合场景:
### 配合使用
**专机专用场景**:
```mermaid
flowchart TD
A["GPU节点设置污点"] --> B["nvidia.com/gpu=present:NoSchedule"]
C["AI Pod设置容忍度"] --> D["key=nvidia.com/gpu:value=present:effect=NoSchedule"]
D --> E["Pod可调度到GPU节点"]
style A fill:#ffcdd2
style E fill:#c8e6c9
示例1:GPU节点专机专用:
# 1. 给GPU节点设置污点
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
# 2. Pod配置容忍度
apiVersion: v1
kind: Pod
metadata:
name: ml-training
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
containers:
- name: training
image: tensorflow:latest
resources:
limits:
nvidia.com/gpu: 1
示例2:数据库节点专机专用:
# 1. 给数据库节点设置污点
kubectl taint nodes db-node-1 dedicated=database:NoSchedule
# 2. 数据库Pod配置容忍度
apiVersion: apps/v1
kind: Deployment
metadata:
name: mysql
spec:
replicas: 3
template:
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "database"
effect: "NoSchedule"
nodeSelector:
node-role: database
containers:
- name: mysql
image: mysql:latest
示例3:临时节点维护(NoExecute):
# 1. 给节点设置NoExecute污点
kubectl taint nodes node-1 maintenance=true:NoExecute
# 2. Pod配置容忍度(带tolerationSeconds)
apiVersion: v1
kind: Pod
metadata:
name: production-app
spec:
tolerations:
- key: "maintenance"
operator: "Equal"
value: "true"
effect: "NoExecute"
tolerationSeconds: 3600
containers:
- name: app
image: app:latest
## 四、生产环境最佳实践
### 4.1 调度优化
**调度优化策略**:
```markdown
## 生产环境最佳实践
### 调度优化
**资源调度优化**:
```yaml
resource_scheduling:
resource_requests:
description: "合理设置资源请求"
practice: "requests应接近实际使用量"
benefit: "提高调度准确性"
resource_limits:
description: "合理设置资源限制"
practice: "limits应大于requests"
benefit: "防止突发流量"
quality_of_service:
description: "QoS级别"
types:
Guaranteed: "requests=limits(最高优先级)"
Burstable: "requests<limits(中等优先级)"
BestEffort: "未设置requests/limits(最低优先级)"
Pod拓扑分布约束:
# 拓扑分布约束示例
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web
### 4.2 污点使用策略
**污点策略**:
```markdown
### 污点使用策略
**污点使用场景**:
```yaml
taint_use_cases:
dedicated_nodes:
description: "专用节点"
example: "数据库节点、GPU节点"
taint: "dedicated=<role>:NoSchedule"
specialized_workloads:
description: "特殊工作负载"
example: "AI训练、大数据"
taint: "workload-type=<type>:NoSchedule"
maintenance:
description: "节点维护"
example: "升级、检修"
taint: "maintenance=true:NoExecute"
resource_pressure:
description: "资源压力"
example: "内存不足、磁盘满"
taint: "node.kubernetes.io/<pressure>=true:NoSchedule"
污点管理最佳实践:
# 节点维护流程
# 1. 驱逐节点上的Pod
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
# 2. 设置维护污点
kubectl taint nodes node-1 maintenance=true:NoSchedule
# 3. 执行维护操作
# ...
# 4. 移除维护污点
kubectl taint nodes node-1 maintenance-
# 5. 恢复节点
kubectl uncordon node-1
### 4.3 调度器高可用
**高可用配置**:
```markdown
### 调度器高可用
**高可用架构**:
```yaml
scheduler_high_availability:
leader_election:
description: "领导者选举"
lease_duration: "15秒"
renew_deadline: "10秒"
retry_period: "5秒"
multiple_replicas:
description: "多副本部署"
recommendation: "至少2个副本"
benefit: "故障自动切换"
高可用配置示例:
# kube-scheduler高可用配置
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
leaseDuration: 15s
renewDeadline: 10s
retryPeriod: 5s
resourceLock: leases
resourceName: kube-scheduler
resourceNamespace: kube-system
## 五、面试1分钟精简版(直接背)
**完整版**:
Scheduler调度流程:1. 预选阶段遍历所有节点,使用Predicates策略过滤不满足条件的节点;2. 优选阶段对通过的节点打分,常用策略包括LeastRequested、BalancedResource等;3. 选择得分最高的节点绑定Pod。污点(Taint)用于标记节点不被普通Pod调度,如master节点;容忍度(Toleration)让Pod能够调度到有相应污点的节点。常见效果:NoSchedule(不调度)、PreferNoSchedule(尽量不调度)、NoExecute(不调度且驱逐已有Pod)。
**30秒超短版**:
调度分预选优选,预选过滤,优选打分;污点排斥Pod,容忍度让Pod能调度到污点节点,NoSchedule/PreferNoSchedule/NoExecute三种效果。
## 六、总结
### 6.1 调度机制总结
```yaml
scheduling_summary:
phases:
predicates:
name: "预选阶段"
purpose: "过滤不满足条件的节点"
priorities:
name: "优选阶段"
purpose: "对通过节点打分"
key_strategies:
predicates:
- "PodFitsResources"
- "PodFitsHostPorts"
- "NodeSelector"
- "TaintToleration"
priorities:
- "LeastRequestedPriority"
- "BalancedResourceAllocation"
- "ImageLocalityPriority"
- "NodeAffinityPriority"
6.2 污点容忍度总结
taint_toleration_summary:
taint_effects:
NoSchedule: "不调度新Pod"
PreferNoSchedule: "尽量不调度新Pod"
NoExecute: "不调度且驱逐已有Pod"
toleration_options:
operator:
Equal: "value必须相等"
Exists: "只需key存在"
effect:
specific: "匹配特定effect"
empty: "匹配所有effect"
6.3 最佳实践清单
best_practices:
scheduling:
- "合理设置资源requests和limits"
- "使用拓扑分布约束实现高可用"
- "优先使用Pod反亲和性分布Pod"
taints:
- "专用节点设置污点"
- "维护前先drain节点"
- "NoExecute污点配合tolerationSeconds"
monitoring:
- "监控调度延迟"
- "监控调度失败率"
- "监控节点资源使用率"
6.4 记忆口诀
K8s调度分两阶段,预选优选要记清,
预选过滤不满足,优选打分比高低,
污点设置在节点,排斥普通Pod调度,
容忍度在Pod配置,允许调度到污点,
NoSchedule不调度,Prefer尽量不调度,
NoExecute不调度且驱逐,维护节点要小心。
文档信息
- 本文作者:soveran zhong
- 本文链接:https://blog.clockwingsoar.cn/2026/05/09/k8s-scheduler-taints-best-practices/
- 版权声明:自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)