K8s污点与容忍详解:节点隔离与专用节点最佳实践
情境与背景
在Kubernetes生产环境中,我们经常需要控制Pod在哪些节点上运行,例如专用GPU节点给AI作业用、数据库节点隔离出来给数据库用、控制节点不跑业务Pod等。污点(Taint)与容忍(Toleration)是K8s提供的用于实现节点隔离和Pod选择性调度的核心机制。
作为高级DevOps/SRE工程师,深入理解污点与容忍的工作原理、掌握各种Effect的区别和适用场景,是构建生产级K8s集群的必备技能。
一、污点与容忍核心概念
1.1 什么是污点(Taint)
污点是给节点(Node)添加的标签,具有排斥Pod的能力,让某些Pod无法调度到该节点上。
flowchart TB
A["污点 Taint"] --> B["Key"]
A --> C["Value"]
A --> D["Effect"]
D --> E["NoSchedule"]
D --> F["PreferNoSchedule"]
D --> G["NoExecute"]
H["有污点的节点"] --> I["排斥无容忍的Pod"]
H --> J["接受有容忍的Pod"]
污点的三个组成部分:
- Key:污点的键(例如
dedicated) - Value:污点的值(例如
gpu) - Effect:污点的效果(NoSchedule/PreferNoSchedule/NoExecute)
1.2 什么是容忍(Toleration)
容忍是给Pod添加的配置,让Pod能够容忍节点上的某些污点,从而调度到该节点上。
flowchart TB
K["容忍 Toleration"] --> L["匹配污点"]
L --> M["Pod可调度"]
N["无容忍"] --> O["Pod被排斥"]
容忍的主要配置项:
- key/value:要匹配的污点的key/value
- operator:
Equal(值相等)或Exists(只要key存在) - effect:要匹配的污点的effect,不指定则匹配所有
- tolerationSeconds:NoExecute场景下Pod在节点上的存活时间
二、污点Effect详解
2.1 NoSchedule - 不调度新Pod
NoSchedule是最常用的污点效果,让新Pod无法调度到该节点上,但节点上已运行的Pod不受影响。
# 给节点打NoSchedule污点
kubectl taint nodes node-001 dedicated=gpu:NoSchedule
# Pod配置容忍NoSchedule污点
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: cuda
image: nvidia/cuda:11.0
| 优点 | 缺点 | 适用场景 |
|---|---|---|
| 新Pod不调度 | 已有Pod继续运行 | 专用资源节点 |
| 实现节点隔离 | - | GPU/DB专用节点 |
2.2 PreferNoSchedule - 尽量不调度
PreferNoSchedule是软约束,尽量不让Pod调度到该节点上,但如果没有其他节点可选,Pod还是能调度上去。
kubectl taint nodes node-001 dedicated=test:PreferNoSchedule
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "test"
effect: "PreferNoSchedule"
containers:
- name: test
image: busybox:1.34
| 优点 | 缺点 | 适用场景 |
|---|---|---|
| 软隔离,不强制 | 可能被调度 | 测试节点标记 |
2.3 NoExecute - 不调度+驱逐
NoExecute是最严格的污点效果,不仅不让新Pod调度,还会驱逐节点上已运行的Pod。
kubectl taint nodes node-001 node.kubernetes.io/unreachable:NoExecute
apiVersion: v1
kind: Pod
metadata:
name: critical-pod
spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300 # 容忍5分钟
containers:
- name: critical
image: mycriticalapp:v1
| 优点 | 缺点 | 适用场景 |
|---|---|---|
| 强制隔离+驱逐 | 影响业务Pod | 故障/维护节点 |
| 快速释放资源 | - | 节点下线准备 |
三、容忍配置详解
3.1 基本容忍配置
apiVersion: v1
kind: Pod
metadata:
name: db-pod
spec:
tolerations:
# 使用Equal精确匹配
- key: "dedicated"
operator: "Equal"
value: "database"
effect: "NoSchedule"
# 使用Exists匹配key存在
- key: "node-role.kubernetes.io/worker"
operator: "Exists"
containers:
- name: db
image: mysql:8.0
3.2 容忍所有污点
apiVersion: v1
kind: Pod
metadata:
name: all-tolerations
spec:
tolerations:
- operator: "Exists" # 匹配所有污点
containers:
- name: admin
image: admin-tool:v1
3.3 tolerationSeconds - 存活时间
在NoExecute场景下,tolerationSeconds配置Pod能在有污点的节点上存活多久。
apiVersion: v1
kind: Pod
metadata:
name: fault-tolerant
spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 600 # 10分钟后才驱逐
containers:
- name: app
image: myapp:v1
四、生产环境最佳实践
4.1 专用节点规划
flowchart TB
subgraph 集群架构
A["控制节点"] -->|污点| B["node-role.kubernetes.io/control-plane:NoSchedule"]
C["Worker节点"] --> D["普通应用"]
E["GPU节点"] -->|污点| F["dedicated=gpu:NoSchedule"]
G["数据库节点"] -->|污点| H["dedicated=database:NoSchedule"]
I["Ingress节点"] -->|污点| J["dedicated=ingress:NoSchedule"]
end
4.2 GPU专用节点配置
# 1. 给GPU节点打污点
kubectl taint nodes gpu-node-001 dedicated=gpu:NoSchedule
kubectl taint nodes gpu-node-002 dedicated=gpu:NoSchedule
# 2. 给GPU节点打标签
kubectl label nodes gpu-node-001 accelerator=nvidia-gpu
kubectl label nodes gpu-node-002 accelerator=nvidia-gpu
# GPU作业Pod配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-training
spec:
replicas: 2
selector:
matchLabels:
app: gpu-training
template:
metadata:
labels:
app: gpu-training
spec:
nodeSelector:
accelerator: nvidia-gpu
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: training
image: my-gpu-app:v1
resources:
limits:
nvidia.com/gpu: 1
4.3 数据库专用节点配置
# StatefulSet配置数据库Pod
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql
spec:
serviceName: mysql
replicas: 3
selector:
matchLabels:
app: mysql
template:
metadata:
labels:
app: mysql
spec:
nodeSelector:
node-type: database
tolerations:
- key: "dedicated"
operator: "Equal"
value: "database"
effect: "NoSchedule"
containers:
- name: mysql
image: mysql:8.0
volumeMounts:
- name: data
mountPath: /var/lib/mysql
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
4.4 控制节点隔离
默认情况下,Kubernetes控制节点已经有污点:
kubectl describe nodes control-plane-001 | grep -i taint
# 控制节点的默认污点
Taints: node-role.kubernetes.io/control-plane:NoSchedule
# 如果需要让某些Pod调度到控制节点(系统组件)
apiVersion: v1
kind: Pod
metadata:
name: system-component
namespace: kube-system
spec:
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: monitoring
image: monitoring-agent:v1
4.5 节点维护操作
# 步骤1: 对节点执行drain操作,驱逐Pod
kubectl drain node-001 --ignore-daemonsets
# 步骤2: (可选) 给节点打NoExecute污点
kubectl taint nodes node-001 maintenance=true:NoExecute
# 步骤3: 执行维护操作...
# 步骤4: 给节点解除污点
kubectl taint nodes node-001 maintenance=true:NoExecute-
# 步骤5: 让节点重新加入调度
kubectl uncordon node-001
4.6 监控告警
# Prometheus监控污点变化
groups:
- name: node_taint_alerts
rules:
- alert: NodeHasNoExecuteTaint
expr: kube_node_spec_taint{effect="NoExecute"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "节点 有NoExecute污点"
五、常见问题排查
5.1 Pod无法调度
# 1. 查看Pod状态和事件
kubectl get pods
kubectl describe pod <pod-name>
# 2. 查看节点污点
kubectl describe nodes <node-name> | grep -i taint
# 3. 检查Pod容忍配置
kubectl get pod <pod-name> -o yaml | grep -A 10 tolerations
5.2 Pod被驱逐
# 1. 查看Pod历史
kubectl get pods --show-all
# 2. 查看事件
kubectl get events --field-selector involvedObject.name=<pod-name>
# 3. 检查节点污点和tolerationSeconds
kubectl describe nodes <node-name>
六、面试精简版
6.1 一分钟版本
K8s的污点(Taint)是给节点添加排斥标签,让特定Pod无法调度;容忍(Toleration)是给Pod添加配置,让Pod能接受有污点的节点。污点有三个Effect:NoSchedule(不调度新Pod)、PreferNoSchedule(尽量不调度)、NoExecute(不调度+驱逐现有Pod)。生产环境常用污点实现专用节点(GPU/DB/Ingress)、控制节点隔离和故障节点维护。
6.2 记忆口诀
污点打在节点上,Pod加容忍才能上,
NoSchedule不能调度,PreferNoSchedule尽量不调,
NoExecute还驱逐跑,节点隔离全靠它,
专用节点规划好,资源利用效率高。
6.3 关键词速查
| 关键词 | 说明 |
|---|---|
| Taint | 污点,给节点添加排斥标签 |
| Toleration | 容忍,让Pod接受有污点的节点 |
| NoSchedule | 不调度新Pod |
| PreferNoSchedule | 尽量不调度 |
| NoExecute | 不调度+驱逐 |
| tolerationSeconds | NoExecute场景下Pod存活时间 |
文档信息
- 本文作者:soveran zhong
- 本文链接:https://blog.clockwingsoar.cn/2026/05/11/k8s-taints-tolerations-best-practices/
- 版权声明:自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)