Files
Cloud-book/Kubernetes/调度器.md
2025-08-27 17:10:05 +08:00

552 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Pod调度
在默认情况下一个Pod在哪个Node节点上运行是由Scheduler组件采用相应的算法计算出来的这个过程是不受人工控制的。但是在实际使用中这并不满足的需求因为很多情况下我们想控制某些Pod到达某些节点上那么应该怎么做呢这就要求了解kubernetes对Pod的调度规则kubernetes提供了四大类调度方式
- 自动调度运行在哪个节点上完全由Scheduler经过一系列的算法计算得出
- 定向调度NodeName、NodeSelector
- 亲和性调度NodeAffinity、PodAffinity、PodAntiAffinity
- 污点容忍调度Taints、Toleration
## 定向调度
定向调度指的是利用在pod上声明nodeName或者nodeSelector以此将Pod调度到期望的node节点上。注意这里的调度是强制的这就意味着即使要调度的目标Node不存在也会向上面进行调度只不过pod运行失败而已。
### **NodeName**
NodeName用于强制约束将Pod调度到指定的Name的Node节点上。这种方式其实是直接跳过Scheduler的调度逻辑直接将Pod调度到指定名称的节点。
接下来实验一下创建一个pod-nodename.yaml文件
```yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodename
spec:
containers:
- name: nginx
image: aaronxudocker/myapp:v1.0
nodeName: node01 # 指定调度到nodem01节点上
```
```shell
#查看Pod调度到NODE属性确实是调度到了node01节点上
$ kubectl get pods pod-nodename -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodename 1/1 Running 0 62s 10.244.196.188 node01 <none> <none>
# 接下来删除pod修改nodeName的值为node03并没有node03节点
#再次查看发现已经向Node03节点调度但是由于不存在node03节点所以pod无法正常运行
$ kubectl get pods pod-nodename -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodename 0/1 Pending 0 10s <none> node03 <none> <none>
```
### **NodeSelector**
NodeSelector用于将pod调度到添加了指定标签的node节点上。它是通过kubernetes的label-selector机制实现的也就是说在pod创建之前会由scheduler使用MatchNodeSelector调度策略进行label匹配找出目标node然后将pod调度到目标节点该匹配规则是强制约束。
接下来,实验一下:
1 首先分别为node节点添加标签
```shell
$ kubectl label nodes node01 nodeenv=pro
node/node02 labeled
$ kubectl label nodes node02 nodeenv=test
node/node02 labeled
```
2 创建一个pod-nodeselector.yaml文件并使用它创建Pod
```yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeselector
spec:
containers:
- name: nginx
image: aaronxudocker/myapp:v1.0
nodeSelector:
nodeenv: pro # 指定调度到具有nodeenv=pro标签的节点上
```
```shell
#查看Pod调度到NODE属性确实是调度到了node1节点上
$ kubectl get pod pod-nodeselector -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeselector 1/1 Running 0 14s 10.244.196.189 node01 <none> <none>
# 接下来删除pod修改nodeSelector的值为nodeenv: xxxx不存在打有此标签的节点
#再次查看发现pod无法正常运行,Node的值为none
$ kubectl get pod -o wide
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 36s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
```
## 亲和性调度
上一节介绍了两种定向调度的方式使用起来非常方便但是也有一定的问题那就是如果没有满足条件的Node那么Pod将不会被运行即使在集群中还有可用Node列表也不行这就限制了它的使用场景。
基于上面的问题kubernetes还提供了一种亲和性调度Affinity。它在**NodeSelector**的基础之上的进行了扩展可以通过配置的形式实现优先选择满足条件的Node进行调度如果没有也可以调度到不满足条件的节点上使调度更加灵活。
Affinity主要分为三类
- nodeAffinity(node亲和性: 以node为目标解决pod可以调度到哪些node的问题
- podAffinity(pod亲和性) : 以pod为目标解决pod可以和哪些已存在的pod部署在同一个拓扑域中的问题
- podAntiAffinity(pod反亲和性) : 以pod为目标解决pod不能和哪些已存在pod部署在同一个拓扑域中的问题
> 关于亲和性(反亲和性)使用场景的说明:
>
> **亲和性**:如果两个应用频繁交互,那就有必要利用亲和性让两个应用的尽可能的靠近,这样可以减少因网络通信而带来的性能损耗。
>
> **反亲和性**当应用的采用多副本部署时有必要采用反亲和性让各个应用实例打散分布在各个node上这样可以提高服务的高可用性。
### **NodeAffinity**
首先来看一下`NodeAffinity`的可配置项:
```shell
pod.spec.affinity.nodeAffinity
requiredDuringSchedulingIgnoredDuringExecution Node节点必须满足指定的所有规则才可以相当于硬限制
nodeSelectorTerms 节点选择列表
matchFields 按节点字段列出的节点选择器要求列表
matchExpressions 按节点标签列出的节点选择器要求列表(推荐)
key 键
values 值
operator 关系符 支持Exists, DoesNotExist, In, NotIn, Gt, Lt
preferredDuringSchedulingIgnoredDuringExecution 优先调度到满足指定的规则的Node相当于软限制 (倾向)
preference 一个节点选择器项,与相应的权重相关联
matchFields 按节点字段列出的节点选择器要求列表
matchExpressions 按节点标签列出的节点选择器要求列表(推荐)
key 键
values 值
operator 关系符 支持In, NotIn, Exists, DoesNotExist, Gt, Lt
weight 倾向权重在范围1-100。
```
```shell
关系符的使用说明:
- matchExpressions:
- key: nodeenv # 匹配存在标签的key为nodeenv的节点
operator: Exists
- key: nodeenv # 匹配标签的key为nodeenv,且value是"xxx"或"yyy"的节点
operator: In
values: ["xxx","yyy"]
- key: nodeenv # 匹配标签的key为nodeenv,且value大于"xxx"的节点
operator: Gt
values: "xxx"
```
接下来首先演示一下`requiredDuringSchedulingIgnoredDuringExecution` ,
创建pod-nodeaffinity-required.yaml
```yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeaffinity-required
spec:
containers:
- name: nginx
image: aaronxudocker/myapp:v1.0
affinity: #亲和性设置
nodeAffinity: #设置node亲和性
requiredDuringSchedulingIgnoredDuringExecution: # 硬限制
nodeSelectorTerms:
- matchExpressions: # 匹配nodeenv的值在["xxx","yyy"]中的标签
- key: nodeenv
operator: In
values: ["xxx","yyy"]
```
```shell
# 查看pod状态 (运行失败)
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeaffinity-required 0/1 Pending 0 6s <none> <none> <none> <none>
# 查看Pod的详情
# 发现调度失败提示node选择失败
$ kubectl describe pod pod-nodeaffinity-required
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 42s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
# 修改文件将values: ["xxx","yyy"]------> ["pro","yyy"]
# 此时查看发现调度成功已经将pod调度到了node1上
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeaffinity-required 1/1 Running 0 7s 10.244.196.187 node01 <none> <none>
```
接下来再演示一下`preferredDuringSchedulingIgnoredDuringExecution` ,
创建pod-nodeaffinity-preferred.yaml
```yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeaffinity-preferred
spec:
containers:
- name: nginx
image: aaronxudocker/myapp:v1.0
affinity: #亲和性设置
nodeAffinity: #设置node亲和性
preferredDuringSchedulingIgnoredDuringExecution: # 软限制
- weight: 1
preference:
matchExpressions: # 匹配env的值在["xxx","yyy"]中的标签(当前环境没有)
- key: nodeenv
operator: In
values: ["xxx","yyy"]
```
```shell
# 创建pod
[root@k8s-master01 ~]# kubectl create -f pod-nodeaffinity-preferred.yaml
pod/pod-nodeaffinity-preferred created
# 查看pod状态 (运行成功)
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeaffinity-preferred 1/1 Running 0 6s 10.244.196.186 node01 <none> <none>
```
```
NodeAffinity规则设置的注意事项
1 如果同时定义了nodeSelector和nodeAffinity那么必须两个条件都得到满足Pod才能运行在指定的Node上
2 如果nodeAffinity指定了多个nodeSelectorTerms那么只需要其中一个能够匹配成功即可
3 如果一个nodeSelectorTerms中有多个matchExpressions ,则一个节点必须满足所有的才能匹配成功
4 如果一个pod所在的Node在Pod运行期间其标签发生了改变不再符合该Pod的节点亲和性需求则系统将忽略此变化
```
### **PodAffinity**
PodAffinity主要实现以运行的Pod为参照实现让新创建的Pod跟参照pod在一个区域的功能。
首先来看一下`PodAffinity`的可配置项:
```shell
pod.spec.affinity.podAffinity
requiredDuringSchedulingIgnoredDuringExecution 硬限制
namespaces 指定参照pod的namespace
topologyKey 指定调度作用域
labelSelector 标签选择器
matchExpressions 按节点标签列出的节点选择器要求列表(推荐)
key 键
values 值
operator 关系符 支持In, NotIn, Exists, DoesNotExist.
matchLabels 指多个matchExpressions映射的内容
preferredDuringSchedulingIgnoredDuringExecution 软限制
podAffinityTerm 选项
namespaces
topologyKey
labelSelector
matchExpressions
key 键
values 值
operator
matchLabels
weight 倾向权重在范围1-100
```
```
topologyKey用于指定调度时作用域,例如:
如果指定为kubernetes.io/hostname那就是以Node节点为区分范围
如果指定为beta.kubernetes.io/os,则以Node节点的操作系统类型来区分
```
接下来,演示下`requiredDuringSchedulingIgnoredDuringExecution`,
1首先创建一个参照Podpod-podaffinity-target.yaml
```yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-podaffinity-target
labels:
podenv: pro #设置标签
spec:
containers:
- name: nginx
image: aaronxudocker/myapp:v1.0
nodeName: node01 # 将目标pod名确指定到node01上
```
```shell
# 查看pod状况
$ kubectl get pod -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
pod-podaffinity-target 1/1 Running 0 10s 10.244.196.191 node01 <none> <none> podenv=pro
```
2创建pod-podaffinity-required.yaml内容如下
```yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-podaffinity-required
spec:
containers:
- name: nginx
image: aaronxudocker/myapp:v1.0
affinity: #亲和性设置
podAffinity: #设置pod亲和性
requiredDuringSchedulingIgnoredDuringExecution: # 硬限制
- labelSelector:
matchExpressions: # 匹配env的值在["xxx","yyy"]中的标签
- key: podenv
operator: In
values: ["xxx","yyy"]
topologyKey: kubernetes.io/hostname
```
上面配置表达的意思是新Pod必须要与拥有标签nodeenv=xxx或者nodeenv=yyy的pod在同一Node上显然现在没有这样pod接下来运行测试一下。
```shell
# 查看pod状态发现未运行
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-podaffinity-required 0/1 Pending 0 67s <none> <none> <none> <none>
pod-podaffinity-target 1/1 Running 0 2m29s 10.244.196.191 node01 <none> <none>
# 查看详细信息
$ kubectl describe pod pod-podaffinity-required
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 106s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) didn't match pod affinity rules. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
# 接下来修改 values: ["xxx","yyy"]----->values:["pro","yyy"]
# 意思是新Pod必须要与拥有标签nodeenv=xxx或者nodeenv=yyy的pod在同一Node上
# 然后重新创建pod查看效果
# 发现此时Pod运行正常
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-podaffinity-required 1/1 Running 0 15s 10.244.196.129 node01 <none> <none>
pod-podaffinity-target 1/1 Running 0 4m8s 10.244.196.191 node01 <none> <none>
```
关于`PodAffinity``preferredDuringSchedulingIgnoredDuringExecution`,这里不再演示。
### **PodAntiAffinity**
PodAntiAffinity主要实现以运行的Pod为参照让新创建的Pod跟参照pod不在一个区域中的功能。
它的配置方式和选项跟PodAffinty是一样的这里不再做详细解释直接做一个测试案例。
1继续使用上个案例中目标pod
```shell
$ kubectl get pod --show-labels
NAME READY STATUS RESTARTS AGE LABELS
pod-podaffinity-target 1/1 Running 0 5m31s podenv=pro
```
2创建pod-podantiaffinity-required.yaml内容如下
```yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-podantiaffinity-required
spec:
containers:
- name: nginx
image: aaronxudocker/myapp:v1.0
affinity: #亲和性设置
podAntiAffinity: #设置pod亲和性
requiredDuringSchedulingIgnoredDuringExecution: # 硬限制
- labelSelector:
matchExpressions: # 匹配podenv的值在["pro"]中的标签
- key: podenv
operator: In
values: ["pro"]
topologyKey: kubernetes.io/hostname
```
上面配置表达的意思是新Pod必须要与拥有标签nodeenv=pro的pod不在同一Node上运行测试一下。
```shell
# 查看pod
# 发现调度到了node2上
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-podaffinity-target 1/1 Running 0 6m30s 10.244.196.191 node01 <none> <none>
pod-podantiaffinity-required 1/1 Running 0 8s 10.244.140.122 node02 <none> <none>
```
## 污点和容忍
**污点Taints**
前面的调度方式都是站在Pod的角度上通过在Pod上添加属性来确定Pod是否要调度到指定的Node上其实我们也可以站在Node的角度上通过在Node上添加**污点**属性来决定是否允许Pod调度过来。
Node被设置上污点之后就和Pod之间存在了一种相斥的关系进而拒绝Pod调度进来甚至可以将已经存在的Pod驱逐出去。
污点的格式为:`key=value:effect`, key和value是污点的标签effect描述污点的作用支持如下三个选项
- PreferNoSchedulekubernetes将**尽量避免把Pod调度到具有该污点的Node上**,除非没有其他节点可调度
- NoSchedulekubernetes将**不会把Pod调度到具有该污点的Node上****但不会影响当前Node上已存在的Pod**
- NoExecutekubernetes将**不会把Pod调度到具有该污点的Node上****同时也会将Node上已存在的Pod驱离**
![image-20240918103251502](调度器/image-20240918103251502.png)
使用kubectl设置和去除污点的命令示例如下
```shell
# 设置污点
kubectl taint nodes node1 key=value:effect
# 去除污点
kubectl taint nodes node1 key:effect-
# 去除所有污点
kubectl taint nodes node1 key-
```
接下来,演示下污点的效果:
```shell
# 为node01设置污点(PreferNoSchedule)
$ kubectl taint node node01 tag=eagle:PreferNoSchedule
node/node01 tainted
# 创建100个pod全部都运行在了node02节点上
$ kubectl create deployment myapp --image=aaronxudocker/myapp:v1.0 --replicas=100
deployment.apps/myapp created
$ kubectl get pod -o wide |grep node02|wc -l
100
# 创建1000个pod的时候node02节点性能不足不满足预选条件部分pod运行到了node01上面
$ kubectl get pod -o wide |grep node01|wc -l
106
$ kubectl get pod -o wide |grep node02|wc -l
103
# 为node1设置污点(取消PreferNoSchedule设置NoSchedule)
$ kubectl taint nodes node01 tag:PreferNoSchedule-
node/node01 untainted
$ kubectl taint nodes node01 tag=eagle:NoSchedule
node/node01 tainted
# 创建pod
$ kubectl create deployment myapp --image=aaronxudocker/myapp:v1.0 --replicas=100
deployment.apps/myapp created
$ kubectl get pod -o wide | grep node02 | wc -l
100
# 关闭node02虚拟机后再次创建无法被调度
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp-9d57f8b4-2jf2f 0/1 Pending 0 4s <none> <none> <none> <none>
myapp-9d57f8b4-2rkhr 0/1 Pending 0 8s <none> <none> <none> <none>
myapp-9d57f8b4-2sfbn 0/1 Pending 0 7s <none> <none> <none> <none>
myapp-9d57f8b4-44hjs 0/1 Pending 0 4s <none> <none> <none> <none>
# 为node1设置污点(取消NoSchedule设置NoExecute)
$ kubectl taint nodes node01 tag:NoSchedule-
# 创建10个pod里面有部分是在node01上运行的
$ kubectl create deployment myapp --image=aaronxudocker/myapp:v1.0 --replicas=10
deployment.apps/myapp created
$ kubectl get pod -o wide |grep node01 |wc -l
5
# 设置NoExecute所有的pod运行到了node02上
$ kubectl taint nodes node01 tag=eagle:NoExecute
$ kubectl get pod -o wide |grep node01 |wc -l
0
```
```
小提示:
使用kubeadm搭建的集群默认就会给master节点添加一个污点标记,所以pod就不会调度到master节点上.
```
**容忍Toleration**
上面介绍了污点的作用我们可以在node上添加污点用于拒绝pod调度上来但是如果就是想将一个pod调度到一个有污点的node上去这时候应该怎么做呢这就要使用到**容忍**。
<img src="调度器/image-20240918103742892.png" alt="image-20240918103742892" style="zoom:33%;" />
> 污点就是拒绝容忍就是忽略Node通过污点拒绝pod调度上去Pod通过容忍忽略拒绝
下面先通过一个案例看下效果:
1. 上一小节已经在node1节点上打上了`NoExecute`的污点此时pod是调度不上去的
2. 本小节可以通过给pod添加容忍然后将其调度上去
创建pod-toleration.yaml,内容如下
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: myapp
name: myapp
spec:
replicas: 10
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- image: aaronxudocker/myapp:v1.0
name: myapp
tolerations: # 添加容忍
- key: "tag" # 要容忍的污点的key
operator: "Equal" # 操作符
value: "eagle" # 容忍的污点的value
effect: "NoExecute" # 添加容忍的规则,这里必须和标记的污点规则相同
```
```shell
# 添加容忍之后的pod
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp-6687b4f9d7-562lv 1/1 Running 0 3s 10.244.140.150 node02 <none> <none>
myapp-6687b4f9d7-6m7wb 1/1 Running 0 3s 10.244.196.128 node01 <none> <none>
myapp-6687b4f9d7-d9tgk 1/1 Running 0 3s 10.244.196.168 node01 <none> <none>
myapp-6687b4f9d7-fsqw7 1/1 Running 0 3s 10.244.140.159 node02 <none> <none>
myapp-6687b4f9d7-j7c4x 1/1 Running 0 3s 10.244.140.158 node02 <none> <none>
myapp-6687b4f9d7-j7dfx 1/1 Running 0 3s 10.244.196.178 node01 <none> <none>
myapp-6687b4f9d7-ll6z2 1/1 Running 0 3s 10.244.196.140 node01 <none> <none>
myapp-6687b4f9d7-lwqj4 1/1 Running 0 3s 10.244.140.167 node02 <none> <none>
myapp-6687b4f9d7-mvdtg 1/1 Running 0 3s 10.244.196.167 node01 <none> <none>
myapp-6687b4f9d7-qlw5x 1/1 Running 0 3s 10.244.140.147 node02 <none> <none>
```
下面看一下容忍的详细配置:
```shell
$ kubectl explain pod.spec.tolerations
......
FIELDS:
key # 对应着要容忍的污点的键,空意味着匹配所有的键
value # 对应着要容忍的污点的值
operator # key-value的运算符支持Equal和Exists默认
effect # 对应污点的effect空意味着匹配所有影响
tolerationSeconds # 容忍时间, 当effect为NoExecute时生效表示pod在Node上的停留时间
```