Lab K203 - Advanced Pod Scheduling
In the Kubernetes bootcamp training, we have seen how to create a pod and and some basic pod configurations to go with it. But this chapter explains some advanced topics related to pod scheduling.
From the api document for version 1.11 following are the pod specs which are relevant from scheduling perspective.
- nodeSelector
- nodeName
- affinity
- schedulerName
- tolerations
nodeName
You could bind a pod to a specific node with a name using nodeName spec. Lets take an example where you want to run the deployment for result
service on a specific node. Lets look at how you would do it,
Begin by listing the nodes
kubectl get nodes
[sample output]
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 35h v1.29.2
kind-worker Ready <none> 35h v1.29.2
kind-worker2 Ready <none> 35h v1.29.2
Now, bind your pod to one node e.g. kind-worker
by modyfying the deployment spec as:
File : result-deploy.yaml
apiVersion: apps/v1
kind: Deployment
....
.....
spec:
containers:
- image: schoolofdevops/vote-result
name: vote-result
nodeName: kind-worker
apply and validate
kubectl apply -f result-deploy.yaml
kubectl get pods -o wide
nodeSelector
Using nodeSelector instead of directly specifying nodeName in Kubernetes offers greater flexibility and resilience in scheduling pods. While nodeName forces a pod to schedule on a specific node, effectively bypassing Kubernetes' scheduler, nodeSelector allows for more dynamic placement by specifying a set of criteria that nodes must meet for the pod to be scheduled there. This approach utilizes Kubernetes' intelligent scheduling capabilities, enabling the system to consider multiple suitable nodes that meet the specified labels. This not only improves fault tolerance by avoiding dependencies on a single node but also facilitates more efficient resource utilization across the cluster. Additionally, nodeSelector supports scenarios where the environment might change, such as when nodes are added or removed, or their labels are updated, ensuring that the pods can still be scheduled according to the current state of the cluster.
To use nodeSelector, begin by labeling your nodes as:
kubectl get nodes --show-labels
kubectl label nodes <node-name> zone=aaa
kubectl get nodes --show-labels
e.g.
kubectl label nodes kind-worker zone=aaa
kubectl label nodes kind-worker2 zone=bbb
kubectl get nodes --show-labels
where, replace kind-worker and kind-worker2 can be the the actual nodes in your cluster.
Now update one of the deployments and add the nodeSelector spec to the pod e.g.
File : result-deploy.yaml
spec:
containers:
- image: schoolofdevops/vote-result
name: vote-result
nodeSelector:
zone: bbb
Note: ensure you have removed nodeName if present.
apply and validate
kubectl apply -f result-deploy.yaml
kubectl get pods -o wide
You shall see the pod being recreated now on the node matching the label selected using nodeSelector.
Affinity and Anti-Affinity
We have discussed about scheduling a pod on a particular node using NodeSelector. Using affinity and anti-affinity in Kubernetes offers a more sophisticated and granular level of control compared to nodeSelector, enabling not just simple label matching but also complex rules that govern pod placement. Affinity rules allow you to specify preferences that attract pods to certain nodes, either based on the node's properties or other pods that are already running on those nodes. Conversely, anti-affinity rules are used to ensure pods are spread across different nodes or node groups, enhancing high availability and reducing the risk of simultaneous failures. This is particularly useful in large-scale deployments where maintaining workload balance and resilience is crucial. For example, you can ensure that multiple instances of a service run in different racks or availability zones, minimizing potential impact from physical infrastructure failures. These features allow Kubernetes to more effectively manage the distribution and redundancy of workloads, which is vital for maintaining robust, scalable applications.
More over, using nodeSelector wolud mean defining a strict condition which must be met. If the condition is not met, the pod cannot be scheduled. Node/Pod affinity and anti-affinity solves this issue by introducing soft and hard conditions which are flexible based on when they are applied. This is controlled using the following properties
- required
-
preferred
-
DuringScheduling
- DuringExecution
and using theese operators
- In
- NotIn
- Exists
- DoesNotExist
- Gt
- Lt
Lets take up some examples and understand this.
nodeAffinity
Examine the current pod distribution
kubectl get pods -o wide --selector="role=vote"
and node labels
kubectl get nodes --show-labels
Lets create node affinity criteria as
- Pods for vote app must not run on the master nodes
- Pods for vote app preferably run on a node in zone bbb
First is a hard affinity versus second being soft affinity.
file: vote-deploy-nodeaffinity.yaml
....
template:
....
spec:
containers:
- name: app
image: schoolofdevops/vote:v1
ports:
- containerPort: 80
protocol: TCP
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: zone
operator: In
values:
- bbb
clearn up previous deployment and apply this code as
kubectl delete deploy vote
kubectl apply -f vote-deploy-nodeaffinity.yaml
kubectl get pods -o wide
podAffinity and podAntiAffinity
Lets define pod affinity criteria as,
- Pods for vote and redis should be co located as much as possible (preferred)
- No two pods with redis app should be running on the same node (required)
kubectl get pods -o wide --selector="role in (vote,redis)"
file: vote-deploy-podaffinity.yaml
...
template:
...
spec:
containers:
- name: app
image: schoolofdevops/vote:v1
ports:
- containerPort: 80
protocol: TCP
affinity:
...
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: role
operator: In
values:
- redis
topologyKey: kubernetes.io/hostname
file: redis-deploy-podaffinity.yaml
....
template:
...
spec:
containers:
- image: schoolofdevops/redis:latest
imagePullPolicy: Always
name: redis
ports:
- containerPort: 6379
protocol: TCP
restartPolicy: Always
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: role
operator: In
values:
- redis
topologyKey: "kubernetes.io/hostname"
clean up the previous deployments and apply as
kubectl delete deploy vote
kubectl delete deploy,sts redis
kubectl apply -f redis-deploy-podaffinity.yaml
kubectl apply -f vote-deploy-podaffinity.yaml
check the pods distribution
kubectl get pods -o wide --selector="role in (vote,redis)"
Observations from the above output,
- Since redis has a hard constraint not to be on the same node, you would observe redis pods being on differnt nodes (node2 and node4)
- since vote app has a soft constraint, you see some of the pods running on node4 (same node running redis), others continue to run on node 3
If you kill the pods on node3, at the time of scheduling new ones, scheduler meets all affinity rules
Now try scaling up redis instances
kubectl scale deploy/redis --replicas=4
kubectl get pods -o wide
- Are all redis pods runnning ? Why?
When you are done experimenting, revert to original configurations
kubectl delete deploy vote
kubectl delete deploy redis
kubectl apply -f vote-deploy.yaml -f redis-deploy.yaml
Taints and Tolerations
- Affinity is defined for pods
- Taints are defined for nodes
You could add the taints with criteria and effects. Effetcs can be
Taint Specs:
- effect
- NoSchedule
- PreferNoSchedule
- NoExecute
- key
- value
- timeAdded (only written for NoExecute taints)
Observe the pods distribution
kubectl get pods -o wide
Lets taint a node.
kubectl taint node kind-worker2 dedicated=worker:NoExecute
kubectl describe node kind-worker2
after tainting the node
kubectl get pods -o wide
All pods running on node2 just got evicted.
Add toleration in the Deployment for worker.
File: worker-deploy.yml
apiVersion: apps/v1
.....
template:
....
spec:
containers:
- name: app
image: schoolofdevops/vote-worker:latest
tolerations:
- key: "dedicated"
operator: "Equal"
value: "worker"
effect: "NoExecute"
apply
kubectl apply -f worker-deploy.yml
Observe the pod distribution now.
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
db-66496667c9-qggzd 1/1 Running 0 4h 10.233.74.74 node4
redis-5bf748dbcf-ckn65 1/1 Running 0 3m 10.233.71.26 node3
redis-5bf748dbcf-vxppx 1/1 Running 0 31m 10.233.74.79 node4
result-5c7569bcb7-4fptr 1/1 Running 0 4h 10.233.71.18 node3
result-5c7569bcb7-s4rdx 1/1 Running 0 4h 10.233.74.75 node4
vote-56bf599b9c-22lpw 1/1 Running 0 30m 10.233.74.80 node4
vote-56bf599b9c-4l6bc 1/1 Running 0 12m 10.233.74.83 node4
vote-56bf599b9c-bqsrq 1/1 Running 0 12m 10.233.74.82 node4
vote-56bf599b9c-xw7zc 1/1 Running 0 12m 10.233.74.81 node4
worker-6cc8dbd4f8-6bkfg 1/1 Running 0 1m 10.233.75.15 node2
You should see worker being scheduled on kind-worker2
To remove the taint created above
kubectl taint node kind-worker2 dedicated=worker:NoExecute-
Exercise
- Master node is unschedulable because of a taint. Find the taint on the master node and remove it. See if new pods get scheduled on it after that.