Troubleshooting the Kubernetes cluster
In this chapter we will learn about how to trouble shoot our Kubernetes cluster at control plane level and at application level.
Troubleshooting the control plane
Listing the nodes in a cluster
First thing to check if your cluster is working fine or not is to list the nodes associated with your cluster.
kubectl get nodes
Make sure that all nodes are in
List the control plane pods
If your nodes are up and running, next thing to check is the status of Kubernetes components. Run,
kubectl get pods -n kube-system
If any of the pod is restarting or crashing, look in to the issue. This can be done by getting the pod's description. For example, in my cluster kube-dns is crashing. In order to fix this first check the deployment for errors.
kubectl describe deployment -n kube-system kube-dns
If your deployment is good, the next thing to look for is log files. The locations of log files are given below...
/var/log/kube-apiserver.log - For API Server logs /var/log/kube-scheduler.log - For Scheduler logs /var/log/kube-controller-manager.log - For Replication Controller logs
If your Kubernetes components are running as pods, then you can get their logs by following the steps given below, Keep in mind that the actual pod's name may differ from cluster to cluster...
kubectl logs -n kube-system -f kube-apiserver-node1 kubectl logs -n kube-system -f kube-scheduler-node1 kubectl logs -n kube-system -f kube-controller-manager-node1
In your worker, you will need to check for errors in kubelet's log...
sudo journalctl -u kubelet
Troubleshooting the application
Sometimes your application(pod) may fail to start because of various reasons. Let's see how to troubleshoot.
Getting detailed status of an object (pods, deployments)
object.status shows a detailed information about whats the status of an object ( e.g. pod) and why its in that condition. This can be very useful to identify the issues.
kubectl get pod vote -o yaml
example output snippet when a wrong image was used to create a pod.
status: ... containerStatuses: .... state: waiting: message: 'rpc error: code = Unknown desc = Error response from daemon: manifest for schoolofdevops/vote:latst not found' reason: ErrImagePull hostIP: 22.214.171.124
Checking the status of Deployment
For this example I have a sample deployment called nginx.
apiVersion: apps/v1beta1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 1 template: metadata: labels: app: nginx spec: containers: - name: nginx image: ngnix:latest ports: - containerPort: 80
List the deployment to check for the availability of pods
kubectl get deployment nginx NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE nginx 1 1 1 0 20h
It is clear that my pod is unavailable. Lets dig further.
Check the events of your deployment.
kubectl describe deployment nginx
List the pods to check for any registry related error
kubectl get pods NAME READY STATUS RESTARTS AGE nginx-57c88d7bb8-c6kpc 0/1 ImagePullBackOff 0 7m
As we can see, we are not able to pull the image(ImagePullBackOff). Let's investigate further.
kubectl describe pod nginx-57c88d7bb8-c6kpc
Check the events of the pod's description.
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 9m default-scheduler Successfully assigned nginx-57c88d7bb8-c6kpc to ip-11-0-1-111.us-west-2.compute.internal Normal SuccessfulMountVolume 9m kubelet, ip-11-0-1-111.us-west-2.compute.internal MountVolume.SetUp succeeded for volume "default-token-8cwn4" Normal Pulling 8m (x4 over 9m) kubelet, ip-11-0-1-111.us-west-2.compute.internal pulling image "ngnix" Warning Failed 8m (x4 over 9m) kubelet, ip-11-0-1-111.us-west-2.compute.internal Failed to pull image "ngnix": rpc error: code = Unknown desc = Error response from daemon: repository ngnix not found: does not exist or no pull access Normal BackOff 7m (x6 over 9m) kubelet, ip-11-0-1-111.us-west-2.compute.internal Back-off pulling image "ngnix" Warning FailedSync 4m (x24 over 9m) kubelet, ip-11-0-1-111.us-west-2.compute.internal Error syncing pod
Bingo! The name of the image is
ngnix instead of
nginx. So fix the typo in your deployment file and redo the deployment.
Sometimes, your application(pod) may fail to start because of some configuration issues. For those errors, we can follow the logs of the pod.
kubectl logs -f nginx-57c88d7bb8-c6kpc
If you have any errors it will get populated in your logs.