Tuesday, 8 July 2025

EKS Admin Guide

 🧑‍💻 Amazon EKS Admin Guide: Troubleshooting Common Issues 

This guide covers how to troubleshoot and resolve: 

  1. Pod stuck in Pending 

  1. Pod in CrashLoopBackOff 

  1. Nodes not joining the cluster 

  1. Persistent storage issues (EBS/EFS) 

  1. EKS IAM and network configuration problems 

 

1️⃣ Pod Stuck in Pending 

🔍 Common Causes 

  • No matching node (resources/taints/nodeSelector) 

  • PVC not bound (Persistent Volume issue) 

  • StorageClass not provisioning correctly 

  • Cluster autoscaler not scaling up 

  • Node affinity or toleration mismatch 

✅ How to Troubleshoot 

a) Describe the Pod 

 

 

kubectl describe pod <pod-name> -n <namespace> 
 

Look for: 

  • 0/3 nodes are available 

  • FailedScheduling 

  • PVC errors 

b) Check PVC Status 

 

 

kubectl get pvc -n <namespace> 
kubectl describe pvc <pvc-name> -n <namespace> 
 

Common issue: PVC in Pending → no volume provisioned 

c) Check Storage Class 

 

 

kubectl get sc 
kubectl describe sc <sc-name> 
 

Verify: 

  • Is provisioner correct (e.g., ebs.csi.aws.com, efs.csi.aws.com)? 

  • Is the default class set correctly? 

d) Check if Cluster Autoscaler is Enabled (Optional) 

If your node group is too small: 

  • Ensure cluster autoscaler is running and properly configured. 

  • Review AWS ASG and node pool settings. 

 

2️⃣ Pod in CrashLoopBackOff 

🔍 Common Causes 

  • App crashes on start 

  • Misconfigured environment variables 

  • Missing secrets/configs 

  • Storage/mount issues (e.g., EBS or EFS volume error) 

  • Bad liveness/readiness probes 

✅ How to Troubleshoot 

a) View Logs 

 

 

kubectl logs <pod-name> -n <namespace> 
kubectl logs <pod-name> -n <namespace> --previous 
 

b) Describe Pod 

 

 

kubectl describe pod <pod-name> -n <namespace> 
 

Look for: 

  • Exit code (e.g., ExitCode: 1) 

  • Event like: Back-off restarting failed container 

  • Volume mount errors 

  • Probe failure messages 

c) Check Liveness/Readiness Probes 

yaml 

 

livenessProbe: 
 httpGet: 
   path: /healthz 
   port: 8080 
 initialDelaySeconds: 5 
 periodSeconds: 10 
 

Bad probe → container restarts even if it’s healthy 

 

3️⃣ Node Not Joining the Cluster 

🔍 Common Causes 

  • IAM role misconfigured 

  • Incorrect bootstrap/user-data 

  • Security groups misconfigured 

  • Kubelet can’t connect to control plane 

✅ How to Troubleshoot 

a) Check EC2 Instance Logs (for unmanaged/SELF-managed nodes) 

 

 

ssh -i <key> ec2-user@<instance-ip> 
sudo journalctl -u kubelet -f 
 

b) Describe Node Group (for managed nodes) 

 

 

aws eks describe-nodegroup \ 
 --cluster-name <cluster-name> \ 
 --nodegroup-name <nodegroup-name> 
 

Look for: 

  • Health conditions 

  • Launch template issues 

  • Node status 

c) Check Node IAM Role 

Make sure the instance role has the following policies: 

  • AmazonEKSWorkerNodePolicy 

  • AmazonEC2ContainerRegistryReadOnly 

  • AmazonEKS_CNI_Policy 

  • AmazonEBSCSIDriverPolicy (if using EBS) 

d) Verify Security Groups 

  • Allow TCP 443 (to control plane) 

  • Allow node-to-node communication on all ports (for kube-proxy, kubelet, CoreDNS) 

 

4️⃣ Storage Issues: EBS/EFS Volumes 

🔍 Common Causes 

  • Volume not attaching 

  • Volume already attached to another node 

  • EBS volume stuck 

  • PVC/PV not bound 

✅ How to Troubleshoot 

a) Check Pod Describe 

 

 

kubectl describe pod <pod-name> -n <namespace> 
 

Look for: 

  • MountVolume.SetUp failed 

  • VolumeAttachment errors 

  • Stuck Pending PVC 

b) Check PVC and PV 

 

 

kubectl get pvc,pv -n <namespace> 
kubectl describe pvc <pvc-name> 
kubectl describe pv <pv-name> 
 

c) Check Volume Status in AWS 

 

 

aws ec2 describe-volumes --volume-ids <volume-id> 
 

  • Is the volume in in-use, available, or error? 

  • Is it attached to the right instance? 

d) Manually Detach Stuck Volume (if needed) 

 

 

aws ec2 detach-volume --volume-id <vol-id> --force 
 

e) CSI Logs 

 

 

kubectl logs -n kube-system -l app=ebs-csi-controller -c ebs-plugin 
kubectl logs -n kube-system -l app=efs-csi-controller -c efs-plugin 
 

 

5️⃣ IAM, RBAC, and Networking Checks 

✅ IAM Role Binding (for Admin Access) 

If you're unable to access EKS as an admin: 

  1. Update aws-auth ConfigMap 

 

 

kubectl edit -n kube-system configmap/aws-auth 
 

Example: 

yaml 

 

mapRoles: 
 - rolearn: arn:aws:iam::<account>:role/<EKS-Node-Role> 
   username: system:node:{{EC2PrivateDNSName}} 
   groups: 
     - system:bootstrappers 
     - system:nodes 
 
 - rolearn: arn:aws:iam::<account>:role/AdminRole 
   username: admin 
   groups: 
     - system:masters 
 

  1. Ensure your IAM role is in the config or use eksctl create iamidentitymapping 

 

🧪 Example Case: Pod Won’t Start Due to Volume Issue 

Pod YAML: 

yaml 

 

apiVersion: v1 
kind: Pod 
metadata: 
 name: test-pod 
spec: 
 containers: 
 - name: app 
   image: nginx 
   volumeMounts: 
   - mountPath: "/data" 
     name: my-ebs 
 volumes: 
 - name: my-ebs 
   persistentVolumeClaim: 
     claimName: my-ebs-claim 
 

PVC Describe Output: 

 

 

kubectl describe pvc my-ebs-claim 

 

Type:       PersistentVolumeClaim 
Status:     Pending 
Reason:     waiting for a volume to be created... 
 

Action: 

  • Check StorageClass 

  • Check CSI driver logs 

  • Check EBS volume in AWS Console 

 

✅ Final Admin Checklist 

Component 

Check/Command 

Pod status 

kubectl describe pod <pod> 

PVC/PV 

kubectl get pvc,pv -n <ns> + describe 

Logs 

kubectl logs <pod> + --previous 

Node issues 

kubectl get nodes + describe node 

IAM 

Verify IAM roles and AWS-auth config 

CSI Drivers 

kubectl logs -n kube-system -l app=<csi-name> 

EBS/EFS state 

aws ec2 describe-volumes or aws efs describe-... 

 

No comments:

Post a Comment