Tips to Improve Knowledge: EKS Admin Guide

🧑‍💻 Amazon EKS Admin Guide: Troubleshooting Common Issues

This guide covers how to troubleshoot and resolve:

Pod stuck in Pending

Pod in CrashLoopBackOff

Nodes not joining the cluster

Persistent storage issues (EBS/EFS)

EKS IAM and network configuration problems

1️⃣ Pod Stuck in Pending

🔍 Common Causes

No matching node (resources/taints/nodeSelector)

PVC not bound (Persistent Volume issue)

StorageClass not provisioning correctly

Cluster autoscaler not scaling up

Node affinity or toleration mismatch

✅ How to Troubleshoot

a) Describe the Pod

kubectl describe pod <pod-name> -n <namespace>

Look for:

0/3 nodes are available

FailedScheduling

PVC errors

b) Check PVC Status

kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

Common issue: PVC in Pending → no volume provisioned

c) Check Storage Class

kubectl get sc
kubectl describe sc <sc-name>

Verify:

Is provisioner correct (e.g., ebs.csi.aws.com, efs.csi.aws.com)?

Is the default class set correctly?

d) Check if Cluster Autoscaler is Enabled (Optional)

If your node group is too small:

Ensure cluster autoscaler is running and properly configured.

Review AWS ASG and node pool settings.

2️⃣ Pod in CrashLoopBackOff

🔍 Common Causes

App crashes on start

Misconfigured environment variables

Missing secrets/configs

Storage/mount issues (e.g., EBS or EFS volume error)

Bad liveness/readiness probes

✅ How to Troubleshoot

a) View Logs

kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

b) Describe Pod

kubectl describe pod <pod-name> -n <namespace>

Look for:

Exit code (e.g., ExitCode: 1)

Event like: Back-off restarting failed container

Volume mount errors

Probe failure messages

c) Check Liveness/Readiness Probes

yaml

livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10

Bad probe → container restarts even if it’s healthy

3️⃣ Node Not Joining the Cluster

🔍 Common Causes

IAM role misconfigured

Incorrect bootstrap/user-data

Security groups misconfigured

Kubelet can’t connect to control plane

✅ How to Troubleshoot

a) Check EC2 Instance Logs (for unmanaged/SELF-managed nodes)

ssh -i <key> ec2-user@<instance-ip>
sudo journalctl -u kubelet -f

b) Describe Node Group (for managed nodes)

aws eks describe-nodegroup \
--cluster-name <cluster-name> \
--nodegroup-name <nodegroup-name>

Look for:

Health conditions

Launch template issues

Node status

c) Check Node IAM Role

Make sure the instance role has the following policies:

AmazonEKSWorkerNodePolicy

AmazonEC2ContainerRegistryReadOnly

AmazonEKS_CNI_Policy

AmazonEBSCSIDriverPolicy (if using EBS)

d) Verify Security Groups

Allow TCP 443 (to control plane)

Allow node-to-node communication on all ports (for kube-proxy, kubelet, CoreDNS)

4️⃣ Storage Issues: EBS/EFS Volumes

🔍 Common Causes

Volume not attaching

Volume already attached to another node

EBS volume stuck

PVC/PV not bound

✅ How to Troubleshoot

a) Check Pod Describe

kubectl describe pod <pod-name> -n <namespace>

Look for:

MountVolume.SetUp failed

VolumeAttachment errors

Stuck Pending PVC

b) Check PVC and PV

kubectl get pvc,pv -n <namespace>
kubectl describe pvc <pvc-name>
kubectl describe pv <pv-name>

c) Check Volume Status in AWS

aws ec2 describe-volumes --volume-ids <volume-id>

Is the volume in in-use, available, or error?

Is it attached to the right instance?

d) Manually Detach Stuck Volume (if needed)

aws ec2 detach-volume --volume-id <vol-id> --force

e) CSI Logs

kubectl logs -n kube-system -l app=ebs-csi-controller -c ebs-plugin
kubectl logs -n kube-system -l app=efs-csi-controller -c efs-plugin

5️⃣ IAM, RBAC, and Networking Checks

✅ IAM Role Binding (for Admin Access)

If you're unable to access EKS as an admin:

Update aws-auth ConfigMap

kubectl edit -n kube-system configmap/aws-auth

Example:

yaml

mapRoles:
- rolearn: arn:aws:iam::<account>:role/<EKS-Node-Role>
   username: system:node:{{EC2PrivateDNSName}}
   groups:
     - system:bootstrappers
     - system:nodes

- rolearn: arn:aws:iam::<account>:role/AdminRole
   username: admin
   groups:
     - system:masters

Ensure your IAM role is in the config or use eksctl create iamidentitymapping

🧪 Example Case: Pod Won’t Start Due to Volume Issue

Pod YAML:

yaml

apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: app
   image: nginx
   volumeMounts:
   - mountPath: "/data"
     name: my-ebs
volumes:
- name: my-ebs
   persistentVolumeClaim:
     claimName: my-ebs-claim

PVC Describe Output:

kubectl describe pvc my-ebs-claim

Type:       PersistentVolumeClaim
Status:     Pending
Reason:     waiting for a volume to be created...

Action:

Check StorageClass

Check CSI driver logs

Check EBS volume in AWS Console

✅ Final Admin Checklist

Component	Check/Command
Pod status	kubectl describe pod <pod>
PVC/PV	kubectl get pvc,pv -n <ns> + describe
Logs	kubectl logs <pod> + --previous
Node issues	kubectl get nodes + describe node
IAM	Verify IAM roles and AWS-auth config
CSI Drivers	kubectl logs -n kube-system -l app=<csi-name>
EBS/EFS state	aws ec2 describe-volumes or aws efs describe-...

Tips to Improve Knowledge

Tuesday, 8 July 2025

EKS Admin Guide

No comments:

Post a Comment