Messed up storage solutions for AWS EKS

So, this is the continuity story of me handling the CTF deployment for Pentester Nepal's 11th Anniversary. During the deployment, I provisioned the EKS cluster and added the addon for the EBS CSI driver as I had done similar previously too. I first started with one node in ASG as only CTFd was supposed to be hosted the day before the event. So, no challenges have been hosted yet. So, before going into details, introducing the AWS storage terms.

EKS: Stands for Elastic Kubernetes Service, a managed service from AWS simplifying Kubernetes on AWS without needing to operate its own Kubernetes control plane and nodes.

EBS: Stands for Elastic Block Store, provides block-level storage for EC2 instances and provides high performance and durability for data-intensive applications. These are designed for availability within a single availability zone. In EKS, it can be used with AWS EBS CSI driver.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-storage-class
provisioner: ebs.csi.aws.com

EFS: Stands for Elastic File System, a storage solution provided by AWS which is NFS compliant making it usable across the availability zone, and also on-premises. In EKS, it can be used with AWS EFS CSI driver.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-storage-class
provisioner: efs.csi.aws.com

So the team uploaded the challenges in GitHub and after analysing the challenges, I decided to use EBS as none of the challenges required additional volumes to be mounted, so it was easy to manage EBS based on the storage. Mainly, storage was required for the database, Redis and CTFd's uploaded files.

As I was deploying some challenges, the node had some CPU/memory-related metrics increased and thus deployed the new node and later scaled down to the single. So, this was normal but I noticed all of a sudden that the CTFd's pods did not start as it was not assigned to any node by the scheduler. This was not normal for me as I had not faced such an issue and I started looking for logs related to why it was not getting assigned to any node. Upon looking up errors everywhere, I did not find any promising answers, and hence I started looking into the YAML of nodes, PVs, storage classes, and the CTFd's deployment.

Error:

0/2 nodes are available: 2 node(s) had volume node affinity conflict. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.

During this time I deleted the deployment and created it again to check if it got assigned this time, but still the same. Finally looking into each PV, I saw it was node-bound for some reason with the "selected node" key in the YAML. I thought of taking the risk and deleted the key from YAML. Still, the scheduler did not assign any nodes, I again looked if there was something else too, and observed that the YAML also had availability zone bound. I checked that the node was in a different AZ. During the scale-down, the ASG deleted the older one to which the EBS's AZ was bound. I then updated the ASG to deploy in the same AZ the volume is, and after the new node was deployed, the scheduler finally assigned the pod and the platform was then up and running.

So, what did I learn?

Firstly, EBS is bound to the Availability Zone (AZ) and the k8s nodes must be in the same AZ. As the Pod gets scheduled on a node in a different AZ, the pod scheduled in a different AZ will not be able to access the PV.
EBS volumes can only be mounted to a single node at a time in ReadWriteOnce mode.

In concluding this blog, choosing the right storage backend for AWS EKS depends on a specific use case. EBS providing high performance and low latency can be used efficiently in a single-node single-AZ, while EFS offering flexibility and scalability can be shared access across multiple nodes.