Amazon EKS- managing and fixing ETCD database size

9 min readJun 5, 2024

Story detailing how to investigate and fix ETCD db issues when using EKS. You will find out how I managed to completely break our EKS cluster because of overloaded ETCD.

[UPDATED 14/06/2024]

The kyverno issue described in this story is being treated as the highest priority by the team. It is specifically related to kyverno 1.12, please track this github issue to avoid creation of too many ephemeralreports. Note that We have a fix is already available in 1.12.4-rc2 and kyverno team is working to release 1.12.4 ASAP.

Additional, this is not the first report resource where overloading of ETCD happened so a solution has been proposed currently named as kyverno reports server. If you want to know more please read -> https://kyverno.io/blog/2024/05/29/kyverno-reports-server-the-ultimate-solution-to-scale-reporting/

[UPDATED 01/07/2024]

Kyverno v1.12.4 is now released. If you are running 1.12, please upgrade to this version to pick up the fix for the ephemeralreports piling-up issue.

If you are seeing consistent creation of ephemeralreports, you can:

disable reporting for admission events, please see this comment
tune --aggregationWorkers to increase the capacity of consuming ephemeralreports, see this comment. It can be configured directly via the container flag, or through Helm extraArgs.

Introduction

Let’s start with a little bit of information about Amazon EKS structure. EKS cluster consists of two primary components:

The Amazon EKS control plane
Amazon EKS nodes that are registered with the control plane

The Amazon EKS control plane consists of control plane nodes that run the Kubernetes software, such as ETCD and the Kubernetes API server. The control plane runs in an account managed by AWS, and the Kubernetes API is exposed via the Amazon EKS endpoint associated with your cluster. Each Amazon EKS cluster control plane is single-tenant and unique, and runs on its own set of Amazon EC2 instances.

All of the data stored by the ETCD nodes and associated Amazon EBS volumes is encrypted using AWS KMS. The cluster control plane is provisioned across multiple Availability Zones and fronted by an Elastic Load Balancing Network Load Balancer. Amazon EKS also provisions elastic network interfaces in your VPC subnets to provide connectivity from the control plane instances to the nodes (for example, to support kubectl exec logs proxy data flows).

My setup

EKS 1.30 and CLI which matches that version

> $ kubectl version
Client Version: v1.30.1
Kustomize Version: v5.0.4–0.20230601165947–6ce0bf390ce3
Server Version: v1.30.0-eks-036c24b

What is ETCD? Why it is important to protect it and never let reach it’s limit?

ETCD is an open-source, distributed, consistent key-value data store. All the objects that are part of a Kubernetes cluster (in my case EKS) are persistently stored and tracked in ETCD. ETCD is meant to be used as a consistent key-value store for configuration management, service discovery, and coordinating distributed work. etcd documentation provides details on use cases and comparison with other systems. When you create an EKS cluster, Amazon EKS provisions the maximum recommended database size for ETCD in Kubernetes, which is 8GB. While 8 GB of etcd database is sufficient for most customer use cases, there are both valid and accidental scenarios, where the maximum allowed database size can be exceeded.

Importantly, when the database size limit is exceeded, ETCD emits a no space alarm and stops taking further write requests. In essence, the EKS cluster becomes read-only, and all requests to mutate objects such as creating new pods, scaling deployments, etc., will be rejected by the cluster’s API server. Furthermore, users won’t be able to delete objects or object revisions to reclaim ETCD storage space. This is because deletion relies on the compaction operation to clean up objects, and compaction is not allowed when the no space alarm is active. While compaction frees up space within the ETCD database, it doesn’t free up file system space taken up by etcd database. To return the space back to the operating system, and to drop the size of ETCD database, the defrag operation needs to run.

Monitor Control Plane Metrics and specifically ETCD Metrics

Monitoring Kubernetes API metrics can give you insights into control plane performance and identify issues. An unhealthy control plane can compromise the availability of the workloads running inside the cluster. For example, poorly written controllers can overload the API servers, affecting your application’s availability.

Kubernetes exposes control plane metrics at the /metrics endpoint.

You can view the metrics exposed using kubectl:

kubectl get --raw /metrics

This information will be important as /metrics endpoint is heavily used especially when there are storage issues with ETCD.

Important- In the Amazon EKS environment, etcd storage is limited to 8 GiB as per upstream guidance. You can monitor a metric for the current database size by running the following command. If your cluster has a Kubernetes version below 1.28, replace apiserver_storage_size_bytes with the following:

Kubernetes version 1.27 and 1.26 – apiserver_storage_db_total_size_in_bytes
Kubernetes version 1.25 and below – etcd_db_total_size_in_bytes

kubectl get --raw=/metrics | grep "apiserver_storage_size_bytes"

Broken ETCD

Recently, I managed to come across an EKS cluster prod issue which was precisely in read-only mode as ETCD database space exceeded. This was a P1 incident and required number of hours to resolve together with AWS Support.

So let me show you some of the information I gathered and executed with the help of AWS and also many lessons have been learnt by me.

Firstly, it is important to note that based on my experience, Amazon EKS cluster come with an ETCD storage space with 8GB, however it is a soft limit. The hard limit is 10GB which we managed to breach as well and this caused major lock across the entire EKS control-plane which is managed by AWS in their centralised account.

How to detect when ETCD is out of space?

Amazon EKS Control plane logging feature provides audit and diagnostic logs directly from the cluster’s control plane to Amazon CloudWatch Logs in your account. One of the log types that can be enabled is Kubernetes Audit logs. Audit logs provide a record of the individual users, administrators, or system components that have affected your cluster. When the cluster exceeds the limit of ETCD’s database size, the audit logs show an error response string database space exceeded. You can use the following Amazon CloudWatch Logs insight query to look for the timestamp of when this error message was first seen.

fields @timestamp, @message, @logStream
| filter @logStream like /kube-apiserver-audit/
| filter @message like /mvcc: database space exceeded/
| limit 10

responseObject.code 500
responseObject.kind Status
responseObject.message etcdserver: mvcc: database space exceeded
responseObject.status Failure

How to identify what is consuming ETCD database space?

Object count

It can happen that the count of total objects stored in ETCD leads to an increase in storage consumption. The Kubernetes API server exposes a metric that shows the count of objects by type.

# 1.22 and later
kubectl get --raw=/metrics | grep apiserver_storage_objects |awk '$2>100' |sort -g -k 2
# 1.21 and earilier
kubectl get --raw=/metrics | grep etcd_object_counts |awk '$2>100' |sort -g -k 2

Example output you can expect:

$ kubectl get --raw=/metrics | grep apiserver_storage_objects |awk '$2>100' |sort -g -k 2
apiserver_storage_objects{resource="controllerrevisions.apps"} 109
apiserver_storage_objects{resource="externalsecrets.external-secrets.io"} 116
apiserver_storage_objects{resource="rolebindings.rbac.authorization.k8s.io"} 124
apiserver_storage_objects{resource="customresourcedefinitions.apiextensions.k8s.io"} 128
apiserver_storage_objects{resource="clusterrolebindings.rbac.authorization.k8s.io"} 135
apiserver_storage_objects{resource="policyreports.wgpolicyk8s.io"} 135
apiserver_storage_objects{resource="deployments.apps"} 163
apiserver_storage_objects{resource="clusterroles.rbac.authorization.k8s.io"} 170
apiserver_storage_objects{resource="serviceaccounts"} 171
apiserver_storage_objects{resource="configmaps"} 187
apiserver_storage_objects{resource="secrets"} 217
apiserver_storage_objects{resource="pods"} 619
apiserver_storage_objects{resource="replicasets.apps"} 1091
apiserver_storage_objects{resource="admissionreports.kyverno.io"} 2854
apiserver_storage_objects{resource="events"} 2955

How to reclaim etcd database space?

You can clean up unused or orphaned objects using kubectl delete command.

For example the shell script below shows how to delete the admissionreports.kyverno.io and/or ephemeralreports.reports.kyverno.io objects.

#!/bin/bash

COUNTER=5 # set counter as (num of total to-be-deleted objects /$LIMIT)
LIMIT=1000 # num of to-be-deleted objects via one kubectl call

# it may imply the following arguments are mis-configured, if script output: Error from server (NotFound): the server could not find the requested resource
GROUP="reports.kyverno.io" # replace with own resource group as needed
VERSION="v1" # replace with own resource version as needed
NAMESPACE="trading" # replace with own resource namespace as needed
KIND="ephemeralreports" # replace with their own resource kind as needed

# Remove comments to remove admissionreports
# GROUP="kyverno.io"
# VERSION="v2"
# NAMESPACE="trading"
# KIND="admissionreports"

function formRequestURI(){
echo "/apis/$GROUP/$VERSION/namespaces/$NAMESPACE/$KIND"
}

requestURI=$(formRequestURI)
echo "Going to delete unwanted objects, requestURI: $requestURI, counter: $COUNTER, limit: $LIMIT"

for (( i = 1; i <= $COUNTER; i++ ))
do
$(kubectl delete --now=true --wait=false --raw $requestURI?limit=$LIMIT)
done

Notes:

counter- depending on number of objects to be deleted, customer can change the for loop iterations count as required.
limit- kubectl delete --now=true --wait=false --raw $requestURI potentially times out without specifying appropriate limit due to the know issue, i.e. Github link.
To form requestURI- commonly resource-uri is formed like this: /apis/<group>/<version>/namespaces/<namespace>/<kind>see the official doc: https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-uris

Importantly, from my observation script above will only work when ETCD db is not locked yet. So please make sure you monitor your ETCD size and make sure to never reach the limit of 8GB.

Broken EKS and it’s ETCD- steps and resolutions

As already mentioned, I managed to completely break ETCD with EKS 1.30. Here are steps that were executed to bring back cluster to a healthy status.

Identify which objects occupy so much ETCD space, in my case it was ephemeralreports resources which caused issues
Using the script above, I tried deleting those objects from ETCD, however that did not work since ETCD was completely blocked.
Engaged AWS support with business-critical ticket and escalated it as much as possible. It was clear that I can’t fix it from my end.
After initial support investigation where we tested different scripts, we were not able to unlock the ETCD storage. So EKS/ETCD team decided to increase storage to 12GB from 8/10GB.
Increased storage to 12GB unlocked the storage, however this caused other issues where none of the kubectl commands were actioned. Seemed like ETCD lost connectivity with API server.
Further investigation and tests showed that API server was lagging and ETCD is struggling with leaked ephemeralreports objects. This investigation was carried out internally by AWS EKS team.
After hours of debugging and investigating, it was deemed that AWS ETCD team needs to directly remove objects from the storage so that ETCD can function again.
Lastly, AWS Principal Engineer had to be engaged who needed approval from my side and AWS top level director to clear out objects directly from ETCD using command such as “meks etcdctl delete — prefix — key “/registry/reports.kyverno.io/ephemeralreports/trading” — create-review”

The final step which required direct deletion of the leaked objects from ETCD solved the problem and EKS started functioning again.

In case you are interested, entire issue took over 7 hours to resolve and I made it clear to AWS support couple of times that they should remove the objects immediately by accessing ETCD on their end. I have seen ETCD issues before so I kind of knew this would be needed regardless. Such a long production incident could have been reduced from 7 hours to 30 mins, however, AWS support preferred to not listen. I have even authorised support guys in writing to action removal of objects from ETCD immediately.

I hope this story is something that won’t happen to you. However, if you come across such problems then you will know how to deal with them.

Sponsor Me

Sponsor @marcincuber on GitHub Sponsors

Hi guys, I am Marcin and I am a Principal Engineer specialising in the field of DevOps. I am also a certified AWS Solutions…

github.com

Like with any other story on Medium written by me, I performed the tasks documented. This is my own research and issues I have encountered.

Thanks for reading everybody. Marcin Cuber