Amazon EKS- etcd monitoring and alerting using Container Insights
Find out how to implement CloudWatch monitoring and alerting for your EKS control-plane using Terraform. We will specifically focus on ETCD and APIserver.
Introduction
In this story I will cover how to implement (easily with Terraform) monitoring and alerting of important (in my opinion) K8s control-plane metrics. Until recently, I though managed control-plane will be well covered by AWS and their support, however, this changed recently after I experienced a complete lockdown of my production EKS cluster which was caused by ETCD db being full and locked. You can read about this in my previous story which provides a lot more details about this incident. Read more -> Amazon EKS- managing and fixing ETCD database size
Tools and applications used
- Amazon EKS 1.30
- Terraform 1.6
- Terraform aws provider 5.54.0
- EKS Container Insights (deployed/enabled through EKS addon)
- CloudWatch logs, metrics, alerts and dashboards
- SNS and SQS
Enabling Container Insights
I am using Amazon CloudWatch Observability EKS add-on to enable Container Insights enhanced observability for Amazon EKS. This allows us to collect infrastructure metrics, application performance telemetry, and container logs from the Amazon EKS cluster.
locals {
amazon_cloudwatch_observability_config = file("${path.module}/configs/amazon-cloudwatch-observability.json")
}
resource "aws_eks_addon" "amazon_cloudwatch_observability" {
cluster_name = aws_eks_cluster.cluster.name
addon_name = "amazon-cloudwatch-observability"
addon_version = "v1.7.0-eksbuild.1"
configuration_values = local.amazon_cloudwatch_observability_config
}
#amazon-cloudwatch-observability.json
{
"agent": {
"config": {
"logs": {
"metrics_collected": {
"kubernetes": {
"enhanced_container_insights": true
}
}
}
}
},
"containerLogs": {
"enabled": false
}
}
As you can see, I am enabling enhanced_container_insights since those metrics include control-plane metrics which I am after. You can also see that I am disabling containerLogs I use a different centralised solution for logs. Also, if you decide to enable container logs then make sure you have money to pay for it, it is super expensive in my opinion.
Once the above is deployed, you will be able to see container insights dashboard for your cluster which will look as follows:
As you can see there is data about your control-plane and worker nodes which are part of the EKS cluster. You can view even more when you go into performance dashboard.
EKS control-plane dashboard
For control-plane metrics, I have built my own dashboard since the ones provided by AWS are outdated and use incorrect metrics. Below you can see four graphs which I am interested in.
In case you want to deploy them quickly, here is the code that you can import. Just make sure to put your ACCOUNT_ID and CLUSTER_NAME.
{
"widgets": [
{
"height": 6,
"width": 12,
"y": 0,
"x": 12,
"type": "metric",
"properties": {
"region": "eu-west-2",
"title": "API server requests",
"legend": {
"position": "bottom"
},
"timezone": "LOCAL",
"metrics": [
[ "ContainerInsights", "apiserver_request_total", "ClusterName", "CLUSTER_NAME", { "id": "mm1m0", "stat": "Sum", "yAxis": "left", "accountId": "ACCOUNT_ID" } ],
[ ".", "apiserver_request_duration_seconds", ".", ".", { "id": "mm2m0", "stat": "Average", "yAxis": "right", "accountId": "ACCOUNT_ID" } ]
],
"liveData": false,
"period": 60,
"yAxis": {
"left": {
"label": "Count",
"showUnits": false
},
"right": {
"label": "Seconds",
"showUnits": false
}
}
}
},
{
"height": 6,
"width": 12,
"y": 6,
"x": 0,
"type": "metric",
"properties": {
"region": "eu-west-2",
"title": "API server admission controller duration / ETCD request duration",
"legend": {
"position": "bottom"
},
"timezone": "LOCAL",
"metrics": [
[ "ContainerInsights", "apiserver_admission_controller_admission_duration_seconds", "ClusterName", "CLUSTER_NAME", { "id": "mm1m0", "stat": "Average", "yAxis": "left", "accountId": "ACCOUNT_ID" } ],
[ ".", "etcd_request_duration_seconds", ".", ".", { "id": "mm2m0", "label": "etcd_request_duration_seconds (alpha)", "stat": "Average", "yAxis": "right", "accountId": "ACCOUNT_ID" } ]
],
"liveData": false,
"period": 60,
"yAxis": {
"left": {
"label": "Seconds",
"showUnits": false
},
"right": {
"label": "Seconds",
"showUnits": false
}
}
}
},
{
"height": 6,
"width": 12,
"y": 0,
"x": 0,
"type": "metric",
"properties": {
"metrics": [
[ "ContainerInsights", "apiserver_storage_objects", "ClusterName", "CLUSTER_NAME", { "id": "mm1m0", "stat": "Maximum", "yAxis": "left", "accountId": "ACCOUNT_ID", "region": "eu-west-2" } ],
[ "ContainerInsights", "apiserver_storage_size_bytes", "ClusterName", "CLUSTER_NAME", { "id": "mm2m0", "stat": "Maximum", "yAxis": "right", "accountId": "ACCOUNT_ID", "region": "eu-west-2" } ]
],
"region": "eu-west-2",
"title": "API server storage objects",
"legend": {
"position": "bottom"
},
"timezone": "LOCAL",
"liveData": false,
"period": 60,
"yAxis": {
"left": {
"label": "Count",
"showUnits": false
},
"right": {
"label": "Bytes",
"showUnits": false
}
},
"view": "timeSeries",
"stacked": false
}
},
{
"height": 6,
"width": 12,
"y": 6,
"x": 12,
"type": "metric",
"properties": {
"region": "eu-west-2",
"title": "REST client requests",
"legend": {
"position": "bottom"
},
"timezone": "LOCAL",
"metrics": [
[ "ContainerInsights", "rest_client_requests_total", "ClusterName", "CLUSTER_NAME", { "id": "mm1m0", "label": "rest_client_requests_total (alpha)", "stat": "Sum", "yAxis": "left", "accountId": "ACCOUNT_ID" } ],
[ "ContainerInsights", "rest_client_request_duration_seconds", "ClusterName", "CLUSTER_NAME", { "id": "mm2m0", "label": "rest_client_request_duration_seconds (alpha)", "stat": "Average", "yAxis": "right", "accountId": "ACCOUNT_ID" } ]
],
"liveData": false,
"period": 60,
"yAxis": {
"left": {
"label": "Count",
"showUnits": false
},
"right": {
"label": "Seconds",
"showUnits": false
}
},
"view": "timeSeries",
"stacked": false
}
}
]
}
CloudWatch alarms and email notifications for EKS control-plane
Here you will see full implementation in Terraform of Cloudwatch alarms and email notifications. I focused my alarms on the following metrics:
- apiserver_storage_size_bytes
- apiserver_storage_objects
- apiserver_request_duration_seconds
- rest_client_request_duration_seconds
- etcd_request_duration_seconds
locals {
eks_cluster_name = "YOUR CLUSTER_NAME"
}
resource "aws_cloudwatch_metric_alarm" "eks_apiserver_storage_size_bytes" {
alarm_name = "eks-${local.eks_cluster_name}-apiserver-storage-size-bytes"
comparison_operator = "GreaterThanOrEqualToThreshold"
period = "300"
evaluation_periods = "5"
threshold = "6000000000" # 6Gb (max is 8Gb)
alarm_description = "Detecting high ETCD storage usage when 75%+ is being used in ${local.eks_cluster_name} EKS cluster."
alarm_actions = [aws_sns_topic.eks_alerts.arn]
statistic = "Maximum"
namespace = "ContainerInsights"
metric_name = "apiserver_storage_size_bytes"
dimensions = {
ClusterName = local.eks_cluster_name
}
}
resource "aws_cloudwatch_metric_alarm" "eks_apiserver_storage_objects" {
alarm_name = "eks-${local.eks_cluster_name}-apiserver-storage-objects"
comparison_operator = "GreaterThanOrEqualToThreshold"
period = "300"
evaluation_periods = "5"
threshold = "100000"
alarm_description = "Detecting 100k+ ETCD storage objects in ${local.eks_cluster_name} EKS cluster."
alarm_actions = [aws_sns_topic.eks_alerts.arn]
statistic = "Maximum"
namespace = "ContainerInsights"
metric_name = "apiserver_storage_objects"
dimensions = {
ClusterName = local.eks_cluster_name
}
}
resource "aws_cloudwatch_metric_alarm" "eks_apiserver_request_duration_seconds" {
alarm_name = "eks-${local.eks_cluster_name}-apiserver-request-duration-seconds"
comparison_operator = "GreaterThanOrEqualToThreshold"
period = "300"
evaluation_periods = "5"
threshold = "1"
alarm_description = "API server request duration exceeds 1 second in ${local.eks_cluster_name} EKS cluster."
alarm_actions = [aws_sns_topic.eks_alerts.arn]
statistic = "Average"
namespace = "ContainerInsights"
metric_name = "apiserver_request_duration_seconds"
dimensions = {
ClusterName = local.eks_cluster_name
}
}
resource "aws_cloudwatch_metric_alarm" "eks_rest_client_request_duration_seconds" {
alarm_name = "eks-${local.eks_cluster_name}-rest-client-request-duration-seconds"
comparison_operator = "GreaterThanOrEqualToThreshold"
period = "300"
evaluation_periods = "5"
threshold = "1"
alarm_description = "REST client request duration exceeds 1 second in ${local.eks_cluster_name} EKS cluster."
alarm_actions = [aws_sns_topic.eks_alerts.arn]
statistic = "Average"
namespace = "ContainerInsights"
metric_name = "rest_client_request_duration_seconds"
dimensions = {
ClusterName = local.eks_cluster_name
}
}
resource "aws_cloudwatch_metric_alarm" "eks_etcd_request_duration_seconds" {
alarm_name = "eks-${local.eks_cluster_name}-etcd-request-duration-seconds"
comparison_operator = "GreaterThanOrEqualToThreshold"
period = "300"
evaluation_periods = "5"
threshold = "1"
alarm_description = "ETCD request duration exceeds 1 second in ${local.eks_cluster_name} EKS cluster."
alarm_actions = [aws_sns_topic.eks_alerts.arn]
statistic = "Average"
namespace = "ContainerInsights"
metric_name = "etcd_request_duration_seconds"
dimensions = {
ClusterName = local.eks_cluster_name
}
}
resource "aws_sns_topic" "eks_alerts" {
name = "eks-${local.eks_cluster_name}-alerts"
}
resource "aws_sns_topic_subscription" "email_eks_alerts" {
topic_arn = aws_sns_topic.eks_alerts.arn
protocol = "email"
endpoint = "eks_alerts@gmail.com"
}
Final and deployed state should look as follows in the AWS Console.
Conclusion
This completes the setup and I hope you can reuse my work to speed up your own. Please do monitor your EKS control-plane and set up alerting as it can be broken and may cause a lot of issues if you don’t clean up your ETCD when needed.
Sponsor Me
Like with any other story on Medium written by me, I performed the tasks documented. This is my own research and issues I have encountered.
Thanks for reading everybody. Marcin Cuber