Amazon EKS Upgrade Journey From 1.19 to 1.20
Process and considerations while upgrading EKS control-plane to version 1.20
Overview
AWS recently released support for Amazon Kubernetes Service 1.20. This is so called “The Raddest Release”. With this release there are some new features introduced and there are not too many deprecated options. In this post I will go through the services that are a must to check and upgrade if necessary before even thinking of upgrading EKS. I have to say, that those EKS upgrades are becoming nice and smooth which is amazing.
In this release I will also implemented EKS Add-ons for kube-proxy and Core DNS so make sure you read everything. There were some complications with it so you may find some answers below.
If you are looking at
- upgrading EKS from 1.18 to 1.19 then check out previous story
- upgrading EKS from 1.17 to 1.18 check out this story
- upgrading EKS from 1.16 to 1.17 check out this story
- upgrading EKS from 1.15 to 1.16 then check out story
It is important to note that Kubernetes project owners/community has recently switched to a cadence of three releases a year from four. This change coupled with the continued maturation of the project is going to lead to much larger feature packed releases. For that reason we are expecting EKS to support a lot more features and with a much faster delivery.
Release of EKS 1.20 means that EKS 1.15 reached its end of life (end of support for EKS 1.15). Kubernetes 1.16 is a notably challenging version to upgrade to due to many deprecated APIs being removed. Be sure to check out my previous blog post on 1.16 upgrade prep and the Amazon EKS release calendar for future dates.
Kubernetes 1.20 features
- API Priority and Fairness is now in beta status and is enabled by default. This allows
kube-apiserver
to categorise incoming requests by priority levels. - RuntimeClass has reached stable status. The
RuntimeClass
resource provides a mechanism for supporting multiple runtimes in a cluster and surfaces information about that container runtime to the control plane. - Process ID Limits has now graduated to general availability.
- kubectl debug has reached beta status.
kubectl debug
provides support for common debugging workflows directly fromkubectl
. - Pod Hostname as FQDN has graduated to beta status. This feature allows setting a pod’s hostname to its Fully Qualified Domain Name (FQDN), giving the ability to set the hostname field of the kernel to the FQDN of a Pod.
- The client-go credential plugins can now be passed in the current cluster information via the
KUBERNETES_EXEC_INFO
environment variable. This enhancement allows Go clients to authenticate using external credential providers, like Key Management Systems (KMS). - CSI Volume Snapshot moves is now generally available. This feature provides a standard way to trigger volume snapshot operations in Kubernetes and allows Kubernetes users to incorporate snapshot operations in a portable manner on any Kubernetes environment regardless of supporting underlying storage providers.
- The FSGroup’s CSIDriver Policy is now beta in 1.20. This allows CSIDrivers to explicitly indicate if they want Kubernetes to manage permissions and ownership for their volumes via
fsgroup
.
Docker deprecation
In general there is nothing to worry about. The Docker container runtime is now deprecated. The Kubernetes community has written a blog post about this in detail with a dedicated FAQ page. Docker-produced images can continue to be used and will work as they always have. You can safely ignore the dockershim deprecation warning message printed in kubelet startup logs. EKS will eventually move to containerd as the runtime for the EKS optimized Amazon Linux 2 AMI. You can follow the containers roadmap issue for more details.
More Upgrade Notes
- RuntimeClass feature graduates to General Availability. Promote
node.k8s.io
API groups fromv1beta1
tov1
.v1beta1
is now deprecated and will be removed in a future release, please start usingv1
. - Kubectl: deprecate — delete-local-data
- Kubelet’s deprecated endpoint
metrics/resource/v1alpha1
has been removed, please adoptmetrics/resource
. TokenRequest
andTokenRequestProjection
are now GA features. The following flags are required by the API server:--service-account-issuer
, should be set to a URL identifying the API server that will be stable over the cluster lifetime.--service-account-key-file
, set to one or more files containing one or more public keys used to verify tokens.--service-account-signing-key-file
, set to a file containing a private key to use to sign service account tokens. Can be the same file given tokube-controller-manager
with--service-account-private-key-file
.
Upgrade your EKS with terraform
This time upgrade of the control plane takes around ~40 minutes and didn’t cause any issues. I have noticed that the control plane wasn’t available immediately so upgraded worker nodes took around 2 minutes to join the upgraded EKS cluster.
I personally use Terraform to deploy and upgrade my EKS clusters. Here is an example of the EKS cluster resource.
resource "aws_eks_cluster" "cluster" {
enabled_cluster_log_types = ["audit"]
name = local.name_prefix
role_arn = aws_iam_role.cluster.arn
version = "1.20"
vpc_config {
subnet_ids = flatten([module.vpc.public_subnets, module.vpc.private_subnets])
security_group_ids = []
endpoint_private_access = "true"
endpoint_public_access = "true"
} encryption_config {
resources = ["secrets"]
provider {
key_arn = module.kms-eks.key_arn
}
} tags = var.tags
}
Templates I use for creating EKS clusters using Terraform can be found in my Github repository reachable under https://github.com/marcincuber/eks/tree/master/terraform-aws
After upgrading EKS control-plane
Remember to upgrade core deployments and daemon sets that are recommended for EKS 1.20.
The above is just a recommendation from AWS. You should look at upgrading all your components to match the 1.20 Kubernetes version. They could include:
- calico-node
- cluster-autoscaler
- Kube-state-metrics
- calico-typha and calico-typha-horizontal-autoscaler
Managed EKS Add-ons
With EKS 1.20, AWS also released managed add-ons for kube-proxy and coreDNS. It is probably the right time to start utilising them.
Important note; VPC CNI is provided as a managed add-on, however I am not a big fan of this particular component to be managed by AWS. I would suggest you simply deploy your own configuration of VPC CNI (YAML format) using Flux. That way you will stay in control of what is actually being deployed. There were many issues with it and I won’t recommend moving this to being a managed add-on.
Kube-proxy managed add-on using Terraform
Terraform resource configuration
resource "aws_eks_addon" "kube_proxy" {
count = var.create_cluster ? 1 : 0 cluster_name = aws_eks_cluster.cluster[0].name
addon_name = "kube-proxy"
addon_version = "v1.20.4-eksbuild.2"
resolve_conflicts = "OVERWRITE" tags = merge(
var.tags,
{
"eks_addon" = "kube-proxy"
}
)
}
Issues:
After implementing kube-proxy with Terraform I immediately hit issues such as:
│ Error: unexpected EKS add-on (eks-test-eu:kube-proxy) state returned during creation: creation not successful (CREATE_FAILED): Errors:
│ Error 1: Code: ConfigurationConflict / Message: Apply failed with 1 conflict: conflict with "before-first-apply" using v1: .data.config
and
Error: unexpected EKS add-on (eks-test-eu:kube-proxy) state returned during creation: creation not successful (CREATE_FAILED): Errors:
│ Error 1: Code: AccessDenied / Message: clusterrolebindings.rbac.authorization.k8s.io "eks:kube-proxy" is forbidden: user "eks:addon-manager" (groups=["system:authenticated"]) is attempting to grant RBAC permissions not currently held:
│ {APIGroups:["discovery.k8s.io"], Resources:["endpointslices"], Verbs:["get"]}
Note that this happened to me in both Ireland (eu-west-1) and China Beijing (cn-north-1) regions.
Fix:
Add missing permissions to eks:addon-manager
cluster role:
kubectl edit clusterrole eks:addon-manager
and make sure you have the following permissions added:
apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- list
- watch
- get
CoreDNS managed add-on using Terraform
Terraform resource configuration
resource "aws_eks_addon" "core_dns" {
count = var.create_cluster ? 1 : 0cluster_name = aws_eks_cluster.cluster[0].name
addon_name = "coredns"
addon_version = "v1.8.3-eksbuild.1"
resolve_conflicts = "OVERWRITE"tags = merge(
var.tags,
{
"eks_addon" = "coredns"
}
)
}
Issues:
Error: unexpected EKS add-on (eks-test-eu:coredns) state returned during creation: timeout while waiting for state to become 'ACTIVE, CREATE_FAILED' (last state: 'CREATING', timeout: 20m0s)
│ [WARNING] Running terraform apply again will remove the kubernetes add-on and attempt to create it again effectively purging previous add-on configuration
Fix:
The deployment takes long time, around 20 mins so just be patient. It will fail first time around but CoreDNS service will work as expected inside the cluster.
The above error has shown however second terraform apply
worked as expected so no major issues here.
Summary
I have to say that this was a nice, pleasant and fast upgrade. Yet again, no significant issues. Only issues occurred while implementing managed addons.
If you are interested in the entire terraform setup for EKS, you can find it on my GitHub -> https://github.com/marcincuber/eks/tree/master/terraform-aws
Hope this article nicely aggregates all the important information around upgrading EKS to version 1.20 and it will help people speed up their task.
Long story short, you hate and/or you love Kubernetes but you still use it ;).
Sponsor Me
Like with any other story on Medium written by me, I performed the tasks documented. This is my own research and issues I have encountered.
Thanks for reading everybody. Marcin Cuber