Implementing AWS ECR Pull Through cache for EKS cluster- most in-depth implementation details

Marcin Cuber
6 min readApr 5, 2024

Find out in detail how to implement AWS ECR pull-through cache for your EKS cluster using Terraform. You will find all the information step-by-step to do it swiftly and without pain.

Introduction

Pull through cache rules for ECR have been first announced at re-invent. 2021. I believe this was a huge feature but at the time it was lacking support for Docker Hub which is essentially home for majority of images. This has changed recently and so I am going to demonstrate how to use it but first, what is pull through cache. With pull through cache rules, you can sync the contents of an upstream registry with your Amazon ECR private registry.

As of writing this article Amazon ECR supports creating pull through cache rules for the following upstream registries.

  • Docker Hub, Microsoft Azure Container Registry, and GitHub Container Registry (Requires authentication)
  • Amazon ECR Public, the Kubernetes container image registry, and Quay (Doesn’t require authentication)

For the upstream registries that require authentication, you must store your credentials in an AWS Secrets Manager secret. The Amazon ECR console makes it easy for you to create the Secrets Manager secret for each of the authenticated upstream registries. For more information on creating a Secrets Manager secret using the Secrets Manager console, see Storing your upstream repository credentials in an AWS Secrets Manager secret. I will also demonstrate how to create such secret later in the story.

After you’ve created a pull through cache rule for the upstream registry, simply pull an image from that upstream registry using your Amazon ECR private registry URI. Amazon ECR then creates a repository and caches that image in your private registry. On your subsequent pull requests of the cached image with a given tag, Amazon ECR checks the upstream registry to see if there is a new version of the image with that specific tag and attempts to update the image in your private registry at least once every 24 hours.

This is enough of context around ECR pull through cache. Now, I will demonstrate how to create and configure caches for DockerHub, GitHub, ECR, Quay, and Kubernetes using Terraform. You will also find examples how to reference new ECR repos within HelmRelease resources. HelmRelease resources in my case is managed by Flux2.

Implementation

Flow

ECR pull through cache design and flow
Design and flow

Terraform

resource "aws_ecr_pull_through_cache_rule" "docker_hub" {
ecr_repository_prefix = "docker-hub"
upstream_registry_url = "registry-1.docker.io"
credential_arn = aws_secretsmanager_secret.ecr_pullthroughcache_docker_hub.arn
}

resource "aws_ecr_pull_through_cache_rule" "github" {
ecr_repository_prefix = "github"
upstream_registry_url = "ghcr.io"
credential_arn = aws_secretsmanager_secret.ecr_pullthroughcache_github.arn
}

resource "aws_ecr_pull_through_cache_rule" "k8s" {
ecr_repository_prefix = "k8s"
upstream_registry_url = "registry.k8s.io"
}

resource "aws_ecr_pull_through_cache_rule" "public_ecr" {
ecr_repository_prefix = "ecr"
upstream_registry_url = "public.ecr.aws"
}

resource "aws_ecr_pull_through_cache_rule" "quay" {
ecr_repository_prefix = "quay"
upstream_registry_url = "quay.io"
}

As mentioned before, DockerHub and Github require a secret which also needs to have a specific name starting with `ecr-pullthroughcache/`. It won’t work if you don’t have the required prefix.

#tfsec:ignore:aws-ssm-secret-use-customer-key
resource "aws_secretsmanager_secret" "ecr_pullthroughcache_docker_hub" {
name = "ecr-pullthroughcache/docker-hub"

recovery_window_in_days = 7
}

#tfsec:ignore:aws-ssm-secret-use-customer-key
resource "aws_secretsmanager_secret" "ecr_pullthroughcache_github" {
name = "ecr-pullthroughcache/github"

recovery_window_in_days = 7
}

Now, that we have the auth secrets in place. The content of them must include username and accessToken keys such as:

# github
{
"username":"marcincuber",
"accessToken":"ghp_token"
}
# dockerhub
{
"username":"marcincuber",
"accessToken":"dckr_pat_token"}
}

In case of GitHub token, you need to generate a new PAT token with following permissions:

In case of DockerHub, it wasn’t straight away obvious how to get the token for DockerHub so here is the official Docker doc for it.

This completes our ECR pull through cache implementation.

EKS Node Permissions

Last step required to make the ECR pull through cache work smoothly was to allow EKS worker nodes to be able to create ECR repos. This was a must for me because ECR cached repositories didn’t exist during first pull through cached ECR. Hence, following minimum permissions need to be attached to all worker nodes in your EKS cluster. Note that I don’t use ECS but I would expect exactly the same permissions to be added to make things work.

resource "aws_iam_role" "eks_node_group" {
name = "node-group"

assume_role_policy = data.aws_iam_policy_document.eks_node_group_assume_role_policy.json

managed_policy_arns = [
"arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy",
"arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy",
"arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly",
"arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore",
"arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
]

inline_policy {
name = "ecr-cache-policy"
policy = data.aws_iam_policy_document.eks_node_custom_inline_policy.json
}
}

data "aws_iam_policy_document" "eks_node_custom_inline_policy" {
statement {
actions = [
"ecr:CreateRepository",
"ecr:ReplicateImage",
"ecr:BatchImportUpstreamImage"
]

resources = ["*"]
}
}

The new inline policy with ECR specific permissions is the only part required. If you interested in seeing full terraform configuration that I use, please take a look at https://github.com/marcincuber/eks/tree/main/terraform and maybe leave a star ;).

Amazon ECR repositories created using the pull through cache workflow are treated like any other Amazon ECR repository. All repository features, such as replication and image scanning are supported.

When Amazon ECR creates a new repository on your behalf using a pull through cache action, the following default settings are applied to the repository unless there is a matching repository creation template. You can use a repository creation template to define the settings applied to repositories created by Amazon ECR on your behalf. For more information, see Manage your repository creation templates.

  • Tag immutability — Turned off, tags are mutable and can be overwritten.
  • Encryption — The default AES256 encryption is used.
  • Repository permissions — Omitted, no repository permissions policy is applied.
  • Lifecycle policy — Omitted, no lifecycle policy is applied.
  • Resource tags — Omitted, no resource tags are applied.

Important to note, when an image is pulled using the pull through cache rule for the first time a route to the internet may be required. There are certain circumstances in which a route to the internet is required so it’s best to set up a route to avoid any failures. In my case, I have private nodes in private subnets which are able to access internet through NAT Gateway.

Utilising ECR Cached Repositories

DockerHub

apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
name: nats
namespace: nats
spec:
releaseName: nats
chart:
spec:
chart: nats
version: 1.1.9
sourceRef:
kind: HelmRepository
name: nats
namespace: flux-system
interval: 15m0s
values:
config:
cluster:
enabled: true
replicas: 3
container:
image:
repository: nats
tag: 2.10.12-alpine
registry: AWS_ACCOUNT_ID.dkr.ecr.eu-west-2.amazonaws.com/docker-hub/library
reloader:
image:
repository: natsio/nats-server-config-reloader
pullPolicy: IfNotPresent
registry: AWS_ACCOUNT_ID.dkr.ecr.eu-west-2.amazonaws.com/docker-hub

Here you can see a NATs service helmrelease which utilises DockerHub ECR two different structured or registry endpoint.

For Docker Hub official images:

AWS_ACCOUNT_ID.dkr.ecr.region.amazonaws.com/docker-hub/library/image_name:tag

Important Note. For Docker Hub official images, the /library prefix must be included. For all other Docker Hub repositories, you should omit the /library prefix.

For all other Docker Hub images:

AWS_ACCOUNT_ID.dkr.ecr.region.amazonaws.com/docker-hub/repository_name/image_name:tag

This was probably the most problematic part which cause some delays in my development of this feature.

Kubernetes- registry.io

apiVersion: apps/v1
kind: Deployment
metadata:
name: external-dns-private
namespace: kube-system
spec:
selector:
matchLabels:
app: external-dns-private
template:
metadata:
labels:
app: external-dns-private
spec:
serviceAccountName: external-dns
containers:
- name: external-dns
image: AWS_ACCOUNT_ID.dkr.ecr.eu-west-2.amazonaws.com/k8s/external-dns/external-dns:v0.14.0
args:
- --source=service
- --source=ingress
- --provider=aws
- --annotation-filter=private-hosted-zone-record in (true, True, TRUE)
- --aws-zone-type=private
- --registry=txt
securityContext:
fsGroup: 65534

I believe above two examples cover the use for all types of ECR cached repositories. You can see how registries that require and don’t require authentication get utilised. Hope this will simplify your implementation.

Validation

Last part which I want to mention is validation of pull through cache. The validate-pull-through-cache-rule AWS CLI command is used to validate a pull through cache rule for an Amazon ECR private registry. The following example uses the ecr namespace prefix. Replace that value with the prefix value for the pull through cache rule to validate.

aws ecr validate-pull-through-cache-rule \
--ecr-repository-prefix ecr \
--region eu-west-2

In the response, the isValid parameter indicates whether the validation was successful or not. If true, Amazon ECR was able to reach the upstream registry and authentication was successful. If false, there was an issue and validation failed. The failure parameter indicates the cause.

Thanks for reading my article. Enjoy AWS, Terraform and Kubernetes!!!

Sponsor Me

Like with any other story on Medium written by me, I performed the tasks documented. This is my own research and issues I have encountered.

Thanks for reading everybody. Marcin Cuber

--

--