Amazon EKS- implementing and using GPU nodes with NVIDIA drivers

Marcin Cuber
4 min readJan 19, 2024

--

Find out how to implement GPU nodes using Flux, Helm and Karpenter.

Introduction

By default, you can run Amazon EKS optimized Amazon Linux AMIs on instance types that support NVIDIA GPUs.

Using the default configuration has the following limitations:

  • The pre-installed NVIDIA GPU driver version and NVIDIA container runtime version lags the release schedule from NVIDIA.
  • You must deploy the NVIDIA device plugin and you assume responsibility for upgrading the plugin.

In my case, using GPU optimized AMIs provided in https://github.com/awslabs/amazon-eks-ami wasn’t sufficient. I had to install GPU NVIDIA drivers which I will demonstrate in this story.

The plugin I selected to use can be found in https://github.com/NVIDIA/k8s-device-plugin and it is well maintained.

I believe this is enough information as far as introduction goes. The following sections will show you how I used Karpenter, Flux and Helm resources to deploy GPU supported setup.

Node implementation

My workloads are running in London region and so I selected g5.2xlarge instance type with is available in eu-west-2a and eu-west-2b availability zones. Please note that g5 instance class is not available in eu-west-2c at the time of writing this article.

For the development purposes I utilised spot instances. Here is the full Karpenter node:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: platform-ml
spec:
amiFamily: AL2
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "dev"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "dev"
role: "dev-node-karpenter"
tags:
Name: "dev-platform-ml"
Intent: "platform-ml"
Environment: "dev"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
encrypted: true
deleteOnTermination: true
detailedMonitoring: true
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: platform-ml
spec:
template:
metadata:
labels:
intent: platform-ml
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: platform-ml
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["g5.2xlarge"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot"]
kubelet:
maxPods: 58
limits:
cpu: 32
memory: 128Gi
disruption:
expireAfter: 1440h

The above NodePool, defines the instance type we are using and the fact that it will be a spot instance. G5 instances are the latest generation of NVIDIA GPU-based instances that can be used for a wide range of graphics-intensive and machine learning use cases. Since it is a GPU-based instance, karpenter automatically selects GPU enhanced AMI, in my case it is ami-040bbe6351fa6133d named amazon-eks-gpu-node-1.28-v20240110. Please note that if you run the same templates at a later date, Karpenter will fetch newer AMI.

NVIDIA device plugin

Device plugin is essential to specify GPU limits at kubernetes workloads level. This can be in your pod, deployment, statefulset, job or cronjob.

I selected a basic install with the device plugin only as I don’t need all other additional features that are offered by the helm chart available at https://github.com/NVIDIA/k8s-device-plugin.

Here is the implementation:

---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-operator
labels:
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/audit: baseline
pod-security.kubernetes.io/warn: baseline
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: nvidia-k8s-device-plugin
namespace: flux-system
spec:
interval: 30m
url: https://nvidia.github.io/k8s-device-plugin
---
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
name: nvidia-device-plugin
namespace: gpu-operator
spec:
releaseName: nvidia-device-plugin
chart:
spec:
chart: nvidia-device-plugin
version: 0.14.3
sourceRef:
kind: HelmRepository
name: nvidia-k8s-device-plugin
namespace: flux-system
interval: 1h0m0s
install:
remediation:
retries: 3
values:
nodeSelector:
karpenter.k8s.aws/instance-gpu-manufacturer: nvidia

Please note the important aspect at the namespace level:

pod-security.kubernetes.io/enforce: privileged

The privileged flag is required for the operator to function correctly.

Testing

To test the setup, I used the following pod definition:

apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
namespace: gpu-operator
spec:
nodeSelector:
intent: platform-ml
restartPolicy: OnFailure
containers:
- name: nvidia-smi
image: nvidia/cuda:12.3.1-devel-ubuntu22.04
args:
- "nvidia-smi"
resources:
limits:
nvidia.com/gpu: 1

The output of the test pod:

 +---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 24C P8 9W / 300W | 4MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

Summary

I hope the above shows the steps required to enable usage of GPUs in your workloads. Machine Learning is a bit topic so my implementation is just one of the example that you can use. Note, that helm chart for gpu operator has more components available which can easily be enabled.

The full implementation of Karpenter, Flux and GPU operator can be found on my github at https://github.com/marcincuber/kubernetes-fluxv2.

Sponsor Me

Like with any other story on Medium written by me, I performed the tasks documented. This is my own research and issues I have encountered.

Thanks for reading everybody. Marcin Cuber

--

--

Marcin Cuber

Principal Cloud Engineer, AWS Community Builder and Solutions Architect