Amazon EKS- implementing and using GPU nodes with NVIDIA drivers

Marcin Cuber

4 min readJan 19, 2024

Find out how to implement GPU nodes using Flux, Helm and Karpenter.

Introduction

By default, you can run Amazon EKS optimized Amazon Linux AMIs on instance types that support NVIDIA GPUs.

Using the default configuration has the following limitations:

The pre-installed NVIDIA GPU driver version and NVIDIA container runtime version lags the release schedule from NVIDIA.
You must deploy the NVIDIA device plugin and you assume responsibility for upgrading the plugin.

In my case, using GPU optimized AMIs provided in https://github.com/awslabs/amazon-eks-ami wasn’t sufficient. I had to install GPU NVIDIA drivers which I will demonstrate in this story.

The plugin I selected to use can be found in https://github.com/NVIDIA/k8s-device-plugin and it is well maintained.

I believe this is enough information as far as introduction goes. The following sections will show you how I used Karpenter, Flux and Helm resources to deploy GPU supported setup.

Node implementation

My workloads are running in London region and so I selected g5.2xlarge instance type with is available in eu-west-2a and eu-west-2b availability zones. Please note that g5 instance class is not available in eu-west-2c at the time of writing this article.

For the development purposes I utilised spot instances. Here is the full Karpenter node:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: platform-ml
spec:
  amiFamily: AL2
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "dev"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "dev"
  role: "dev-node-karpenter"
  tags:
    Name: "dev-platform-ml"
    Intent: "platform-ml"
    Environment: "dev"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        encrypted: true
        deleteOnTermination: true
  detailedMonitoring: true
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: platform-ml
spec:
  template:
    metadata:
      labels:
        intent: platform-ml
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: platform-ml
      requirements:
        - key: "node.kubernetes.io/instance-type"
          operator: In
          values: ["g5.2xlarge"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot"]
      kubelet:
        maxPods: 58
  limits:
    cpu: 32
    memory: 128Gi
  disruption:
    expireAfter: 1440h

The above NodePool, defines the instance type we are using and the fact that it will be a spot instance. G5 instances are the latest generation of NVIDIA GPU-based instances that can be used for a wide range of graphics-intensive and machine learning use cases. Since it is a GPU-based instance, karpenter automatically selects GPU enhanced AMI, in my case it is ami-040bbe6351fa6133d named amazon-eks-gpu-node-1.28-v20240110. Please note that if you run the same templates at a later date, Karpenter will fetch newer AMI.

NVIDIA device plugin

Device plugin is essential to specify GPU limits at kubernetes workloads level. This can be in your pod, deployment, statefulset, job or cronjob.

I selected a basic install with the device plugin only as I don’t need all other additional features that are offered by the helm chart available at https://github.com/NVIDIA/k8s-device-plugin.

Here is the implementation:

---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
  labels:
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/audit: baseline
    pod-security.kubernetes.io/warn: baseline
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: nvidia-k8s-device-plugin
  namespace: flux-system
spec:
  interval: 30m
  url: https://nvidia.github.io/k8s-device-plugin
---
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: nvidia-device-plugin
  namespace: gpu-operator
spec:
  releaseName: nvidia-device-plugin
  chart:
    spec:
      chart: nvidia-device-plugin
      version: 0.14.3
      sourceRef:
        kind: HelmRepository
        name: nvidia-k8s-device-plugin
        namespace: flux-system
  interval: 1h0m0s
  install:
    remediation:
      retries: 3
  values:
    nodeSelector:
      karpenter.k8s.aws/instance-gpu-manufacturer: nvidia

Please note the important aspect at the namespace level:

pod-security.kubernetes.io/enforce: privileged

The privileged flag is required for the operator to function correctly.

Testing

To test the setup, I used the following pod definition:

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
  namespace: gpu-operator
spec:
  nodeSelector:
    intent: platform-ml
  restartPolicy: OnFailure
  containers:
    - name: nvidia-smi
      image: nvidia/cuda:12.3.1-devel-ubuntu22.04
      args:
        - "nvidia-smi"
      resources:
        limits:
          nvidia.com/gpu: 1

The output of the test pod:

 +---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   24C    P8               9W / 300W |      4MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Summary

I hope the above shows the steps required to enable usage of GPUs in your workloads. Machine Learning is a bit topic so my implementation is just one of the example that you can use. Note, that helm chart for gpu operator has more components available which can easily be enabled.

The full implementation of Karpenter, Flux and GPU operator can be found on my github at https://github.com/marcincuber/kubernetes-fluxv2.

Sponsor Me

Sponsor @marcincuber on GitHub Sponsors

Hi guys, I am Marcin and I am Technical Lead specialising in the field of DevOps. I am also a certified AWS Solutions…

github.com

Like with any other story on Medium written by me, I performed the tasks documented. This is my own research and issues I have encountered.

Thanks for reading everybody. Marcin Cuber