Using GPUs in Kubernetes Pods

This topic describes how to identify and request GPU and MIG resources, and how to use node labels and the Kubernetes nodeAffinity feature to constrain the pods that are eligible for scheduling.

Identifying GPU Resources

You can view GPU and MIG resources in HPE Ezmeral Runtime Enterprise using the GUI or by using kubectl or nvidia-smi commands. See GPU and MIG Support.

Requesting GPU Resources

A Kubernetes application can request GPU resources in its YAML file, and these resources will be scheduled accordingly.

HPE Ezmeral Runtime Enterprise taints GPU hosts to try to eliminate having non-GPU pods scheduled on hosts with GPUs. However, GPU-equipped hosts will be used for non-GPU pods if no other resources are available.

There are two key parts to specifying a GPU resource in the YAML file:

  • Specifying the correct key name in the resources: specification. For GPUs in HPE Ezmeral Runtime Enterprise, that key name is: nvidia.com/gpu

    For example:

        resources:
          limits:
            nvidia.com/gpu: 2
  • Setting the NVIDIA_DRIVER_CAPABILITIES environment variable to the value: compute,utility

    For example:

        env:
         - 
          name: "NVIDIA_DRIVER_CAPABILITIES"
          value: "compute,utility"
    

    If this is a KubeDirector application with GPU support, such as Jupyter Notebook with ML toolkits, when you select a nonzero GPU count in the UI, HPE Ezmeral Runtime Enterprise adds the NVIDIA_DRIVER_CAPABILITIES environment variable to the KD app YAML automatically. Otherwise, you can add the environment variable manually.

You include these items in any native Kubernetes resource that includes a Container object (link opens an external website in a new browser tab or window), including pods and higher-level pod-creating resources such as Deployment, StatefulSet, and DaemonSet.

In a KubeDirectorCluster specification, you include these items in the RoleSpec (link opens an external website in a new browser tab or window) of the role that accesses GPUs.

To specify MIG resources, see Requesting MIG Resources.

Using nodeAffinity

You might want to restrict the application to run on a specific GPU type because of availability or cost considerations in your business environment. For example, using an A100 GPU might have a different billing rate than other types of GPUs.

You can use a combination of node labels and the Kubenetes nodeAffinity feature (link opens an external website in a new browser tab or window) to constrain which nodes pods are eligible to be scheduled on.

Using nodeAffinity to Select By GPU Type

You might want to restrict the application to run on a specific GPU type because of availability or cost considerations in your business environment. For example, using an A100 GPU might have a different billing rate than other types of GPUs.

You can use a combination of node labels and the Kubenetes nodeAffinity feature (link opens an external website in a new browser tab or window) to constrain which nodes pods are eligible to be scheduled on.

The nodeAffinity feature includes an expressive matching language, and the ability to specify a preference instead of a hard requirement. You can also use the match expressions and operators to express an anti-affinity.

The procedure, in concept, is the following:

  1. If needed, the Kubernetes Cluster Administrator or Platform Administrator can label the nodes to which you want to apply preferences or restrictions.

    If you want to use an existing default node label, you do not need to create and apply label key-value pairs to nodes, but you do need the Kubernetes Cluster Administrator or Platform Administrator to supply you with the list of node labels.

    A Kubernetes Cluster Administrator or Platform Administrator can get a valid list of keys and values of node labels by querying with kubectl commands. For an example, see Listing the nvidia.com Node Labels.

    For example, in HPE Ezmeral Runtime Enterprise, nodes that have GPUs have a set of default node labels, one of which has the key: nvidia.com/gpu.product. One of the valid values of that key is Tesla-P4.

    However, you might want to enable users that create applications to specify the appropriate category of GPU without knowing the exact model identifier of the GPU. For example, you might want to label one or more nodes as having "general-purpose" or "higher-performance" GPUs, using node labels such as gputype=general-purpose. In your deployment, you might apply the same label to hosts that have one of several GPU models.

  2. Specify the nodeAffinity in the affinity field.

    Any native Kubernetes resource that includes a PodSpec object (link opens an external website in a new browser tab or window) can put an affinity field into that object. This includes pods and higher-level pod-creating resources such as Deployment, StatefulSet, and DaemonSet.

    In a KubeDirectorCluster specification, you include the affinity field in the RoleSpec (link opens an external website in a new browser tab or window).

In the following example, nodeAffinity expresses a preference to schedule RESTserver pods in nodes with a Tesla-P4 GPU.

Specifying preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution indicates that, if a preferred node is not available at the time the pod is scheduled, the pod may be scheduled in a node that is not eligible according to the matchExpressions.

...
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
            - key: nvidia.com/gpu.product
              operator: In
              values:
              - Tesla-P4
...

Requesting MIG Resources

As with requesting GPU resources, setting the NVIDIA_DRIVER_CAPABILITIES environment variable to compute,utility is required. However, the way you specify the MIG instance differs in both resource requests and in the standard nvidia.com/gpu.product node label values.

For applications that support specifying resources for MIG-enabled GPUs, the way you specify the MIG instance differs depending on the Kubernetes MIG strategy chosen by the Platform Administrator.

single strategy

If the single strategy is used, when you request resources, you specify the number of MIG instances in the same way as for physical GPUs devices.

For example:

...
    resources:
      limits:
        nvidia.com/gpu: 1
...

If you have different nodes with different MIG configurations, you can use the nodeAffinity field to specify the node that has MIG configuration you want to use.

The following example uses the standard nvidia.com/gpu.product key to require a particular MIG configuration. If a node with that configuration is not available, the pod will not be scheduled.

...
      resources:
        limits:
          nvidia.com/gpu: 1
      env:
       - 
        name: NVIDIA_DRIVER_CAPABILITIES
        value: 'compute,utility'
...
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
            - key: nvidia.com/gpu.product
              operator: In
              values:
              - A100-SXM4-40GB-MIG-1g.5gb
...
mixed strategy
If the mixed strategy is used, when you request resources, you specify and enumerate MIG devices by their fully qualified name in the form:
nvidia.com/mig-<slice_count>g.<memory_size>gb

If the mixed strategy is used, the value of standard nvidia.com/gpu.product node label is the physical GPU.

...
      resources:
        limits:
          nvidia.com/mig-3g.20gb: 1
      env:
       - 
        name: NVIDIA_DRIVER_CAPABILITIES
        value: 'compute,utility'
...
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
            - key: nvidia.com/gpu.product
              operator: In
              values:
              - A100-SXM4-40GB
...
NOTE

As stated in "Device Enumeration" in the NVIDIA Multi-Instance GPU User Guide (link opens an external website in a new browser window or tab): "MIG supports running CUDA applications by specifying the CUDA device on which the application should be run. With CUDA 11, only enumeration of a single MIG instance is supported.”

Therefore, an application can access only one GPU MIG instance (the first instance applied to the pod), even if the pod spec specifies a limit larger than one.

Listing the nvidia.com Node Labels

The following command queries all nodes for the node labels that have a label key that starts with nvidia.com. You must have Kubernetes Cluster Administrator or Platform Administrator rights to execute this command.

kubectl get nodes -o json | jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia.com")))'

The command is useful to obtain the valid nodeAffinity key-value pairs.