Deploying MIG Support

This topic describes how to configure and deploy a supported MIG-enabled GPU on HPE Ezmeral Runtime Enterprise.

You must configure MIG before adding the host to HPE Ezmeral Runtime Enterprise. If the host has not yet been added to HPE Ezmeral Runtime Enterprise, see Host Has Not Been Added to HPE Ezmeral Runtime Enterprise.

If the host has already been added to HPE Ezmeral Runtime Enterprise, see Host Already Added to HPE Ezmeral Runtime Enterprise.

Required access rights: Platform Administrator

Host Has Not Been Added to HPE Ezmeral Runtime Enterprise

If the host has not yet been added to HPE Ezmeral Runtime Enterprise, you install the driver, enable and configure MIG. Then you can add the host to HPE Ezmeral Runtime Enterprise and to the Kubernetes cluster.

  1. Install NVIDIA driver version 470.57.02 or later. To install the driver, see GPU Driver Installation.
  2. Use the nvidia-smi tool to configure and enable MIG. See MIG Configuration Using nvidia-smi.
  3. Add the host to HPE Ezmeral Runtime Enterprise and to the Kubernetes cluster as a Kubernetes Worker. See Kubernetes Worker Installation Overview.

Host Already Added to HPE Ezmeral Runtime Enterprise

Use the following procedure if you are adding GPU or MIG GPU support to a host and you are not performing this task as part of HPE Ezmeral Runtime Enterprise.

If you are upgrading from an earlier version of HPE Ezmeral Runtime Enterprise, you remove the host from the Kubernetes cluster and from HPE Ezmeral Runtime Enterprise as part of the upgrade procedure.

  1. Remove the host from the Kubernetes cluster. See Expanding or Shrinking a Kubernetes Cluster.
  2. Delete the host from HPE Ezmeral Runtime Enterprise. See Decommissioning/Deleting a Kubernetes Host.
  3. Ensure that the NVIDIA driver is driver version 470.57.02 or later. Update the driver on the host as needed. See GPU Driver Installation.
  4. Use the nvidia-smi tool to configure and enable MIG. See MIG Configuration Using nvidia-smi.
  5. Add the host to HPE Ezmeral Runtime Enterprise a Kubernetes Worker. See Kubernetes Worker Installation Overview.

MIG Configuration Using nvidia-smi

You use the NVIDIA nvidia-smi command-line interface to configure, enable, and manage MIG.

See the following NVIDIA documentation (links open an external website in a new browser tab or window):

IMPORTANT

As stated in the NVIDIA documentation, to run CUDA workloads on the GPU, you must create both MIG GPU instances and their corresponding compute instances. However, the created MIG devices are not persistent across system reboots or if the GPU is reset.

The HPE Ezmeral Runtime Enterprise bds-nvidia-mig-config service preserves the MIG device configurations across system reboots, so no additional configuration or mitigation is required.‚Äč

HPE Ezmeral Runtime Enterprise supports the following Kubernetes strategies for MIG deployment:

Single

The single strategy enables you to interact with MIG instances in the same way you interact with physical GPUs. All MIG devices on a node have the same MIG configuration, such as MIG 1g.5gb. All MIG and physical GPU devices are enumerated using the same resource type: nvidia.com/gpu.

For example: --limits=nvidia.com/gpu=1

Mixed

MIG devices on a node can have different configurations. Each MIG configuration in the cluster is identified by a resource type, in the form <slice_count>g.<memory_size>gb.

You specify and enumerate MIG devices by their fully qualified name in the form: nvidia.com/mig-<slice_count>g.<memory_size>gb

For example:

  • --limits=nvidia.com/mig-1g.5gb=1
  • --limits=nvidia.com/mig-3g.20gb=2

The mixed strategy supports nodes that include GPUs that do not support MIG. GPU devices that do not support MIG are enumerated using the resource type: nvidia.com/gpu.