GPU Support

Provides information about support for NVIDIA GPU, MIG partitioning, preparing hosts for GPU-enabled environment, adding hosts and enabling GPU in HPE Ezmeral Unified Analytics Software.

GPUs provide essential computational power and parallel processing capabilities to accelerate the training and inference processes of deep learning models, reading and processing data frames, processing SQL queries within Spark, and running experiments using Jupyter notebooks integrated with GPUs.

The hundreds or thousands of smaller cores working in parallel enable GPUs to process massive amounts of data in a short period of time.

HPE Ezmeral Unified Analytics Software supports homogenous configuration deployment where the GPU is split into N equal parts with the same amount of memory and CUDA cores. All GPU models on the same Kubernetes cluster must operate in the same configuration mode. HPE Ezmeral Unified Analytics Software does not support any mixed configuration across multiple GPU models.

Supported GPU Models

To see the GPU models supported by HPE Ezmeral Unified Analytics Software, see GPU Models .

MIG Partitioning

HPE Ezmeral Unified Analytics Software supports single-access multi-instance GPU. You can use MIG GPU when there are multiple applications that require GPU acceleration. By using MIG, you can achieve higher resource utilization and cost efficiency.

In HPE Ezmeral Unified Analytics Software, GPU partitions are presented as whole devices by using the MIG mechanism. When an application requests one GPU, the application receives a partition. Only one GPU device is visible to the application. To learn more, see CUDA visible devices.

During the installation of HPE Ezmeral Unified Analytics Software, you must specify GPU partition size (Whole, Large, Medium, and Small) and request the number of GPU instances required for the workload.

For A100 GPU, the partition size maps to the following profiles:
Unified Analytics vGPU Size No. of Unified Analytics vGPUs per physical GPU MIG Profile - A100-40GB MIG Profile - A100-80GB Description
Whole 1–100% No MIG - entire physical GPU No MIG - entire physical GPU

A100 GPU models are not split into any partitions.

You will get the entire physical GPU.

In this configuration, applications can use only one virtual GPU at a time.

Large 2 – 42% each 3g.20gb 3g.40gb

A100 GPU models are split into two equal partitions.

In this configuration, 16% of the GPU will remain idle.

Medium 3 – 28% each 2g.10gb 2g.20gb

A100 GPU models are split into three equal partitions.

In this configuration, 16% of the GPU will remain idle.

Small 7 – 14% each 1g.5gb 1g.10gb

A100 GPU models are split into seven equal partitions.

In this configuration, 2% of the GPU will remain idle.

For A30 GPU, the partition size maps to the following profiles:
Unified Analytics vGPU Size No. of Unified Analytics vGPUs per physical GPU MIG Profile - A30-24GB Description
Whole 1–100% No MIG - entire physical GPU

A30 GPU models are not split into any partitions.

You will get the entire physical GPU.

In this configuration, applications can use only one virtual GPU at a time.

Large 2 – 50% each all-2g.12gb

A30 GPU models are split into two equal partitions.

This configuration will utilize 100% of the GPU.

Small 4 – 25% each all-1g.6gb

A30 GPU models are split into four equal partitions.

This configuration will utilize 100% of the GPU.

For H100-NVL GPU, the partition size maps to the following profiles:
Unified Analytics vGPU Size No. of Unified Analytics vGPUs per physical GPU MIG Profile - H100-NVL Description
Whole 1–100% No MIG - entire physical GPU

H100 GPU models are not split into any partitions.

You will get the entire physical GPU.

In this configuration, applications can use only one virtual GPU at a time.

Large 2 – 45% each 3g-47gb

H100 GPU models are split into two equal partitions.

In this configuration, 10% of the GPU will remain idle.

Medium 3 – 24% each 2g.24gb

H100 GPU models are split into three equal partitions.

In this configuration, 28% of the GPU will remain idle.

Small 7 – 12% each 1g.12gb

H100 GPU models are split into seven equal partitions.

In this configuration, 16% of the GPU will remain idle.

For H100-PCI GPU, the partition size maps to the following profiles:
NOTE
To use the H100-PCI GPU, manually update the MIG profile values in the gpunodeconfig CR.
Unified Analytics vGPU Size No. of Unified Analytics vGPUs per physical GPU MIG Profile - H100-PCI Description
Whole 1–100% No MIG - entire physical GPU

H100 GPU models are not split into any partitions.

You will get the entire physical GPU.

In this configuration, applications can use only one virtual GPU at a time.

Large 2 – 40% each 3g-40gb

H100 GPU models are split into two equal partitions.

In this configuration, 20% of the GPU will remain idle.

Medium 3 – 26% each 2g.20gb

H100 GPU models are split into three equal partitions.

In this configuration, 22% of the GPU will remain idle.

Small 7 – 12% each 1g.10gb

H100 GPU models are split into seven equal partitions.

In this configuration, 16% of the GPU will remain idle.

NOTE
If you are using a GPU host on GCP (Google Cloud Platform), you must reboot after deployment to perform MIG partitioning. If you do not reboot, MIG will not be enabled.

To learn about MIG profile names, see MIG Device Names.

Preparing the GPU Environment

HPE Ezmeral Unified Analytics Software supports GPUs on Kubernetes nodes. The underlying hosts must be running an operating system and version that is supported on the corresponding version of HPE Ezmeral Unified Analytics Software.

HPE Ezmeral Unified Analytics Software supports user-provided deployment.

Preparing hosts to use GPU in the user-provided host model:

If you want to use GPU in HPE Ezmeral Unified Analytics Software, you will have to prepare your hosts. The following are the requirements for preparing your hosts before installation to use GPU in HPE Ezmeral Unified Analytics Software:
  • The host can be bare metal or VM with GPU pass-through, or an AWS EC2 instance.
  • Install the latest version of the supported operating system. To learn about the supported operating system versions for GPU in HPE Ezmeral Unified Analytics Software, see Operating System.
    NOTE
    Do not use operating systems with pre-installed NVIDIA drivers. HPE Ezmeral Unified Analytics Software does not support operating systems with pre-installed NVIDIA drivers. The GPU operator automatically installs NVIDIA drivers when the host is added to HPE Ezmeral Unified Analytics Software.
  • Disable SELinux on the host before adding the host to HPE Ezmeral Unified Analytics Software. This is the NVIDIA limitation, see GPU Operator with RHEL8/SELinux.
    NOTE
    After successfully adding the host to HPE Ezmeral Unified Analytics Software cluster and the successfull NVIDIA driver install through the GPU operator, you can enable SELinux on that host and set it to enforcing mode.

To learn more about user-provided hosts, see Installing on User-Provided Hosts (Connected and Air-gapped Environments).

Environments Description
vSphere VM

Configure the VMs in the GPU pass-through setup by following the steps in VMware setting up GPU pass-through documentation.

Add hosts to the HPE Ezmeral Unified Analytics Software.

AWS

Use the AWS account with access to provision GPU-based instances (p4d.24xlarge, and p4de.24xlarge EC2 instances).

Deploy the A100 EC2 instance (P4d instance) with the AMI image in the supported operating system.

Add hosts to the HPE Ezmeral Unified Analytics Software.

Adding Hosts and Enabling GPU Environment

After you have prepared hosts to work in the GPU-enabled environment, you must add them to the HPE Ezmeral Unified Analytics Software during the installation or during cluster expansion. After adding the host, the GPU is enabled automatically.

Adding Hosts and Selecting GPU Environment During Installation
To add hosts and select the GPU environment in the cluster during installation, follow these steps :
  1. Perform the installation instructions provided in the installation documentation for your deployment target until you reach the Installation Details step in the installation wizard. See Installation.
  2. In the Installation Details step, to enable the GPU, check Use GPU.
    1. vGPU: Specify the vGPU instances for your cluster.

      The number of vGPUs allocated depends on the GPU configuration partition size, the number of added GPU worker hosts, and the number of GPU cards per host. The number of allocated vGPUs may be less than the number of requested vGPUs.

      For example: If one A100 GPU host is added with two GPU cards with the following configurations:
      • vGPU request: 10 vGPUs
      • vGPU configuration: large
      Then the number of allocated vGPUs is as follows:
      • vGPUs allocated: 2 x 2 large per GPU card = 4
    2. GPU Configuration: Specify the GPU partition size.


    As each node is added to the HPE Ezmeral Unified Analytics Software inventory node pool, HPE Ezmeral Unified Analytics Software configures the MIG profile if it detects MIG-capable devices (e.g., A100).

  3. To specify the details for other boxes or options in the Installation Details step and to complete the cluster installation, refer to the installation documentation for your deployment target. See Installation.

Adding Hosts and Selecting GPU Environment During Cluster Expansion
To add hosts and select the GPU environment in the cluster during cluster expansion, follow these steps:
  1. Perform the steps to expand the cluster until you reach the Expand Cluster screen. See Expanding the Cluster.
  2. To enable the GPU, in the Expand Cluster screen, check Use GPU.
    NOTE
    If you enabled the Use GPU option during the cluster installation, then that means GPU is already enabled and you cannot disable the Use GPU option while expanding the cluster.
    1. vGPU: Specify the additional vGPU instances for your cluster.
      NOTE
      The number of additional vGPUs allocated depends on the GPU configuration partition size, the number of added GPU worker hosts, and the number of GPU cards per host. The number of allocated vGPUs may be less than the number of requested vGPUs.
    2. GPU Configuration: Specify the GPU partition size.
      NOTE
      If you selected the partition size during the cluster installation, you can not update the partition size while expanding the cluster.


  3. To specify the details for other boxes or options in the Expand Cluster screen and to complete the cluster expansion, see Expanding the Cluster.

Viewing GPU Model Information

To retrieve the information about the GPU model installed in HPE Ezmeral Unified Analytics Software for your operating system, run:
lspci | grep -i nvidia

To learn more about supported operating systems for GPUs in HPE Ezmeral Unified Analytics Software, see Operating System.

Updating GPU Partition Size Post Installation

To update the GPU partition size, modify the gpunodeconfig CR as follows:

  1. Verify that the GPU workloads are not running in the cluster. To verify, SSH into the GPU node, run the following command:
    chroot /run/nvidia/driver/ nvidia-smi
    Then, look for any processes consuming the GPU. If no processes are shown, the GPU is not in use. Ensure this is consistent across all GPU nodes in the cluster.
  2. To fetch the resource name, SSH into the workload control plane node and run the following command:
    gpunodeconfig=$(kubectl get gpunodeconfig -n hpecp-gpunodeconfig -o jsonpath='{.items[].metadata.name}')​
  3. To update the partition size, update the enabled field with the partition size of your choice.

    For example, to update the partition size to large, run:

    kubectl patch gpunodeconfig $gpunodeconfig -n hpecp-gpunodeconfig --type merge -p '{"spec":{"enabled":"large"}}'
  4. To verify that the new partition size is applied in your cluster, run the following command inside the driver daemon pod in the hpecp-gpu-operator namespace:
    nvidia-smi

Integrating GPU with Tools and Frameworks

In HPE Ezmeral Unified Analytics Software, both imported and included tools and frameworks support GPU. With a MIG configuration, only one GPU is assigned per application. Applications request GPUs using the nvidia.com/gpu resource specifier​.
NOTE
HPE Ezmeral Unified Analytics Software does not support MIG specifier nvidia.com/mig-Xg.YYgb.
The following applications and frameworks support GPU in HPE Ezmeral Unified Analytics Software: