GPU Driver Installation

Download NVIDIA GPU drivers from NVIDIA, install them on hosts, and test the installation after the hosts have been added to HPE Ezmeral Runtime Enterprise.

IMPORTANT

The host OS NVIDIA driver must be compatible with the NVIDIA driver included in your application image (such as, for example, TensorFlow in the HPE Ezmeral ML Ops Training image).

For MIG support on GPUs, the driver must also support MIG.

See GPU and MIG Support.

If possible, install the NVIDIA GPU drivers before adding the host to HPE Ezmeral Runtime Enterprise.

If you update the OS Kernel on a host, you must reinstall the NVIDIA GPU drivers on that host (see Steps 8-11).

If you want add GPUs or update GPU drivers after a host has been added to HPE Ezmeral Runtime Enterprise, do the following:

If the host is a Kubernetes host, remove the Worker from the Kubernetes cluster and then remove the host from HPE Ezmeral Runtime Enterprise.

Installing or Updating the GPU Driver on RHEL and CentOS Hosts

You must perform the following procedure on each RHEL or CentOS host that will supply GPU devices to the deployment.

Install the GPU devices in the host.
Locate the appropriate GPU device driver and libraries package on the NVIDIA Downloads Index (link opens an external website in a new browser tab or window), and then download it to each GPU-providing host. You will use the downloaded file in both Steps 3 and 7 of this procedure.

IMPORTANT

Select Linux 64-bit (not Linux 64-bit RHEL6 or similar) to obtain a runfile. Selecting a specific Linux distribution will download an RPM, which will not work with your HPE Ezmeral Runtime Enterprise deployment.
If you are performing the initial driver installation, execute the following commands as the root user (you should not need to perform this step when upgrading existing drivers):
```
yum update -y
yum install -y kernel-devel kernel-headers gcc-c++ perl pciutils
yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
```
If any packages get updated through yum update (in previous step), ensure that GPU driver is still working (You can execute the command nvidia-smi to ensure the expected results). If the GPU driver is not working, reinstall the driver (see Steps 8-11).
NOTE

This step reboots the host. If HPE Ezmeral Runtime Enterprise is already installed on this host, and if virtual nodes/containers are assigned to this host, then this step will briefly interrupt those nodes/containers.

If you are performing the initial driver installation, execute the following commands as the root user. (You should not need to perform this step when upgrading existing drivers.)
```
cat > /etc/modprobe.d/blacklist-nouveau.conf <<EOF
blacklist nouveau
options nouveau modeset=0
EOF
rmmod nouveau
dracut --force
reboot
```
After reboot, verify that the nouveau module is not loaded by executing the command lsmod | grep nouveau.

If nouveau is still loaded, then repeat Steps 5 and 6.
Install or upgrade the host GPU driver by executing the following commands as a root user:
```
cd /nvidia
chmod +x ./NVIDIA-Linux-*.run
./NVIDIA-Linux-*.run -s
```
Execute the command nvidia-smi to query the available GPU devices on the host, to verify successful installation.
Execute the command nvidia-modprobe -u -c=0.

This command is needed to probe the nvidia-uvm kernel module, which is necessary in order for HPE Ezmeral Runtime Enterprise to recognize the host as having GPUs.
Reboot the Worker host.

Installing GPU Drivers on SLES Hosts

You must perform the following procedure on each SLES host that will supply GPU devices to the deployment.

Install the GPU devices in the host.
Determine which version of the NVIDIA GPU drivers to install.

The host OS NVIDIA driver must be compatible with the NVIDIA driver included in the application image. See GPU and MIG Support.
Locate the appropriate GPU device driver and libraries package on the NVIDIA Downloads Index (link opens an external website in a new browser tab or window), and then download it to each GPU-providing host.
Install the drivers and CUDA packages as appropriate for this operating system. See NVIDIA Driver Installation Quickstart Guide (link opens an external website in a new browser tab or window).
To verify successful installation, execute the command to query the available GPU devices on the host:
```
nvidia-smi
```
Set permissions to enable read and write access to the GPU devices by tenant users:
1. In the /etc/modprobe.d/50-nvidia-default.conf file, change the entry NVreg_DeviceFileMode=0660 to the following:
  NVreg_DeviceFileMode=0666
2. Reboot the host.
3. Verify that the permissions are read and write for all users of the GPU devices:
```
ls -al /dev/nvidia*
crw-rw-rw-. 1 root video 195,   0 Jun 21 11:21 /dev/nvidia0
crw-rw-rw-. 1 root video 195, 255 Jun 21 11:21 /dev/nvidiactl
...
```

Testing the Installation (Kubernetes Pods)

To test GPU installation in Kubernetes, see Using GPUs in Kubernetes Pods.

HPE Ezmeral Runtime Enterprise 5.6 Documentation
Abstract	HPE Ezmeral Container Platform is a unified container platform built on open source Kubernetes and designed for both cloud-native applications and non-cloud-native applications running on any infrastructure either on-premises, in multiple public clouds, in a hybrid model, or at the edge.
Published	July 2024
Edition	5.6.0