GPU Driver Installation
Download NVIDIA GPU drivers from NVIDIA, install them on hosts, and test the installation after the hosts have been added to HPE Ezmeral Runtime Enterprise.
The host OS NVIDIA driver must be compatible with the NVIDIA driver included in your application image (such as, for example, TensorFlow in the HPE Ezmeral ML Ops Training image).
For MIG support on GPUs, the driver must also support MIG.
See GPU and MIG Support.
If possible, install the NVIDIA GPU drivers before adding the host to HPE Ezmeral Runtime Enterprise.
If you update the OS Kernel on a host, you must reinstall the NVIDIA GPU drivers on that host (see Steps 8-11).
If you want add GPUs or update GPU drivers after a host has been added to HPE Ezmeral Runtime Enterprise, do the following:
- If the host is a Kubernetes host, remove the Worker from the Kubernetes cluster and then remove the host from HPE Ezmeral Runtime Enterprise.
Installing or Updating the GPU Driver on RHEL and CentOS Hosts
You must perform the following procedure on each RHEL or CentOS host that will supply GPU devices to the deployment.
- Install the GPU devices in the host.
-
Locate the appropriate GPU device driver and libraries package on the NVIDIA Downloads Index (link opens an external website in a new browser tab or window), and then download it to each GPU-providing host. You will use the downloaded file in both Steps 3 and 7 of this procedure.
IMPORTANTSelect Linux 64-bit (not Linux 64-bit RHEL6 or similar) to obtain a runfile. Selecting a specific Linux distribution will download an RPM, which will not work with your HPE Ezmeral Runtime Enterprise deployment.
- If you are performing the initial driver installation, execute the following
commands as the root user (you should not need to perform this step when
upgrading existing drivers):
yum update -y yum install -y kernel-devel kernel-headers gcc-c++ perl pciutils yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
- If any packages get updated through yum update (in previous step), ensure that
GPU driver is still working (You can execute the command
nvidia-smi
to ensure the expected results). If the GPU driver is not working, reinstall the driver (see Steps 8-11). -
NOTE
This step reboots the host. If HPE Ezmeral Runtime Enterprise is already installed on this host, and if virtual nodes/containers are assigned to this host, then this step will briefly interrupt those nodes/containers.
If you are performing the initial driver installation, execute the following commands as the root user. (You should not need to perform this step when upgrading existing drivers.)
cat > /etc/modprobe.d/blacklist-nouveau.conf <<EOF blacklist nouveau options nouveau modeset=0 EOF rmmod nouveau dracut --force reboot
-
After reboot, verify that the nouveau module is not loaded by executing the command
lsmod | grep nouveau
.If nouveau is still loaded, then repeat Steps 5 and 6.
-
Install or upgrade the host GPU driver by executing the following commands as a root user:
cd /nvidia chmod +x ./NVIDIA-Linux-*.run ./NVIDIA-Linux-*.run -s
- Execute the command
nvidia-smi
to query the available GPU devices on the host, to verify successful installation. -
Execute the command
nvidia-modprobe -u -c=0
.This command is needed to probe the
nvidia-uvm
kernel module, which is necessary in order for HPE Ezmeral Runtime Enterprise to recognize the host as having GPUs. - Reboot the Worker host.
Installing GPU Drivers on SLES Hosts
You must perform the following procedure on each SLES host that will supply GPU devices to the deployment.
- Install the GPU devices in the host.
-
Determine which version of the NVIDIA GPU drivers to install.
The host OS NVIDIA driver must be compatible with the NVIDIA driver included in the application image. See GPU and MIG Support.
- Locate the appropriate GPU device driver and libraries package on the NVIDIA Downloads Index (link opens an external website in a new browser tab or window), and then download it to each GPU-providing host.
- Install the drivers and CUDA packages as appropriate for this operating system. See NVIDIA Driver Installation Quickstart Guide (link opens an external website in a new browser tab or window).
- To verify successful installation, execute the command to query the available
GPU devices on the host:
nvidia-smi
- Set permissions to enable read and write access to
the GPU devices by tenant users:
- In the
/etc/modprobe.d/50-nvidia-default.conf
file, change the entryNVreg_DeviceFileMode=0660
to the following:NVreg_DeviceFileMode=0666
- Reboot the host.
- Verify that the permissions are read and write for all users of the GPU
devices:
ls -al /dev/nvidia* crw-rw-rw-. 1 root video 195, 0 Jun 21 11:21 /dev/nvidia0 crw-rw-rw-. 1 root video 195, 255 Jun 21 11:21 /dev/nvidiactl ...
- In the
Testing the Installation (Kubernetes Pods)
To test GPU installation in Kubernetes, see Using GPUs in Kubernetes Pods.