Tutorial: Training a PyTorch Model (Pytorch MNIST)

If you have not done so already: Before beginning this tutorial, download the Kubeflow tutorials zip file, which contains sample files for all of the included Kubeflow tutorials.

To complete this tutorial:

Log in to the KubeDirector notebook as an LDAP user.
Create or upload the .yaml file for the PyTorch job: pytorch-mnist-ddp-cpu.yaml.
Open the web terminal in the HPE Ezmeral Runtime Enterprise UI, or from the terminal within the KubeDirector notebook.
NOTE
By default, you cannot execute kubectl commands in a newly created KubeDirector notebook. To enable kubectl in a notebook, select one of the following methods:
- Through the HPE Ezmeral Runtime Enterprise UI:
  1. In the HPE Ezmeral Runtime Enterprise UI, navigate to the Tenant section and initialize a web terminal with the corresponding button.
  2. Start a new Terminal session inside the KubeDirector notebook. Check that the files inside your KubeDirector notebook have the appropriate file permissions that allow you to work with them.
  3. Move all files you want to work with to the following path:
```
/bd-fs-mnt/TenantShare
```
  4. You can now access the files inside the web terminal with kubectl.
- From inside the KubeDirector notebook:
  1. To authorize your user inside the KubeDirector notebook, execute the following Jupyter code cell:
```
from ezmllib.kubeconfig.ezkubeconfig import set_kubeconfig
set_kubeconfig()
```
  2. A prompt appears below the code cell you executed. Enter your user password in the prompt.
  3. kubectl is now enabled for your KubeDirector notebook. Start a Terminal session in the KubeDirector notebook to work with kubectl.
Create the PyTorch job:
```
kubectl apply -f pytorch-mnist-ddp-cpu.yaml
```
IMPORTANT
To complete this tutorial in an Air Gapped environment, you must perform the following:
1. Push the bluedata/pytorch:mnist-ddp-cpu image to your Air Gap registry.
2. Add the prefix of your Air Gap registry before the image name within the .yaml file. For example:
```
<air-gap-registry>/bluedata/pytorch:mnist-ddp-cpu
```

Verify the PyTorch job is created:

$ kubectl get pytorchjobs
NAME                    STATE       AGE
pytorch-mnist-ddp-cpu   Created     3s

Verify the relative pods are created:

$ kubectl get pods -l job-name=pytorch-mnist-ddp-cpu
NAME                             READY   STATUS              RESTARTS   AGE
pytorch-mnist-ddp-cpu-master-0   0/1     ContainerCreating   0          6s
pytorch-mnist-ddp-cpu-worker-0   0/1     Init:0/1            0          6s
pytorch-mnist-ddp-cpu-worker-1   0/1     Init:0/1            0          6s
pytorch-mnist-ddp-cpu-worker-2   0/1     Init:0/1            0          5s

Verify the status for the PyTorch job pods. Wait until all pods have status Completed:
```
kubectl get pods -l job-name=pytorch-mnist-ddp-cpu
```

Insepct the logs to observe PyTorch training progress:

PODNAME=$(kubectl get pods -l job-name=pytorch-mnist-ddp-cpu,replica-type=master,replica-index=0 -o name) \
kubectl logs -f ${PODNAME};

You can also check the status of the PyTorch job with the describe command:

kubectl describe pytorchjob pytorch-mnist-ddp-cpu
...
//message: PyTorchJob pytorch-mnist-ddp-cpu is successfully completed.
...

Clean up after the job run:

kubectl delete -f pytorch-mnist-ddp-cpu.yaml
...
//message: 
persistentvolumeclaim "pvcpy" deleted
pytorchjob.kubeflow.org "pytorch-mnist-ddp-cpu" deleted
...

HPE Ezmeral Runtime Enterprise 5.6 Documentation
Abstract	HPE Ezmeral Container Platform is a unified container platform built on open source Kubernetes and designed for both cloud-native applications and non-cloud-native applications running on any infrastructure either on-premises, in multiple public clouds, in a hybrid model, or at the edge.
Published	July 2024
Edition	5.6.0