Tutorial: Training a PyTorch Model (Pytorch MNIST)
If you have not done so already: Before beginning this tutorial, download the Kubeflow tutorials zip file, which contains sample files for all of the included Kubeflow tutorials.
To complete this tutorial:
- Log in to the KubeDirector notebook as an LDAP user.
- Create or upload the
.yaml
file for the PyTorch job:pytorch-mnist-ddp-cpu.yaml
. - Open the web terminal in the HPE Ezmeral Runtime Enterprise UI, or from the terminal within the KubeDirector
notebook.NOTEBy default, you cannot execute
kubectl
commands in a newly created KubeDirector notebook. To enablekubectl
in a notebook, select one of the following methods:- Through the HPE Ezmeral Runtime Enterprise UI:
- In the HPE Ezmeral Runtime Enterprise UI, navigate to the Tenant section and initialize a web terminal with the corresponding button.
- Start a new Terminal session inside the KubeDirector notebook. Check that the files inside your KubeDirector notebook have the appropriate file permissions that allow you to work with them.
- Move all files you want to work with to the following
path:
/bd-fs-mnt/TenantShare
- You can now access the files inside the web terminal with
kubectl
.
- From inside the KubeDirector notebook:
- To authorize your user inside the KubeDirector notebook, execute
the following Jupyter code
cell:
from ezmllib.kubeconfig.ezkubeconfig import set_kubeconfig set_kubeconfig()
- A prompt appears below the code cell you executed. Enter your user password in the prompt.
kubectl
is now enabled for your KubeDirector notebook. Start a Terminal session in the KubeDirector notebook to work withkubectl
.
- To authorize your user inside the KubeDirector notebook, execute
the following Jupyter code
cell:
- Through the HPE Ezmeral Runtime Enterprise UI:
- Create the PyTorch job:
kubectl apply -f pytorch-mnist-ddp-cpu.yaml
IMPORTANTTo complete this tutorial in an Air Gapped environment, you must perform the following:- Push the
bluedata/pytorch:mnist-ddp-cpu
image to your Air Gap registry. - Add the prefix of your Air Gap registry before the image name within the
.yaml
file. For example:<air-gap-registry>/bluedata/pytorch:mnist-ddp-cpu
- Push the
- Verify the PyTorch job is
created:
$ kubectl get pytorchjobs NAME STATE AGE pytorch-mnist-ddp-cpu Created 3s
- Verify the relative pods are
created:
$ kubectl get pods -l job-name=pytorch-mnist-ddp-cpu NAME READY STATUS RESTARTS AGE pytorch-mnist-ddp-cpu-master-0 0/1 ContainerCreating 0 6s pytorch-mnist-ddp-cpu-worker-0 0/1 Init:0/1 0 6s pytorch-mnist-ddp-cpu-worker-1 0/1 Init:0/1 0 6s pytorch-mnist-ddp-cpu-worker-2 0/1 Init:0/1 0 5s
- Verify the status for the PyTorch job pods. Wait until all pods have status
Completed
:kubectl get pods -l job-name=pytorch-mnist-ddp-cpu
- Insepct the logs to observe PyTorch training
progress:
You can also check the status of the PyTorch job with thePODNAME=$(kubectl get pods -l job-name=pytorch-mnist-ddp-cpu,replica-type=master,replica-index=0 -o name) \ kubectl logs -f ${PODNAME};
describe
command:kubectl describe pytorchjob pytorch-mnist-ddp-cpu ... //message: PyTorchJob pytorch-mnist-ddp-cpu is successfully completed. ...
- Clean up after the job
run:
kubectl delete -f pytorch-mnist-ddp-cpu.yaml ... //message: persistentvolumeclaim "pvcpy" deleted pytorchjob.kubeflow.org "pytorch-mnist-ddp-cpu" deleted ...