Tutorial: Training a PyTorch Model (Pytorch MNIST)

If you have not done so already: Before beginning this tutorial, download the Kubeflow tutorials zip file, which contains sample files for all of the included Kubeflow tutorials.

To complete this tutorial:

  1. Log in to the KubeDirector notebook as an LDAP user.
  2. Create or upload the .yaml file for the PyTorch job: pytorch-mnist-ddp-cpu.yaml.
  3. Open the web terminal in the HPE Ezmeral Runtime Enterprise UI, or from the terminal within the KubeDirector notebook.
    NOTE By default, you cannot execute kubectl commands in a newly created KubeDirector notebook. To enable kubectl in a notebook, select one of the following methods:
    • Through the HPE Ezmeral Runtime Enterprise UI:
      1. In the HPE Ezmeral Runtime Enterprise UI, navigate to the Tenant section and initialize a web terminal with the corresponding button.
      2. Start a new Terminal session inside the KubeDirector notebook. Check that the files inside your KubeDirector notebook have the appropriate file permissions that allow you to work with them.
      3. Move all files you want to work with to the following path:
        /bd-fs-mnt/TenantShare
      4. You can now access the files inside the web terminal with kubectl.
    • From inside the KubeDirector notebook:
      1. To authorize your user inside the KubeDirector notebook, execute the following Jupyter code cell:
        from ezmllib.kubeconfig.ezkubeconfig import set_kubeconfig
        set_kubeconfig()
      2. A prompt appears below the code cell you executed. Enter your user password in the prompt.
      3. kubectl is now enabled for your KubeDirector notebook. Start a Terminal session in the KubeDirector notebook to work with kubectl.
  4. Create the PyTorch job:
    kubectl apply -f pytorch-mnist-ddp-cpu.yaml
    IMPORTANT To complete this tutorial in an Air Gapped environment, you must perform the following:
    1. Push the bluedata/pytorch:mnist-ddp-cpu image to your Air Gap registry.
    2. Add the prefix of your Air Gap registry before the image name within the .yaml file. For example:
      <air-gap-registry>/bluedata/pytorch:mnist-ddp-cpu
  5. Verify the PyTorch job is created:
    $ kubectl get pytorchjobs
    NAME                    STATE       AGE
    pytorch-mnist-ddp-cpu   Created     3s
  6. Verify the relative pods are created:
    $ kubectl get pods -l job-name=pytorch-mnist-ddp-cpu
    NAME                             READY   STATUS              RESTARTS   AGE
    pytorch-mnist-ddp-cpu-master-0   0/1     ContainerCreating   0          6s
    pytorch-mnist-ddp-cpu-worker-0   0/1     Init:0/1            0          6s
    pytorch-mnist-ddp-cpu-worker-1   0/1     Init:0/1            0          6s
    pytorch-mnist-ddp-cpu-worker-2   0/1     Init:0/1            0          5s
  7. Verify the status for the PyTorch job pods. Wait until all pods have status Completed:
    kubectl get pods -l job-name=pytorch-mnist-ddp-cpu
  8. Insepct the logs to observe PyTorch training progress:
    PODNAME=$(kubectl get pods -l job-name=pytorch-mnist-ddp-cpu,replica-type=master,replica-index=0 -o name) \
    kubectl logs -f ${PODNAME};
    You can also check the status of the PyTorch job with the describe command:
    kubectl describe pytorchjob pytorch-mnist-ddp-cpu
    ...
    //message: PyTorchJob pytorch-mnist-ddp-cpu is successfully completed.
    ...
  9. Clean up after the job run:
    kubectl delete -f pytorch-mnist-ddp-cpu.yaml
    ...
    //message: 
    persistentvolumeclaim "pvcpy" deleted
    pytorchjob.kubeflow.org "pytorch-mnist-ddp-cpu" deleted
    ...