Tutorial: Transition from KubeDirector to Kubeflow Training

This tutorial provides a use case to help transition from KubeDirector training and deployment to Kubeflow equivalents.

Prerequisites:
  • This tutorial assumes you have an existing KubeDirector notebook cluster up and running.
  • Before beginning this tutorial, download the KubeDirector transition tutorials zip file, which contains sample files for all the included KubeDirector tutorials.

Tutorial 1: Transition From KubeDirector Training to TFJob

  1. Provision the KubeDirector Training cluster:
    1. Run the training.yaml file included under templates:
      kubectl apply -f training.yaml -n <tenant>
    2. Check the provisioning status of the cluster:
      kubectl get pods -n <tenant> | grep train1
    3. Run the notebook example training_sample.ipynb. This notebook example runs the sample Tensorflow script using the KubeDirector Training cluster.
  2. Run the Tensorflow job:

    Next, run the same training script using Kubeflow TFJob.

    You can run the scripts for this step using the tutorial.ipynb notebook included in both the tensorflow/KServe and tensorflow/Seldon folders. Select the folder corresponding to the type of inferencing that you want to run.

    The steps in the notebook are explained in detail below.

    1. Create an image that includes the required scripts and relevant datasets from the sample zip file. This image acts as the basis of the TFJob utility. Ensure that the required training and dataset files are available in your local machine.
    2. You must have access to a Docker daemon to build and push the created image to a compatible docker registry. To install Docker, see this page in the official Docker documentation (link opens an external site in a new tab or window).

      Ensure you have access to a Docker registry which is accessible from the HPE Ezmeral Runtime Enterprise cluster.

    3. Run the scripts using the tutorial.ipynb notebook. This notebook is included in both the KServe and Seldon folders. Select the folder corresponding to the type of inferencing you want to run.

      The steps included in the notebook are as follows:

      1. Create a basic Docker file following the template in Dockerfile. Make sure to include the datasets and the scripts to the image provided in the sample.
      2. After the image is ready, build and push the image to the registry:
        docker build -t <docker_image_name_with_tag>
        docker push <docker_image_name_with_tag>

        The pushed image now serves as the base image during the training phase.

  3. Before beginning training, create a PVC for the saved model:
    1. Open and apply the PVC YAML available as part of the training folder:
      kubectl apply -f tfjob-pvc.yaml
    2. Verify that the PVC is created and in a bound state:
      kubectl get pvc
  4. Apply the TFJob CR YAML to run the training:
    • If you are using Kserve inferencing:
      kubectl -n <namespace> apply -f tfjob_kserve.yaml
    • If you are using Seldon inferencing:
      kubectl -n <namespace> apply -f tfjob_seldon.yaml
  5. A TFJob is created and a pod is provisioned to run the training. The output of the training is a file that exists in the associated PVC:
    kubectl get pods -n <namespace> | grep tfjob
    When the pod enters a complete state, the model building is complete. You can now deploy the generated model with KServe or Seldon. See:

Tutorial 2: Transition From KubeDirector Training to PyTorchJob

This tutorial uses the notebook under examples/mlflow/PyTorch_sample.ipynb as an example. Sample scripts for this tutorial are located in the tutorials/pytorch folder in the sample zip file.
  1. Upload the notebook PyTorch_sample.ipynb to your KubeDirector notebook cluster. Familiarize yourself with the script. Then proceed with the following steps to run the same script as a part of Kubeflow PyTorchJob.
  2. Create an image that includes the required scripts and relevant datasets from the sample zip file. This image acts as the basis of the PyTorchJob utility. Ensure that the required training and dataset files are available in your local machine.
  3. You must have access to a Docker daemon to build and push the created image to a compatible docker registry. To install Docker, see this page in the official Docker documentation (link opens an external site in a new tab or window).

    Ensure you have access to a Docker registry which is accessible from the HPE Ezmeral Runtime Enterprise cluster.

  4. Create a basic Docker file following the template in Dockerfile. Make sure to include the required datasets and the scripts to the image.
  5. After the image is ready, build and push the image to the registry:
    docker build -t <docker_image_name_with_tag>
    docker push <docker_image_name_with_tag>
  6. The pushed image now serves as the base image during our training phase. Before we start training, ensure storage for the saved model. Create a PVC:
    1. Open and apply the PVC YAML available as part of the training folder:
      kubectl apply -f pytorch-pvc.yaml
    2. Verify that the PVC is created and in a bound state:
      kubectl get pvc
  7. Apply the PyTorch CR YAML to run the training:
    kubectl -n <namespace> apply -f pytorch.yaml
  8. A PyTorchJob is created and a pod is provisioned to run the training. The output of the training is a model file that exists in the associated PVC:
    kubectl get pods -n <namespace> | grep  tfjob1-sample-worker-0
    When the pod enters a complete state, the model building is complete. You can now deploy the generated model with Seldon Core. See: Tutorial 4: Inferencing with Seldon Core.

Tutorial 3: Inferencing with KServe

  1. Obtain the KServe/inference_kserve.yaml file from the tensorflow directory in the sample zip file.
  2. Apply KServe/inference_kserve.yaml to the tenant namespace:
    kubectl apply -f KServe/inference_kserve.yaml -n <namespace>
  3. Ensure that the pods are up and running. You can track the status of the serving deployment with the following commands:
    kubectl get inferenceservices
    kubectl get pods | grep tfjob-serving
  4. After the pods are up and running, send a request to the model.

    Sample requests are available under tensorflow/kserve/requests_kserve.py.

    1. In the Jupyter Notebook terminal, install the following Python dependencies:
      pip install requests lxml --user
    2. From the Jupyter notebook, launch kserving-request.py as follows:
      python kfserving-request.py http://<kserve-service>-default.<tenant-name>.svc.cluster.local:80
      The output appears similar to the following:
      200
      {u'predictions': [[0.841960549]]}

Tutorial 4: Inferencing with Seldon Core

  1. Obtain the inference_seldon.yaml file from the tensorflow directory in the sample zip file.
  2. Apply inference_seldon.yaml to the tenant namespace:
    kubectl apply -f inference_seldon.yaml -n <namespace>
  3. Ensure that the pods are up and running. You can track the status of the serving deployment with the following commands:
    kubectl get sdep
    kubectl get pods | grep tfserving
  4. After the pods are up and running, send a request to the model.

    Sample requests are available under tensorflow/seldon/requests_seldon.py.

    1. In the Jupyter Notebook terminal, install the following Python dependencies:
      pip install requests lxml --user
    2. From the Jupyter notebook, launch seldon-request.py as follows:
      python seldon-request.py http://<seldon-service>.<tenant-name>.svc.cluster.local:8000
      The output appears similar to the following:
      200
      {u'predictions': [[0.841960549]]}