MNIST Digits Recognition Workflow

Provides an end-to-end workflow in HPE Ezmeral Unified Analytics Software for an MNIST digits recognition example.

Scenario

A data scientist wants to use a Jupyter Notebook to train a model that recognizes numbers in images. The image files reside in object storage and need to be transformed into Parquet format and put into a shared directory in the shared volume that the data scientist’s notebook is mounted to.

HPE Ezmeral Unified Analytics Software includes the following components and applications to support an end-to-end workflow for this scenario:

Spark
A Spark application pulls images from the HPE Ezmeral Data Fabric Object Store via MinIO endpoint, transforms the images into Parquet format, and puts the Parquet data into the shared directory in the shared volume.
Airflow
Coded Airflow DAG that runs the Spark application.
Notebook (Jupyter)
Preconfigured Jupyter notebook mounted to the shared volume to run code and train models for the following Kubeflow pipelines:
  • Run experiments with Katib to pick the best model and then deploy the model using KServe.
  • Full training with TensorFlow jobs.
The following diagram shows the components and applications in the workflow:

Steps

Prerequisites

Connect to your Jupyter notebook and perform setup tasks to prepare the environment to train the model. A <username> folder with a sample notebook file and SSL certificate is provided for the purpose of this tutorial. To connect your notebook and perform setup tasks, follow these steps:

  • An administrator must create an S3 object store bucket and load data as the Spark application reads raw data from the local-S3 Object Store.
    To copy the required datasets to ezaf-demo bucket at data/mnist, run:
    # Code to copy mnist digit dataset from ezua-tutorials to ezaf-demo bucket at data/mnist for Digit Recognition Example
    import os, boto3
    s3 = boto3.client("s3", verify=False)
    bucket = 'ezaf-demo'
    source_dir = '/mnt/shared/ezua-tutorials/current-release/Data-Science/Kubeflow/MNIST-Digits-Recognition/dataset'
    dest_dir = 'data/mnist'
    # Get list of files under dataset dir
    dataset_list = os.listdir(source_dir)# Create source file path and destination object key strings
    source_files = []
    dest_objects = []
    for dataset in dataset_list:
        source_files.append(source_dir + '/' + dataset)
        dest_objects.append(dest_dir + '/' + dataset)# check whether bucket is already created
    buckets = s3.list_buckets()
    bucket_exists = False
    available_buckets = buckets["Buckets"]
    for available_bucket in available_buckets:
        if available_bucket["Name"] == bucket:
            bucket_exists = True
            break
    if not bucket_exists:
        s3.create_bucket(Bucket=bucket)# Upload files
    for i in range(len(source_files)):
        s3.upload_file(Filename=source_files[i], Bucket=bucket, Key=dest_objects[i]) 
  • (Air-gapped environment only) Manually pull the following images and make them available in the local repository for clusters in an airgap network to run the example:
    nikenano/launchernew:latest
    quay.io/aipipeline/kserve-component:v0.10.1
    Next, add your local airgap repository path prefix to the previous mentioned images in the following YAML files:
    • component/kubeflow-launcher-component.yaml
    • component/kserve-component.yaml
NOTE
If you are an administrator completing these tutorials, after finishing the administrator steps, make sure to complete the prerequisite steps for member users.

Before completing these steps as a member user, ask the administrator to complete the steps for administrator users.

  1. In the HPE Ezmeral Unified Analytics Software, go to Tools & Frameworks.
  2. Select the Data Science tab and then click Open in the Kubeflow tile.
  3. In Kubeflow, click Notebooks to open the notebooks page.
  4. Click Connect to connect to your notebook server.
  5. Go to the <username> folder.
  6. Copy the template object_store_secret.yaml.tpl file from the shared/ezua-tutorials/current-release/Data-Analytics/Spark directory to the <username> folder.
  7. In the <username>/MNIST-Digits-Recognition folder, open the mnist_katib_tf_kserve_example.ipynb file.
    NOTE
    If you do not see the MNIST-Digits-Recognition folder in the <username> folder, copy the folder from the shared/ezua-tutorials/current-release/Data-Science/Kubeflow directory into the <username> folder. The shared directory is accessible to all users. Editing or running examples from the shared directory is not advised. The <username> directory is specific to you and cannot be accessed by other users

    If the MNIST-Digits-Recognition folder is not available in the shared/ezua-tutorials/current-release/Data-Science/Kubeflow directory, perform:

    1. Go to GitHub repository for tutorials.
    2. Clone the repository.
    3. Navigate to ezua-tutorials/Data-Science/Kubeflow.
    4. Navigate back to the <username> directory.
    5. Copy the MNIST-Digits-Recognition folder from the ezua-tutorials/Data-Science/Kubeflow directory into the <username> directory.
  8. To generate a secret to read data source files from S3 bucket by Spark application (Airflow DAG), run the first cell of the mnist_katib_tf_kserve_example.ipynb file:
    import kfp
    kfp_client = kfp.Client()
    namespace = kfp_client.get_user_namespace()
    !sed -e "s/\$AUTH_TOKEN/$AUTH_TOKEN/" /mnt/user/object_store_secret.yaml.tpl > object_store_secret.yaml
    

A - Run a DAG in Airflow

In Airflow, run the DAG named spark_read_write_parquet_mnist. The DAG runs a Spark application that pulls the images from object storage, transforms the data into Parquet format, and writes the transformed Parquet data into the shared volume.
  1. Go to Airflow using either of the following methods:
    • Click Data Engineering > Airflow Pipelines.
    • Click Tools & Frameworks, select the Data Engineering tab, and click Open in the Airflow tile.
  2. In Airflow, verify that you are on the DAGs tab.
  3. Click on the spark_read_write_parquet_mnist DAG.

    NOTE
    The DAG is pulled from a pre-configured HPE GitHub repository. This DAG is constructed to submit a Spark application that pulls ubyte.gz files from an object storage bucket, converts the images into Parquet format, and places the converted files in a shared directory. If you want to use your private GitHub repository, see Configuring Airflow to find the steps to configure your repository.
  4. Click Code to view the DAG code.

  5. Click Graph to view the graphical representation of the DAG.
  6. Click Trigger DAG (play button) to open a screen where you can configure parameters.

  7. (Air-gapped environment only)Specify the airgap registry URL.
  8. Click the Trigger button on the bottom-left of the screen. Upon successful DAG completion, the data is accessible inside your notebook server by default in the following directory for further processing:
    shared/mnist-spark-data/ 
  9. To view details for the DAG, click Details. Under DAG Details, you can see green, red, and/or yellow buttons with the number of times the DAG ran successfully or failed.

  10. Click the Success or Failed button.
  11. To find your job, sort by End Date to see the latest jobs that have run, and then scroll to the right and click the log icon under Log URL for that run. Note that jobs run with the configuration:
    Conf "username":"your_username"


    IMPORTANT
    The cluster clears the logs that result from the DAG runs. The duration after which the cluster clears the logs depends on the Airflow task, cluster configuration, and policy.

B – View the Spark Application

After you run the DAG, you can view the status of the Spark application in the Spark Applications screen.
  1. To view the Spark application, go to Tools & Frameworks and then click on the Analytics tab.
  2. On the Analytics tab, select the Spark Operator tile and click Open.
  3. Identify the spark-mnist-<username>-<timestamp> application, for example spark-mnist-hpedemo-user01-20230728103759, and view the status of the application..
  4. Optionally, in the Actions column, click View YAML.

C- Update Path of Spark Generated Results

  1. Open mnist_katib_tf_kserve_example.ipynb file.
  2. In the third cell of the mnist_katib_tf_kserve_example.ipynb file, update the user folder name as follows:

    user_mounted_dir_name = "<username-folder-name>"


D - Train the Model

To train the model:
  1. In the Notebook Launcher, select the second cell of the notebook and select Run-->Run Selected Cell and All Below.
  2. In the second to last cell, follow the Run Details link to open your Kubeflow Pipeline.
  3. Run the Kubeflow pipeline in the UI and wait for it to successfully complete.

  4. To get details about components created by the pipeline run, go to the Experiments (AutoML) and Models pages in the Kubeflow UI.

E - Serve the Model

To serve the model with KServe and get the prediction, wait for the the Kubeflow pipeline to successfully complete the run. The output displays the following results:

End of Tutorial

You have completed this tutorial. This tutorial demonstrated that you can use Airflow, Spark, and Notebooks in HPE Ezmeral Unified Analytics Software to extract, transform, and load data into a shared volume and then run analytics and train models using Kubeflow pipelines.