MNIST Digits Recognition Workflow
Provides an end-to-end workflow in HPE Ezmeral Unified Analytics Software for an MNIST digits recognition example.
Scenario
A data scientist wants to use a Jupyter Notebook to train a model that recognizes numbers in images. The image files reside in object storage and need to be transformed into Parquet format and put into a shared directory in the shared volume that the data scientist’s notebook is mounted to.
HPE Ezmeral Unified Analytics Software includes the following components and applications to support an end-to-end workflow for this scenario:
- Spark
- A Spark application pulls images from the HPE Ezmeral Data Fabric Object Store via MinIO endpoint, transforms the images into Parquet format, and puts the Parquet data into the shared directory in the shared volume.
- Airflow
- Coded Airflow DAG that runs the Spark application.
- Notebook (Jupyter)
- Preconfigured Jupyter notebook mounted to the shared volume to run code and train
models for the following Kubeflow pipelines:
- Run experiments with Katib to pick the best model and then deploy the model using KServe.
- Full training with TensorFlow jobs.
Steps
Prerequisites
Connect to your Jupyter notebook and perform setup tasks to prepare the environment to
train the model. A <username>
folder with a sample notebook file and SSL
certificate is provided for the purpose of this tutorial. To connect your notebook and
perform setup tasks, follow these steps:
- An administrator must create an S3 object store bucket and load data as the
Spark application reads raw data from the local-S3 Object Store. To copy the required datasets to
ezaf-demo
bucket atdata/mnist
, run:# Code to copy mnist digit dataset from ezua-tutorials to ezaf-demo bucket at data/mnist for Digit Recognition Example import os, boto3 s3 = boto3.client("s3", verify=False) bucket = 'ezaf-demo' source_dir = '/mnt/shared/ezua-tutorials/current-release/Data-Science/Kubeflow/MNIST-Digits-Recognition/dataset' dest_dir = 'data/mnist' # Get list of files under dataset dir dataset_list = os.listdir(source_dir)# Create source file path and destination object key strings source_files = [] dest_objects = [] for dataset in dataset_list: source_files.append(source_dir + '/' + dataset) dest_objects.append(dest_dir + '/' + dataset)# check whether bucket is already created buckets = s3.list_buckets() bucket_exists = False available_buckets = buckets["Buckets"] for available_bucket in available_buckets: if available_bucket["Name"] == bucket: bucket_exists = True break if not bucket_exists: s3.create_bucket(Bucket=bucket)# Upload files for i in range(len(source_files)): s3.upload_file(Filename=source_files[i], Bucket=bucket, Key=dest_objects[i])
-
(Air-gapped environment only) Manually pull the following images and make them available in the local repository for clusters in an airgap network to run the example:
Next, add your local airgap repository path prefix to the previous mentioned images in the following YAML files:nikenano/launchernew:latest quay.io/aipipeline/kserve-component:v0.10.1
component/kubeflow-launcher-component.yaml
component/kserve-component.yaml
Before completing these steps as a member user, ask the administrator to complete the steps for administrator users.
- In the HPE Ezmeral Unified Analytics Software, go to Tools & Frameworks.
- Select the Data Science tab and then click Open in the Kubeflow tile.
- In Kubeflow, click Notebooks to open the notebooks page.
- Click Connect to connect to your notebook server.
- Go to the
<username>
folder. - Copy the template
object_store_secret.yaml.tpl
file from theshared/ezua-tutorials/current-release/Data-Analytics/Spark
directory to the<username>
folder. - In the
<username>/MNIST-Digits-Recognition
folder, open themnist_katib_tf_kserve_example.ipynb
file.NOTEIf you do not see theMNIST-Digits-Recognition
folder in the<username>
folder, copy the folder from theshared/ezua-tutorials/current-release/Data-Science/Kubeflow
directory into the<username>
folder. Theshared
directory is accessible to all users. Editing or running examples from theshared
directory is not advised. The<username>
directory is specific to you and cannot be accessed by other usersIf the
MNIST-Digits-Recognition
folder is not available in theshared/ezua-tutorials/current-release/Data-Science/Kubeflow
directory, perform:- Go to GitHub repository for tutorials.
- Clone the repository.
- Navigate to
ezua-tutorials/Data-Science/Kubeflow
. - Navigate back to the
<username>
directory. - Copy the
MNIST-Digits-Recognition
folder from theezua-tutorials/Data-Science/Kubeflow
directory into the<username>
directory.
- To generate a secret to read data source files from S3 bucket by Spark
application (Airflow DAG), run the first cell of the
mnist_katib_tf_kserve_example.ipynb
file:import kfp kfp_client = kfp.Client() namespace = kfp_client.get_user_namespace() !sed -e "s/\$AUTH_TOKEN/$AUTH_TOKEN/" /mnt/user/object_store_secret.yaml.tpl > object_store_secret.yaml
A - Run a DAG in Airflow
- Go to Airflow using either of the following methods:
- Click Data Engineering > Airflow Pipelines.
- Click Tools & Frameworks, select the Data Engineering tab, and click Open in the Airflow tile.
- In Airflow, verify that you are on the DAGs tab.
- Click on the spark_read_write_parquet_mnist DAG.NOTEThe DAG is pulled from a pre-configured HPE GitHub repository. This DAG is constructed to submit a Spark application that pulls ubyte.gz files from an object storage bucket, converts the images into Parquet format, and places the converted files in a shared directory. If you want to use your private GitHub repository, see Configuring Airflow to find the steps to configure your repository.
- Click Code to view the DAG code.
- Click Graph to view the graphical representation of the DAG.
- Click Trigger DAG (play button) to open a screen where you can configure
parameters.
- (Air-gapped environment only)Specify the airgap registry URL.
- Click the Trigger button on the bottom-left of the screen. Upon
successful DAG completion, the data is accessible inside your notebook server by default
in the following directory for further processing:
shared/mnist-spark-data/
- To view details for the DAG, click Details. Under DAG Details, you can
see green, red, and/or yellow buttons with the number of times the DAG ran successfully
or failed.
- Click the Success or Failed button.
- To find your job, sort by End Date to see the latest jobs that have run, and
then scroll to the right and click the log icon under Log URL for that run. Note that
jobs run with the
configuration:
Conf "username":"your_username"
IMPORTANTThe cluster clears the logs that result from the DAG runs. The duration after which the cluster clears the logs depends on the Airflow task, cluster configuration, and policy.
B – View the Spark Application
- To view the Spark application, go to Tools & Frameworks and then click on the Analytics tab.
- On the Analytics tab, select the Spark Operator tile and click Open.
- Identify the spark-mnist-<username>-<timestamp> application, for example spark-mnist-hpedemo-user01-20230728103759, and view the status of the application..
- Optionally, in the Actions column, click View YAML.
C- Update Path of Spark Generated Results
- Open
mnist_katib_tf_kserve_example.ipynb
file. -
In the third cell of the
mnist_katib_tf_kserve_example.ipynb
file, update the user folder name as follows:user_mounted_dir_name = "<username-folder-name>"
D - Train the Model
- In the Notebook Launcher, select the second cell of the notebook and select Run-->Run Selected Cell and All Below.
- In the second to last cell, follow the
Run Details
link to open your Kubeflow Pipeline. - Run the Kubeflow pipeline in the UI and wait for it to successfully complete.
- To get details about components created by the pipeline run, go to the Experiments (AutoML) and Models pages in the Kubeflow UI.
E - Serve the Model
End of Tutorial
You have completed this tutorial. This tutorial demonstrated that you can use Airflow, Spark, and Notebooks in HPE Ezmeral Unified Analytics Software to extract, transform, and load data into a shared volume and then run analytics and train models using Kubeflow pipelines.