Using whylogs with Spark

Describes how to use whylogs with Spark.

Prerequisites

Sign in to HPE Ezmeral Unified Analytics Software as a member.

About this task

In HPE Ezmeral Unified Analytics Software, whylogs is integrated to work with Livy sessions submitted through Kubeflow notebooks using the %manage_spark magic function. You can use whylogs with Spark to profile, visualize, and monitor data to detect drifts.

To use whylogs with Spark, refer to the Data Validation example and WhyLogs Profiling example in GitHub. The basic steps are outlined as follows:
  1. Create a notebook or import your notebook into HPE Ezmeral Unified Analytics Software. See Creating and Managing Notebook Servers.
  2. Enter the %manage_spark command in your notebook and configure your Spark session through different tabs. You must select the authentication as Single Sign-On and the runtime language as Python. To learn about creating sessions by using %manage_spark, see %manage_spark.
  3. Enter the %config_spark magic in your notebook and update the value of spark.kubernetes.container.image property to gcr.io/mapr-252711/spark<version>:<image-tag>. Click Submit when done. To learn about using %config_spark, see %config_spark.
  4. Verify that your created session is in the Idle state. You can verify by clicking the Manage Sessions tab or by navigating to the Spark Interactive Sessions screen. See Managing Interactive Sessions.
  5. Once the session is in the Idle state, you can set the environment variables and import the required libraries and modules from whylogs.
  6. Create data frames to profile the data or validate the data with whylogs and run the notebook.
  7. Once you finish running your notebook, navigate back to the HPE Ezmeral Unified Analytics Software home screen.
  8. In the left navigation bar, go to Data Engineering > Data Sources.
  9. Click Browse.
  10. Go to the /shared/<spark-whylogs> folder which is a path set in your notebook to store the logs from whylogs. You can see that the data proļ¬les and the drift summary report are stored in the shared volume in the .html and .bin formats.
  11. To download a summary report, select Download from the Actions menu.

Results

You can analyze the summary report to detect drifts and monitor your data.