Using whylogs with Spark
Describes how to use whylogs with Spark.
Prerequisites
About this task
In HPE Ezmeral Unified Analytics Software, whylogs is integrated to work with Livy
sessions submitted through Kubeflow notebooks using the
%manage_spark
magic function. You can use whylogs with Spark to
profile, visualize, and monitor data to detect drifts.
To use whylogs with Spark, refer to the Data Validation example and WhyLogs Profiling example in GitHub. The basic steps are outlined as
follows:
- Create a notebook or import your notebook into HPE Ezmeral Unified Analytics Software. See Creating and Managing Notebook Servers.
- Enter the
%manage_spark
command in your notebook and configure your Spark session through different tabs. You must select the authentication as Single Sign-On and the runtime language as Python. To learn about creating sessions by using%manage_spark
, see %manage_spark. - Enter the
%config_spark
magic in your notebook and update the value ofspark.kubernetes.container.image
property togcr.io/mapr-252711/spark<version>:<image-tag>
. Click Submit when done. To learn about using%config_spark
, see %config_spark. - Verify that your created session is in the Idle state. You can verify by clicking the Manage Sessions tab or by navigating to the Spark Interactive Sessions screen. See Managing Interactive Sessions.
- Once the session is in the Idle state, you can set the environment variables and import the required libraries and modules from whylogs.
- Create data frames to profile the data or validate the data with whylogs and run the notebook.
- Once you finish running your notebook, navigate back to the HPE Ezmeral Unified Analytics Software home screen.
- In the left navigation bar, go to Data Engineering > Data Sources.
- Click Browse.
- Go to the
/shared/<spark-whylogs>
folder which is a path set in your notebook to store the logs from whylogs. You can see that the data proļ¬les and the drift summary report are stored in the shared volume in the.html
and.bin
formats. - To download a summary report, select Download from the Actions menu.
Results
You can analyze the summary report to detect drifts and monitor your data.