Monitoring
Describes monitoring in HPE Ezmeral Unified Analytics Software.
Model Monitoring
Model monitoring is the process of continuously observing and analyzing the performance and behavior of machine learning models deployed in production environments. It is a critical aspect of the machine learning lifecycle that ensures models remain reliable, accurate, and aligned with the intended objectives.
Model monitoring involves the collection, analysis, and visualization of various metrics and data related to the model's performance and data characteristics. It is an iterative process that helps ensure model reliability and enables timely adjustments or updates to maintain optimal performance. Model monitoring plays a crucial role in building trust in machine learning systems and making informed decisions based on model outputs.
Model monitoring metrics are essential to track and measure the performance of the deployed models.
In HPE Ezmeral Unified Analytics Software, you can use KServe or MLflow for monitoring operational performance and whylogs for functional performance.
Collected Metrics
- Knative metrics
-
Knative Serving does not have built-in native support for model monitoring metrics. You can integrate Kserve with other monitoring and observability tools to collect and analyze metrics related to the performance and behavior of your deployed models.(Prometheus, Grafana, Kiali, ESK etc)
To learn more, see Importing dashboards to Grafana.
The following metrics are collected via KServe:- Knative Serving: Revision HTTP Requests
- Knative Serving: Scaling Debugging
- Knative Serving: Revision CPU and Memory Usage
- Knative: Reconciler
- Knative Serving: Control Plane Efficiency
- MLflow metrics
-
Use OTel to collect and export the telemetry data from MLflow applications, including metrics, and traces to third-party or external monitoring systems such as Prometheus, Jaeger, or Grafana for analysis and visualization. To learn more, see Configuring Endpoints.
The following metrics are collected via MLflow:mlflow_http_request_total
: Total number of incoming HTTP requests.mlflow_http_request_duration_seconds_sum
: Total duration in seconds of all incoming HTTP requests.mlflow_http_request_duration_seconds_count
: Total count of all incoming HTTP requests.
Model Monitoring with whylogs
- Track changes in the dataset and detect data drifts in the model input features.
- Create data constraints to validate data quality in model inputs or in a data pipeline.
- Detect training-serving skew, concept drift, and model performance degradation.
- Perform exploratory data analysis of massive datasets.
- Track data distributions and data quality for ML experiments.
- Standardize data documentation practices across the organization.
- Visualize the key summary statistics about the datasets in HTML and JSON file formats.
To learn more about whylogs, see whylogs documentation.
HPE Ezmeral Unified Analytics Software enables you to use an open-source library called whylogs in the preview environment. whylogs is integrated into the Notebook as a third-party package. You can access data from external S3 object store when using whylogs for monitoring. To learn more about accessing data, see Accessing Data in External S3 Object Stores.
- Airflow. See Using whylogs with Airflow.
- MLflow. See Using whylogs with MLflow.NOTEHPE Ezmeral Unified Analytics Software supports external data sources such as AWS, MinIO for whylogs with MLflow. You can not use S3 proxy as a data source.
- Ray. See Using whylogs with Ray.
- Spark. See Using whylogs with Spark.