Monitoring

Describes monitoring in HPE Ezmeral Unified Analytics Software.

Monitoring and alerting play an integral role in the observability framework. They involve monitoring the health, performance, and resource utilization of a Kubernetes cluster and its components. Administrators receive alerts about potential issues, which helps maintain optimal cluster and application operations and enables prompt responses to critical events.
NOTE
You cannot configure notifications or turn off notifications. You must view alerts and notifications in HPE Ezmeral Unified Analytics Software.

Model Monitoring

Model monitoring is the process of continuously observing and analyzing the performance and behavior of machine learning models deployed in production environments. It is a critical aspect of the machine learning lifecycle that ensures models remain reliable, accurate, and aligned with the intended objectives.

Model monitoring involves the collection, analysis, and visualization of various metrics and data related to the model's performance and data characteristics. It is an iterative process that helps ensure model reliability and enables timely adjustments or updates to maintain optimal performance. Model monitoring plays a crucial role in building trust in machine learning systems and making informed decisions based on model outputs.

Model monitoring metrics are essential to track and measure the performance of the deployed models.

In HPE Ezmeral Unified Analytics Software, you can use KServe or MLflow for monitoring operational performance and whylogs for functional performance.

Collected Metrics

Knative metrics

Knative Serving does not have built-in native support for model monitoring metrics. You can integrate Kserve with other monitoring and observability tools to collect and analyze metrics related to the performance and behavior of your deployed models.(Prometheus, Grafana, Kiali, ESK etc)

To learn more, see Importing dashboards to Grafana.

The following metrics are collected via KServe:
  • Knative Serving: Revision HTTP Requests
  • Knative Serving: Scaling Debugging
  • Knative Serving: Revision CPU and Memory Usage
  • Knative: Reconciler
  • Knative Serving: Control Plane Efficiency
MLflow metrics

Use OTel to collect and export the telemetry data from MLflow applications, including metrics, and traces to third-party or external monitoring systems such as Prometheus, Jaeger, or Grafana for analysis and visualization. To learn more, see Configuring Endpoints.

The following metrics are collected via MLflow:
  • mlflow_http_request_total: Total number of incoming HTTP requests.
  • mlflow_http_request_duration_seconds_sum: Total duration in seconds of all incoming HTTP requests.
  • mlflow_http_request_duration_seconds_count: Total count of all incoming HTTP requests.

Model Monitoring with whylogs

NOTE
This feature is presented as a developer preview. Developer previews are not tested for production environments, and should be used with caution.
whylogs is an open-source library for logging any kind of data. With whylogs, you can generate summaries of your datasets (data profiles) that you can use to:
  • Track changes in the dataset and detect data drifts in the model input features.
  • Create data constraints to validate data quality in model inputs or in a data pipeline.
  • Detect training-serving skew, concept drift, and model performance degradation.
  • Perform exploratory data analysis of massive datasets.
  • Track data distributions and data quality for ML experiments.
  • Standardize data documentation practices across the organization.
  • Visualize the key summary statistics about the datasets in HTML and JSON file formats.

To learn more about whylogs, see whylogs documentation.

HPE Ezmeral Unified Analytics Software enables you to use an open-source library called whylogs in the preview environment. whylogs is integrated into the Notebook as a third-party package. You can access data from external S3 object store when using whylogs for monitoring. To learn more about accessing data, see Accessing Data in External S3 Object Stores.

The following applications and frameworks support whylogs in HPE Ezmeral Unified Analytics Software: