Alerting

Describes alerting in HPE Ezmeral Unified Analytics Software.

An alert in HPE Ezmeral Unified Analytics Software is a system notification that informs you of issues, warnings, and updates. Unified Analytics uses Prometheus to monitor and collect metrics from nodes, system processes, and applications that run in an HPE Ezmeral Unified Analytics Software cluster. Unified Analytics generates alerts based on the metrics collected. An Alertmanager in Unified Analytics enables you to control the behavior of alerts, for example, silence specific alerts or send notifications to a specific user when the system raises an alert.

To learn about Prometheus and Alertmanager in detail, see the Prometheus and Alertmanager documentation.

The alert system in HPE Ezmeral Unified Analytics Software is comprised of several components. The following sections include an architectural diagram, component descriptions, and alerting workflow.

Alerting Worflow

The following is an overview of the alerting workflow along with a detailed description in HPE Ezmeral Unified Analytics Software.

Collect Metrics

Prometheus scrapes metrics from targets exposed by exporters. For example, Prometheus collects the CPU usage metrics from servers via Node Exporter, and database query latency from MySQL via Mysqld Exporter.

Evaluate Alert Rules

Prometheus continuously evaluates the alerting rules that are defined in PromQL against the collected metrics.

Generate Alerts
If the condition for a rule is met, Prometheus generates an alert. For example:
  • The following alert rules send notifications if an average API server error rate exceeds 5 per minute.
     avg(http_requests_total{job="api_server", status_code="500"}) by (job) > 5
  • The following alert rules send notifications if the disk has less than 10GB free.
    node_filesystem_avail_bytes{mountpoint="/"} < 10 * 1024 * 1024 * 1024
Dispatch Alerts to Alertmanager

Prometheus sends the generated alerts to the configured Alertmanager.

Process Alerts

Alertmanager deduplicates, groups, and routes the alerts based on configured rules.

Send Notifications to Receivers

Alertmanager sends notifications to the appropriate recipients through the designated channels.

Resource Events Alerting

Alerts are triggered for the following events:
  • High resource CPU usage
  • High resource memory usage
  • Unusual pod restart
  • Pods not in running state
  • PVC status not Bound
  • Failed jobs
  • Failed cronjobs
  • Node failures
  • Unsual node memory or CPU usage behavior
  • Kubelet failures
  • Node filesystem issues
  • Node network issues
  • Prometheus issues

To find the list of alerts generated in HPE Ezmeral Unified Analytics Software, see List of Alerts.