List of Alerts

Provides a list of platform alerts and application alerts in HPE Ezmeral Unified Analytics Software.

HPE Ezmeral Unified Analytics Software issues the following application alerts:

Node Alerts

Group Title Description Severity
Node.Resources NodeCPUUsageHigh Alerts when a node's average CPU utilization over a five-minute window consistently exceeds 80% for 15 minutes. warning
Node.Resources NodeMemoryUsageHigh Alerts if a node's memory usage surpasses 85% of its total memory for 10 minutes. warning
Node.Requests NodeMemoryRequestsVsAllocatableWarning80 Warns when memory requests from pods on a node reach 80-90% of the node's allocatable memory, indicating potential problems scheduling new pods. warning
Node.Requests NodeMemoryRequestsVsAllocatableCritical90 Triggers a critical alert if memory requests from pods reach or exceed 90% of the node's allocatable memory, indicating a high risk of new pods being stuck in a pending state. critical
Node.Requests NodeCPURequestsVsAllocatableWarning80 Warns when CPU requests from pods on a node reach 80-90% of the node's allocatable CPU resources, indicating potential problems scheduling new pods. warning
Node.Requests NodeCPURequestsVsAllocatableCritical90 Triggers a critical alert if CPU requests from pods reach or exceed 90% of the node's allocatable CPU resources, indicating a high risk of new pods being stuck in a pending state. critical
node-exporter NodeFilesystemSpaceFillingUp Alerts when a filesystem's free space drops below a threshold, predicting potential exhaustion within 24 hours. warning
node-exporter NodeFilesystemSpaceFillingUp Alerts if a filesystem's free space is critically low, predicting exhaustion within 4 hours. critical
node-exporter NodeFilesystemAlmostOutOfSpace Alerts if a filesystem's free space falls below 5%. warning
node-exporter NodeFilesystemAlmostOutOfSpace Alerts if a filesystem's free space falls below 3%. critical
node-exporter NodeFilesystemFilesFillingUp Alerts when inode usage on a filesystem is predicted to reach exhaustion within 24 hours. warning
node-exporter NodeFilesystemFilesFillingUp Alerts if inode usage is critically low, predicting exhaustion within 4 hours. critical
node-exporter NodeFilesystemAlmostOutOfFiles Alerts if a filesystem's free inodes fall below 5%. warning
node-exporter NodeFilesystemAlmostOutOfFiles Alerts if a filesystem's free inodes fall below 3%. critical
node-exporter NodeNetworkReceiveErrs Alerts if a network interface reports a high rate of receive errors. warning
node-exporter NodeNetworkTransmitErrs Alerts if a network interface reports a high rate of transmit errors. warning
node-exporter NodeHighNumberConntrackEntriesUsed Alerts if a large percentage of conntrack entries are in use. warning
node-exporter NodeTextFileCollectorScrapeError Alerts if the Node Exporter's text file collector fails to scrape metrics. warning
node-exporter NodeClockSkewDetected Alerts if the node's clock is significantly out of sync. warning
node-exporter NodeClockNotSynchronising Alerts if the node's clock is not synchronizing with NTP. warning
node-exporter NodeRAIDDiskFailure Alerts if a device in a RAID array has failed. warning
node-exporter NodeFileDescriptorLimit Warns when file descriptors usage approaches a defined limit. warning
node-exporter NodeFileDescriptorLimit Critically alerts when file descriptors usage breaches a limit. critical
node-exporter NodeCPUHighUsage Warns when CPU usage exceeds 90% for a sustained period (15 minutes). info
node-exporter NodeSystemSaturation Warns when the average system load per CPU core exceeds a high threshold. warning
node-exporter NodeMemoryMajorPagesFaults Warns when the rate of major memory page faults is high. N/A

Etcd Alerts

Alert Title Description Severity
etcdMembersDown Alerts when members of the etcd cluster are down or experiencing network connectivity issues. critical
etcdInsufficientMembers Alerts when the etcd cluster doesn't have a sufficient number of members to reach quorum. critical
etcdNoLeader Alerts when the etcd cluster does not have a leader, indicating potential leadership election issues. critical

Resources Alerts

Group Name Title Description Severity
kubernetes-apps KubePodCrashLooping Alerts when a pod is repeatedly restarting due to crashes (CrashLoopBackOff state). warning
kubernetes-apps KubePodNotReady Alerts when a pod remains in a "not ready" state for over 15 minutes. warning
kubernetes-apps KubeDeploymentGenerationMismatch Alerts if a deployment's generation mismatch occurs, suggesting a failed rollback. warning
kubernetes-apps KubeDeploymentReplicasMismatch Alerts if a deployment hasn't scaled to the desired number of replicas within 15 minutes. warning
kubernetes-apps KubeDeploymentRolloutStuck Alerts if a deployment's rollout stalls for more than 15 minutes. warning
kubernetes-apps KubeStatefulSetReplicasMismatch Alerts if a StatefulSet hasn't scaled to the desired number of replicas within 15 minutes. warning
kubernetes-apps KubeStatefulSetGenerationMismatch Alerts if a StatefulSet's generation mismatch occurs, suggesting a failed rollback. warning
kubernetes-apps KubeStatefulSetUpdateNotRolledOut Alerts if a StatefulSet's update hasn't finished rolling out completely. warning
kubernetes-apps KubeDaemonSetRolloutStuck Alerts if a DaemonSet rollout stalls or fails to progress within 15 minutes. warning
kubernetes-apps KubeContainerWaiting Alerts if a container within a pod is in a waiting state for over an hour. warning
kubernetes-apps KubeDaemonSetNotScheduled Alerts if one or more pods in a DaemonSet fail to be scheduled. warning
kubernetes-apps KubeDaemonSetMisScheduled Alerts if one or more pods in a DaemonSet are scheduled on ineligible nodes. warning
kubernetes-apps KubeJobNotCompleted Alerts if a Job takes longer than 12 hours (43200 seconds) to complete. warning
kubernetes-apps KubeJobFailed Alerts if a Job fails to complete (enters failed state). warning
kubernetes-apps KubeHpaReplicasMismatch Alerts if a HorizontalPodAutoscaler (HPA) hasn't scaled to the desired number of replicas within 15 minutes. warning
kubernetes-apps KubeHpaMaxedOut Alerts if a HPA persistently operates at its maximum replica count for over 15 minutes. warning

Container Alerts

Group Name Title Description Severity
container.highmemoryusage.rules Container has a high memory usage Alerts when a container uses more than 80% of its memory limit. Includes details about the pod, container, namespace, etc. warning
container.highcpuusage.rules Container has a high CPU utilization rate Alerts when a container uses more than 80% of its CPU limit. Includes details about the pod, container, namespace, etc. warning
container.restarted.rules Container has multiple restarts Alerts on containers with multiple restarts, usually indicating instability. Includes relevant details about the container. warning

GPU Alerts

Group Name Title Description Severity
pod.gpu.evicted Pod Preempted Due To Inactivity Alerts on GPU-requesting pods evicted due to exceeding the inactivity limit in their PriorityClass. Guides troubleshooting. warning
pod.gpu.pending Pending Pods Due To GPU Requirement Alerts when GPU-requesting pods are stuck in pending because of insufficient resources or scheduling constraints. warning

Storage Alerts

Group Name Title Description Severity
kubernetes-storage KubePersistentVolumeFillingUp Alerts when a PersistentVolume's free space falls below 3%. critical
kubernetes-storage KubePersistentVolumeFillingUp Warns when a PersistentVolume is predicted to fill up within 4 days, and currently has less than 15% space available. warning
kubernetes-storage KubePersistentVolumeInodesFillingUp Alerts when a PersistentVolume's free inodes fall below 3%. critical
kubernetes-storage KubePersistentVolumeInodesFillingUp Warns when a PersistentVolume is predicted to run out of inodes within 4 days, and has less than 15% of its inodes free. warning
kubernetes-storage KubePersistentVolumeErrors Triggers when a PersistentVolume enters a "Failed" or "Pending" state, indicating potential provisioning issues. critical

Kubelet Alerts

Group Name Title Description Severity
kubernetes-system-kubelet KubeNodeNotReady Alerts when a Kubernetes node has been in the "Not Ready" state for more than 15 minutes. warning
kubernetes-system-kubelet KubeNodeUnreachable Alerts when a Kubernetes node becomes unreachable, indicating potential workload rescheduling. critical
kubernetes-system-kubelet KubeletTooManyPods Warns when a Kubelet is approaching its maximum pod capacity (95%). info
kubernetes-system-kubelet KubeNodeReadinessFlapping Alerts when a node's readiness status frequently changes in a short period (more than twice in 15 minutes), suggesting instability. warning
kubernetes-system-kubelet KubeletPlegDurationHigh Alerts when the Kubelet's Pod Lifecycle Event Generator (PLEG) takes a significant time to relist pods (99th percentile duration exceeding 10 seconds). warning
kubernetes-system-kubelet KubeletPodStartUpLatencyHigh Alerts when the time for pods to reach full readiness becomes high (99th percentile exceeding 60 seconds) warning
kubernetes-system-kubelet KubeletClientCertificateExpiration Warns when a Kubelet's client certificate is about to expire within a week. warning
kubernetes-system-kubelet KubeletClientCertificateExpiration Alerts critically when a Kubelet's client certificate is about to expire within a day. critical
kubernetes-system-kubelet KubeletServerCertificateExpiration Warns when a Kubelet's server certificate is about to expire within a week. warning
kubernetes-system-kubelet KubeletServerCertificateExpiration Alerts critically when a Kubelet's server certificate is about to expire within a day. critical
kubernetes-system-kubelet KubeletClientCertificateRenewalErrors Alerts when a Kubelet encounters repeated errors while attempting to renew its client certificate. warning
kubernetes-system-kubelet KubeletServerCertificateRenewalErrors Alerts when a Kubelet encounters repeated errors while attempting to renew its server certificate. warning
kubernetes-system-kubelet KubeletDown Alerts critically when a Kubelet disappears from Prometheus' target discovery, potentially indicating a serious issue. critical

Billing Alerts

Group Name Title Description
ezbilling.clusterstate.rules Cluster is in disabled state Alerts when the EzBilling cluster is disabled. Suggests contacting HPE Support.
ezbilling.upload.rules Billing usage records not uploaded Alerts when billing usage records failed to upload for the past 24 hours.
ezbilling.activation.code.grace.peroid.rules Activation code grace period started Alerts when the activation code grace period begins, providing the expiration date.

Licensing Alerts

Group Name Title Description
ezlicense.license.rules Cluster is in disabled state Alerts when the EzLicense cluster enters a disabled state, suggesting the need to contact HPE Support.
ezlicense.expiry.tenday.rules Activation key expiration Alerts when an activation key is going to expire within the next 10 days, providing the expiration date.
ezlicense.expiry.thirtyday.rules Activation key expiration Alerts when an activation key is going to expire within the next 30 days, providing the expiration date.

Licensing Capacity Alerts

Group Name Title Description
ezlicense.capacity.vCPU.rules Worker node capacity has exceeded vCPU license capacity Alerts when the vCPU capacity of worker nodes surpasses the available vCPU license limit.
ezlicense.capacity.GPU.rules Worker node capacity has exceeded GPU license capacity Alerts when the GPU capacity of worker nodes surpasses the available GPU license limit.
ezlicense.capacity.no.gpu.license.rules GPU worker node found but no GPU license exists Alerts when a GPU worker node is detected, but there's no corresponding GPU license available.

Prometheus Alerts

Group Name Title Description Severity
prometheus PrometheusBadConfig Alerts when Prometheus fails to reload its configuration. critical
prometheus PrometheusSDRefreshFailure Alerts when Prometheus fails to refresh service discovery (SD) with a specific mechanism. warning
prometheus PrometheusNotificationQueueRunningFull Alerts when the Prometheus alert notification queue is predicted to reach full capacity soon. warning
prometheus PrometheusErrorSendingAlertsToSomeAlertmanagers Alerts when Prometheus encounters a significant error rate (> 1%) sending alerts to a specific Alertmanager. warning
prometheus PrometheusNotConnectedToAlertmanagers Alerts when Prometheus is not connected to any configured Alertmanagers. warning
prometheus PrometheusTSDBReloadsFailing Alerts when Prometheus encounters repeated failures (>0) during the loading of data blocks from disk. warning
prometheus PrometheusTSDBCompactionsFailing Alerts when Prometheus encounters repeated failures (>0) during block compactions. warning
prometheus PrometheusNotIngestingSamples Alerts when a Prometheus instance stops ingesting new metric samples. warning
prometheus PrometheusDuplicateTimestamps Alerts when Prometheus reports samples being dropped due to duplicate timestamps. warning
prometheus PrometheusOutOfOrderTimestamps Alerts when Prometheus reports samples being dropped due to arriving out of order. warning
prometheus PrometheusRemoteStorageFailures Alerts when Prometheus encounters a significant error rate (> 1%) when sending samples to configured remote storage. critical
prometheus PrometheusRemoteWriteBehind Alerts when Prometheus remote write operations fall behind significantly (> 2 minutes). critical
prometheus PrometheusRemoteWriteDesiredShards Alerts when the desired number of shards calculated for remote write exceeds the configured maximum. warning
prometheus PrometheusRuleFailures Alerts when Prometheus encounters repeated failures during rule evaluations. critical
prometheus PrometheusMissingRuleEvaluations Alerts when Prometheus misses rule group evaluations due to exceeding the allowed evaluation time. warning
prometheus PrometheusTargetLimitHit Alerts when Prometheus drops targets because the number of targets exceeds a configured limit. warning
prometheus PrometheusLabelLimitHit Alerts if Prometheus drops targets due to exceeding configured limits on label counts or label lengths. warning
prometheus PrometheusScrapeBodySizeLimitHit Alerts if Prometheus fails scrapes due to targets exceeding the configured maximum scrape body size. warning
prometheus PrometheusScrapeSampleLimitHit Alerts if Prometheus fails scrapes due to targets exceeding the configured maximum sample count. warning
prometheus PrometheusTargetSyncFailure Alerts when Prometheus is unable to synchronize targets successfully due to configuration errors. critical
prometheus PrometheusHighQueryLoad Alerts when the Prometheus query engine reaches close to full capacity, with less than 20% remaining. warning
prometheus PrometheusErrorSendingAlertsToAnyAlertmanager Alerts when there's a persistent error rate (> 3%) while sending alerts from Prometheus to any configured Alertmanager. critical

Airflow Alerts

Group Name Title Description
airflow.scheduler.healthy.rules Airflow Scheduler Unresponsive Airflow Scheduler is not responding to health checks.
airflow.dag.import.rules Airflow DAG Import Errors Errors detected during import of DAGs from the Git repository.
airflow.tasks.queued.rules Airflow Tasks Queued and Not Running Airflow tasks are queued and unable to be executed.
airflow.tasks.starving.rules Airflow Tasks Starving for Resources Airflow tasks cannot be scheduled due to lack of available resources in the pool.
airflow.dags.gitrepo.rules Airflow DAG Git Repository Inaccessible Airflow cannot access the Git repository containing DAGs.

Kubeflow Alerts

Group Name Title Description
kubeflow.katib.rules Kubeflow katib stuck Indicates a potential issue with Katib where it's not starting new experiments, trials, or successfully completing trials. Suggests restarting the Katib controller.

MLflow Alerts

Group Name Title Description
mlflow_http_request_total High MLflow HTTP Request Rate without status 200 Alerts if more than 5% of HTTP requests to the MLflow server over a 5-minute window fail (don't have a status code of 200).
mlflow_http_request_duration_seconds_bucket A histogram representation of the duration of the incoming HTTP requests Alerts when the 95th percentile of MLflow HTTP request durations exceeds 5 seconds within a 5-minute window, indicating potential slowdowns.
mlflow_http_request_duration_seconds_sum Total duration in seconds of all incoming HTTP requests Alerts if the total time spent handling all MLflow HTTP requests exceeds 600 seconds over a 5-minute period, suggesting overload.

Ray Alerts

Group Name Title Description Severity
ray.object.store.memory.high.pressure.alert Ray: High Pressure on Object Store Memory Alerts when 90% of Ray object store memory is used consistently for 5 minutes. warning
ray.node.memory.high.pressure.alert Ray: High Memory Pressure on Ray Nodes Alerts when a Ray node's memory usage exceeds 90% of its capacity for 5 minutes. warning
ray.node.cpu.utilization.high.pressure.alert Ray: High CPU Pressure on Ray Nodes Alerts when CPU utilization across Ray nodes exceeds 95% for 5 minutes. warning
ray.autoscaler.failed.node.creation.alert Ray: Autoscaler Failed to Create Nodes Alerts when the Ray autoscaler has failed attempts at creating new nodes for 5 minutes. warning
ray.scheduler.failed.worker.startup.alert Ray: Scheduler Failed Worker Startup Alerts when the Ray scheduler encounters failures during worker startup for 5 minutes. warning
ray.node.low.disk.space.alert Ray: Low Disk Space on Nodes Alerts when a Ray node has less than 10% of disk space free for 5 minutes. warning
ray.node.network.high.usage.alert Ray: High Network Usage on Ray Nodes Alerts when network usage (receive + transmit) on Ray nodes exceeds a threshold for 5 minutes, indicating potential congestion. warning

Spark Alerts

Group Name Title Description Severity
spark.app.high.failed.rule Spark Operator: High Failed App Count Alerts when the number of failed Spark applications handled by the operator surpasses a threshold within a 5-minute window. warning
spark.app.high.latency.rule Spark Operator: High Average Latency for App Starting Alerts when the average latency (time to start) for Spark applications exceeds 120 seconds for a 5-minute period. warning
spark.app.submission.high.failed.percentage.rule Spark Operator: High Percentage of Failed Spark App Submissions Alerts when the failure rate of Spark application submissions exceeds 10% of total submissions for 15 minutes. warning
spark.app.low.success.rate.rule Spark Operator: Low Success Rate of Spark Applications Alerts when the success rate of Spark applications drops below 80% of total submissions for a 20-minute period. warning
spark.app.executor.low.success.rate.rule Spark Operator: Low Success Rate of Spark Application Executors Alerts when the success rate of Spark executors drops below 90% of total executors for a 20-minute period. warning
spark.workload.high.memory.pressure.rule Spark Workload: High Memory Pressure Alerts when overall memory pressure on Spark's BlockManager exceeds a critical threshold for 1 minute. warning
spark.workload.high.heap.memory.pressure.rule Spark Workload: High On-Heap Memory Pressure Alerts when on-heap memory pressure on Spark's BlockManager exceeds a critical threshold for 1 minute. warning
spark.workload.high.cpu.usage.rule Spark Workload: High JVM CPU Usage Alerts when JVM CPU usage within Spark exceeds a critical threshold for 5 minutes. warning

Superset Alerts

Group Name Title Description Severity
superset.http.request.duration A histogram representation of the duration of the incoming HTTP requests Alerts when the 99th percentile of HTTP request duration exceeds 3 seconds for 5 minutes, indicating slow responses. critical
superset.http.request.total Superset total number of HTTP requests without status 200 Alerts if more than 5% of HTTP requests to Superset within a 5-minute window fail (don't have a status code of 200). critical
superset.http.request.exceptions.total Total number of HTTP requests which resulted in an exception Alerts when more than 10 HTTP requests to Superset result in exceptions within a 5-minute window. critical
superset.gc.objects.collected Objects collected during gc Alerts if more than 100 objects (generation 0) are collected during garbage collection within a 5-minute window. warning
superset.gc.objects.uncollectable Uncollectable objects found during GC Alerts if more than 50 uncollectable objects (generation 0) are found during garbage collection within a 5-minute window. warning
supercet.gc.collections Number of times this generation was collected Alerts if the youngest generation (0) of garbage collection has run more than 100 times within a 5-minute window, indicating potential memory pressure. warning