List of Alerts
Provides a list of platform alerts and application alerts in HPE Ezmeral Unified Analytics Software.
HPE Ezmeral Unified Analytics Software issues the following platform alerts:
    HPE Ezmeral Unified Analytics Software issues the following application alerts:
    Node Alerts
| Group | Title | Description | Severity | 
|---|---|---|---|
| Node.Resources | NodeCPUUsageHigh | Alerts when a node's average CPU utilization over a five-minute window consistently exceeds 80% for 15 minutes. | warning | 
| Node.Resources | NodeMemoryUsageHigh | Alerts if a node's memory usage surpasses 85% of its total memory for 10 minutes. | warning | 
| Node.Requests | NodeMemoryRequestsVsAllocatableWarning80 | Warns when memory requests from pods on a node reach 80-90% of the node's allocatable memory, indicating potential problems scheduling new pods. | warning | 
| Node.Requests | NodeMemoryRequestsVsAllocatableCritical90 | Triggers a critical alert if memory requests from pods reach or exceed 90% of the node's allocatable memory, indicating a high risk of new pods being stuck in a pending state. | critical | 
| Node.Requests | NodeCPURequestsVsAllocatableWarning80 | Warns when CPU requests from pods on a node reach 80-90% of the node's allocatable CPU resources, indicating potential problems scheduling new pods. | warning | 
| Node.Requests | NodeCPURequestsVsAllocatableCritical90 | Triggers a critical alert if CPU requests from pods reach or exceed 90% of the node's allocatable CPU resources, indicating a high risk of new pods being stuck in a pending state. | critical | 
| node-exporter | NodeFilesystemSpaceFillingUp | Alerts when a filesystem's free space drops below a threshold, predicting potential exhaustion within 24 hours. | warning | 
| node-exporter | NodeFilesystemSpaceFillingUp | Alerts if a filesystem's free space is critically low, predicting exhaustion within 4 hours. | critical | 
| node-exporter | NodeFilesystemAlmostOutOfSpace | Alerts if a filesystem's free space falls below 5%. | warning | 
| node-exporter | NodeFilesystemAlmostOutOfSpace | Alerts if a filesystem's free space falls below 3%. | critical | 
| node-exporter | NodeFilesystemFilesFillingUp | Alerts when inode usage on a filesystem is predicted to reach exhaustion within 24 hours. | warning | 
| node-exporter | NodeFilesystemFilesFillingUp | Alerts if inode usage is critically low, predicting exhaustion within 4 hours. | critical | 
| node-exporter | NodeFilesystemAlmostOutOfFiles | Alerts if a filesystem's free inodes fall below 5%. | warning | 
| node-exporter | NodeFilesystemAlmostOutOfFiles | Alerts if a filesystem's free inodes fall below 3%. | critical | 
| node-exporter | NodeNetworkReceiveErrs | Alerts if a network interface reports a high rate of receive errors. | warning | 
| node-exporter | NodeNetworkTransmitErrs | Alerts if a network interface reports a high rate of transmit errors. | warning | 
| node-exporter | NodeHighNumberConntrackEntriesUsed | Alerts if a large percentage of conntrack entries are in use. | warning | 
| node-exporter | NodeTextFileCollectorScrapeError | Alerts if the Node Exporter's text file collector fails to scrape metrics. | warning | 
| node-exporter | NodeClockSkewDetected | Alerts if the node's clock is significantly out of sync. | warning | 
| node-exporter | NodeClockNotSynchronising | Alerts if the node's clock is not synchronizing with NTP. | warning | 
| node-exporter | NodeRAIDDiskFailure | Alerts if a device in a RAID array has failed. | warning | 
| node-exporter | NodeFileDescriptorLimit | Warns when file descriptors usage approaches a defined limit. | warning | 
| node-exporter | NodeFileDescriptorLimit | Critically alerts when file descriptors usage breaches a limit. | critical | 
| node-exporter | NodeCPUHighUsage | Warns when CPU usage exceeds 90% for a sustained period (15 minutes). | info | 
| node-exporter | NodeSystemSaturation | Warns when the average system load per CPU core exceeds a high threshold. | warning | 
| node-exporter | NodeMemoryMajorPagesFaults | Warns when the rate of major memory page faults is high. | N/A | 
Etcd Alerts
| Alert Title | Description | Severity | 
|---|---|---|
| etcdMembersDown | Alerts when members of the etcd cluster are down or experiencing network connectivity issues. | critical | 
| etcdInsufficientMembers | Alerts when the etcd cluster doesn't have a sufficient number of members to reach quorum. | critical | 
| etcdNoLeader | Alerts when the etcd cluster does not have a leader, indicating potential leadership election issues. | critical | 
Resources Alerts
| Group Name | Title | Description | Severity | 
|---|---|---|---|
| kubernetes-apps | KubePodCrashLooping | Alerts when a pod is repeatedly restarting due to crashes (CrashLoopBackOff state). | warning | 
| kubernetes-apps | KubePodNotReady | Alerts when a pod remains in a "not ready" state for over 15 minutes. | warning | 
| kubernetes-apps | KubeDeploymentGenerationMismatch | Alerts if a deployment's generation mismatch occurs, suggesting a failed rollback. | warning | 
| kubernetes-apps | KubeDeploymentReplicasMismatch | Alerts if a deployment hasn't scaled to the desired number of replicas within 15 minutes. | warning | 
| kubernetes-apps | KubeDeploymentRolloutStuck | Alerts if a deployment's rollout stalls for more than 15 minutes. | warning | 
| kubernetes-apps | KubeStatefulSetReplicasMismatch | Alerts if a StatefulSet hasn't scaled to the desired number of replicas within 15 minutes. | warning | 
| kubernetes-apps | KubeStatefulSetGenerationMismatch | Alerts if a StatefulSet's generation mismatch occurs, suggesting a failed rollback. | warning | 
| kubernetes-apps | KubeStatefulSetUpdateNotRolledOut | Alerts if a StatefulSet's update hasn't finished rolling out completely. | warning | 
| kubernetes-apps | KubeDaemonSetRolloutStuck | Alerts if a DaemonSet rollout stalls or fails to progress within 15 minutes. | warning | 
| kubernetes-apps | KubeContainerWaiting | Alerts if a container within a pod is in a waiting state for over an hour. | warning | 
| kubernetes-apps | KubeDaemonSetNotScheduled | Alerts if one or more pods in a DaemonSet fail to be scheduled. | warning | 
| kubernetes-apps | KubeDaemonSetMisScheduled | Alerts if one or more pods in a DaemonSet are scheduled on ineligible nodes. | warning | 
| kubernetes-apps | KubeJobNotCompleted | Alerts if a Job takes longer than 12 hours (43200 seconds) to complete. | warning | 
| kubernetes-apps | KubeJobFailed | Alerts if a Job fails to complete (enters failed state). | warning | 
| kubernetes-apps | KubeHpaReplicasMismatch | Alerts if a HorizontalPodAutoscaler (HPA) hasn't scaled to the desired number of replicas within 15 minutes. | warning | 
| kubernetes-apps | KubeHpaMaxedOut | Alerts if a HPA persistently operates at its maximum replica count for over 15 minutes. | warning | 
Container Alerts
| Group Name | Title | Description | Severity | 
|---|---|---|---|
| container.highmemoryusage.rules | Container has a high memory usage | Alerts when a container uses more than 80% of its memory limit. Includes details about the pod, container, namespace, etc. | warning | 
| container.highcpuusage.rules | Container has a high CPU utilization rate | Alerts when a container uses more than 80% of its CPU limit. Includes details about the pod, container, namespace, etc. | warning | 
| container.restarted.rules | Container has multiple restarts | Alerts on containers with multiple restarts, usually indicating instability. Includes relevant details about the container. | warning | 
GPU Alerts
| Group Name | Title | Description | Severity | 
|---|---|---|---|
| pod.gpu.evicted | Pod Preempted Due To Inactivity | Alerts on GPU-requesting pods evicted due to exceeding the inactivity limit in their PriorityClass. Guides troubleshooting. | warning | 
| pod.gpu.pending | Pending Pods Due To GPU Requirement | Alerts when GPU-requesting pods are stuck in pending because of insufficient resources or scheduling constraints. | warning | 
Storage Alerts
| Group Name | Title | Description | Severity | 
|---|---|---|---|
| kubernetes-storage | KubePersistentVolumeFillingUp | Alerts when a PersistentVolume's free space falls below 3%. | critical | 
| kubernetes-storage | KubePersistentVolumeFillingUp | Warns when a PersistentVolume is predicted to fill up within 4 days, and currently has less than 15% space available. | warning | 
| kubernetes-storage | KubePersistentVolumeInodesFillingUp | Alerts when a PersistentVolume's free inodes fall below 3%. | critical | 
| kubernetes-storage | KubePersistentVolumeInodesFillingUp | Warns when a PersistentVolume is predicted to run out of inodes within 4 days, and has less than 15% of its inodes free. | warning | 
| kubernetes-storage | KubePersistentVolumeErrors | Triggers when a PersistentVolume enters a "Failed" or "Pending" state, indicating potential provisioning issues. | critical | 
Kubelet Alerts
| Group Name | Title | Description | Severity | 
|---|---|---|---|
| kubernetes-system-kubelet | KubeNodeNotReady | Alerts when a Kubernetes node has been in the "Not Ready" state for more than 15 minutes. | warning | 
| kubernetes-system-kubelet | KubeNodeUnreachable | Alerts when a Kubernetes node becomes unreachable, indicating potential workload rescheduling. | critical | 
| kubernetes-system-kubelet | KubeletTooManyPods | Warns when a Kubelet is approaching its maximum pod capacity (95%). | info | 
| kubernetes-system-kubelet | KubeNodeReadinessFlapping | Alerts when a node's readiness status frequently changes in a short period (more than twice in 15 minutes), suggesting instability. | warning | 
| kubernetes-system-kubelet | KubeletPlegDurationHigh | Alerts when the Kubelet's Pod Lifecycle Event Generator (PLEG) takes a significant time to relist pods (99th percentile duration exceeding 10 seconds). | warning | 
| kubernetes-system-kubelet | KubeletPodStartUpLatencyHigh | Alerts when the time for pods to reach full readiness becomes high (99th percentile exceeding 60 seconds) | warning | 
| kubernetes-system-kubelet | KubeletClientCertificateExpiration | Warns when a Kubelet's client certificate is about to expire within a week. | warning | 
| kubernetes-system-kubelet | KubeletClientCertificateExpiration | Alerts critically when a Kubelet's client certificate is about to expire within a day. | critical | 
| kubernetes-system-kubelet | KubeletServerCertificateExpiration | Warns when a Kubelet's server certificate is about to expire within a week. | warning | 
| kubernetes-system-kubelet | KubeletServerCertificateExpiration | Alerts critically when a Kubelet's server certificate is about to expire within a day. | critical | 
| kubernetes-system-kubelet | KubeletClientCertificateRenewalErrors | Alerts when a Kubelet encounters repeated errors while attempting to renew its client certificate. | warning | 
| kubernetes-system-kubelet | KubeletServerCertificateRenewalErrors | Alerts when a Kubelet encounters repeated errors while attempting to renew its server certificate. | warning | 
| kubernetes-system-kubelet | KubeletDown | Alerts critically when a Kubelet disappears from Prometheus' target discovery, potentially indicating a serious issue. | critical | 
Billing Alerts
| Group Name | Title | Description | 
|---|---|---|
| ezbilling.clusterstate.rules | Cluster is in disabled state | Alerts when the EzBilling cluster is disabled. Suggests contacting HPE Support. | 
| ezbilling.upload.rules | Billing usage records not uploaded | Alerts when billing usage records failed to upload for the past 24 hours. | 
| ezbilling.activation.code.grace.peroid.rules | Activation code grace period started | Alerts when the activation code grace period begins, providing the expiration date. | 
Licensing Alerts
| Group Name | Title | Description | 
|---|---|---|
| ezlicense.license.rules | Cluster is in disabled state | Alerts when the EzLicense cluster enters a disabled state, suggesting the need to contact HPE Support. | 
| ezlicense.expiry.tenday.rules | Activation key expiration | Alerts when an activation key is going to expire within the next 10 days, providing the expiration date. | 
| ezlicense.expiry.thirtyday.rules | Activation key expiration | Alerts when an activation key is going to expire within the next 30 days, providing the expiration date. | 
Licensing Capacity Alerts
| Group Name | Title | Description | 
|---|---|---|
| ezlicense.capacity.vCPU.rules | Worker node capacity has exceeded vCPU license capacity | Alerts when the vCPU capacity of worker nodes surpasses the available vCPU license limit. | 
| ezlicense.capacity.GPU.rules | Worker node capacity has exceeded GPU license capacity | Alerts when the GPU capacity of worker nodes surpasses the available GPU license limit. | 
| ezlicense.capacity.no.gpu.license.rules | GPU worker node found but no GPU license exists | Alerts when a GPU worker node is detected, but there's no corresponding GPU license available. | 
Prometheus Alerts
| Group Name | Title | Description | Severity | 
|---|---|---|---|
| prometheus | PrometheusBadConfig | Alerts when Prometheus fails to reload its configuration. | critical | 
| prometheus | PrometheusSDRefreshFailure | Alerts when Prometheus fails to refresh service discovery (SD) with a specific mechanism. | warning | 
| prometheus | PrometheusNotificationQueueRunningFull | Alerts when the Prometheus alert notification queue is predicted to reach full capacity soon. | warning | 
| prometheus | PrometheusErrorSendingAlertsToSomeAlertmanagers | Alerts when Prometheus encounters a significant error rate (> 1%) sending alerts to a specific Alertmanager. | warning | 
| prometheus | PrometheusNotConnectedToAlertmanagers | Alerts when Prometheus is not connected to any configured Alertmanagers. | warning | 
| prometheus | PrometheusTSDBReloadsFailing | Alerts when Prometheus encounters repeated failures (>0) during the loading of data blocks from disk. | warning | 
| prometheus | PrometheusTSDBCompactionsFailing | Alerts when Prometheus encounters repeated failures (>0) during block compactions. | warning | 
| prometheus | PrometheusNotIngestingSamples | Alerts when a Prometheus instance stops ingesting new metric samples. | warning | 
| prometheus | PrometheusDuplicateTimestamps | Alerts when Prometheus reports samples being dropped due to duplicate timestamps. | warning | 
| prometheus | PrometheusOutOfOrderTimestamps | Alerts when Prometheus reports samples being dropped due to arriving out of order. | warning | 
| prometheus | PrometheusRemoteStorageFailures | Alerts when Prometheus encounters a significant error rate (> 1%) when sending samples to configured remote storage. | critical | 
| prometheus | PrometheusRemoteWriteBehind | Alerts when Prometheus remote write operations fall behind significantly (> 2 minutes). | critical | 
| prometheus | PrometheusRemoteWriteDesiredShards | Alerts when the desired number of shards calculated for remote write exceeds the configured maximum. | warning | 
| prometheus | PrometheusRuleFailures | Alerts when Prometheus encounters repeated failures during rule evaluations. | critical | 
| prometheus | PrometheusMissingRuleEvaluations | Alerts when Prometheus misses rule group evaluations due to exceeding the allowed evaluation time. | warning | 
| prometheus | PrometheusTargetLimitHit | Alerts when Prometheus drops targets because the number of targets exceeds a configured limit. | warning | 
| prometheus | PrometheusLabelLimitHit | Alerts if Prometheus drops targets due to exceeding configured limits on label counts or label lengths. | warning | 
| prometheus | PrometheusScrapeBodySizeLimitHit | Alerts if Prometheus fails scrapes due to targets exceeding the configured maximum scrape body size. | warning | 
| prometheus | PrometheusScrapeSampleLimitHit | Alerts if Prometheus fails scrapes due to targets exceeding the configured maximum sample count. | warning | 
| prometheus | PrometheusTargetSyncFailure | Alerts when Prometheus is unable to synchronize targets successfully due to configuration errors. | critical | 
| prometheus | PrometheusHighQueryLoad | Alerts when the Prometheus query engine reaches close to full capacity, with less than 20% remaining. | warning | 
| prometheus | PrometheusErrorSendingAlertsToAnyAlertmanager | Alerts when there's a persistent error rate (> 3%) while sending alerts from Prometheus to any configured Alertmanager. | critical | 
Airflow Alerts
| Group Name | Title | Description | 
|---|---|---|
| airflow.scheduler.healthy.rules | Airflow Scheduler Unresponsive | Airflow Scheduler is not responding to health checks. | 
| airflow.dag.import.rules | Airflow DAG Import Errors | Errors detected during import of DAGs from the Git repository. | 
| airflow.tasks.queued.rules | Airflow Tasks Queued and Not Running | Airflow tasks are queued and unable to be executed. | 
| airflow.tasks.starving.rules | Airflow Tasks Starving for Resources | Airflow tasks cannot be scheduled due to lack of available resources in the pool. | 
| airflow.dags.gitrepo.rules | Airflow DAG Git Repository Inaccessible | Airflow cannot access the Git repository containing DAGs. | 
Kubeflow Alerts
| Group Name | Title | Description | 
|---|---|---|
| kubeflow.katib.rules | Kubeflow katib stuck | Indicates a potential issue with Katib where it's not starting new experiments, trials, or successfully completing trials. Suggests restarting the Katib controller. | 
MLflow Alerts
| Group Name | Title | Description | 
|---|---|---|
| mlflow_http_request_total | High MLflow HTTP Request Rate without status 200 | Alerts if more than 5% of HTTP requests to the MLflow server over a 5-minute window fail (don't have a status code of 200). | 
| mlflow_http_request_duration_seconds_bucket | A histogram representation of the duration of the incoming HTTP requests | Alerts when the 95th percentile of MLflow HTTP request durations exceeds 5 seconds within a 5-minute window, indicating potential slowdowns. | 
| mlflow_http_request_duration_seconds_sum | Total duration in seconds of all incoming HTTP requests | Alerts if the total time spent handling all MLflow HTTP requests exceeds 600 seconds over a 5-minute period, suggesting overload. | 
Ray Alerts
| Group Name | Title | Description | Severity | 
|---|---|---|---|
| ray.object.store.memory.high.pressure.alert | Ray: High Pressure on Object Store Memory | Alerts when 90% of Ray object store memory is used consistently for 5 minutes. | warning | 
| ray.node.memory.high.pressure.alert | Ray: High Memory Pressure on Ray Nodes | Alerts when a Ray node's memory usage exceeds 90% of its capacity for 5 minutes. | warning | 
| ray.node.cpu.utilization.high.pressure.alert | Ray: High CPU Pressure on Ray Nodes | Alerts when CPU utilization across Ray nodes exceeds 95% for 5 minutes. | warning | 
| ray.autoscaler.failed.node.creation.alert | Ray: Autoscaler Failed to Create Nodes | Alerts when the Ray autoscaler has failed attempts at creating new nodes for 5 minutes. | warning | 
| ray.scheduler.failed.worker.startup.alert | Ray: Scheduler Failed Worker Startup | Alerts when the Ray scheduler encounters failures during worker startup for 5 minutes. | warning | 
| ray.node.low.disk.space.alert | Ray: Low Disk Space on Nodes | Alerts when a Ray node has less than 10% of disk space free for 5 minutes. | warning | 
| ray.node.network.high.usage.alert | Ray: High Network Usage on Ray Nodes | Alerts when network usage (receive + transmit) on Ray nodes exceeds a threshold for 5 minutes, indicating potential congestion. | warning | 
Spark Alerts
| Group Name | Title | Description | Severity | 
|---|---|---|---|
| spark.app.high.failed.rule | Spark Operator: High Failed App Count | Alerts when the number of failed Spark applications handled by the operator surpasses a threshold within a 5-minute window. | warning | 
| spark.app.high.latency.rule | Spark Operator: High Average Latency for App Starting | Alerts when the average latency (time to start) for Spark applications exceeds 120 seconds for a 5-minute period. | warning | 
| spark.app.submission.high.failed.percentage.rule | Spark Operator: High Percentage of Failed Spark App Submissions | Alerts when the failure rate of Spark application submissions exceeds 10% of total submissions for 15 minutes. | warning | 
| spark.app.low.success.rate.rule | Spark Operator: Low Success Rate of Spark Applications | Alerts when the success rate of Spark applications drops below 80% of total submissions for a 20-minute period. | warning | 
| spark.app.executor.low.success.rate.rule | Spark Operator: Low Success Rate of Spark Application Executors | Alerts when the success rate of Spark executors drops below 90% of total executors for a 20-minute period. | warning | 
| spark.workload.high.memory.pressure.rule | Spark Workload: High Memory Pressure | Alerts when overall memory pressure on Spark's BlockManager exceeds a critical threshold for 1 minute. | warning | 
| spark.workload.high.heap.memory.pressure.rule | Spark Workload: High On-Heap Memory Pressure | Alerts when on-heap memory pressure on Spark's BlockManager exceeds a critical threshold for 1 minute. | warning | 
| spark.workload.high.cpu.usage.rule | Spark Workload: High JVM CPU Usage | Alerts when JVM CPU usage within Spark exceeds a critical threshold for 5 minutes. | warning | 
Superset Alerts
| Group Name | Title | Description | Severity | 
|---|---|---|---|
| superset.http.request.duration | A histogram representation of the duration of the incoming HTTP requests | Alerts when the 99th percentile of HTTP request duration exceeds 3 seconds for 5 minutes, indicating slow responses. | critical | 
| superset.http.request.total | Superset total number of HTTP requests without status 200 | Alerts if more than 5% of HTTP requests to Superset within a 5-minute window fail (don't have a status code of 200). | critical | 
| superset.http.request.exceptions.total | Total number of HTTP requests which resulted in an exception | Alerts when more than 10 HTTP requests to Superset result in exceptions within a 5-minute window. | critical | 
| superset.gc.objects.collected | Objects collected during gc | Alerts if more than 100 objects (generation 0) are collected during garbage collection within a 5-minute window. | warning | 
| superset.gc.objects.uncollectable | Uncollectable objects found during GC | Alerts if more than 50 uncollectable objects (generation 0) are found during garbage collection within a 5-minute window. | warning | 
| supercet.gc.collections | Number of times this generation was collected | Alerts if the youngest generation (0) of garbage collection has run more than 100 times within a 5-minute window, indicating potential memory pressure. | warning |