List of Alerts
Provides a list of platform alerts and application alerts in HPE Ezmeral Unified Analytics Software.
HPE Ezmeral Unified Analytics Software issues the following platform alerts:
HPE Ezmeral Unified Analytics Software issues the following application alerts:
Node Alerts
Group | Title | Description | Severity |
---|---|---|---|
Node.Resources | NodeCPUUsageHigh | Alerts when a node's average CPU utilization over a five-minute window consistently exceeds 80% for 15 minutes. | warning |
Node.Resources | NodeMemoryUsageHigh | Alerts if a node's memory usage surpasses 85% of its total memory for 10 minutes. | warning |
Node.Requests | NodeMemoryRequestsVsAllocatableWarning80 | Warns when memory requests from pods on a node reach 80-90% of the node's allocatable memory, indicating potential problems scheduling new pods. | warning |
Node.Requests | NodeMemoryRequestsVsAllocatableCritical90 | Triggers a critical alert if memory requests from pods reach or exceed 90% of the node's allocatable memory, indicating a high risk of new pods being stuck in a pending state. | critical |
Node.Requests | NodeCPURequestsVsAllocatableWarning80 | Warns when CPU requests from pods on a node reach 80-90% of the node's allocatable CPU resources, indicating potential problems scheduling new pods. | warning |
Node.Requests | NodeCPURequestsVsAllocatableCritical90 | Triggers a critical alert if CPU requests from pods reach or exceed 90% of the node's allocatable CPU resources, indicating a high risk of new pods being stuck in a pending state. | critical |
node-exporter | NodeFilesystemSpaceFillingUp | Alerts when a filesystem's free space drops below a threshold, predicting potential exhaustion within 24 hours. | warning |
node-exporter | NodeFilesystemSpaceFillingUp | Alerts if a filesystem's free space is critically low, predicting exhaustion within 4 hours. | critical |
node-exporter | NodeFilesystemAlmostOutOfSpace | Alerts if a filesystem's free space falls below 5%. | warning |
node-exporter | NodeFilesystemAlmostOutOfSpace | Alerts if a filesystem's free space falls below 3%. | critical |
node-exporter | NodeFilesystemFilesFillingUp | Alerts when inode usage on a filesystem is predicted to reach exhaustion within 24 hours. | warning |
node-exporter | NodeFilesystemFilesFillingUp | Alerts if inode usage is critically low, predicting exhaustion within 4 hours. | critical |
node-exporter | NodeFilesystemAlmostOutOfFiles | Alerts if a filesystem's free inodes fall below 5%. | warning |
node-exporter | NodeFilesystemAlmostOutOfFiles | Alerts if a filesystem's free inodes fall below 3%. | critical |
node-exporter | NodeNetworkReceiveErrs | Alerts if a network interface reports a high rate of receive errors. | warning |
node-exporter | NodeNetworkTransmitErrs | Alerts if a network interface reports a high rate of transmit errors. | warning |
node-exporter | NodeHighNumberConntrackEntriesUsed | Alerts if a large percentage of conntrack entries are in use. | warning |
node-exporter | NodeTextFileCollectorScrapeError | Alerts if the Node Exporter's text file collector fails to scrape metrics. | warning |
node-exporter | NodeClockSkewDetected | Alerts if the node's clock is significantly out of sync. | warning |
node-exporter | NodeClockNotSynchronising | Alerts if the node's clock is not synchronizing with NTP. | warning |
node-exporter | NodeRAIDDiskFailure | Alerts if a device in a RAID array has failed. | warning |
node-exporter | NodeFileDescriptorLimit | Warns when file descriptors usage approaches a defined limit. | warning |
node-exporter | NodeFileDescriptorLimit | Critically alerts when file descriptors usage breaches a limit. | critical |
node-exporter | NodeCPUHighUsage | Warns when CPU usage exceeds 90% for a sustained period (15 minutes). | info |
node-exporter | NodeSystemSaturation | Warns when the average system load per CPU core exceeds a high threshold. | warning |
node-exporter | NodeMemoryMajorPagesFaults | Warns when the rate of major memory page faults is high. | N/A |
Etcd Alerts
Alert Title | Description | Severity |
---|---|---|
etcdMembersDown | Alerts when members of the etcd cluster are down or experiencing network connectivity issues. | critical |
etcdInsufficientMembers | Alerts when the etcd cluster doesn't have a sufficient number of members to reach quorum. | critical |
etcdNoLeader | Alerts when the etcd cluster does not have a leader, indicating potential leadership election issues. | critical |
Resources Alerts
Group Name | Title | Description | Severity |
---|---|---|---|
kubernetes-apps | KubePodCrashLooping | Alerts when a pod is repeatedly restarting due to crashes (CrashLoopBackOff state). | warning |
kubernetes-apps | KubePodNotReady | Alerts when a pod remains in a "not ready" state for over 15 minutes. | warning |
kubernetes-apps | KubeDeploymentGenerationMismatch | Alerts if a deployment's generation mismatch occurs, suggesting a failed rollback. | warning |
kubernetes-apps | KubeDeploymentReplicasMismatch | Alerts if a deployment hasn't scaled to the desired number of replicas within 15 minutes. | warning |
kubernetes-apps | KubeDeploymentRolloutStuck | Alerts if a deployment's rollout stalls for more than 15 minutes. | warning |
kubernetes-apps | KubeStatefulSetReplicasMismatch | Alerts if a StatefulSet hasn't scaled to the desired number of replicas within 15 minutes. | warning |
kubernetes-apps | KubeStatefulSetGenerationMismatch | Alerts if a StatefulSet's generation mismatch occurs, suggesting a failed rollback. | warning |
kubernetes-apps | KubeStatefulSetUpdateNotRolledOut | Alerts if a StatefulSet's update hasn't finished rolling out completely. | warning |
kubernetes-apps | KubeDaemonSetRolloutStuck | Alerts if a DaemonSet rollout stalls or fails to progress within 15 minutes. | warning |
kubernetes-apps | KubeContainerWaiting | Alerts if a container within a pod is in a waiting state for over an hour. | warning |
kubernetes-apps | KubeDaemonSetNotScheduled | Alerts if one or more pods in a DaemonSet fail to be scheduled. | warning |
kubernetes-apps | KubeDaemonSetMisScheduled | Alerts if one or more pods in a DaemonSet are scheduled on ineligible nodes. | warning |
kubernetes-apps | KubeJobNotCompleted | Alerts if a Job takes longer than 12 hours (43200 seconds) to complete. | warning |
kubernetes-apps | KubeJobFailed | Alerts if a Job fails to complete (enters failed state). | warning |
kubernetes-apps | KubeHpaReplicasMismatch | Alerts if a HorizontalPodAutoscaler (HPA) hasn't scaled to the desired number of replicas within 15 minutes. | warning |
kubernetes-apps | KubeHpaMaxedOut | Alerts if a HPA persistently operates at its maximum replica count for over 15 minutes. | warning |
Container Alerts
Group Name | Title | Description | Severity |
---|---|---|---|
container.highmemoryusage.rules | Container has a high memory usage | Alerts when a container uses more than 80% of its memory limit. Includes details about the pod, container, namespace, etc. | warning |
container.highcpuusage.rules | Container has a high CPU utilization rate | Alerts when a container uses more than 80% of its CPU limit. Includes details about the pod, container, namespace, etc. | warning |
container.restarted.rules | Container has multiple restarts | Alerts on containers with multiple restarts, usually indicating instability. Includes relevant details about the container. | warning |
GPU Alerts
Group Name | Title | Description | Severity |
---|---|---|---|
pod.gpu.evicted | Pod Preempted Due To Inactivity | Alerts on GPU-requesting pods evicted due to exceeding the inactivity limit in their PriorityClass. Guides troubleshooting. | warning |
pod.gpu.pending | Pending Pods Due To GPU Requirement | Alerts when GPU-requesting pods are stuck in pending because of insufficient resources or scheduling constraints. | warning |
Storage Alerts
Group Name | Title | Description | Severity |
---|---|---|---|
kubernetes-storage | KubePersistentVolumeFillingUp | Alerts when a PersistentVolume's free space falls below 3%. | critical |
kubernetes-storage | KubePersistentVolumeFillingUp | Warns when a PersistentVolume is predicted to fill up within 4 days, and currently has less than 15% space available. | warning |
kubernetes-storage | KubePersistentVolumeInodesFillingUp | Alerts when a PersistentVolume's free inodes fall below 3%. | critical |
kubernetes-storage | KubePersistentVolumeInodesFillingUp | Warns when a PersistentVolume is predicted to run out of inodes within 4 days, and has less than 15% of its inodes free. | warning |
kubernetes-storage | KubePersistentVolumeErrors | Triggers when a PersistentVolume enters a "Failed" or "Pending" state, indicating potential provisioning issues. | critical |
Kubelet Alerts
Group Name | Title | Description | Severity |
---|---|---|---|
kubernetes-system-kubelet | KubeNodeNotReady | Alerts when a Kubernetes node has been in the "Not Ready" state for more than 15 minutes. | warning |
kubernetes-system-kubelet | KubeNodeUnreachable | Alerts when a Kubernetes node becomes unreachable, indicating potential workload rescheduling. | critical |
kubernetes-system-kubelet | KubeletTooManyPods | Warns when a Kubelet is approaching its maximum pod capacity (95%). | info |
kubernetes-system-kubelet | KubeNodeReadinessFlapping | Alerts when a node's readiness status frequently changes in a short period (more than twice in 15 minutes), suggesting instability. | warning |
kubernetes-system-kubelet | KubeletPlegDurationHigh | Alerts when the Kubelet's Pod Lifecycle Event Generator (PLEG) takes a significant time to relist pods (99th percentile duration exceeding 10 seconds). | warning |
kubernetes-system-kubelet | KubeletPodStartUpLatencyHigh | Alerts when the time for pods to reach full readiness becomes high (99th percentile exceeding 60 seconds) | warning |
kubernetes-system-kubelet | KubeletClientCertificateExpiration | Warns when a Kubelet's client certificate is about to expire within a week. | warning |
kubernetes-system-kubelet | KubeletClientCertificateExpiration | Alerts critically when a Kubelet's client certificate is about to expire within a day. | critical |
kubernetes-system-kubelet | KubeletServerCertificateExpiration | Warns when a Kubelet's server certificate is about to expire within a week. | warning |
kubernetes-system-kubelet | KubeletServerCertificateExpiration | Alerts critically when a Kubelet's server certificate is about to expire within a day. | critical |
kubernetes-system-kubelet | KubeletClientCertificateRenewalErrors | Alerts when a Kubelet encounters repeated errors while attempting to renew its client certificate. | warning |
kubernetes-system-kubelet | KubeletServerCertificateRenewalErrors | Alerts when a Kubelet encounters repeated errors while attempting to renew its server certificate. | warning |
kubernetes-system-kubelet | KubeletDown | Alerts critically when a Kubelet disappears from Prometheus' target discovery, potentially indicating a serious issue. | critical |
Billing Alerts
Group Name | Title | Description |
---|---|---|
ezbilling.clusterstate.rules | Cluster is in disabled state | Alerts when the EzBilling cluster is disabled. Suggests contacting HPE Support. |
ezbilling.upload.rules | Billing usage records not uploaded | Alerts when billing usage records failed to upload for the past 24 hours. |
ezbilling.activation.code.grace.peroid.rules | Activation code grace period started | Alerts when the activation code grace period begins, providing the expiration date. |
Licensing Alerts
Group Name | Title | Description |
---|---|---|
ezlicense.license.rules | Cluster is in disabled state | Alerts when the EzLicense cluster enters a disabled state, suggesting the need to contact HPE Support. |
ezlicense.expiry.tenday.rules | Activation key expiration | Alerts when an activation key is going to expire within the next 10 days, providing the expiration date. |
ezlicense.expiry.thirtyday.rules | Activation key expiration | Alerts when an activation key is going to expire within the next 30 days, providing the expiration date. |
Licensing Capacity Alerts
Group Name | Title | Description |
---|---|---|
ezlicense.capacity.vCPU.rules | Worker node capacity has exceeded vCPU license capacity | Alerts when the vCPU capacity of worker nodes surpasses the available vCPU license limit. |
ezlicense.capacity.GPU.rules | Worker node capacity has exceeded GPU license capacity | Alerts when the GPU capacity of worker nodes surpasses the available GPU license limit. |
ezlicense.capacity.no.gpu.license.rules | GPU worker node found but no GPU license exists | Alerts when a GPU worker node is detected, but there's no corresponding GPU license available. |
Prometheus Alerts
Group Name | Title | Description | Severity |
---|---|---|---|
prometheus | PrometheusBadConfig | Alerts when Prometheus fails to reload its configuration. | critical |
prometheus | PrometheusSDRefreshFailure | Alerts when Prometheus fails to refresh service discovery (SD) with a specific mechanism. | warning |
prometheus | PrometheusNotificationQueueRunningFull | Alerts when the Prometheus alert notification queue is predicted to reach full capacity soon. | warning |
prometheus | PrometheusErrorSendingAlertsToSomeAlertmanagers | Alerts when Prometheus encounters a significant error rate (> 1%) sending alerts to a specific Alertmanager. | warning |
prometheus | PrometheusNotConnectedToAlertmanagers | Alerts when Prometheus is not connected to any configured Alertmanagers. | warning |
prometheus | PrometheusTSDBReloadsFailing | Alerts when Prometheus encounters repeated failures (>0) during the loading of data blocks from disk. | warning |
prometheus | PrometheusTSDBCompactionsFailing | Alerts when Prometheus encounters repeated failures (>0) during block compactions. | warning |
prometheus | PrometheusNotIngestingSamples | Alerts when a Prometheus instance stops ingesting new metric samples. | warning |
prometheus | PrometheusDuplicateTimestamps | Alerts when Prometheus reports samples being dropped due to duplicate timestamps. | warning |
prometheus | PrometheusOutOfOrderTimestamps | Alerts when Prometheus reports samples being dropped due to arriving out of order. | warning |
prometheus | PrometheusRemoteStorageFailures | Alerts when Prometheus encounters a significant error rate (> 1%) when sending samples to configured remote storage. | critical |
prometheus | PrometheusRemoteWriteBehind | Alerts when Prometheus remote write operations fall behind significantly (> 2 minutes). | critical |
prometheus | PrometheusRemoteWriteDesiredShards | Alerts when the desired number of shards calculated for remote write exceeds the configured maximum. | warning |
prometheus | PrometheusRuleFailures | Alerts when Prometheus encounters repeated failures during rule evaluations. | critical |
prometheus | PrometheusMissingRuleEvaluations | Alerts when Prometheus misses rule group evaluations due to exceeding the allowed evaluation time. | warning |
prometheus | PrometheusTargetLimitHit | Alerts when Prometheus drops targets because the number of targets exceeds a configured limit. | warning |
prometheus | PrometheusLabelLimitHit | Alerts if Prometheus drops targets due to exceeding configured limits on label counts or label lengths. | warning |
prometheus | PrometheusScrapeBodySizeLimitHit | Alerts if Prometheus fails scrapes due to targets exceeding the configured maximum scrape body size. | warning |
prometheus | PrometheusScrapeSampleLimitHit | Alerts if Prometheus fails scrapes due to targets exceeding the configured maximum sample count. | warning |
prometheus | PrometheusTargetSyncFailure | Alerts when Prometheus is unable to synchronize targets successfully due to configuration errors. | critical |
prometheus | PrometheusHighQueryLoad | Alerts when the Prometheus query engine reaches close to full capacity, with less than 20% remaining. | warning |
prometheus | PrometheusErrorSendingAlertsToAnyAlertmanager | Alerts when there's a persistent error rate (> 3%) while sending alerts from Prometheus to any configured Alertmanager. | critical |
Airflow Alerts
Group Name | Title | Description |
---|---|---|
airflow.scheduler.healthy.rules | Airflow Scheduler Unresponsive | Airflow Scheduler is not responding to health checks. |
airflow.dag.import.rules | Airflow DAG Import Errors | Errors detected during import of DAGs from the Git repository. |
airflow.tasks.queued.rules | Airflow Tasks Queued and Not Running | Airflow tasks are queued and unable to be executed. |
airflow.tasks.starving.rules | Airflow Tasks Starving for Resources | Airflow tasks cannot be scheduled due to lack of available resources in the pool. |
airflow.dags.gitrepo.rules | Airflow DAG Git Repository Inaccessible | Airflow cannot access the Git repository containing DAGs. |
Kubeflow Alerts
Group Name | Title | Description |
---|---|---|
kubeflow.katib.rules | Kubeflow katib stuck | Indicates a potential issue with Katib where it's not starting new experiments, trials, or successfully completing trials. Suggests restarting the Katib controller. |
MLflow Alerts
Group Name | Title | Description |
---|---|---|
mlflow_http_request_total | High MLflow HTTP Request Rate without status 200 | Alerts if more than 5% of HTTP requests to the MLflow server over a 5-minute window fail (don't have a status code of 200). |
mlflow_http_request_duration_seconds_bucket | A histogram representation of the duration of the incoming HTTP requests | Alerts when the 95th percentile of MLflow HTTP request durations exceeds 5 seconds within a 5-minute window, indicating potential slowdowns. |
mlflow_http_request_duration_seconds_sum | Total duration in seconds of all incoming HTTP requests | Alerts if the total time spent handling all MLflow HTTP requests exceeds 600 seconds over a 5-minute period, suggesting overload. |
Ray Alerts
Group Name | Title | Description | Severity |
---|---|---|---|
ray.object.store.memory.high.pressure.alert | Ray: High Pressure on Object Store Memory | Alerts when 90% of Ray object store memory is used consistently for 5 minutes. | warning |
ray.node.memory.high.pressure.alert | Ray: High Memory Pressure on Ray Nodes | Alerts when a Ray node's memory usage exceeds 90% of its capacity for 5 minutes. | warning |
ray.node.cpu.utilization.high.pressure.alert | Ray: High CPU Pressure on Ray Nodes | Alerts when CPU utilization across Ray nodes exceeds 95% for 5 minutes. | warning |
ray.autoscaler.failed.node.creation.alert | Ray: Autoscaler Failed to Create Nodes | Alerts when the Ray autoscaler has failed attempts at creating new nodes for 5 minutes. | warning |
ray.scheduler.failed.worker.startup.alert | Ray: Scheduler Failed Worker Startup | Alerts when the Ray scheduler encounters failures during worker startup for 5 minutes. | warning |
ray.node.low.disk.space.alert | Ray: Low Disk Space on Nodes | Alerts when a Ray node has less than 10% of disk space free for 5 minutes. | warning |
ray.node.network.high.usage.alert | Ray: High Network Usage on Ray Nodes | Alerts when network usage (receive + transmit) on Ray nodes exceeds a threshold for 5 minutes, indicating potential congestion. | warning |
Spark Alerts
Group Name | Title | Description | Severity |
---|---|---|---|
spark.app.high.failed.rule | Spark Operator: High Failed App Count | Alerts when the number of failed Spark applications handled by the operator surpasses a threshold within a 5-minute window. | warning |
spark.app.high.latency.rule | Spark Operator: High Average Latency for App Starting | Alerts when the average latency (time to start) for Spark applications exceeds 120 seconds for a 5-minute period. | warning |
spark.app.submission.high.failed.percentage.rule | Spark Operator: High Percentage of Failed Spark App Submissions | Alerts when the failure rate of Spark application submissions exceeds 10% of total submissions for 15 minutes. | warning |
spark.app.low.success.rate.rule | Spark Operator: Low Success Rate of Spark Applications | Alerts when the success rate of Spark applications drops below 80% of total submissions for a 20-minute period. | warning |
spark.app.executor.low.success.rate.rule | Spark Operator: Low Success Rate of Spark Application Executors | Alerts when the success rate of Spark executors drops below 90% of total executors for a 20-minute period. | warning |
spark.workload.high.memory.pressure.rule | Spark Workload: High Memory Pressure | Alerts when overall memory pressure on Spark's BlockManager exceeds a critical threshold for 1 minute. | warning |
spark.workload.high.heap.memory.pressure.rule | Spark Workload: High On-Heap Memory Pressure | Alerts when on-heap memory pressure on Spark's BlockManager exceeds a critical threshold for 1 minute. | warning |
spark.workload.high.cpu.usage.rule | Spark Workload: High JVM CPU Usage | Alerts when JVM CPU usage within Spark exceeds a critical threshold for 5 minutes. | warning |
Superset Alerts
Group Name | Title | Description | Severity |
---|---|---|---|
superset.http.request.duration | A histogram representation of the duration of the incoming HTTP requests | Alerts when the 99th percentile of HTTP request duration exceeds 3 seconds for 5 minutes, indicating slow responses. | critical |
superset.http.request.total | Superset total number of HTTP requests without status 200 | Alerts if more than 5% of HTTP requests to Superset within a 5-minute window fail (don't have a status code of 200). | critical |
superset.http.request.exceptions.total | Total number of HTTP requests which resulted in an exception | Alerts when more than 10 HTTP requests to Superset result in exceptions within a 5-minute window. | critical |
superset.gc.objects.collected | Objects collected during gc | Alerts if more than 100 objects (generation 0) are collected during garbage collection within a 5-minute window. | warning |
superset.gc.objects.uncollectable | Uncollectable objects found during GC | Alerts if more than 50 uncollectable objects (generation 0) are found during garbage collection within a 5-minute window. | warning |
supercet.gc.collections | Number of times this generation was collected | Alerts if the youngest generation (0) of garbage collection has run more than 100 times within a 5-minute window, indicating potential memory pressure. | warning |