Troubleshooting Services

The Services tabs of the Platform Administrator and Kubernetes Administrator Dashboard screens display the status of each service. The Platform Administrator Dashboard screen also includes general HPE Ezmeral Runtime Enterprise services, such as monitoring.

See Dashboard - Kubernetes Administrator and Dashboard - Platform Administrator.

A service that is one of the following degraded states may require troubleshooting, corrective action, or both:

Warning: Yellow
Critical: Red.

Audit

This service audits all user access to the platform interface, and specifically CRUD operations on clusters, but does not audit requests sent to specific Kubernetes clusters. This service runs on the Controller only.

To troubleshoot this service, view the log file on the Controller host at:

/var/log/bluedata/bds-audit.log

This log file provides a comprehensive history of all interface-level user actions and is a subset of the bd-mgmt.log. Contact Hewlett Packard Enterprise Support if you require assistance to resolve an issue with this service.

Caching Node

This service is a critical component for running Big Data jobs against the tenant storage, external DataTaps, or both. I/O pressure, memory issues, or incompatibility with a remote DataTap can cause issues.

In Kubernetes deployments of HPE Ezmeral Runtime Enterprise, the caching node (cnode) runs a sidecar container is always named dtap. When troubleshooting the caching node, you can use the standard kubectl logs commands. For example, to output the caching node log of mypod in mynamespace, enter the following command:

kubectl logs -f -n mynamespace mypod -c dtap

If this service continues to restart, or if it remains in a critical state, then contact Hewlett Packard Enterprise Support.

HA Engine

This service runs the HA process for the platform. If the status of this service is Critical, then contact Hewlett Packard Enterprise Support.

HA Engine logs are stored in /var/log/bluedata/pl_ha/ and /var/log/pacemaker.

HA Proxy

This service runs on the Gateway hosts in the platform and is managed by the platform. If this service becomes Critical (red dot), then collect /var/log/bludata/bds-mgmt.log and /var/log/messages on the affected Gateway host, and then contact Hewlett Packard Enterprise Support.

Management

This service is a key component that manages the overall system, including:

The physical hosts
Submitting jobs
The UI and RESTful APIs.

If this service is in a degraded state, then the web interface will not be accessible. You can access the Nagios interface directly by navigating to:

http://<controller-ip-address>:8085/nagios

The management service can fail for a variety of reasons, including:

Low availability of resources on the Controller host.
Disk failure on the root volume of the Controller node.

To restart this service, execute the following commands:

stop bds-controller
start bds-controller

If this service still fails, then contact Hewlett Packard Enterprise Support.

The /var/log/bluedata/bds-mgmt.log file contains detailed interface-based operations, including:

CRUD of various objects such as tenants, DataTaps, clusters, and flavors.
Errors related to cluster creation failures, network connectivity issues between containers.
Other related items.

Restarting Services

After restarting the monitoring container, services might fail to start.

To restart the management service, see Management.

To restart gateway services, see Restarting Gateway Services.

To restart a service manually, on the Controller host, execute the following command:

docker exec <id-of-dontainer-running-the-monitoring-image> service metricbeat restart