Basic Troubleshooting

PROBLEM/IMPACT

PREPARATION/BEST PRACTICE

RECOVERY PROCEDURE

Remote DataTap HDFS storage failure

Follow all applicable best practices and other guidelines from your storage vendor to mitigate any storage failure.

Obtain any applicable recovery procedures from your storage vendor.

The Caching Note (cnode) service will automatically retry when certain HDFS errors occur before propagating the error to the application/interface.

Controller host failure

When the Controller host fails:

If platform HA is enabled, then any pending virtual cluster and/or tenant creation will resume when the Shadow Controller takes over.
Virtual clusters that are already running will continue to run, and users can continue interacting with them normally, either directly (routable container network) or via the Gateway hosts (non-routable container network).
Running jobs may be interrupted, and the affected users may need to restart the affected jobs.

HPE recommends storing any critical system files (such as custom keytab files and TLS certificates) on a shared file server that is mounted on both the Primary and Shadow Controller hosts.

The Arbiter host will detect the primary Controller host failure and begin a failover transition to the Shadow Controller host. The deployment will then be running in a degraded state until the Shadow Controller becomes the primary Controller.

The interface will be in Lockdown mode during the transition, and no administration tasks will be possible during this period. Users may need to restart any running jobs that were interrupted as a result of the failure/transition.

Shadow Controller host failure

If the Shadow Controller host fails or crashes, then the primary Controller host will continue operating; however, the platform will be running a degraded state and will not be protected against any failure of the primate Controller host. The interface displays a warning message when this occurs.

See Controller host failure, above

HPE Ezmeral Runtime Enterprise analyzes the cause of the host failure and attempts to recover the failed host automatically. If recovery is possible, then the failed host will come back up, and HPE Ezmeral Runtime Enterprise will resume normal operation.

If the problem cannot be resolved, then the affected host will be left in a degraded state. You will need to manually diagnose and (if possible) repair the problem, and then reboot that host. If rebooting solves the problem, then the failed host will come back up, and HPE Ezmeral Runtime Enterprise will resume normal operation with High Availability protection enabled. Container Platform does not currently support designating another Worker host as the new Shadow Controller.

Please contact HPE Technical Support for assistance if you are unable to resolve the issue.

Arbiter host failure

If the Arbiter host fails or crashes, then the Controller and Shadow Controller hosts will continue operating; however, the platform will be running in a degraded state and will not be protected against any failure of the Controller or Shadow Controller host. The interface displays a warning message when this occurs.

See Shadow Controller host failure, above.

Gateway host failure

The Gateway host may fail or crash while one or more users are connected to virtual clusters.

HPE highly recommends setting up multiple Gateway hosts to provide both High Availability and load balancing.

IT can either diagnose and repair the failed Gateway host, or provision a new Gateway host.

If the deployment has two or more Gateway hosts, then sessions connected through the failed hosts will be moved to the available hosts. The in-flight TCP connections might need to reset as they are moved to the backup host.

The deployment includes a load-balancer in front of the Gateway hosts, and users should therefore experience no performance impacts.

Expired TLS certificates or Keytab files

The underlying KDC keytab files that the virtual cluster uses to access the HDFS may be expired. Hadoop services will not be able to run because they cannot access the underlying HDFS.

Ensure that expiration date of all applicable SSL certificated and/or keytab files is sufficient for the lifespan of the cluster.

HPE Ezmeral Runtime Enterprise can experience adverse operational impact caused by changes in system files or system settings.

Changes to various system settings may cause unpredictable behavior in HPE Ezmeral Runtime Enterprise. Some examples include:

/etc/sysconfig/iptables
/etc/sysconfig/network
SELinux context
/etc/rsyslog.d/bds
umask settings
ipforward settings
RPM package deletion/changes.
- Do not manually install Network Manager
- RHEL subscription becomes inactive
Do not delete or alter any service user accounts.

Coordinate with your system administration teams to ensure that Chef/Puppet or other configuration management systems do not modify these settings/files on the HPE Ezmeral Runtime Enterprise hosts.

On either a regular basis (e.g. weekly) or when there is a significant configuration change in the environment (e.g. OS patch update, network configuration change), perform the HPE Ezmeral Runtime Enterprise configuration checks described in Config Checks Tab, and pay attention to any problem/warning reported by these checks.

Contact HPE Technical Support for assistance.