HPE Ezmeral Data Fabric Issues

You can view the status of HPE Ezmeral Data Fabric services in the following locations:

  • Virtual clusters:Services tab of the Cluster Details screen, or the Services tab of the Training Cluster Details or Deployment Cluster Details screen, as appropriate.
  • Kubernetes virtual clusters:Services Status tab of the Kubernetes Cluster Details screen.

Checking Service Status

If the HPE Ezmeral Data Fabric (MapR) service does not appear in any Services tab, then it may not be running. You can determine the status of this service by executing the following commands:

  • Deployment Controller host: docker ps -a
  • Kubernetes Data Fabric Master node:kubectl get po -A (if the deployment includes a Kubernetes Data Fabric cluster)

Troubleshooting Errors

This article provides guidance in case any of the HPE Ezmeral Data Fabric services go into an ERROR state (red dot), or if you need to remove stale node IDs.

HPE Ezmeral Data Fabric Service Description Diagnostics Steps / Corrective Action
Container Location Database (CLDB) Tracks critical metadata about every container in Data Fabric, cluster file servers, and node activity. The CLDB service on multiple nodes distributes lookup operations across those nodes for load balancing and also provides high availability.

Look at /opt/mapr/logs/cldb.log.

Restart CLDB services, as described here (link opens an external website in a new browser tab/window).

Warden A light Java application that runs on all the nodes in a cluster and coordinates cluster services. Warden's job on each node is to start, stop, or restart the appropriate services, and allocate the correct amount of memory to them.

Get more context on the error by looking at the Warden logs located at /opt/mapr/logs/warden.log in the HPE Ezmeral Data Fabric container.

Refer to the troubleshooting steps here (link opens an external website in a new browser tab/window).

Consider restarting the Zookeeper and Warden services, as described here (link opens an external website in a new browser tab/window.

Posix Clients HPE Ezmeral Data Fabric POSIX clients allow Docker to read and write directly and securely on the filesystem exposed by HPE Ezmeral Data Fabric FUSE (Filesystem in Userspace).

Look at /opt/mapr/logs/posix-client-basic.log.

Turn on HPE Ezmeral Data Fabric tracing to collect more information, as described here (link opens an external website in a new browser tab/window.

AdminApp This is the web application that allows users and administrators to control and configure an HPE Ezmeral Data Fabric cluster.

Look at /opt/mapr/apiserver/logs/apiserver.log.

The admin application is normally controlled by the Warden process, which should restart it if it fails. The primary repair action is to tell the warden on the appropriate node to restart this service.

Zookeeper ZooKeeper is a coordination service for distributed applications. It provides a shared hierarchical namespace that is organized like a standard file system. Look at /opt/mapr/zookeeper/zookeeper-3.4.11/logs/.
Fileserver The mapr-fileserver service is the actual process that stores data on disks. This service needs to be running on every machine that is storing data. Having more file servers running will increase both failure tolerance and overall I/O bandwidth.

Look at /opt/mapr/logs/mfs.log*.

The warden will try three times to restart the service automatically. After an interval (30 minutes by default), the warden will again try three times to restart the service. The interval can be configured using the parameter services.retryinterval.time.sec in the warden.conf file. If the warden successfully restarts the File Server service, then it should return back to NORMAL (green) status. If the warden is unable to restart the File Server service, then you may need to contact HPE Technical Support.

Removing staleid node records from the MapR cluster

A staleid node record in the HPE Ezmeral Runtime Enterprise list of Data Fabric cluster nodes indicates that the host was removed from the deployment and then re-added within five minutes. This record will appear underneath a valid record for the same host. There is no problem with the cluster or deployment, and you can safely delete this record. To check for records in this state, execute the following command on the primary Controller host:

bdmapr maprcli node list -columns h,svc,id

Stale IDs will appear in the output as shown here:

host.enterprise.net!
fileserver,mastgateway,hoststats,posixclientbasic@!
16.143.22.202
0
7640315902262614304
host.enterprise.net_staleid_4873757504540959328
16.143.22.202
4
4873757504540959328

Execute the following command to delete the staleid using the full hostname:

/usr/lib/python2.7/site-packages/bluedata/mapr/bds-mapr-config.py removeNode --host-name <hostname>_staleid_4873757504540959328

Do not delete the corresponding valid host entry, which does not have the staleid reference.