Debugging and Troubleshooting

NOTE
In this article, the term tenant refers to HPE Ezmeral Data Fabric tenants (formerly '"MapR tenants") and not to Kubernetes tenants unless explicitly noted otherwise on a case-by-case basis.

This article contains the following information to help with debugging your environment if you run into errors or warnings during or after bootstrapping and applying CRs:

See also Troubleshooting Guide for Kubernetes Clusters (link opens an external website in a new browser tab/window).

Verifying Bootstrapping

  • Execute the following command to verify that Data Fabric, tenant, and Spark operators are active by listing namespaces and pods:

    kubectl get ns

    This should list the hpe-system and spark-operator active namespaces.

  • Execute the following command to verify that the Data Fabric and tenant operator pods are ready:

    kubectl get pods -n hpe-system
  • If any pod is not up and ready, then execute either of the following commands to check the State and Events metrics for debugging:

    kubectl describe pod <pod-name> -n hpe-system
    kubectl logs <pod-name> -n hpe-system
  • If the deployment created multiple pods, then use the kubectl describe deployments, Replica Sets (rs), and pods to look for any errors.

Verifying Data Fabric CR Deployment

Determine if pods are up and running,

  1. List the pods in the Data Fabric cluster namespace to determine whether pods are up and running by executing the following command:

    kubectl get pods -n <Data-Fabric-cluster-namespace>
  2. If any pod is not up and ready, then check the State and Events metrics for debugging by executing either of the following commands:

    kubectl describe pod <pod-name> -n <data-platform-cluster-namespace>
    kubectl logs <pod-name> -n <Data-Fabric-cluster-namespace>

    Wait until all pods show as Running, Ready, or Completed.

  3. Confirm that the Data Fabric CR is working:
    1. exec to the cldb and mfs pods.
    2. Execute the following command:
      kubectl exec -it -n <cluster-name> cldb-0 /bin/bash
    3. Execute the following command to log in as the mapr user:
      su - mapr
    4. Generate a mapr ticket using the mapr user credentials for the Data Fabric cluster namespace by executing the following command:

      maprlogin password

Verifying Tenant CR Deployment

List the pods in your tenant namespace by running the following command:

kubectl get pods -n <tenant-name>
Verify that all pods listed in CR are Ready and Running with the expected number of instances. If any pod is not up and ready, run either of the following commands to check for the State and Events metrics for debugging:
kubectl describe pod <pod-name> -n <tenant-name>
kubectl logs <pod-name> -n <tenant-name>

Getting the IP addresses of MFS pods

The following command returns the IP addresses (internal and external) of the MFS pods:

maprcli node list -columns h

Applying RBAC Changes

The Tenant Operator supports RBAC authorization for any users or groups listed in the Tenant CR. For applying any changes in RBAC settings using the CR, you must delete the deployment by running the kubectl delete -f <cr-tenant-xyz.yaml> command and recreate using the kubectl apply -f <cr-tenant-xyz.yaml> command.

Troubleshooting

This section contains troubleshooting tips for the following issues:

FailedMount Warning

You may see the following warning when you run the describe command for pods:

"Warning FailedMount 8m20s (x6 over 8m35s) kubelet, worker2 MountVolume.SetUp failed
 for volume "client-secrets" : secret "mapr-client-secrets" not found
 Warning FailedMount 8m20s (x6 over 8m35s) kubelet, worker2 MountVolume.SetUp failed
 for volume "server-secrets" : secret "mapr-server-secrets" not found
 Warning FailedMount 8m20s (x6 over 8m35s) kubelet, worker2 MountVolume.SetUp failed
 for volume "ssh-secrets" : secret "mapr-ssh-secrets" not found"

This is normal and expected. Pods cannot mount the secrets until the init job has run. If you see timeouts because of these issues, then it is likely that resource constraints prevented scheduling the init job.

You may see the following warning when deploying the CR:

Warning FailedMount 51m (x7 over 51m) kubelet, aks-agentpool-34842125-0
 MountVolume.SetUp failed for volume "client-secrets" : secrets
 "mapr-client-secrets" not found
 Warning FailedMount 51m (x7 over 51m) kubelet, aks-agentpool-34842125-0
 MountVolume.SetUp failed for volume "server-secrets" : secrets
 "mapr-server-secrets" not found

You can ignore event messages like this because they do not prevent pods from launching.

FailedScheduling Warning

You may see the following warning when you execute the describe command for the pod:

Events:
Type    Reason            Age                From              Message
----    ------            ---                ----              -------
Warning Failed Scheduling 30m (x22 over 31m) default-scheduler 0/5 nodes are available: 5
node(s) didn't have free ports for the requested pod ports.

This indicates a mismatch between the number of nodes in the cluster versus the number of instances of CLDB and the Data Fabric filesystem. Both require the same host ports and therefore cannot be deployed on the same node. The default installation requires three (3) CLDB nodes and two (2) Data Fabric filesystem nodes. This problem can also occur if all five nodes are present but one or more nodes cannot host a Data Fabric file system or CLDB container because the nodes are too small for the scheduler to schedule one of those pods.

Objectstore Pod Not Ready

The objectstore pod may not ready after a long time. For example:

Warning FailedMount 5m57s (x65 over 151m) kubelet, atsqa8c145.qa.lab Unable
 to mount volumes for pod
 "objectstore-0_mycluster(23af1481-41e2-11e9-b693-40167e367edb)": timeout
 expired waiting for volumes to attach or mount for pod
 "mycluster"/"objectstore-0". list of unmounted
volumes=[objectstore-csi-volume]. list of unattached volumes=[cluster-cm status-cm replace-cm logs cores podinfo ldap-cm sssd-secrets ssh-secrets
 client-secrets server-secrets objectstore-csi-volume
 mapr-mycluster-cluster-token-f4krn]

If the objectstore pod remains stuck in the init state for more than 10 minutes, then manually delete and relaunch the pod by executing the following command:

kubectl delete pod -n <namespace> objectstore-0 pod "objectstore-0" deleted

CrashLoopBackOff or RunContainerError

  1. Execute the following command to get the pods in the kube-system namespace:

    kubectl get pod -n kube-system
  2. Check the pod Status and Events metrics by executing the following command:

    kubectl describe pod <pod-name> -n kube-system

    For example, if the pod named kube-flannel-ds-amd64-v7qdt failed, then execute the following command:

    kubectl describe pod kube-flannel-ds-amd64-v7qdt -n kube-system

If the pod is not ready because of Events errors or warnings, then recreate the cluster.

CLDB Running but Other Pods Waiting

The CLDB may be in the Running state but other pods are failing to initialize, waiting for CLDB, after applying <cr-cluster-full.yaml>. This occurs because the ObjectStore init container is unavailable and the pod fails to initialize, waiting for CLDB. Check the CLDB logs. If you see the message Setting up disk failed:

  1. Execute the following command:

    kubectl get pods -n <cluster-name>

    You may notice that objectstore-0 has failed to initialize:

    objectstore-0 0/1 Init:0/1 0 1h
  2. Execute the following command to get pod information:

    kubectl describe pod objectstore-0 -n <cluster-name>
    You might see something similar to:
    Status: Pending 
    IP: 
    Controlled By: StatefulSet/objectstore 
    Init Containers: cldb-available: 
    Container ID: 
    Image: busybox 
    Image ID:
    Port: 
    Host Port: 
    Command: 
    sh -c avail='UNAVAILABLE'; 
    while \\\[ $avail -ne 'AVAILABLE' \\\]; 
    do 
    echo waiting for CLDB; 
    sleep 10; 
    avail=\\\`cat /opt/mapr/kubernetes/status-cm/CLDB_STATUS\\\`; 
    done; 
    State: Waiting
    Reason: PodInitializing
  3. Get CLDB pod logs by executing the following command:

    kubectl logs cldb-0 -n <cluster-name>
    You may see something similar to:
    2019/03/05 21:26:33 common.sh: \[INFO\] Setting up disk with:
     /opt/mapr/server/disksetup -F /opt/mapr/conf/disks.txt /dev/sdb
     failed. Error 16, Device or resource busy. Disk is used by some other
     module/process. 
    2019/03/05 21:26:37 common.sh: \[WARNING\]
     /opt/mapr/server/disksetup failed with error code 1... Retrying in 10
     seconds 
    2019/03/05 21:26:47 common.sh: \[INFO\] Setting up disk with:
     /opt/mapr/server/disksetup -F /opt/mapr/conf/disks.txt /dev/sdb
     failed. Error 16, Device or resource busy. Disk is used by some other
     module/process. 
    2019/03/05 21:26:52 common.sh: \[WARNING\]
     /opt/mapr/server/disksetup failed with error code 1... Retrying in 10
     seconds 
    2019/03/05 21:27:02 common.sh: \[INFO\] Setting up disk with:
     /opt/mapr/server/disksetup -F /opt/mapr/conf/disks.txt /dev/sdb
     failed. Error 16, Device or resource busy. Disk is used by some other
     module/process.
  4. SSH to the cluster nodes.
  5. Execute the following command:

    disk -l
  6. Identify a disk that is not used or free and change the disk values in the CR simpleDeploymentDisks list. For example:
    kubectl delete -f cr-Data Fabricr-full.yaml
    kubectl get ns
    kubectl apply -f cr-Data Fabric-full.yaml