Debugging and Troubleshooting
This article contains the following information to help with debugging your environment if you run into errors or warnings during or after bootstrapping and applying CRs:
- Verifying Bootstrapping
- Verifying Data Fabric CR Deployment
- Verifying Tenant CR Deployment
- Applying RBAC Changes
- Getting the IP addresses of MFS pods
- Troubleshooting
See also Troubleshooting Guide for Kubernetes Clusters (link opens an external website in a new browser tab/window).
Verifying Bootstrapping
-
Execute the following command to verify that Data Fabric, tenant, and Spark operators are active by listing namespaces and pods:
kubectl get ns
This should list the
hpe-system
andspark-operator
active namespaces. -
Execute the following command to verify that the Data Fabric and tenant operator pods are ready:
kubectl get pods -n hpe-system
-
If any pod is not up and ready, then execute either of the following commands to check the
State and Events
metrics for debugging:kubectl describe pod <pod-name> -n hpe-system kubectl logs <pod-name> -n hpe-system
- If the deployment created multiple pods, then use the
kubectl describe deployments
, Replica Sets (rs), and pods to look for any errors.
Verifying Data Fabric CR Deployment
Determine if pods are up and running,
-
List the pods in the Data Fabric cluster namespace to determine whether pods are up and running by executing the following command:
kubectl get pods -n <Data-Fabric-cluster-namespace>
-
If any pod is not up and ready, then check the
State and Events
metrics for debugging by executing either of the following commands:kubectl describe pod <pod-name> -n <data-platform-cluster-namespace>
kubectl logs <pod-name> -n <Data-Fabric-cluster-namespace>
Wait until all pods show as Running, Ready, or Completed.
- Confirm that the Data Fabric CR is working:
- exec to the
cldb
andmfs
pods. - Execute the following
command:
kubectl exec -it -n <cluster-name> cldb-0 /bin/bash
- Execute the following command to log in as the
mapr
user:su - mapr
-
Generate a
mapr
ticket using themapr
user credentials for the Data Fabric cluster namespace by executing the following command:maprlogin password
- exec to the
Verifying Tenant CR Deployment
List the pods in your tenant namespace by running the following command:
kubectl get pods -n <tenant-name>
kubectl describe pod <pod-name> -n <tenant-name>
kubectl logs <pod-name> -n <tenant-name>
Getting the IP addresses of MFS pods
The following command returns the IP addresses (internal and external) of the MFS pods:
maprcli node list -columns h
Applying RBAC Changes
The Tenant Operator supports RBAC authorization for any users or groups listed in
the Tenant CR. For applying any changes in RBAC settings using the CR, you must
delete the deployment by running the kubectl delete -f
<cr-tenant-xyz.yaml>
command and recreate using the
kubectl apply -f <cr-tenant-xyz.yaml>
command.
Troubleshooting
This section contains troubleshooting tips for the following issues:
FailedMount Warning
You may see the following warning when you run the describe command for pods:
"Warning FailedMount 8m20s (x6 over 8m35s) kubelet, worker2 MountVolume.SetUp failed
for volume "client-secrets" : secret "mapr-client-secrets" not found
Warning FailedMount 8m20s (x6 over 8m35s) kubelet, worker2 MountVolume.SetUp failed
for volume "server-secrets" : secret "mapr-server-secrets" not found
Warning FailedMount 8m20s (x6 over 8m35s) kubelet, worker2 MountVolume.SetUp failed
for volume "ssh-secrets" : secret "mapr-ssh-secrets" not found"
This is normal and expected. Pods cannot mount the secrets until the init
job has run. If you see timeouts because of these issues, then it is
likely that resource constraints prevented scheduling the init
job.
You may see the following warning when deploying the CR:
Warning FailedMount 51m (x7 over 51m) kubelet, aks-agentpool-34842125-0
MountVolume.SetUp failed for volume "client-secrets" : secrets
"mapr-client-secrets" not found
Warning FailedMount 51m (x7 over 51m) kubelet, aks-agentpool-34842125-0
MountVolume.SetUp failed for volume "server-secrets" : secrets
"mapr-server-secrets" not found
You can ignore event messages like this because they do not prevent pods from launching.
FailedScheduling Warning
You may see the following warning when you execute the describe
command for the pod:
Events:
Type Reason Age From Message
---- ------ --- ---- -------
Warning Failed Scheduling 30m (x22 over 31m) default-scheduler 0/5 nodes are available: 5
node(s) didn't have free ports for the requested pod ports.
This indicates a mismatch between the number of nodes in the cluster versus the number of instances of CLDB and the Data Fabric filesystem. Both require the same host ports and therefore cannot be deployed on the same node. The default installation requires three (3) CLDB nodes and two (2) Data Fabric filesystem nodes. This problem can also occur if all five nodes are present but one or more nodes cannot host a Data Fabric file system or CLDB container because the nodes are too small for the scheduler to schedule one of those pods.
Objectstore Pod Not Ready
The objectstore
pod may not ready after a long time. For
example:
Warning FailedMount 5m57s (x65 over 151m) kubelet, atsqa8c145.qa.lab Unable
to mount volumes for pod
"objectstore-0_mycluster(23af1481-41e2-11e9-b693-40167e367edb)": timeout
expired waiting for volumes to attach or mount for pod
"mycluster"/"objectstore-0". list of unmounted
volumes=[objectstore-csi-volume]. list of unattached volumes=[cluster-cm status-cm replace-cm logs cores podinfo ldap-cm sssd-secrets ssh-secrets
client-secrets server-secrets objectstore-csi-volume
mapr-mycluster-cluster-token-f4krn]
If the objectstore
pod remains stuck in the init
state for more than 10 minutes, then manually delete and relaunch the
pod by executing the following command:
kubectl delete pod -n <namespace> objectstore-0 pod "objectstore-0" deleted
CrashLoopBackOff or RunContainerError
-
Execute the following command to get the pods in the
kube-system
namespace:kubectl get pod -n kube-system
-
Check the pod
Status and Events
metrics by executing the following command:kubectl describe pod <pod-name> -n kube-system
For example, if the pod named
kube-flannel-ds-amd64-v7qdt
failed, then execute the following command:kubectl describe pod kube-flannel-ds-amd64-v7qdt -n kube-system
If the pod is not ready because of Events errors or warnings, then recreate the cluster.
CLDB Running but Other Pods Waiting
The CLDB may be in the Running state but other pods are failing to initialize,
waiting for CLDB, after applying <cr-cluster-full.yaml>
. This
occurs because the ObjectStore init
container is unavailable and
the pod fails to initialize, waiting for CLDB. Check the CLDB logs. If you see the
message Setting up disk failed
:
-
Execute the following command:
kubectl get pods -n <cluster-name>
You may notice that objectstore-0 has failed to initialize:
objectstore-0 0/1 Init:0/1 0 1h
-
Execute the following command to get pod information:
kubectl describe pod objectstore-0 -n <cluster-name>
You might see something similar to:Status: Pending IP: Controlled By: StatefulSet/objectstore Init Containers: cldb-available: Container ID: Image: busybox Image ID: Port: Host Port: Command: sh -c avail='UNAVAILABLE'; while \\\[ $avail -ne 'AVAILABLE' \\\]; do echo waiting for CLDB; sleep 10; avail=\\\`cat /opt/mapr/kubernetes/status-cm/CLDB_STATUS\\\`; done; State: Waiting Reason: PodInitializing
-
Get CLDB pod logs by executing the following command:
kubectl logs cldb-0 -n <cluster-name>
You may see something similar to:2019/03/05 21:26:33 common.sh: \[INFO\] Setting up disk with: /opt/mapr/server/disksetup -F /opt/mapr/conf/disks.txt /dev/sdb failed. Error 16, Device or resource busy. Disk is used by some other module/process. 2019/03/05 21:26:37 common.sh: \[WARNING\] /opt/mapr/server/disksetup failed with error code 1... Retrying in 10 seconds 2019/03/05 21:26:47 common.sh: \[INFO\] Setting up disk with: /opt/mapr/server/disksetup -F /opt/mapr/conf/disks.txt /dev/sdb failed. Error 16, Device or resource busy. Disk is used by some other module/process. 2019/03/05 21:26:52 common.sh: \[WARNING\] /opt/mapr/server/disksetup failed with error code 1... Retrying in 10 seconds 2019/03/05 21:27:02 common.sh: \[INFO\] Setting up disk with: /opt/mapr/server/disksetup -F /opt/mapr/conf/disks.txt /dev/sdb failed. Error 16, Device or resource busy. Disk is used by some other module/process.
- SSH to the cluster nodes.
-
Execute the following command:
disk -l
- Identify a disk that is not used or free and change the disk values in the CR
simpleDeploymentDisks
list. For example:kubectl delete -f cr-Data Fabricr-full.yaml kubectl get ns kubectl apply -f cr-Data Fabric-full.yaml