Using fsck
to Check for File System Inconsistencies
This procedure describes how use the fsck
utility to check for and
repair file system inconsistencies in a disk storage pool on HPE Ezmeral Data Fabric on Kubernetes on HPE Ezmeral Runtime Enterprise.
Prerequisites
Required access rights:
-
Platform Administrator or Kubernetes Cluster Administrator access rights are required to download the admin kubeconfig file, which is needed to access Kubernetes cluster pods (see Downloading Admin Kubeconfig).
-
You must be logged on as the root user on the nodes that contain the disk and on which the Kubernetes cluster is running.
About this task
Most disk failures can be identified and possibly remedied by running the
fsck
utility, which scans the storage pool to which the disk
belongs and reports errors. The fsck
utility can be used on an
offline storage pool after a node failure, after a disk failure, a filesystem
process crash, or to verify the consistency of data for suspected disk errors.
During this procedure, you place the pod in maintenance mode and take the storage pool offline. You restore operations at the end of the procedure.
Procedure
-
Use
kubectl exec
command to access the CLDB or MFS pod that contains the storage pool that you want to check.For example:
kubectl exec -it cldb-0 -n mycluster1 -- /bin/bash
If needed, you use the
kubectl get pods -n <cluster-name>
command to get the list of pods, and then determine the CLDB or MFS pod in which you want to run thefsck
tool. -
Place the pod in maintenance mode by entering the following command:
sudo touch /opt/mapr/kubernetes/maintenance
-
Use the
mrconfig sp list
command to list the storage pools that are in the pod:In the following example, there is one storage pool,
SP1
, with path:/dev/drive0
mrconfig sp list ListSPs resp: status 0:1 No. of SPs (1), totalsize 224491 MB, totalfree 221235 MB SP 0: name SP1, Online, size 224491 MB, free 221235 MB, path /dev/drive0
-
Mark the storage pool as offline.
For example:
mrconfig sp offline /dev/drive0
-
Verify the storage pool is offline by examining the output of the
mrconfig sp list
command.For example:
mrconfig sp list ListSPs resp: status 0:1 No. of SPs (1), totalsize 0 MB, totalfree 0 MB SP 0: name SP1, Offline, size 2575449 MB, free 0 MB, path /dev/drive0
-
Run the
fsck
utility on the storage pool, examine the output, and identify and resolve any errors.For information about
fsck
and resolving errors, see the following in the HPE Ezmeral Data Fabric documentation (links open in a new browser tab or window):For example:
/opt/mapr/server/fsck -n SP1 Using logfile /opt/mapr/logs/fsck.log.2021-05-20.19:49:22.28795 tcmalloc: large alloc 26829914112 bytes == 0x55a10d184000 @ 0x55a10945a710 0x55a1095c537c 0x55a10938ee7a fs/common/daremgr.cc:194: Failed to open the file /opt/mapr/conf/dare.master.key No such file or directory, err 2 tcmalloc: large alloc 26829922304 bytes == 0x55a74dd3c000 @ 0x55a10945a710 0x55a1095c50fc 0x55a109336572 FSCK start (initialize storage pool and replay log) ... Allocator init: 2515g (329711616 blocks) in 5031 groups 1: SG: f 99%: 0 [n 4198 6%, r 0] --> 7 [n 65536 100%, r 0] FSCK phase 1 (initialize cache and verify log) ... FSCK phase 2 and 3 (verify all containers and inodes) ... done with all containers 242 of 242 ... FSCK phase 4 (verify namespace and orphanage) ... FSCK phase 5 (verify allocation bitmap) ... FSCK completed without errors.
-
Bring the storage pool online.
For example:
mrconfig sp online /dev/drive0
-
List the storage pools and verify the storage pool is online.
For example:
mrconfig sp list ListSPs resp: status 0:1 No. of SPs (1), totalsize 2506499 MB, totalfree 2505357 MB SP 0: name SP1, Online, size 2506499 MB, free 2505357 MB, path /dev/drive0
-
Bring the pod out of maintenance mode by entering the following command:
sudo rm -f /opt/mapr/kubernetes/maintenance
-
(Optional) Verify that the Data Fabric cluster pods are operational.
For example, you can execute the
edf report ready
command.