Using fsck to Check for File System Inconsistencies

This procedure describes how use the fsck utility to check for and repair file system inconsistencies in a disk storage pool on HPE Ezmeral Data Fabric on Kubernetes on HPE Ezmeral Runtime Enterprise.

Prerequisites

Required access rights:

  • Platform Administrator or Kubernetes Cluster Administrator access rights are required to download the admin kubeconfig file, which is needed to access Kubernetes cluster pods (see Downloading Admin Kubeconfig).

  • You must be logged on as the root user on the nodes that contain the disk and on which the Kubernetes cluster is running.

About this task

Most disk failures can be identified and possibly remedied by running the fsck utility, which scans the storage pool to which the disk belongs and reports errors. The fsck utility can be used on an offline storage pool after a node failure, after a disk failure, a filesystem process crash, or to verify the consistency of data for suspected disk errors.

During this procedure, you place the pod in maintenance mode and take the storage pool offline. You restore operations at the end of the procedure.

Procedure

  1. Use kubectl exec command to access the CLDB or MFS pod that contains the storage pool that you want to check.

    For example:

    kubectl exec -it cldb-0 -n mycluster1 -- /bin/bash

    If needed, you use the kubectl get pods -n <cluster-name> command to get the list of pods, and then determine the CLDB or MFS pod in which you want to run the fsck tool.

  2. Place the pod in maintenance mode by entering the following command:
    sudo touch /opt/mapr/kubernetes/maintenance
  3. Use the mrconfig sp list command to list the storage pools that are in the pod:

    In the following example, there is one storage pool, SP1, with path: /dev/drive0

    mrconfig sp list
    ListSPs resp: status 0:1
    No. of SPs (1), totalsize 224491 MB, totalfree 221235 MB
    
    SP 0: name SP1, Online, size 224491 MB, free 221235 MB, path /dev/drive0
    
  4. Mark the storage pool as offline.

    For example:

    mrconfig sp offline /dev/drive0
  5. Verify the storage pool is offline by examining the output of the mrconfig sp list command.

    For example:

    mrconfig sp list
    ListSPs resp: status 0:1
    No. of SPs (1), totalsize 0 MB, totalfree 0 MB
    
    SP 0: name SP1, Offline, size 2575449 MB, free 0 MB, path /dev/drive0
    
  6. Run the fsck utility on the storage pool, examine the output, and identify and resolve any errors.

    For information about fsck and resolving errors, see the following in the HPE Ezmeral Data Fabric documentation (links open in a new browser tab or window):

    For example:

    /opt/mapr/server/fsck -n SP1     
    
    Using logfile /opt/mapr/logs/fsck.log.2021-05-20.19:49:22.28795
    tcmalloc: large alloc 26829914112 bytes == 0x55a10d184000 @  0x55a10945a710 0x55a1095c537c 0x55a10938ee7a
    fs/common/daremgr.cc:194: Failed to open the file /opt/mapr/conf/dare.master.key No such file or directory, err 2
    tcmalloc: large alloc 26829922304 bytes == 0x55a74dd3c000 @  0x55a10945a710 0x55a1095c50fc 0x55a109336572
    
    FSCK start (initialize storage pool and replay log) ...
    Allocator init: 2515g (329711616 blocks) in 5031 groups
    1: SG: f 99%: 0 [n 4198 6%, r 0] --> 7 [n 65536 100%, r 0]
    
    FSCK phase 1 (initialize cache and verify log) ...
    
    FSCK phase 2 and 3 (verify all containers and inodes) ...
      done with all containers 242 of 242 ...
    
    FSCK phase 4 (verify namespace and orphanage) ...
    
    FSCK phase 5 (verify allocation bitmap) ...
    
    FSCK completed without errors.
    
  7. Bring the storage pool online.

    For example:

    mrconfig sp online /dev/drive0
  8. List the storage pools and verify the storage pool is online.

    For example:

    mrconfig sp list
    ListSPs resp: status 0:1
    No. of SPs (1), totalsize 2506499 MB, totalfree 2505357 MB
    
    SP 0: name SP1, Online, size 2506499 MB, free 2505357 MB, path /dev/drive0
    
  9. Bring the pod out of maintenance mode by entering the following command:
    sudo rm -f /opt/mapr/kubernetes/maintenance
  10. (Optional) Verify that the Data Fabric cluster pods are operational.
    For example, you can execute the edf report ready command.