Replacing a Failed Disk

This procedure describes using the mrconfig to replace a failed disk that is part a storage pool on HPE Ezmeral Data Fabric on Kubernetes on HPE Ezmeral Runtime Enterprise.

Prerequisites

Prerequisites:

  • Required access rights:

    • Platform Administrator or Kubernetes Cluster Administrator access rights are required to download the admin kubeconfig file, which is needed to access Kubernetes cluster pods (see Downloading Admin Kubeconfig).

    • You must be logged on as the root user on the nodes that contain the disk and on which the Kubernetes cluster is running.

  • You have identified the disk that has failed and needs replacement.

About this task

During this procedure, you place the pod in maintenance mode and take the storage pool offline. After you replace the failed disk, you will use the mrconfig utility to recreate the storage pool, and then you will bring the storage pool and pod back online.

NOTE

You must use the mrconfig utility to perform this task. Using the equivalent maprcli commands is not supported.

Procedure

  1. Use kubectl exec command to access the CLDB or MFS pod that contains the storage pool that contains the failed disk.

    For example:

    kubectl exec -it cldb-0 -n myclusternode1 -- /bin/bash

    If needed, you use the kubectl get pods -n <cluster-name> command to get the list of pods, and then determine the CLDB or MFS pod in which you want to run the fsck tool.

  2. Place the pod in maintenance mode by entering the following command:
    sudo touch /opt/mapr/kubernetes/maintenance
  3. Use the mrconfig sp list command to list the storage pools that are in the pod:

    In the following example, there is one storage pool, SP1, with path: /dev/drive0

    mrconfig sp list
    ListSPs resp: status 0:1
    No. of SPs (1), totalsize 224491 MB, totalfree 221235 MB
    
    SP 0: name SP1, Online, size 224491 MB, free 221235 MB, path /dev/drive0
    
  4. Make note of the other disk drives in the storage pool.

    Later in this procedure you will remove and then add the other disks in the storage pool that containes the failed disk. You can display the disks in the storage pool by entering the mrconfig dg list <path> command, where <path> is the path of the storage pool. In the output of the command, the drive paths of the disks in the group are listed at the end of the lines that start with SubDG.

  5. Mark the storage pool as offline.

    For example:

    mrconfig sp offline /dev/drive0
  6. Verify the storage pool is offline by examining the output of the mrconfig sp list command.

    For example:

    mrconfig sp list
    ListSPs resp: status 0:1
    No. of SPs (1), totalsize 0 MB, totalfree 0 MB
    
    SP 0: name SP1, Offline, size 2575449 MB, free 0 MB, path /dev/drive0
    
  7. Remove the failed disk from the configuration.
    CAUTION

    Removing a disk destroys the data on the disk, so ensure that all data on a disk is backed up and replicated before removing a disk.

    For example:

    mrconfig disk remove /dev/drive0
  8. Replace the disk hardware. Follow the instructions for the system and disk you are replacing to remove the disk from the system and install the replacement disk.
  9. Initialize the replaced disk by using the mrconfig disk init command.

    For example:

    mrconfig disk init -F /dev/drive0
    Disk guid: 7cc56e064fd1e1fe:60a6bfaa0693a2
    
  10. Load the replaced disk by using the mrconfig disk load command.

    For example:

    /opt/mapr/server/mrconfig disk load /dev/drive0
    guid FEE1D14F-066E-C57C-A293-06AABFA66000
    dgguid 00000000-0000-0000-0000-000000000000
    
  11. One disk at a time, use the mrconfig utility to remove, initialize, and load the other disks that were part of the storage pool that contained the replaced disk.
    After you finish this step, the replaced disk and the remaining disks in the storage pool have been initiated and loaded.
  12. Use the mrconfig dg create raid0to create a disk group of type raid0 that includes the disks in the storage pool.

    For example:

    /opt/mapr/server/mrconfig dg create raid0 /dev/drive0 /dev/drive1 /dev/drive2
    CreateDG disks(3) stripeDepth(0) layout(3)
    
  13. Create a concatenated disk group with mrconfig dg create concat by specifying the primary drive.

    For example:

    mrconfig dg create concat /dev/drive0
    CreateDG disks(1) stripeDepth(0) layout(2)
    

    At this point, you can use the mrconfig dg list to see the layout of the disk group, and which disk is the primary disk. The primary disk can be used in other commands to refer to the disk group as a whole.

  14. Make the storage pool from the newly-created disk group.

    For example:

    /opt/mapr/server/mrconfig sp make -F /dev/drive0
  15. Make the storage pool from the newly-created disk group.

    For example:

    /opt/mapr/server/mrconfig sp make -F /dev/drive0
  16. Bring the storage pool online.

    For example:

    mrconfig sp online /dev/drive0
  17. List the storage pools and verify the storage pool is online.

    For example:

    mrconfig sp list
    ListSPs resp: status 0:1
    No. of SPs (1), totalsize 2510595 MB, totalfree 2509693 MB
    
    SP 0: name SP2, Online, size 2510595 MB, free 2509693 MB, path /dev/drive0
    

    The storage pool is identified by its path. The name of the storage pool is generated automatically, and is not necessarily retained when you recreate a storage pool for a given path.

  18. Bring the pod out of maintenance mode:
    sudo rm -f /opt/mapr/kubernetes/maintenance
  19. (Optional) Verify that the Data Fabric cluster pods are operational.
    For example, you can execute the edf report ready command.