Recovering from Disk Failure
Lists the disk errors and their resolution.
Most software failures can be remedied by running the fsck utility, which scans the storage pool to which the disk belongs and reports errors. For hardware failures, remove the failed disk and replace it according to the procedure in Removing and Replacing Disks.
The following are the types of failures and the recommended courses of action:
- I/OTimeOut Error
- Failure Reason: The default value for
mfs.disk.io.timeout
parameter is 60 seconds. The time to declare an IO as stuck is 3 times the value of this parameter (3 xmfs.disk.io.timeout
). The disk will be taken offline even if a single IO has not completed.
- No Such Device
- Failure Reason: The $INSTALL_DIR/conf/disktab file
contains
"/MissingDisk"
or references a disk path not found in /proc/partitions file.
- ENODEV: MissingDisk# Error: disktab file contains a /MissingDisk# entry
- Failure Reason: A disk corresponding to a GUID is missing and the corresponding
disk path in the
disktab
file belongs to another disk. When an attempt is made to automatically fix thedisktab
file, this entry is replaced with/MissingDisk# path
.
- EIO Error
- Failure Reason: I/O error. This could be due to a bad block or disk. The system will offline the SP after one final attempt to complete the IO.
- CRC Error
- Failure Reason: This could be due to a bad block or bit flip on the disk. The SP will be taken offline immediately.
- SlowDisk Error
- Failure Reason: The default value for the
mfs.disk.io.timeout
parameter is 60 seconds. The time to declare an IO as slow is equal to the value of this parameter (1 xmfs.disk.io.timeout
). Thirty or more slow IO completions in a short span of time (5 seconds) on the same disk is recorded as a slow event. The SP will be taken offline if 3 such events are recorded within an hour.NOTEAfter an hour, HPE Ezmeral Data Fabric filesystem will reset tracking (to 0).
- GUID of disk mismatches with the one in
$INSTALL_DIR/conf/disktab
- Failure Reason: Possible that disk names have changed.
- Unknown Error
- Failure Reason: Any reason