Disk failure
Recently I recieved an email from some system process on one of my file servers. It politely informed me that one of my RAID5 devices had a disk failure, and that I really ought to do something about it. The email said
Date: Tue, 25 Jan 2005 03:37:42 -0800
From: mdadm monitoring <root@aserver.ogre.com>
To: root@ogre.com
Subject: Fail event on /dev/md4:aserver.ogre.com
This is an automatically generated mail message from mdadm
running on aserver.ogre.com
A Fail event had been detected on md device /dev/md4.
Faithfully yours, etc.
Examining the logfiles clearly shows that something was wrong:
Jan 25 03:36:40 aserver kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Jan 25 03:36:40 aserver kernel: hde: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=206384139, ...
Jan 25 03:36:41 aserver kernel: end_request: I/O error, dev hde, sector 206384136
Jan 25 03:36:41 aserver kernel: raid5: Disk failure on hde3, disabling device. Operation continuing ...
Jan 25 03:36:41 aserver kernel: disk 0, o:0, dev:hde3
I've had to deal with these things many times in the past, but this time I figured I'd really dig deeper into the problem, and figure out exactly how severe this failure was. My system is fairly modern, running a slightly modified Fedora Core 3 Linux installation, with SMART enabled disk drives.
Time to get SMART
This is not going to be a documentation of all the SMART features, but needless to say, SMART and the set of SMART tools that are available on Linux (and many other OSs) is increadibly useful. I also found this very useful HOWTO about using these tools to "debug" a faulty disk.
First of all, I wanted to get some status information from the SMART enabled drive, to see how bad the failure really was. This is really easy using the smartctl command:
root@aservert 282/0 # <b>smartctl -A /dev/hde</b>
smartctl version 5.21 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: Maxtor 6Y120L0
Serial Number: Y41MB1XE
Firmware Version: YAR41VW0
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Sun Jan 30 16:12:57 2005 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
.
.
.
196 Reallocated_Event_Count 0x0008 252 252 000 Old_age Offline - 1
197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 0
198 Offline_Uncorrectable 0x0008 252 250 000 Old_age Offline - 1
.
.
.
Obviously the disk doesn't think it's in too bad of a shape, it's still passing it's overall health checks without problems. However, there is a bad sector on the disk, indicated by the Offline_Uncorrectable count above. There are many of those counters to look for in the output from smartctl, I'm only including the ones that were relevant to my problem.
With this knowledge, the next step is to run a diagnostic on the disk itself. This can also be done with a simple smartctl command:
root@aserver 136/0 # <b>smartctl -t long /dev/hde</b>
This will start a selftest in the "background", and it takes a while to complete, in my case about an hour or so. Once finished, you can get the results using yet another smartctl command:
root@aserver 159/0 # <b>smartctl -l selftest /dev/hde</b>
smartctl version 5.21 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 10% 2258 0x0c4d2c0b
Buffer I/O error on device hde3, logical block 25154912
This tells us that there is an a bad block at 0x0c4d2c0b, and it's not able to relocate this since the disk still think there is hope to recover the data from it. We know better than that, so we must some how force the disk to reallocate this bad block.
Fixing the disk
To force the block to be reallocated, I decided to simply run a destructive "mkfs" on the disk. You can also use dd or some other tool to write to the entire disk partition.
root@aserver 268/0 # <b>mke2fs -c -c -j /dev/hde3</b>
mke2fs 1.35 (28-Feb-2004)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
14696448 inodes, 29372112 blocks
1468605 blocks (5.00%) reserved for the super user
First data block=0
897 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 23 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
Note the two -c options to mke2fs, it will perform a destructive write/read test on the disk. This will force the SMART disk to reallocate the bad block (and any other block that might also be bad). Be careful with this command, it will for sure remove any data you might have on the partition.
To verify that the disk is now in an acceptable state, we rerun the SMART selftest (see above), and then look at the finished result:
root@aserver 286/0 # <b>smartctl -l selftest /dev/hde</b>
smartctl version 5.21 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 2374 -
# 2 Extended offline Completed: read failure 10% 2289 0x0c4d2c0b
This is great news, no more errors on our drive! Now all we have to do is to put the repaired partition back into the RAID5 device, something like
root@aserver 286/0 # <b>mdadm /dev/md4 -a /dev/hde3</b>
And that's it!