All modern harddisks comes with S.M.A.R.T management. This is a fault monitoring and testing interface that can be quite useful, both as a way to monitor the health of your disks and test them. This wiki describes the basics you need to check your drive, for details check the man page and the smartctl homepage at http://smartmontools.sourceforge.net/.
SMART should work with IDE/ATA, SATA and SCSI drives, but the output might look different from the examples below. This example is using a ATA/PATA drive as a reference. If you are in doubt, you can always probe the device with "smartctl -i /dev/sda".
Fedora Core automatically comes with a "smartd" service that will email root if serious problems are detected on your disks. This wiki entry will describe the output from the "smartctl" tool. If not installed, you can install it with "yum install smartmontools". Also pay attention to the output from smartctl, some of the functions might not be supported if your drive is old.
Smartctl can take a lot of flags, I'll only deal with two of them: -a and -t. The -a option will displayall information available. Here is a sample output trimmed somewhat, an explanation comes below:
[root@balrog ~] # smartctl -a /dev/hdg smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: IC35L080AVVA07-0 User Capacity: 82,348,277,760 bytes ATA Version is: 5 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: ['Deleted lots of SMART capability flags'] SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE [some lines removed] 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 9 Power_On_Hours 0x0012 098 098 000 Old_age Always - 17426 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 195 194 Temperature_Celsius 0x0002 157 157 000 Old_age Always - 35 (Lifetime Min/Max 14/56) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 30 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Device does not support Selective Self Tests/Logging
The first "Information section" just prints some general information; manufacturer, serial number, microcode version. Check that SMART support is Available and Supported. Then follows a SMART data section and the two sections we care about SMART Attributes Data Structure and SMART Self-test log structure. There is also a SMART Error Log section, but if you can read and understand that you don't need this wiki entry.
The SMART Attributes Data Structure section contains many useful parts. Reallocated_Sector_Ct is how many sectors have been reallocated to to errors. Some sector reallocations are OK, but if this number start to grow it is an indication that your disk is getting sick. Also take note of Reallocated_Event_Count and Current_Pending_Sector. The event count is how many times it had to reallocate sectors due to I/O errors, pending sectors is the number of sectors that have developed a problem but has not yet been moved. Sectors will usually only be moved when written to. So if a read error occurs, the sector will be marked as faulty and only reallocated at the next write to the sector. If you have pending sectors you could use your manufacturers tool for disk diagnostics and surface testing.
The SMART Self-test log show you the test results. As I mentioned above, smartctl can also test the drive online. To run, use the -t flag:
[root@balrog ~] # smartctl -t short /dev/hdg smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 1 minutes for test to complete. Test will complete after Mon Jul 4 14:48:17 2005 Use smartctl -X to abort test.
You can also use "-t long" to get an extended test. These tests will log any errors in the SMART logs and you can use "smartctl -a" to see the result.
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Here you can see three test runs, two successful and one failure. The failure occured at LBA 158906040. If you get errors like that, I recommend getting your manufacturers tools to test and repair the sector(s) if possible. The advanced users out there can also try to use dd to both salvage the data in the sector and to do a write operation to the sector to trigger the sector reallocation. Tools for can be found here: *Hitachi/IBM Drive Fitness Tool: http://www.hitachigst.com/hdd/support/download.htm *Seagate Seatools: http://www.seagate.com/support/seatools/ For other manufacturers, please search the manufacturers web page and update this document.