Smartctl

All modern harddisks comes with S.M.A.R.T management. This is a fault monitoring and testing interface that can be quite useful, both as a way to monitor the health of your disks and test them. This wiki describes the basics you need to check your drive, for details check the man page and the smartctl homepage at http://smartmontools.sourceforge.net/.

SMART should work with IDE/ATA, SATA and SCSI drives, but the output might look different from the examples below. This example is using a ATA/PATA drive as a reference. If you are in doubt, you can always probe the device with "smartctl -i /dev/sda".

Fedora Core automatically comes with a "smartd" service that will email root if serious problems are detected on your disks. This wiki entry will describe the output from the "smartctl" tool. If not installed, you can install it with "yum install smartctl". Also pay attention to the output from smartctl, some of the functions might not be supported if your drive is old.

Smartctl can take a lot of flags, I'll only deal with two of them: -a and -t. The -a option will displayall information available. Here is a sample output trimmed somewhat, an explanation comes below:

[root@balrog ~] # smartctl -a /dev/hdg smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/

START OF INFORMATION SECTION
Device Model:    IC35L080AVVA07-0 User Capacity:   82,348,277,760 bytes ATA Version is:  5 SMART support is: Available - device has SMART capability. SMART support is: Enabled

START OF READ SMART DATA SECTION
SMART overall-health self-assessment test result: PASSED

General SMART Values:

['Deleted lots of SMART capability flags']

SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME         FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE [some lines removed] 5 Reallocated_Sector_Ct  0x0033   100   100   005    Pre-fail  Always       -       0 7 Seek_Error_Rate        0x000b   100   100   067    Pre-fail  Always       -       0 9 Power_On_Hours         0x0012   098   098   000    Old_age   Always       -       17426 12 Power_Cycle_Count      0x0032   100   100   000    Old_age   Always       -       195 194 Temperature_Celsius    0x0002   157   157   000    Old_age   Always       -       35 (Lifetime Min/Max 14/56) 196 Reallocated_Event_Count 0x0032  100   100   000    Old_age   Always       -       0 197 Current_Pending_Sector 0x0022   100   100   000    Old_age   Always       -       0 198 Offline_Uncorrectable  0x0008   100   100   000    Old_age   Offline      -       0 199 UDMA_CRC_Error_Count   0x000a   200   200   000    Old_age   Always       -       30

SMART Error Log Version: 1 No Errors Logged

SMART Self-test log structure revision number 1 Num Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

Device does not support Selective Self Tests/Logging

The first "Information section" just prints some general information; manufacturer, serial number, microcode version. Check that SMART support is Available and Supported. Then follows a SMART data section and the two sections we care about SMART Attributes Data Structure and SMART Self-test log structure. There is also a SMART Error Log section, but if you can read and understand that you don't need this wiki entry.

The SMART Attributes Data Structure section contains many useful parts. Reallocated_Sector_Ct is how many sectors have been reallocated to to errors. Some sector reallocations are OK, but if this number start to grow it is an indication that your disk is getting sick. Also take note of Reallocated_Event_Count and Current_Pending_Sector. The event count is how many times it had to reallocate sectors due to I/O errors, pending sectors is the number of sectors that have developed a problem but has not yet been moved. Sectors will usually only be moved when written to. So if a read error occurs, the sector will be marked as faulty and only reallocated at the next write to the sector. If you have pending sectors you could use your manufacturers tool for disk diagnostics and surface testing.

The SMART Self-test log show you the test results. As I mentioned above, smartctl can also test the drive online. To run, use the -t flag:

[root@balrog ~] # smartctl -t short /dev/hdg smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/

START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION
Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 1 minutes for test to complete. Test will complete after Mon Jul 4 14:48:17 2005

Use smartctl -X to abort test.

You can also use "-t long" to get an extended test. These tests will log any errors in the SMART logs and you can use "smartctl -a" to see the result.

SMART Self-test log structure revision number 1 Num Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

Here you can see three test runs, two successful and one failure. The failure occured at LBA 158906040. If you get errors like that, I recommend getting your manufacturers tools to test and repair the sector(s) if possible. The advanced users out there can also try to use dd to both salvage the data in the sector and to do a write operation to the sector to trigger the sector reallocation.

Tools for can be found here:


 * Hitachi/IBM Drive Fitness Tool: http://www.hitachigst.com/hdd/support/download.htm
 * Seagate Seatools: http://www.seagate.com/support/seatools/

For other manufacturers, please search the manufacturers web page and update this document.