Diagnosis of a defective hard disk

Requirements:

Smartmontools is an utility to check and monitor a hard disk using the standard “SMART” (Self-Monitoring, Analysis and Reporting Technology System).

It contains two parts:

  • smartd, a daemon that allows you to check your disks periodically
  • smartctl, an utility to visualize the status of your disk

It is able to check the majority of modern disks.

Check of a server with a single hard drive

It is very easy to check the status of a disk on servers with a single hard drive, just type the following code:

smartctl -a /dev/sda

sda references to your disk.

Check of a Dell server with multiple disks

H200

On this type of servers, the name of the disks is referenced as sg*

You need to type the following code:

smartctl -a -T permissive /dev/sg0
smartctl -a -T permissive /dev/sg1
smartctl -a -T permissive /dev/sg2

Sometimes the devices are positioned a little further.

Try up to sg5 if you don't get a result.

H310

You have two options:

  • megaclisas-status
  • megacli

The first will show you the status of the RAID, the second will show the SMART values of the disks.

aptitude update
aptitude install megaclisas-status megacli
megaclisas-status

To get the SMART values, you need to use the following small script:

DEVICE=/dev/sda
for i in $(megacli -pdlist -a0 | grep Id | cut -d":" -f2); do
     echo "============================== $i =============================="
    smartctl -s on -a -d megaraid,${i} ${DEVICE} -T permissive
done

Check of a HP server with multiple disks

To check the status of a HP server, we invite you to read the documentation about ssacli


Usage of SMARTD for the monitoring of the disks

As we have seen previously smartd will allow you to monitor your disks and to be alerted (depending on configuration) by email in case of failure.

There is no guarantee that the SMART values will detect an error of your disk before it will fail completely. However, there are high chances that you have time to backup and to order a replacement disk.

The following example will explain the configuration on a debian-like server:

The following commands have to be run as root or using sudo

For the start, we have to enable some options of SMART:

smartctl -s on -o on -S on /dev/sda

We will verify now, that the disk does not have any error:

smartctl -H /dev/sda

.

To check it automatically, the following steps are required: Edit the file

/etc/smartd.conf

Start with the following line:

DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

Then add something like:

/dev/sda -a -d sat -o on -S on -s (S/../.././01|L/../../1/03) -m root -M exec /usr/share/smartmontools/smartd-runner

The example above allows you to do a test on the disk as following:

  • A short test (S) each day at 1h (01) in the night
  • A long test(L), each Monday (1) at 3h (03) in the night.

You will have to enable the daemon now. Uncomment the following line:

start_smartd=yes

in the file

/etc/default/smartmontools

To finish, start the daemon:

service smartmontools start

If a problem is detected, a mail will be send to root (-m root). We invite you to redirect the mails of the root user to your mailbox, or use directly another address.

Manual launch of SMART tests

To launch the SMART tests manually, you have to run the following commands:

smartctl -t short /dev/sda
smartctl -t long /dev/sda

Once the tests have completed you can view the results with the command:

smartctl -l selftest /dev/sda

If you detect errors on the disk

Open a ticket and ask for the replacement of the disk. Include the serial number, as well as the output of smartctl in the ticket:

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD103UJ
Serial Number:    S13PJ1KQ513170   <----------------------- serial number
Firmware Version: 1AA01113
User Capacity:    1 000 204 886 016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Fri Oct 29 11:20:27 2010 CEST