5

back story.. i took my 2 seagate 32000641AS drives out of an old dns-323 disk array.. i put them in my dell Precision T5600 since the dns-323 was old and becoming a pain to manage. i then created two separate XFS fs on them and mounted them. The disks i replaced these with were smaller 500G drives.. i copied the data from them onto a 300G usb encrypted thumbdrive. after i put in the seagates i copied the data back.. i use one for local backup and the other to run a virtualbox VM but both drives had the same data from the usb.

i noticed this today in the syslog for both drives.

smartd[809]: Device: /dev/sda [SAT], 19 Currently unreadable (pending) sectors
smartd[809]: Device: /dev/sda [SAT], 19 Offline uncorrectable sectors
smartd[809]: Device: /dev/sda [SAT], 19 Currently unreadable (pending) sectors
smartd[809]: Device: /dev/sda [SAT], 19 Offline uncorrectable sectors

however, no performance issues. also, when i was copying the data back from the usb drive, 1 DIR would not copy back over.. it kept giving me an I/O error.. i didnt need it so i just didnt copy it back.

is this the reason im getting the same exact errors on both drives? or is it coincidence?

so I did mkfs.ext4 on /dev/sdb1

and im now running badblocks -s -v -n -f /dev/sda

Checking for bad blocks in non-destructive read-write mode
From block 0 to 1953514583
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern:   1.43% done, 36:36 elapsed. (0/0/0 errors)

and so far no errors with badblocks, but then i got this in syslog

smartd[809]: Device: /dev/sda [SAT], 19 Currently unreadable (pending) sectors
smartd[809]: Device: /dev/sda [SAT], 19 Offline uncorrectable sectors
smartd[809]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117
smartd[809]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
smartd[809]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
smartd[809]: Device: /dev/sdb [SAT], 35 Currently unreadable (pending) sectors
smartd[809]: Device: /dev/sdb [SAT], 35 Offline uncorrectable sectors
smartd[809]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 113
smartd[809]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 35 to 49
keith
  • 53

3 Answers3

12

Yup!

SMART data is produced by the hard drive firmware itself; it isn't really possible for you to get false positives there. If the smart data is saying you have bad blocks on both your drives, then you have bad blocks. Some small chunks of the spinning rust have gone bad, and there's no way to fix them. This is a slow process; it happened as they were aging in their previous home. The fact that the numbers were identical at first is interesting, but not really shocking; the drives would have come from the same manufacturing lot, and thus have very similar properties. If you'd like to be sure, you can try looking into firmware updates; it's possible for bad firmware to cause the device to falsely detect errors. However, the likley explanaiton is the simple one; bad blocks on both.

Now, its not the end of the world; you've lost some data (on the drive), and the drives are likely to lose more, or fail outright. But you can keep using them, provided that said data is also going to another, likely-good drive. Depending on your RAID setup, it should maintain the two copies whenever a block comes up bad. Don't RAID the two aging drives into one, since when they go, they'll go together. And run xfs_scrub (or whatever your preferred filesystem is) over the RAID'd data on a routine basis, to detect more bad blocks.

3

First, your data of:

smartd[809]: Device: /dev/sda [SAT], 19 Currently unreadable (pending) sectors
smartd[809]: Device: /dev/sda [SAT], 19 Offline uncorrectable sectors
smartd[809]: Device: /dev/sda [SAT], 19 Currently unreadable (pending) sectors
smartd[809]: Device: /dev/sda [SAT], 19 Offline uncorrectable sectors

It doesn't look like either of your two Seagate 32000641AS drives, but rather, your boot drive SDA. In either case, because they're Seagate 2G drives, they should be formatted in GPT, not MBR formats.

Regarding:

badblocks -s -v -n -f /dev/sda

If you look at man badblocks you'll see that badblocks indicates not to run directly...

       For this reason, it is strongly recommended that users not run badblocks
       directly, but rather use the -c option of the e2fsck and mke2fs programs.

The correct way to bad block a disk is:

sudo e2fsck -fccky /dev/sdXX # where sdXX is the drive you want to test

The -k is important, because it saves the previous bad block table, and adds any new bad blocks to that table. Without -k, you loose all of the prior bad block information.

The -fccky parameter...

   -f    Force checking even if the file system seems clean.

-c This option causes e2fsck to use badblocks(8) program to do a read-only scan of the device in order to find any bad blocks. If any bad blocks are found, they are added to the bad block inode to prevent them from being allocated to a file or direc‐ tory. If this option is specified twice, then the bad block scan will be done using a non-destructive read-write test.

-k When combined with the -c option, any existing bad blocks in the bad blocks list are preserved, and any new bad blocks found by running badblocks(8) will be added to the existing bad blocks list.

-y Assume an answer of `yes' to all questions; allows e2fsck to be used non-interactively. This option may not be specified at the same time as the -n or -p options.

heynnema
  • 73,649
2

The SMART data indicates that the drives are going bad, but are not yet truly dead. I would not use them for anything important, but they are arguably still usable as temporary storage for data you don’t care about.

SMART data itself is in general not prone to false positives (the only way that is realistically possible is a bug in the drive firmware). The ‘Current pending sectors’ and ‘offline uncorrectable sectors’ refer to bad sectors on the physical media of the drive that are in different states of attempted repair (current pending means the drive wants to try and fix things, offline uncorrectable means it can’t and will almost always be strictly less than or equal to current pending). The drive will, however, automatically remap bad sectors to spare sectors that are not otherwise user accessible, so it will still be mostly usable without issue, other than having higher latency reading from the remapped sectors.

As such, given the actual numbers, I would not expect badblocks to actually indicate any errors. All it actually does is read the whole drive (and in some modes write to the whole drive) and count the IO errors it gets back. Decades ago, when hard-drives were not intelligent enough to remap bad sectors themselves, this was a sensible way to check for bad sectors and avoid them. These days, if a drive actually throws IO errors on read/write, it’s almost certainly well beyond salvageable at that point (and, in fact, most modern filesystems other than ext4 don’t have any support for ignoring known bad sectors).

The ‘normal’ test I use instead in cases like this is:

  1. Write every sector of the drive, usually using dd if=/dev/zero of=/dev/sda bs=65536. Obviously update /dev/sda to the drive being tested. The block size is not super important, but using 64k instead of the default usually speeds things up significantly. This command will exit with an error at the end regardless (because /dev/zero technically has an inifnite size), so you need to confirm that it wrote the correct number of blocks (if you use a 64k block size with your 2T drives, it should be 30517 or 30518 blocks written).
  2. Read every sector of the drive, usually using dd if=/dev/sda of=/dev/null bs=65536. Again, update /dev/sda to the drive being tested. The block size is again not super important, but using 64k usually speeds things up significantly.
  3. Run smartctl -H -A /dev/sda to get the state of the SMART counters for the drive. You want to look for the lines that specify Self-Assessment (if that line says the self-assessment failed, scrap the drive, that means the drive itself has failed per manufacturer specifications), as well as Current Pending Sectors and Offline Uncorrectable Sectors. Note the values of Current Pending Sectors and Offline Uncorrectable Sectors.
  4. Repeat from step 1. If the values you see in step 3 have changed after the second pass, repeat from step 1 again, if not the drive is probably fine for the time being. If they change a second time, the drive should probably be scrapped irrespective of what the self-assessment from the firmware says. If they do not change between the second and third pass, the drive is at least stable and possibly usable, but only for unimportant data.

As far as both drives showing the same numbers, that’s unusual, but not wholly unexpected. They’re probably from the same manufacturing lot, so it’s entirely possible that they have identical manufacturing defects. It’s also possible the firmware is bad, but figuring that out requires actually updating the drive firmware (Seagate lets you search for firmware updates by drive serial number here).