Matt Connolly's Blog

my brain dumps here…

Monthly Archives: November 2013

Time Machine Backups and silent data corruptions

I’ve recently heard many folk talking about Time Machine backup strategies. To do it well, you really do need to backup your backup, as Time Machine can “eat itself”, especially doing network backups.

Regardless of whether your Time Machine backup is to a locally attached disk or a network drive, when you make a backup of your backup, you want to make sure it’s valid, otherwise you’re propagating a corrupt backup.

So how do you know if your backup is corrupt? You could read it from beginning to end. But this would only protect you from data corruptions that can be detected by the drive itself. Disk verify, fsck, and others go further and validate the file system structures, but still not your actual data.

There are “silent corruptions”, which is where the data you wrote to the disk comes back corrupted (different data, not a read error). “That never happens”, you might say, but how would you know?

I have two servers running SmartOS using data stored on ZFS. I ran a data scrub on them, and both reported checksum errors. This is exactly the silent data corruption scenario.

ZFS features full checksumming of all data when stored, and if your data is in a RAIDZ or mirror configuration, it will also self-heal. This means that instead of returning an error, ZFS will go fetch the data from a good drive and also make another clean copy of that block so that its durability matches your setup.

Here’s the specifics of my corruptions:

On a XEON system with ECC RAM, the affected drive is a Seagate 1TB Barracuda 7200rpm, ST31000524AS, approximately 1 year old.

  pool: zones
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
   
  scan: resilvered 72.4M in 0h48m with 0 errors on Mon Nov 18 13:28:16 2013
config:

        NAME          STATE     READ WRITE CKSUM
        zones         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c1t1d0s0  ONLINE       0     0     0
            c1t0d0s0  ONLINE   2.61K  366k   635
            c1t4d0s1  ONLINE       0     0     0
        logs
          c1t2d0s0    ONLINE       0     0     0
        cache
          c1t2d0s1    ONLINE       0     0     0

errors: No known data errors

On a Celeron system with non-ECC RAM, the affected drive is a Samsung 2TB low power drive, approximately 2 years old.

  pool: zones
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 8K in 12h51m with 0 errors on Thu Nov 21 00:44:25 2013
config:

        NAME          STATE     READ WRITE CKSUM
        zones         ONLINE       0     0     0
          raidz1-0    ONLINE       0     0     0
            c0t1d0    ONLINE       0     0     0
            c0t3d0    ONLINE       0     0     0
            c0t2d0p2  ONLINE       0     0     2
        logs
          c0t0d0s0    ONLINE       0     0     0
        cache
          c0t0d0s1    ONLINE       0     0     0

errors: No known data errors

Any errors are scary, but the checksum errors even more so.

I had previously seen thousands of checksum errors on a Western Digital Green drive. I stopped using it and threw it in the bin.

I have other drives that are HFS formatted. I have no way of knowing if they have any corrupted blocks.

So unless your data is being checksummed, you are not protected from data corruption, and making a backup of a backup could easily be propagating data corruptions.

I dream of a day when we can have ZFS natively on Mac. And if it can’t be done for whatever ‘reasons’, at least give us the features from ZFS that we can use to protect our data.

Advertisements