I’ve recently heard many folk talking about Time Machine backup strategies. To do it well, you really do need to backup your backup, as Time Machine can “eat itself”, especially doing network backups.
Regardless of whether your Time Machine backup is to a locally attached disk or a network drive, when you make a backup of your backup, you want to make sure it’s valid, otherwise you’re propagating a corrupt backup.
So how do you know if your backup is corrupt? You could read it from beginning to end. But this would only protect you from data corruptions that can be detected by the drive itself. Disk verify, fsck, and others go further and validate the file system structures, but still not your actual data.
There are “silent corruptions”, which is where the data you wrote to the disk comes back corrupted (different data, not a read error). “That never happens”, you might say, but how would you know?
I have two servers running SmartOS using data stored on ZFS. I ran a data scrub on them, and both reported checksum errors. This is exactly the silent data corruption scenario.
ZFS features full checksumming of all data when stored, and if your data is in a RAIDZ or mirror configuration, it will also self-heal. This means that instead of returning an error, ZFS will go fetch the data from a good drive and also make another clean copy of that block so that its durability matches your setup.
Here’s the specifics of my corruptions:
On a XEON system with ECC RAM, the affected drive is a Seagate 1TB Barracuda 7200rpm, ST31000524AS, approximately 1 year old.
pool: zones
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: resilvered 72.4M in 0h48m with 0 errors on Mon Nov 18 13:28:16 2013
config:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t1d0s0 ONLINE 0 0 0
c1t0d0s0 ONLINE 2.61K 366k 635
c1t4d0s1 ONLINE 0 0 0
logs
c1t2d0s0 ONLINE 0 0 0
cache
c1t2d0s1 ONLINE 0 0 0
errors: No known data errors
On a Celeron system with non-ECC RAM, the affected drive is a Samsung 2TB low power drive, approximately 2 years old.
pool: zones
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 8K in 12h51m with 0 errors on Thu Nov 21 00:44:25 2013
config:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c0t1d0 ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c0t2d0p2 ONLINE 0 0 2
logs
c0t0d0s0 ONLINE 0 0 0
cache
c0t0d0s1 ONLINE 0 0 0
errors: No known data errors
Any errors are scary, but the checksum errors even more so.
I had previously seen thousands of checksum errors on a Western Digital Green drive. I stopped using it and threw it in the bin.
I have other drives that are HFS formatted. I have no way of knowing if they have any corrupted blocks.
So unless your data is being checksummed, you are not protected from data corruption, and making a backup of a backup could easily be propagating data corruptions.
I dream of a day when we can have ZFS natively on Mac. And if it can’t be done for whatever ‘reasons’, at least give us the features from ZFS that we can use to protect our data.