Matt Connolly's Blog

my brain dumps here…

Tag Archives: zfs

Happy Holidays

This is a quick happy holidays and thank you to all the people and companies that have done great things in 2013. In no particular order:

Podcasters:

I’ve enjoyed many a podcast episode this year. My favourites are the Edge Cases featuring Wolf Rentzsch and Andrew Pontious, Accidental Tech Podcast featuring John Siracusa, Casey Liss and Marco Arment and Rails Casts by Ryan Bates.
Thank you all for your hard work putting your respective shows together. Your efforts are greatly appreciated, and I hope you are getting enough out of it so that it’s worthwhile continuing in 2014!

Companies:

JetBrains, makers of Rubymine. These guys pump out great work. If you’re keen to get involved in the early access program you can get nightly or weekly builds. Twice this year I’ve submitted a bug and within a week had it verified by JetBrains, fixed, in a build and in my hands. Their CI system even updates the bug with the build number including the fix. Seriously impressive. They set the bar so high, I challenge any company (including myself) to match their effective communication rapid turn around on issues.

Joyent for actually innovating in the cloud, and your contributions to open source projects such as NodeJS and SmartOS! Pretty impressive community engagement, not only in open source code, but community events too… What a shame I don’t live in San Francisco to attend and thank you guys in person.

Github for helping open source software and providing an awesome platform for collaboration. So many projects benefit from being on Github.

Apple, thanks for making great computers and devices. Well done on 64 bit ARM. The technology improvements in iOS 7 are great, however, my new iPhone 5S doesn’t feel a single bit faster than my previous iPhone 5 due to excessive use of ease out animations which have no place in a User Interface. Too many of my bug reports have been closed as “works as intended”, when the problem is in the design not the implementation. Oh well.

Products / Services:

Strava has helped me improve in my cycling and fitness. The website and iPhone apps are shining examples of a great user experience: works well, easy to use, functional and good looking. Thanks for a great product.

Reveal App is a great way to break down the UI of an iOS app. Awesome stuff.

Twitter has been good, mostly because of how people use it. I suppose it’s more thanks to the people on Twitter who I follow.

Black Star Coffee, it’s how I start my day! Great coffee.

Technologies:

ZeroMQ: This is awesome. Reading the ZeroMQ guide was simply fantastic. This has changed my approach to communications in computing. Say goodbye to mutexes and locks and hello to messages and event driven applications. Special thanks to Pieter Hintjens for his attention to the ZeroMQ mailing lists, and to all of the contributors to a great project.

SmartOS: Totally the best way to run a hypervisor stack. The web page says it all: ZFS + DTrace + Zones + KVM. Get into it. Use ZFS. You need a file system that can verify your data. Hard drives cannot be trusted. I repeat, use ZFS.

Using ZFS Snapshots on Time Machine backups.

I use time machine because it’s an awesome backup program. However, I don’t really trust hard drives that much, and I happen to be a bit of a file system geek, so I backup my laptop an iMac to another machine that stores the data on ZFS.

I first did this using Netatalk on OpenSolaris, then OpenIndiana, and now on SmartOS. Netatalk is an open source project for running AFP (Apple Filesharing Protocol) services on unix operatings systems. It has great support for new features in the protocol required for Time Machine. As far as I’m aware, all embedded NAS devices use this software.

Sometimes, Time Machine “eats itself”. A backup will fail with a message like “Verification failed”, and you’ll need to make a new one. I’ve never managed to recover the disk from this point using Disk Utility.

My setup is RaidZ of 3 x 2TB drives, giving a total of 4TB of storage space (and 2TB redundancy). In the four years I’ve been running this, I have had 3 drives go bad and replace them. They’re cheap drives, but I’ve never lost data due to a bad disk and having to replace it. I’ve also seen silent data corruptions, and know that ZFS has corrected them for me.

Starting a new backup is a pain, so what do I do?

ZFS Snapshots

I have a script, which looks like this:

ZFS=zones/MacBackup/MattBookPro
SERVER=vault.local
if [ -n "$1" ]; then
  SUFFIX=_"$1"
fi
SNAPSHOT=`date "+%Y%m%d_%H%M"`$SUFFIX
echo "Creating zfs snapshot: $SNAPSHOT"
ssh -x $SERVER zfs snapshot $ZFS@$SNAPSHOT

This uses the zfs snapshot command to create a snapshot of the backup. There’s another one for my iMac backup. I run this script manually for the ZFS file system (directory) for each backup. I’m working on an automatic solution that listens to system logs to know when the backup has completed and the volume is unmounted, but it’s not finished yet (like many things). Running the script takes about a second.

Purging snapshots

My current list of snapshots looks like this:

matt@vault:~$ zfs list -r -t all zones/MacBackup/MattBookPro
NAME                                      USED  AVAIL  REFER  MOUNTPOINT
zones/MacBackup/MattBookPro               574G   435G   349G  /MacBackup/MattBookPro
...snip...
zones/MacBackup/MattBookPro@20131124_1344 627M      -   351G  -
zones/MacBackup/MattBookPro@20131205_0813 251M      -   349G  -
zones/MacBackup/MattBookPro@20131212_0643 0         -   349G  -

The used at the top shows the space used by this file system and all of its snapshots. The used column for the snapshot shows how much space is used by that snapshot on its own.

Purging old snapshots is a manual process for now. One day I’ll get around to keeping a snapshots on a rule like time machine’s hourly, daily, weekly rules.

Rolling back

So when Time Machine goes bad, it’s as simple as rolling back to the latest snapshot, which was a known good state.

My steps are:

  1. shut down netatalk service
  2. zfs rollback
  3. delete netatalk inode database files
  4. restart netatalk service
  5. rescan directory to recreate inode numbers (using netatalks “dbd -r ” command.)

This process is a little more involved, but still much faster than making a whole new backup.

The main reason for this is that HFS uses an “inode” number to uniquely identify each file on a volume. This is one trick that Mac Aliases use to track a file even if it changes name and moves to another directory. This concept doesn’t exist in other file systems, and so Netatalk has to maintain a database of which numbers to use for which files. There’s some rules, like inode numbers can’t be reused and they must not change for a given file.

Unfortunately, ZFS rollback, like any other operation on the server that changes files without netatalk knowing, ends up with files that have no inode number. The bigger problem seems to be deleting files and leaving their inodes in that database. This tends to make Time Machine quite unhappy about using that network share. So after a rollback, I have a rule that I nuke netatalk’s database and recreate it.

This violates the rule that inode numbers shouldn’t change (unless they magically come out the same, which I highly doubt), but this hasn’t seemed to cause a problem for me. Imagine plugging a new computer into a time machine volume, it has no knowledge of what the inode numbers were, so it just starts using them as is. It’s more likely to be an issue for Netatalk scanning a directory and seeing inodes for files that are no longer there.

Recreating the netatalk inode database can take an hour or two, but it’s local to the server and much faster than a complete network backup which also looses your history.

Conclusion

This used to happen a lot. Say once every 3-4 months when I first started doing it. This may have been due to bugs in Time Machine, bugs in Netatalk or incompatibilities between them. It certainly wasn’t due to data corruptions.

Pros:

  • Time Machine, yay!
  • ZFS durability and integrity.
  • ZFS snapshots allow point in time recovery of my backup volume.
  • ZFS on disk compression to save backup space!
  • Netatalk uses standard AFP protocol, so time machine volume can be accessed from your restore partition or a new mac – no extra software required on the mac!

Cons:

  • Effort – complexity to manage, install & configure netatalk, etc.
  • Rollback time.
  • Network backups are slow.

As time has gone on, both Time Machine and Netatalk have improved substantially. And I’ve added an SSD cache to the server, and its is swimmingly fast and reliable. And thanks to ZFS, durable and free of corruptions. I think I’ve had this happen only twice in the last year, and both times was on Mountain Lion. I haven’t had to do a single rollback since starting to use Mavericks beta back around June.

Where to from here?

I’d still like to see a faster solution, and I have a plan: a network block device.

This would, however, require some software to be installed on the mac, so it may not be as easy to use in a disaster recover scenario.

ZFS has a feature called a “volume”. When you create one, it appears to the system (that’s running zfs) as another block device, just like a physical hard disk, or file. A file system can be created on this volume which can then be mounted locally. I use this for the disks in virtual machines, and can snapshot them and rollback just as if they were a file system tree of files.

There’s an existing server module that’s been around for a while: http://nbd.sourceforge.net

If this volume could be mounted across the network on a mac, the volume could be formatted as HFS+ and Time Machine could backup to it using local disk mode, skipping all the slow sparse image file system work. And there’s a lot of work. My time machine backup of a Mac with a 256GB disk creates a whopping 57206 files in the bands directory of the sparseimage. It’s a lot of work to traverse these files, even locally on the server.

This is my next best solution to actually using ZFS on mac. Whatever “reasons” Apple has for ditching them are not good enough simply because we don’t know what they are. ZFS is a complex beast. Apple is good at simplifying things. It could be the perfect solution.

Time Machine Backups and silent data corruptions

I’ve recently heard many folk talking about Time Machine backup strategies. To do it well, you really do need to backup your backup, as Time Machine can “eat itself”, especially doing network backups.

Regardless of whether your Time Machine backup is to a locally attached disk or a network drive, when you make a backup of your backup, you want to make sure it’s valid, otherwise you’re propagating a corrupt backup.

So how do you know if your backup is corrupt? You could read it from beginning to end. But this would only protect you from data corruptions that can be detected by the drive itself. Disk verify, fsck, and others go further and validate the file system structures, but still not your actual data.

There are “silent corruptions”, which is where the data you wrote to the disk comes back corrupted (different data, not a read error). “That never happens”, you might say, but how would you know?

I have two servers running SmartOS using data stored on ZFS. I ran a data scrub on them, and both reported checksum errors. This is exactly the silent data corruption scenario.

ZFS features full checksumming of all data when stored, and if your data is in a RAIDZ or mirror configuration, it will also self-heal. This means that instead of returning an error, ZFS will go fetch the data from a good drive and also make another clean copy of that block so that its durability matches your setup.

Here’s the specifics of my corruptions:

On a XEON system with ECC RAM, the affected drive is a Seagate 1TB Barracuda 7200rpm, ST31000524AS, approximately 1 year old.

  pool: zones
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
   
  scan: resilvered 72.4M in 0h48m with 0 errors on Mon Nov 18 13:28:16 2013
config:

        NAME          STATE     READ WRITE CKSUM
        zones         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c1t1d0s0  ONLINE       0     0     0
            c1t0d0s0  ONLINE   2.61K  366k   635
            c1t4d0s1  ONLINE       0     0     0
        logs
          c1t2d0s0    ONLINE       0     0     0
        cache
          c1t2d0s1    ONLINE       0     0     0

errors: No known data errors

On a Celeron system with non-ECC RAM, the affected drive is a Samsung 2TB low power drive, approximately 2 years old.

  pool: zones
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 8K in 12h51m with 0 errors on Thu Nov 21 00:44:25 2013
config:

        NAME          STATE     READ WRITE CKSUM
        zones         ONLINE       0     0     0
          raidz1-0    ONLINE       0     0     0
            c0t1d0    ONLINE       0     0     0
            c0t3d0    ONLINE       0     0     0
            c0t2d0p2  ONLINE       0     0     2
        logs
          c0t0d0s0    ONLINE       0     0     0
        cache
          c0t0d0s1    ONLINE       0     0     0

errors: No known data errors

Any errors are scary, but the checksum errors even more so.

I had previously seen thousands of checksum errors on a Western Digital Green drive. I stopped using it and threw it in the bin.

I have other drives that are HFS formatted. I have no way of knowing if they have any corrupted blocks.

So unless your data is being checksummed, you are not protected from data corruption, and making a backup of a backup could easily be propagating data corruptions.

I dream of a day when we can have ZFS natively on Mac. And if it can’t be done for whatever ‘reasons’, at least give us the features from ZFS that we can use to protect our data.

ZFS = Data integrity

So, for a while now, I’ve been experiencing crappy performance of a Western Digital Green drive (WD15EARS) I have an a zfs mirror storing my time machine backups (using OpenIndiana and Netatalk).

Yesterday, the drive started reporting errors. Unfortunately, the system hung – that’s not so cool – ZFS is supposed to keep working when a drive fails… Aside from that, when I rebooted, the system automatically started a scrub to verify data integrity, and after about 10 minutes:

  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Thu Mar 10 10:19:42 2011
    1.68G scanned out of 1.14T at 107M/s, 3h5m to go
    146K resilvered, 0.14% done
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         DEGRADED     0     0     0
          mirror-0    DEGRADED     0     0     0
            c8t1d0s0  DEGRADED     0     0    24  too many errors  (resilvering)
            c8t0d0s0  ONLINE       0     0     0
        cache
          c12d0s0     ONLINE       0     0     0

errors: No known data errors

Check it out. It’s found 24 errors on the Western Digital Drive, but so far no data errors have been found, because they were correct on the other drive.

That’s obvious, right? But what other operating systems can tell the difference between the right and wrong data when they’re both there??? Most raid systems only detect a total drive failure, but don’t deal with incorrect data coming off the drive !!

Sure backing up to a network (Time Machine’s sparse image stuff) is *way* slower than a directly connected firewire drive, but in my opinion, it’s well worth doing it this way for the data integrity that you don’t get on a single USB or Firewire drive.

Thank you ZFS for keeping my data safe. B*gger off Western Digital for making crappy drives. I’m off to get a replacement today… what will it be? Samsung or Seagate?

ZFS for Mac Coming soon…

A little birdy told me, that there might be a new version of ZFS ported to Mac OS X coming up soon…

It seems the guys at Tens Compliment are working on a port of ZFS at a much more recent version than what was left behind by apple and forked as a Google code project: http://code.google.com/p/maczfs/

On my mac, I have installed the Mac-ZFS which can be found at the Google Code project. (I don’t have any ZFS volumes, it’s installed because I wanted to know what version it was up to.)

bash-3.2# uname -prs
Darwin 10.6.0 i386
bash-3.2# zpool upgrade
This system is currently running ZFS pool version 8.

All pools are formatted using this version.

My backup server at home is running OpenIndiana oi-148:

root@vault:~# uname -prs
SunOS 5.11 i386
root@vault:~# zpool upgrade
This system is currently running ZFS pool version 28.

All pools are formatted using this version.

Pretty exciting that we can get the same zpool version as the latest OpenIndiana… think of the backup/restore possibilities sending a snapshot over to a remote machine.

ZFS – dud hard drive slowing whole system

I have a low-power server running OpenIndiana oi-148. It has 4GB RAM and with three drives in it, like so:

matt@vault:~$ zpool status
  pool: rpool
 state: ONLINE
 scan: resilvered 588M in 0h3m with 0 errors on Fri Jan  7 07:38:06 2011
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c8t1d0s0  ONLINE       0     0     0
            c8t0d0s0  ONLINE       0     0     0
        cache
          c12d0s0     ONLINE       0     0     0

errors: No known data errors

I’m running netatalk file sharing for mac, and using it as a time machine backup server for my mac laptop.

When files are copying to the server, I often see periods of a minute or so where network traffic stops. I’m convinced that there’s some bottleneck in the storage side of things because when this happens, I can still ping the machine and if I have an ssh window, open, I can still see output from a `top` command running smoothly. However, if I try and do anything that touches disk (eg `ls`) that command stalls. At the time it comes good, everything comes good, file copies across the network continue, etc.

If I have a ssh terminal session open and run `iostat -nv 5` I see something like this:

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.2   36.0  153.6 4608.0  1.2  0.3   31.9    9.3  16  18 c12d0
    0.0  113.4    0.0 7446.7  0.8  0.1    7.0    0.5  15   5 c8t0d0
    0.2  106.4    4.1 7427.8  4.0  0.1   37.8    1.4  93  14 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.4   73.2   25.7 9243.0  2.3  0.7   31.6    9.8  34  37 c12d0
    0.0  226.6    0.0 24860.5  1.6  0.2    7.0    0.9  25  19 c8t0d0
    0.2  127.6    3.4 12377.6  3.8  0.3   29.7    2.2  91  27 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   44.2    0.0 5657.6  1.4  0.4   31.7    9.0  19  20 c12d0
    0.2   76.0    4.8 9420.8  1.1  0.1   14.2    1.7  12  13 c8t0d0
    0.0   16.6    0.0 2058.4  9.0  1.0  542.1   60.2 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.2    0.0   25.6  0.0  0.0    0.3    2.3   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   11.0    0.0 1365.6  9.0  1.0  818.1   90.9 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2    0.0    0.1    0.0  0.0  0.0    0.1   25.4   0   1 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   17.6    0.0 2182.4  9.0  1.0  511.3   56.8 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   16.6    0.0 2058.4  9.0  1.0  542.1   60.2 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   15.8    0.0 1959.2  9.0  1.0  569.6   63.3 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2    0.0    0.1    0.0  0.0  0.0    0.1    0.1   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   17.4    0.0 2157.6  9.0  1.0  517.2   57.4 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   18.2    0.0 2256.8  9.0  1.0  494.5   54.9 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   14.8    0.0 1835.2  9.0  1.0  608.1   67.5 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2    0.0    0.1    0.0  0.0  0.0    0.1    0.1   0   0 c12d0
    0.0    1.4    0.0    0.6  0.0  0.0    0.0    0.2   0   0 c8t0d0
    0.0   49.0    0.0 6049.6  6.7  0.5  137.6   11.2 100  55 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   55.4    0.0 7091.2  1.9  0.6   34.9    9.9  27  28 c12d0
    0.2  126.0    8.6 9347.7  1.4  0.1   11.4    0.6  20   7 c8t0d0
    0.0  120.8    0.0 9340.4  4.9  0.2   40.5    1.5  77  18 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.2   57.0  153.6 7271.2  1.8  0.5   31.0    9.4  26  28 c12d0
    0.2  108.4   12.8 6498.9  0.3  0.1    2.5    0.6   6   5 c8t0d0
    0.2  104.8    5.2 6506.8  4.0  0.2   38.2    1.4  67  15 c8t1d0

The stall occurs when the drive c8t1d0 is 100% waiting, and doing only slow i/o, typically writing about 2MB/s. However, the other drive is all zeros… doing nothing.

The drives are:
c8t1d0 – Western Digital Green – SATA_____WDC_WD15EARS-00Z_____WD-WMAVU2582242
c8t0d0 – Samsung Silencer – SATA_____SAMSUNG_HD154UI_______S1XWJDWZ309550

I’ve installed smartmon and done a short and long test on both drives, all resulting in no found errors.

I expect that the c8t1d0 WD Green is the lemon here and for some reason is getting stuck in periods where it can write no faster than about 2MB/s. Why? I don’t know…

Secondly, what I wonder is why it is that the whole file system seems to hang up at this time. Surely if the other drive is doing nothing, a web page can be served by reading from the available drive (c8t0d0) while the slow drive (c8t1d0) is stuck writing slow. Is this a bug in ZFS?

If anyone has any ideas, please let me know!

ZFS saved my Time Machine backup

For a while now, I’ve been using Time Machine to backup to an AFP share hosted by netatalk on an OpenIndiana low powered home server.

Last night, Time Machine stopped, with an error message: “Time Machine completed a verification of your backups. To improve reliability, Time Machine must create a new backup for you“.

Periodically I create ZFS snapshots of the volume containing my Time Machine backup. I haven’t enabled any automatic snapshots yet (like OpenIndiana/Solaris’s Time Slider service), so I just do it manually every now and then.

So, I shutdown netatalk, rolled back the snapshot, checked the netatalk database, restarted netatalk, and was then back in business.

# /etc/init.d/netatalk stop
# zfs rollback rpool/MacBackup/TimeMachine@20100130
# /usr/local/bin/dbd -r /MacBackup/TimeMachine
# /etc/init.d/netatalk start

Lost only a day or two’s incremental backups, which was much more palatable than having to do another complete backup of >250GB.

ZFS is certainly proving to be useful, even in a low powered home backup scenario.

Western Digital Green Lemon

I have an OpenSolaris backup machine with 2 x 1.5 TB drives mirrored. One is a Samsung Silencer, the other is a Western Digital Green drive. The silencer is, ironically, the noisier of the two, but way outperforms the WD drive.

I’ve done some failure tests on the mirror by unplugging one drive while copying files to/from the backup server from my laptop.

First, I was copying from the server, onto a single FW drive, writing at a solid 30MB/s. I disconnected the Samsung drive while it was running and the file copy proceeded without fault at about 25MB/s of the single WD drive.

`zpool status` showed the drive was UNAVAIL and that the pool would continue to work in a degraded state. When I reconnected the drive, `cfgadm` showed it as connected by unconfigured. When I reconfigured the Samsung drive, the pool automatically resilvered any missing data. (wasn’t much because I was reading from the network) in a matter of seconds.

Failure test #2 was to remove the WD drive. I copied data to the server from the laptop, and the progress was intermittent… bursts of 30MB/s, then nothing for quite a few seconds, etc…. I disconnected the WD drive, and hey presto, the transfer rate instantly jumped up to a solid 20MB/s. This samsung drive definitely writes a whole stack faster than the WD drive. (A mirror is as fast as the slowest writing drive).

And here’s the lemon part. When I reconnected the WD drive, it showed up as disconnected. The samsung was connected, but unconfigured. To my frustration, I couldn’t reconnect the drive:

$ cfgadm
Ap_Id                          Type         Receptacle   Occupant     Condition
sata1/0                        sata-port    disconnected unconfigured failed
$ cfgadm -c connect sata1/0
cfgadm: Insufficient condition
I did a bit of searching and found this page: SolarisZfsReplaceDrive : use the -f force option:
$ pfexec cfgadm -f -c connect sata1/0
Activate the port: /devices/pci@0,0/pci8086,4f4d@1f,2:0
This operation will enable activity on the SATA port
Continue (yes/no)? yes
$ cfgadm
Ap_Id                          Type         Receptacle   Occupant     Condition
sata1/0                        disk         connected    unconfigured unknown
sata1/1::dsk/c8t1d0            disk         connected    configured   ok

So, now OpenSolaris sees the drive as connected, let’s configure it and zpool should see it straight away…

$ pfexec cfgadm -c configure sata1/0
$ cfgadm
Ap_Id                          Type         Receptacle   Occupant     Condition
sata1/0::dsk/c8t0d0            disk         connected    configured   ok
sata1/1::dsk/c8t1d0            disk         connected    configured   ok
$ zpool status -x
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress for 0h0m, 0.00% done, 465h28m to go
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c8t0d0s0  ONLINE       0 1.14K     0  544K resilvered
            c8t1d0s0  ONLINE       0     0     0

Oh man… I have to resilver the whole drive. Why!!??! The other drive remembered it was a part of the pool and intelligently went about resilvering the differences. This drive looks like it was to resilver the whole damn thing.

After a while:

$ zpool status
  pool: rpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h23m, 5.05% done, 7h20m to go
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c8t0d0s0  ONLINE       0     0     0  12.3G resilvered
            c8t1d0s0  ONLINE       0     0     0

And here’s another interesting bit… the performance of the WD drive (c8t0d0) on my machine is really poor:

$ iostat -x 5

                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   61.2    0.0 1056.2  0.0  9.1    0.0  148.1   0 100 c8t0d0
   79.0    0.0  978.7    0.0  0.0  0.0    0.0    0.6   0   3 c8t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c9t0d0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   72.0    0.0  178.8  0.0  7.2    0.0   99.6   0 100 c8t0d0
  111.8    0.0  361.3    0.0  0.0  0.0    0.0    0.3   0   1 c8t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c9t0d0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   51.6    0.0  120.4  0.0  7.5    0.0  145.9   0 100 c8t0d0
   79.4    0.0  143.7    0.0  0.0  0.0    0.0    0.2   0   1 c8t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c9t0d0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   62.2    0.0 1968.5  0.0  8.3    0.0  133.7   0 100 c8t0d0
   81.8    0.0 2616.7    0.0  0.0  0.3    0.0    3.2   0   8 c8t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c9t0d0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   34.6    0.0 1880.2  0.0  7.1    0.0  204.9   0  79 c8t0d0
   28.4   11.6 1413.5   41.7  0.0  0.1    0.0    3.1   0   7 c8t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c9t0d0

Check it out. 100% busy use of the drive, and it’s writing less than 2MB/s. Compare that to the %b busy for the Samsun (on c8t1d0) for reading the same amount of data. And check out the average service time (asvc_t) – that’s bad like a cd-rom!! Yikes.

It doesn’t get reconnect to the system, its service time is way slow and its write performance stinks. This WD drive is a total lemon!

My first real Time Machine backup on a ZFS mirror

So following my last post about the impact of compression on ZFS, I’ve created a ZFS file system with Compression ON and am sharing it via Netatalk to my MacBook Pro.

I connected the Mac via gigabit ethernet for the original backup, and it backed up 629252 items (193.0 GB) in 7 hours, 23 minutes, 4.000 seconds, according the backup log. That’s an average of 7.4MB/sec. Nowhere near the maximum transfer rates that I’ve seen to the ZFS share, but acceptable nonetheless.

`zfs list` reports that the compression ratio is 1.11x. I would have expected more, but oh well.

And now my incremental backups are also working well over the wireless connection. Excellent.

ZFS performance networked from a Mac

Before I go ahead and do a full time machine backup to my OpenSolaris machine with a ZFS mirror, I thought I’d try and test out what performance hit there might be when using compression. I also figured, that I’d test out the impact of changing the recordsize too. Optimising this for the data record size seems to be best practices for databases, and since Time Machine stores data in a Mac Disk Image (SparseBundle) it probably writes data in 4k chunks matching the allocation size of the HFS filesystem in the disk image.

There were three copy tasks done:

  1. Copy a single large video file (1.57GB) to the Netatalk AFP share,
  2. Copy a single large video file (1.57GB) to a locally (mac) mounted disk image stored on the Netatalk AFP share,
  3. Copy a folder with 2752 files (117.3MB) to a locally (mac) mounted disk image stored on the Netatalk AFP share.

Here’s the results:

To Netatalk AFP share To Disk Image stored on AFP share To Disk Image stored on AFP share
ZFS Recordsize and compression 1 video file, 1.57GB 1 video file, 1.57GB 2752 files, 117.3MB
128k, off 0m29.826s

53.9MB/s

2m5.889s

12.7MB/s

1m45.809s

1.1MB/s

128k, on 0m52.179s

30.9MB/s

1m36.084s

16.7MB/s

1m34.367s

1.24MB/s

128k, gzip 0m31.290s

51.4MB/s

2m32.485s

10.5MB/s

2m29.141s

0.79MB/s

4k, off 0m27.131s

59.3MB/s

2m16.138s

11.8MB/s

2m47.718s

0.70MB/s

4k, on 0m25.651s

62.7MB/s

1m59.459s

13.5MB/s

1m41.551s

1.2MB/s

4k, gzip 0m30.348s

53.0MB/s

5m16.195s

5.08MB/s

4m48.378s

0.41MB/s

I think there was something else happening on the server for the 128k compression=on test, impacting on its data rate.

Conclusion:

The clear winner is the default compression and default record size. It must be that even my low powered Atom processor machine can compress the data faster than it can be written to disk resulting in less bandwidth to disc and therefore increasing performance at the same time as saving space. Well done ZFS!