Matt Connolly's Blog

my brain dumps here…

ZFS – dud hard drive slowing whole system

I have a low-power server running OpenIndiana oi-148. It has 4GB RAM and with three drives in it, like so:

matt@vault:~$ zpool status
  pool: rpool
 state: ONLINE
 scan: resilvered 588M in 0h3m with 0 errors on Fri Jan  7 07:38:06 2011
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c8t1d0s0  ONLINE       0     0     0
            c8t0d0s0  ONLINE       0     0     0
        cache
          c12d0s0     ONLINE       0     0     0

errors: No known data errors

I’m running netatalk file sharing for mac, and using it as a time machine backup server for my mac laptop.

When files are copying to the server, I often see periods of a minute or so where network traffic stops. I’m convinced that there’s some bottleneck in the storage side of things because when this happens, I can still ping the machine and if I have an ssh window, open, I can still see output from a `top` command running smoothly. However, if I try and do anything that touches disk (eg `ls`) that command stalls. At the time it comes good, everything comes good, file copies across the network continue, etc.

If I have a ssh terminal session open and run `iostat -nv 5` I see something like this:

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.2   36.0  153.6 4608.0  1.2  0.3   31.9    9.3  16  18 c12d0
    0.0  113.4    0.0 7446.7  0.8  0.1    7.0    0.5  15   5 c8t0d0
    0.2  106.4    4.1 7427.8  4.0  0.1   37.8    1.4  93  14 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.4   73.2   25.7 9243.0  2.3  0.7   31.6    9.8  34  37 c12d0
    0.0  226.6    0.0 24860.5  1.6  0.2    7.0    0.9  25  19 c8t0d0
    0.2  127.6    3.4 12377.6  3.8  0.3   29.7    2.2  91  27 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   44.2    0.0 5657.6  1.4  0.4   31.7    9.0  19  20 c12d0
    0.2   76.0    4.8 9420.8  1.1  0.1   14.2    1.7  12  13 c8t0d0
    0.0   16.6    0.0 2058.4  9.0  1.0  542.1   60.2 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.2    0.0   25.6  0.0  0.0    0.3    2.3   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   11.0    0.0 1365.6  9.0  1.0  818.1   90.9 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2    0.0    0.1    0.0  0.0  0.0    0.1   25.4   0   1 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   17.6    0.0 2182.4  9.0  1.0  511.3   56.8 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   16.6    0.0 2058.4  9.0  1.0  542.1   60.2 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   15.8    0.0 1959.2  9.0  1.0  569.6   63.3 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2    0.0    0.1    0.0  0.0  0.0    0.1    0.1   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   17.4    0.0 2157.6  9.0  1.0  517.2   57.4 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   18.2    0.0 2256.8  9.0  1.0  494.5   54.9 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c8t0d0
    0.0   14.8    0.0 1835.2  9.0  1.0  608.1   67.5 100 100 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2    0.0    0.1    0.0  0.0  0.0    0.1    0.1   0   0 c12d0
    0.0    1.4    0.0    0.6  0.0  0.0    0.0    0.2   0   0 c8t0d0
    0.0   49.0    0.0 6049.6  6.7  0.5  137.6   11.2 100  55 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   55.4    0.0 7091.2  1.9  0.6   34.9    9.9  27  28 c12d0
    0.2  126.0    8.6 9347.7  1.4  0.1   11.4    0.6  20   7 c8t0d0
    0.0  120.8    0.0 9340.4  4.9  0.2   40.5    1.5  77  18 c8t1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.2   57.0  153.6 7271.2  1.8  0.5   31.0    9.4  26  28 c12d0
    0.2  108.4   12.8 6498.9  0.3  0.1    2.5    0.6   6   5 c8t0d0
    0.2  104.8    5.2 6506.8  4.0  0.2   38.2    1.4  67  15 c8t1d0

The stall occurs when the drive c8t1d0 is 100% waiting, and doing only slow i/o, typically writing about 2MB/s. However, the other drive is all zeros… doing nothing.

The drives are:
c8t1d0 – Western Digital Green – SATA_____WDC_WD15EARS-00Z_____WD-WMAVU2582242
c8t0d0 – Samsung Silencer – SATA_____SAMSUNG_HD154UI_______S1XWJDWZ309550

I’ve installed smartmon and done a short and long test on both drives, all resulting in no found errors.

I expect that the c8t1d0 WD Green is the lemon here and for some reason is getting stuck in periods where it can write no faster than about 2MB/s. Why? I don’t know…

Secondly, what I wonder is why it is that the whole file system seems to hang up at this time. Surely if the other drive is doing nothing, a web page can be served by reading from the available drive (c8t0d0) while the slow drive (c8t1d0) is stuck writing slow. Is this a bug in ZFS?

If anyone has any ideas, please let me know!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: