[BBLISA] SunFire 4500: Linux + ZFS/FUSE ?

Sat Jul 7 09:12:59 EDT 2012

> From: Peter Baer Galvin [mailto:pbg at cptech.com]
> Sent: Friday, July 06, 2012 10:50 AM
> 
> Hmm, resilvering performance has greatly increased over time Ned. With
> which
> version of ZFS did you have the never-completing problem?

I haven't had the problem myself, because I know enough to avoid it.  I
participate a lot in the zfs-discuss mailing list (which was formerly
extremely active including zfs developers, but now it's mostly just other IT
people offering advice to each other, since the oracle takeover.)

The root cause of the problem is like this:

In a zfs resilver, they decided to be clever.  By comparison to a hardware
raid resilver which must resilver the entire disk, including unused blocks,
a ZFS resilver only resilvers the used blocks.  Theoretically this should
make resilvering very fast, right?  Unfortunately, no.  Because the hardware
resilver sequentially does each block of the whole disk, it's easy to
calculate the whole-disk resilver time as the total disk capacity divided by
the sustained sequential speed of the drive.  Something on the order of 2
hours depending on your drive.  But in zfs, they don't have any way to
*sort* the used blocks into disk sequential order.  The resilver ordering is
approximated by temporal order.  And, assuming you have a mostly full pool
(>50%), that's been in production for a while, reading & writing, creating &
destroying snapshots, it means temporal order is approximated by random
order.  So zfs resilvering is approximated by random IO for all your used
blocks.  This is very much dependent on your individual specific usage
patterns.

Resilvering is a per-vdev operation.  If we assume the size of the pool &
the size of the data are given by design & budget constraints, and you are
faced with the decision to organize your pool into a big raidz versus divide
your pool up into a bunch of mirrors, it means you have less data in each
mirror to resilver.  Naturally, for equal usable capacity, mirrors cost
more.  For the sake of illustrating my point I've assumed you're able to
consider a big raidz versus an equivalently sized (higher cost) bunch of
mirrors.  The principal holds even if you scale up or scale down...  If you
have a set amount of data, divided by a configurable number of vdev's, you
will have less to resilver if you choose to have more vdev's.

Also, random IO for a raidzN (or raid5 or raid6 or raid-DP) is approximated
by the worst case access time for any individual disk (approx 2x slower than
the average access time for a single disk).  Meanwhile, random IO for a
mirror is approximated by the average access time for an individual disk.  

So if you break up your pool into a bunch of mirrors rather than a large
raidzN, you have both a faster ability to perform random IO (factor of 2x),
and less random IO that needs to be done (factor of Mx, where M is how many
times smaller the mirror is compared to the raidz.  If you obey the rule of
thumb "limit raidz to 8-10 disks per vdev," then Mx is something like factor
of 8x).  End result is factor of ~16x faster using mirrors instead of raid.

So in rough numbers, a 46-disk raidz2 (capacity of 44 disks) will be
approximately 88 times slower to resilver than a bunch of mirrors.

In systems that I support, I only deploy mirrors.  When I have a resilver, I
expect it to take 12 hours.  By comparison, if this were a hardware raid, it
would resilver in 2 hours...  And if it were one big raidz, it would
resilver in approx 6 weeks.