[BBLISA] disk corruption recovery ideas?

Wed Oct 5 08:44:29 EDT 2005

On Tue, 4 Oct 2005, Douglas Alan wrote:

> Eric Smith <esmithphoto at gmail.com> wrote:
> 
>... 
> I've had two drives die at the same time on a RAID.  Good thing the
> RAID was only used as our backup server.  I'd never trust RAID again to
> be any kind of security against disk failure.
> 
> |>oug
> 

>From http://www.nber.org/sys-admin/linux-nas-raid.html :

Why do drive failures come in pairs?

Most of the drives in our NAS boxes and drive arrays claim a MTBF of
500,000 hours. That's about 2% per year. With three drives the chance of
at least one failing is a little less than 6%. (1-(1-.98)^3). Our
experience is that such numbers are at least a reasonable approximation
of reality. (We especially like the 5400 RPM Maxtor 5A300J0 300GB drives
for long life).

Suppose you have three drives in a RAID 5. If it takes 24 hours to
replace and reconstruct a failed drive, one is tempted to calculate that
the chance of a second drive failing before full redundancy is
established is about .02/365, or about one in a hundred thousand. The
total probability of a double failure seems like it should be about 6 in
a million per year.

Our double failure rate is about 5 orders of magnitude worse than that
- the majority of single drive failures are followed by a second drive
failure before redundancy is established. This prevents rebuilding the
array with a new drive replacing the original failed drive, however you
can probably recover most files if you stay in degraded mode. It isn't
that failures are correlated because drives are from the same batch, or
the controller is at fault, or the environment is bad (common electrical
spike or heat problem). The fault lies with the Linux md driver, which
stops rebuilding parity after a drive failure at the first point it
encounters a uncorrectable read error on the remaining "good" drives. Of
course with two drives unavailable, there isn't an unambiguous
reconstruction of the bad sector, so it might be best to go to the backups
instead of continuing. At least that is the apparently the reason for the
decision.

Alternatively, if the first drive failed was readable on that sector,
(even if not reading some other sectors) it should be possible to fully
recover all the data with a high degree of confidence even if a second
drive is failed later. Since that is far from an unusual situation (a
drive will be failed for a single uncorrectable error even if further
reads are possible on other sectors) it isn't clear to us why that isn't
done. Even if that sector isn't readable, logging the bad block, writing
zeroes to the targets, and going on might be better than simply giving up.

A single unreadable sector isn't unusual among the tens of millions
of sectors on a modern drive. If the sector has never been written to,
there is no occaison for the drive electronics or the OS to even know
it is bad. If the OS tried to write to it, the drive would automatically
remap the sector and no damage would be done - not even a log entry. But
that one bad sector will render the entire array unrecoverable no matter
where on the disk it is if one other drive has already been failed.

Let's repeat the reliability calculation with our new knowledge of the
situation. In our experience perhaps half of drives have at least one
unreadable sector in the first year. Again assume a 6 percent chance of a
single failure. The chance of at least one of the remaining two drives
having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is
about 4.5%/year, which is .5% MORE than the 4% failure rate one would
expect from a two drive RAID 0 with the same capacity. Alternatively, if
you just had two drives with a partition on each and no RAID of any kind,
the chance of a failure would still be 4%/year but only half the data
loss per incident, which is considerably better than the RAID 5 can even
hope for under the current reconstruction policy even with the most
expensive hardware.

We don't know what the reconstruction policy is for other raid
controllers, drivers or NAS devices. None of the boxes we bought
acknowledged this "gotcha" but none promised to avoid it either. We assume
Netapp and ECCS have this under control, since we have had several single
drive failures on those devices with no difficulty resyncing. We have not
had a single drive failure yet in the MVD based boxes, so we really don't
know what they will do.

Some mitigation of the danger is possible. You could read and write the
entire drive surface periodically, and replace any drives with even a
single uncorrectable block visible. A daemon <a
href="http://sourceforge.net/projects/smartmontools/"> Smartd </a> is
available for Linux that will scan the disk in background for errors and
report them. We had been running that, but ignored errors on unwritten
sectors, because we were used to such errors disappearing when the sector
was written (and the bad sector remapped).

Our current inclination is to shift to a recent 3ware controller,
which we understand has a "continue on error" rebuild policy available as
an option in the array setup. But we would really like to know more about
just what that means. What do the apparently similar RAID controllers from
Mylex, LSI Logic and Adaptec do about this? A look at their web sites
reveals no information.

Daniel Feenberg
feenberg isat nber dotte org