14

I want to compare the reliability of different RAID systems with either consumer (URE/bit = 1e-14) or enterprise (URE/bit = 1e-15) drives. The formula to have the probability of success of a rebuild (ignoring mechanical problems, that I will take later into account) is simple:

error_probability = 1 - (1-per_bit_error_rate)^bit_read

Important to remember is that this is the probability of getting AT LEAST one URE, not necessarily only one.

Let's suppose we want 6 TB usable space. We can get it with:

  • RAID1 with 1+1 disks of 6 TB each. During rebuild we read back 1 disk of 6TB and the risk is: 1-(1-1e-14)^(6e12*8)=38% for consumer or 4.7% for enterprise drives.

  • RAID10 with 2+2 disks of 3 TB each. During rebuild we read back only 1 disk of 3TB (the one paired with the failed one!) and the risk is lower: 1-(1-1e-14)^(3e12*8)=21% for consumer or 2.4% for enterprise drives.

  • RAID5/RAID Z1 with 2+1 disks of 3TB each. During rebuild we read back 2 disks of 3TB each and the risk is: 1-(1-1e-14)^(2*3e12*8)=38% for consumer or 4.7% or enterprise drives.

  • RAID5/RAID Z1 with 3+1 disks of 2 TB each (often used by users of SOHO products like Synologys). During rebuild we read back 3 disks of 2TB each and the risk is: 1-(1-1e-14)^(3*2e12*8)=38% for consumer or 4.7% or enterprise drives.

Calculating the error for single disk tolerance is easy, more difficult is calculating the probability with systems tolerant to multiple disks failures (RAID6/Z2, RAIDZ3).

If only the first disk is used for rebuild and the second one is read again from the beginning in case or an URE, then the error probability is the one calculated above square rooted (14.5% for consumer RAID5 2+1, 4.5% for consumer RAID1 1+2). However, I suppose (at least in ZFS that has full checksums!) that the second parity/available disk is read only where needed, meaning that only few sectors are needed: how many UREs can possibly happen in the first disk? not many, otherwise the error probability for single-disk tolerance systems would skyrocket even more than I calculated.

If I'm correct, a second parity disk would practically lower the risk to extremely low values.

Question aside, it is important to keep in mind that manufacturers increase the URE probability for consumer-class drives for marketing reasons (sell more enterprise-class drives), therefore even consumer-class HDDs are expected to achieve 1E-15 URE/bit read.

Some data: http://www.high-rely.com/hr_66/blog/why-raid-5-stops-working-in-2009-not/

The values I provided in parentheses (enterprise drives) therefore realistically apply to consumer drives too. And real enterprise drives have an even higher reliability (URE/bit=1e-16).

Concerning the probability of mechanical failures, they are proportional to the number of disks and proportional to the time required to rebuild.

3
  • 1
    Hi Olaf! As far as I'm concerned, this question seems a little too specific to computer hardware to be a good fit for Mathematics, but you could ask on their meta site if they'd like to have your question. If that's the case, flag again and we'll be happy to migrate it for you!
    – slhck
    Commented Dec 13, 2012 at 17:34
  • 2
    How exactly do you arrive at 38% URE probability for RAID5 with 3 drives? Using URE = 10^14, HDD = 3.5*1024^4 bytes I get 3.8% URE per drive and 11.1% for URE while rebuilding. That is: 100* (1- (1-(hdd / ure))^3). I think your numbers are a bit off (although the practical failure rate is higher than what is stated by manufacturers). Since the error rates are given per bits read per drive and not per bits read, I think the part where you use ^bit_read is wrong. Perhaps give more detail on how you calculated those odds? +1 for interesting question. cs.cmu.edu/~bianca/fast07.pdf Commented Mar 11, 2013 at 12:22
  • Added info and checked calculations.
    – FarO
    Commented Nov 20, 2013 at 19:50

2 Answers 2

4

This is the best answer, with theory of probabilities too:

http://evadman.blogspot.com/2010/08/raid-array-failure-probabilities.html?showComment=1337533818123#c7465506102422346169

2

There are a number of sites and articles that attempt to address this question.

This site has calculators for RAID 0, 5, 10/50/60 levels.

The wikipedia article on RAID levels has sections on RAID 0 and RAID 1 failure rates.

RAID 0:

Reliability of a given RAID 0 set is equal to the average reliability of each disk divided by the number of disks in the set:

That is, reliability (as measured by mean time to failure (MTTF) or mean time between failures (MTBF)) is roughly inversely proportional to the number of members – so a set of two disks is roughly half as reliable as a single disk. If there were a probability of 5% that the disk would fail within three years, in a two disk array, that probability would be increased to {P}(at least one fails) = 1 - {P}(neither fails) = 1 - (1 - 0.05)^2 = 0.0975 = 9.75%.

RAID 1:

As a simplified example, consider a RAID 1 with two identical models of a disk drive, each with a 5% probability that the disk would fail within three years. Provided that the failures are statistically independent, then the probability of both disks failing during the three-year lifetime is 0.25%. Thus, the probability of losing all data is 0.25% over a three-year period if nothing is done to the array.



Also I've found several blog articles about this subject including this one that reminds us the independent drives in a system (the I in RAID) may not be that independent after all:

The naïve theory is that if hard disk 1 has probability of failure 1/1000 and so does disk 2, then the probability of both failing is 1/1,000,000. That assumes failures are statistically independent, but they’re not. You can’t just multiply probabilities like that unless the failures are uncorrelated. Wrongly assuming independence is a common error in applying probability, maybe the most common error.

Joel Spolsky commented on this problem in the latest StackOverflow podcast. When a company builds a RAID, they may grab four or five disks that came off the assembly line together. If one of these disks has a slight flaw that causes it to fail after say 10,000 hours of use, it’s likely they all do. This is not just a theoretical possibility. Companies have observed batches of disks all failing around the same time.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .