11

A recent issue with a Buffalo TeraStation NAS here in my office has got me investigating Raid 5.

I've found a few different articles talking about the unsuitability of using raid 5 in large arrays, or with large disks

Here is one example article that talks about problems with rebuilding an array with large consumer drives.

I'm trying to work out what counts as 'large'?

The NAS we have here is a 4 drive Raid 5 setup, each drive is 1 TB. A drive failed and has been replaced, the array is currently rebuilding.

Does this setup constitute as large, in terms of will likely have a problem during the rebuild?

How reliable is this setup for day to day use?

6
  • 2
    Given your usual system load, how long does the controller expect the rebuild to take? What is the MTBF of the HDDs? One you have those two numbers, you know the chance of a second - and catastrophic - failure during RAID rebuild. Bear in mind that the HDDs are most stressed during rebuild, so the result above will be an underestimate of the chance of double failure.
    – MadHatter
    Commented Apr 28, 2014 at 14:22
  • 3
    As an aside, you know that RAID is not backup, right?
    – cjc
    Commented Apr 28, 2014 at 14:29
  • 5
    @cjc, do you add that pearl of wisdom to every single RAID question on SF, or does something about this one make you think the OP thinks RAID is a backup? Commented Apr 28, 2014 at 14:57
  • Yes, I'm aware of that. It's all backed up, I just done want the hassle of having to restore it all because the raid array didnt repair itself properly.
    – Rob
    Commented Apr 28, 2014 at 14:57
  • possible duplicate of What are the different widely used RAID levels and when should I consider them?
    – Basil
    Commented Apr 28, 2014 at 16:47

2 Answers 2

18

Designing the reliability of a disk array:

  1. Find the URE Rate of your drive (manufacturers don't like to talk about their drives failing, so you might have to dig to find this. It should be 1/10^X where X is around 12-18 commonly).
  2. Decide what is an acceptable risk rate for your storage needs†. Typically this is <0.5% chance of failure, but could be several percent in a "scratch" storage, and could be <0.1 for critical data.
  3. 1 - ( 1 - [Drive Size] x [URE Rate]) ^ [Data Drives‡] = [Risk]
    For arrays with more than one disk of parity or mirrors with more than a pair of disks in the mirror, change the 1 after the Drives in Array to the number of disks with parity/mirror.

So I've got a set of four 1TB WD Green drives in an array. They have a URE Rate of 1/10^14. And I use them in as scratch storage. 1 - (1 - 1TB x 1/10^14byte) ^ 3 => 3.3% risk of failure rebuilding the array after one drive dies. These are great for storing my junk, but I'm not putting critical data on there.

†Determining acceptable failure is a long and complicated process. It can be summarizes as Budget = Risk * Cost. So if a failure is going to cost $100, and has a 10% chance of happening then you should have a budget of $10 to prevent it. This grossly simplifies the task of determining the risk, the costs of various failures, and the nature of potential prevention techniques - but you get the idea. [Data Drives] = [Total Drives] - [Parity Drives]. A two disk mirror (RAID1) and RAID5 has 1 parity drive. A three disk mirror (RAID1) and RAID6 has 2 parity drives. It's possible to have more parity drives with RAID1 and/or custom schemes, but atypical.


This statistical equation does come with it's caveats however:

  • That URE Rate is the advertised rate and is commonly better in most drives rolling off the assembly line. You might get lucky and buy a drive that is orders of magnitude better than advertised. Similarly you could get a drive that dies of infant mortality.
  • Some manufacturing lines have bad runs (where many disks in the run fail at the same time), so getting disks from different manufacturing batches helps to distribute the likelihood of simultaneous failure.
  • Older disks are more likely to die under the stress of a rebuild.
  • Environmental factors take a toll:
    • Disks that are heat cycled commonly are more likely to die (eg. powering them on/off regularly).
    • Vibration can cause all kinds of issues - see video on YouTube of IT yelling at a disk array.
  • "There are three kinds of lies: lies, damned lies, and statistics" - Benjamin Disraeli
11
  • The drive I took /out/ of the device is a Samsung HD103SI 1TB drive. I believe the other three remaining drives are the same. The replacement drive is from a different manufacturer, I dont have the details to hand.
    – Rob
    Commented Apr 28, 2014 at 15:01
  • It seems the rate for this drive is 1/10^15 according to this:- comx-computers.co.za/HD103SI-specifications-28474.htm
    – Rob
    Commented Apr 28, 2014 at 15:06
  • 1
    I just corrected the equations, the example was correct, now they both are. Your array would be 1-(1-1099511627776*0.000000000000001)^3 => 0.00329. You have a bracket on the outside of the ^3 where it should be on the inside; and there should be one more zero in that 1/10^15 thing.
    – Chris S
    Commented Apr 28, 2014 at 15:53
  • 2
    A 1TB drive would be 1000000000000 bytes so it works out slightly less than 3%|0.3% depending on your URE Rate.
    – user9517
    Commented Apr 28, 2014 at 15:58
  • 1
    @IanRingrose This is statistically valid. I already addressed your specific concerns. Do you have anything relevant to add besides what has already been stated?
    – Chris S
    Commented Apr 28, 2014 at 19:38
9

The reason that article exists is to draw attention to Unrecoverable Bit Error Rates on HDDs. Specifically, your cheap 'home PC' disks. They typically have a factory spec of 1 / 10^14. This is about 12.5TB of data, which if you are doing a RAID-5 with 2TB disks ... you hit quite quickly.

This means you should either:

  • use smaller RAID groups, and accept higher wasted space.
  • Use RAID-6 and accept the additional write penalty. (50% higher than RAID5)
  • Buy more expensive disks - 'server grade' have an UBER spec of 1 / 10^16, which means this is a moot point. (1.2PB is better than 12.5TB)

I would suggest typically that RAID-6 is the way forwards generally, but it'll cost you performance.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .