0

With the space to cost ratio of hard drives leading to capacities that present an increasing challenge to parity Raid systems (URE probability, multi-day rebuilds stressing drives and risking secondary failure) is there a case to be made that they also present an opportunity/recommendation for multi-disk (>2) Raid 1 configurations for use at home and in small businesses? That is, Raid1 'wastes' space compared to a parity scheme, but with cheap mutli-terabyte drives, is space not arguably the commodity we have to waste compared to the memory, bandwidth, cpu clock, downtime etc required in striping, rebuilds etc.

If I understand the technology correctly, a Raid 1 (>2) configuration:

  • ..cannot fail a rebuild, in so much as long as one drive is accessible so are the files.
  • ..provides greater security relative to the drive count. It's basically an inversion of the ratio right? Raid5 on three disks provides 66% usable capacity and can handle 33% (1 drive) failure. Raid1 provides only 33% capacity but 66% failure (2 drives).
  • ..has an easier and more robust recovery mechanism. In dire conditions, the Raid need not be rebuilt to access the data. Drives can be migrated between different systems.

That last point feels akin to the unRaid philosophy where files are saved across drives, and not striped, and the second point suggests to me something more attractive to a small business; prioritizing data security over data capacity. What's the point of being able to store more data if it's at greater risk?

Questions:

  1. Would a strategy of 'manually-mirroring' the content, as opposed to Raid1 copying-at-time-of-writing, avoid potential problems with non-ECC memory corruption? Since the job of writing to each disk would be a distinct process the same memory fault that occurs in one write would not theoretically occur in another ?

  2. Is there a Raid1 software/hardware designed to work with more than two drives, and can therefore handle traditionally un-raid1-like behaviours such as bit-rot, URE etc? With a traditional disk mirror setup if disk A reports a 0 and B reports a 1, you have no idea who flipped, but in a three or more disk solution could you not 'correct' the odd man out disk? A=0, B=1, C=0 thus B must have flipped?

  3. At the same time, would such a Raid1 solution support parallel reads from more than two disks to provide accelerated read speeds? This is especially important since I imagine there are many businesses who read the same data many more times than they edit it.

  4. Raid is not a backup solution. So if you were to follow good practice and have between 2-3 storage locations, is it intended that only one of them is protected from drive failure via a raid scheme or would you expect a second offsite location to also have multidisk redundancy? Is it the case that even a single drive, stored in a different location is thought to be protected from device failure because it's in an 'effective' mirror relationship with the drives in the other locations? If not, then does Raid1 not offer some redudancy for the lowest minimum drive count (2)?

Update:

Re point 1, as some have pointed out, this does rely on a data source free of errors. This is however the case for pretty much any storage/backup strategy. We may use sophisticated systems for data integrity in our NAS/SAN solutions but the data we store in them is typically being generated by work stations and devices without such measures. Most production PC's are built for either cost or speed. Highly unlikely, especially in a small business, that the computer you do your CAD or finances or powerpoint on etc uses ZFS formatted drives and ECC memory.

In my specific example, one of the things I will be looking to store will be the output from a camera. I have to assume the photos and videos it saves on to the SD card are 'correct', and there's not much I can do if the corruption occurs at the point in the data chain.

Suggestion:

If this strategy is not supported at the RAID level are there any software packages that can be used to manually perform some of the activities I'm talking about? Writing copies via queued, discrete rysnc tasks? A bash script could perform a periodic bit-rot scrub? Just checksum all copies of a particular file across all disks, then overwrite the copy on a disk with the wrong checksum using the copy on a correct disk?

9
  • I may be missing the point/intent of your sub-question #1.  If data corruption occurs when you write to your main (first) disk, and then you copied the file to a second disk, wouldn’t that copy the corrupt data that was written to the first disk? Commented Jul 25, 2018 at 21:52
  • 1
    That there is a whole lot of questions.... See superuser.com/questions/489793/… for a potential duplicate though.... Commented Jul 25, 2018 at 21:56
  • RAID 6 also provides 2 drive fails. Overhead of RAID 5 is +1, and RAID 6 is +2, so the overhead goes down as you increase the number of drives. As the total number of extra drives remains the same.
    – cybernard
    Commented Jul 25, 2018 at 22:16
  • If you daisy chained your writes you would be correct. So write from source to disk 1, then disk 1 to disk 2 etc would perpetuate the error. However if you read and wrote to each disk from the soure separately you might avoid this issue? Obviously the source holding the original data, has to be correct.
    – John S.
    Commented Jul 25, 2018 at 22:25
  • Apologies for the many questions. There are a few topics on here that are potential duplicates but none that I saw that addressed these particular questions. Best I have gathered is no-one so far can name a hardware controller that provides this functionality and that Linux mdadm handles Raid1 with more than two drives but doesn't offer any added functionality when dealing with >2 configurations.
    – John S.
    Commented Jul 25, 2018 at 22:25

2 Answers 2

0

To answer the questions

  1. Manually mirroring the content would bypass some potential problems related to ECC, but could introduce others. It also assumes the source data has not been corrupted by an earlier failure.

  2. This depends on the hardware / software. I'd imagine most RAID1 implementations would not pick up bitrot, but would handle URE. Reads of a piece of data are generally only done from 1 of the drives, not all - which would allow for faster reads overall.

  3. If RAID1 does support all 3 disks, yes, parallell reads would be faster with more disks. AFAIK this is not supported in mdadm (ie Linux software raid based solutions) as the 3rd drive is treated as a hot spare.

  4. Whether you RAID your offsite location is a question of robustness and reliability. It would be a best practice, but not an absolute necessity. In my mind hard disks are consumable parts - having RAID allows for replacement of failures without having rebuilds AND increases durability. This would really be a cost-benefit discussion.

3
  • A thought (but not an answer) - Have you considered alternative technologies to RAID? ZFS (great for snapshots/pitr and off-site replication), MooseFS or similar should allow bricks with distributed redundancy but less hardware, even LVM can provide mirroring - used in conjunction with MDADM it could provide a RAID1 type solution over 3 disks.
    – davidgo
    Commented Jul 26, 2018 at 0:20
  • Thank you for such a comprehensive answer. Can you elaborate on what types of problems could be introduced by manual mirroring, and how you imagine handling a URE would occur? Simple as the controller will attempt to read from one disk and automatically move the read to another should it encounter a URE?
    – John S.
    Commented Jul 26, 2018 at 14:50
  • You're correct that in ZFS a mirrored vdev supports checksumming and multiple disk arrangements (zfsbuild.com/2010/05/26/zfs-raid-levels). Would be interesting to find out if this arrangement has a lower overhead (ZFS is very RAM intensive) than the far more common RaidZ, RaidZ2 and RaidZ3 arrangements.
    – John S.
    Commented Jul 26, 2018 at 15:20
0

Lets start with the assumptions:

If I understand the technology correctly, a Raid 1 (>2) configuration:

..cannot fail a rebuild, in so much as long as one drive is accessible so are the files.

This is true, but it does not protect from lightning strikes, theft, flooding etc. Thus while you reduce one risk you still need off-site backups.

..provides greater security relative to the drive count. It's basically an inversion of the ratio right? RAID5 on three disks provides 66% usable capacity and can handle 33% (1 drive) failure. Raid1 provides only 33% capacity but 66% failure (2 drives).

That assumes that drive failure are independant. On SAS this might be the case (unless the drive fails spectacular and also damages other drives), but it is not the case for PATA or SATA. Usualy a hung disk means all drives on that controller will hang. You will still have your data but you also would have your downtime.

..has an easier and more robust recovery mechanism. In dire conditions, the RAID need not be rebuilt to access the data.

Most of the time the rebuild is not a problem. Nobody rebuilds RAID arrays. With terabyte disks it is faster to replace the disk, recreate the arrays with your data and restore from backup.

From a business perpective the mean goal from RAID is to keep to the system running till 17:00, then make sure that your daily backup works, followed by a new disk, a fresh RAID array and a restore from backup.

Drives can be migrated between different systems

This depends a lot of the RAID implementation. It may work. It may not.


Would a strategy of 'manually-mirroring' the content, as opposed to RAID1 copying-at-time-of-writing, avoid potential problems with non-ECC memory corruption? Since the job of writing to each disk would be a distinct process the same memory fault that occurs in one write would not theoretically occur in another ?

If you are lucky. But what would stop you from successfully and flwlessly copying a corructed file?

Is there a RAID1 software/hardware designed to work with more than two drives, and can therefore handle traditionally un-raid1-like behaviours such as bit-rot, URE etc? With a traditional disk mirror setup if disk A reports a 0 and B reports a 1, you have no idea who flipped, but in a three or more disk solution could you not 'correct' the odd man out disk? A=0, B=1, C=0 thus B must have flipped?

There is theis 'RAID1-flavour' called RAID5....

Seriously though, RAID3,4,5, and RAID6 come to mind. No need to force a mirror into something else.

At the same time, would such a RAID1 solution support parallel reads from more than two disks to provide accelerated read speeds? This is especially important since I imagine there are many businesses who read the same data many more times than they edit it.

RAID 5 would. And RAID 6, and ...

RAID is not a backup solution. So if you were to follow good practice and have between 2-3 storage locations, is it intended that only one of them is protected from drive failure via a RAID scheme or would you expect a second offsite location to also have multidisk redundancy?

I would want a off-line backup. One not accesible via the Internet and in a physical different location. A second backup which is reachable and updated daily is also nice. But mostly in case of fire and similar.

Mothly backups to really safe off-line and daily to tape/disk/cloud would be a nice addition to that.

Is it the case that even a single drive, stored in a different location is thought to be protected from device failure because it's in an 'effective' mirror relationship with the drives in the other locations?

RAID is not backup. A part of a RAID is not a proper backup. Not even a disk from a multidisk mirror. Depending on your RAID implementation it might work or it might not. A proper backup always works. Even after updating software (e.g. RAID drivers).

If not, then does RAID1 not offer some redudancy for the lowest minimum drive count (2)?

It offers redudancy to a non reable sector or broken disk.

And that is all that is typically needed. Enough redundancy to keep things up and running until you can do emergency maintenance.

0

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .