15

I'm considering using a RAID0 setup for one of our SQL Server clusters. I'll outline the situation and am looking for why this may be a bad idea. Also if someone you have use cases, white papers or other documentation you can point me to on this topic, that would be great.

We have 3 servers in 2 datacenters that are part of an SQL cluster. They are all running SQL Server in an Availability Group. The primary has a replica sitting right next to it and another in the other datacenter. They are running synchronous replication with automatic failover. All drives are enterprise class SSDs. They will be running SQL Server 2017 or 2019.

I'm thinking that there would be multiple benefits to running them on RAID0 arrays over other methods with few, if any, real drawbacks. The only negative I'm currently seeing is lack of redundancy on the primary server, so it failing increases. As pros:

  1. If a drive fails, rather than run in a slowed, degraded state until someone receives a notice an manually acts on it, the server will immediately fail to a secondary maintaining full operational capability. This will have an added benefit of notifying us of a failover, so we can investigate the cause sooner.

  2. It reduces the chance of failure overall per TB capacity. Since we don't need parity or mirror drives, we reduce the number of drives per array. With fewer drives there is less total chance of a drive failure.

  3. It is cheaper. Needing fewer drives for our required capacity obviously costs less.

I know this isn't the conventional business thinking, but is there something I'm not considering? I'd love any input either pro or con.

I'm not trying to do this for query performance gains, though if there are meaningful ones feel free to point them out. My primary concern is failing to consider or address a reliability or redundancy issue that I haven't thought of.

The OS is on a separate mirrored drive, so the server itself should stay up. One of those drives can be replaced and again mirrored. It is small and there aren't any database files other than system DB's on it. I can't imagine it taking more than minutes. If one of the data arrays fails, we replace the drive, rebuild the array, restore and resync with the AG. In my personal experience, restoring has been MUCH faster than a RAID5 drive rebuild. I haven't ever had a RAID1 failure, so I don't know whether that rebuild would be faster or not. The restores would be coming from a backup and rolled forward to match the primary, so the load increase on the primary server should be very minimal only syncing the last few minutes of logs with the recovered replica.

1

4 Answers 4

20

There's one very important aspect I think you are missing in your assessment:

How do you plan to recover?

When raid5 loses a drive, it will run in a degraded state until it has recovered automatically. (At least if you have a hot spare at hand.)

When a raid0 loses a drive, it cannot ever recover at all. This means you have lost redundancy, and to recover it, you need to rebuild your raid0, and copy all the data (not just the data on the broken drive) back from the secondary that is now under production load. That is, instead of the single degraded raid5 array, it's now your entire production setup that gets the performance hit.

If the raid5 (or raid6) degraded state performance penalty isn't something you can cope with, you should probably do raid 1+0 instead. Yes, it costs more, but disk prices being what they are, it's going to be money well spent.

Maybe "actively monitor the raid5 state, and transfer the load off the primary when a drive fails" is the solution that gives you most of the benefits without any drawbacks? (Apart from losing the coolness factor of running without any local redundancy, of course.) If your raid5 drive recovery is taking a lot longer than a complete database data sync, either your raid software is acting strangely, or you have seriously oversized disks, I'd think.

0
16

Drive failure should be taken into consideration here.

Imagine for a second that our drives on any particular day have a 1/1000 failure rate. Imagine then that we have 20 drives in each of our 3 arrays.

The chance of a single drive failing in an array is therefore 20/1000 = 1/50. The chance of two drives failing within the same array is something close to 20/1000 * 20/1000 / 2 = 200/1000000 = 1/5000. So by switching from RAID 0 to RAID 5 we're already significantly less likely to kill one of our arrays.

So we can take this further - if the chance of an array failing on a day is 1/50, then the chance of two arrays failing in a day is 1/(50*50) = 1/2500. The chance of two identical RAID 0 arrays failing is twice as much as one RAID 5 array failing, assuming the same disk set. This exponential increase in the chances of failure should concern you, as it massively increases the chance that more than one array fails at once.

As these disks are likely to have a long life time, you can likely run the numbers as above and directly see what effect this will have on reliability - if you can post the drive specifications I can add that calculation to this post. Whether the risk is then acceptable or not is for your organisation to decide.

Another item to note is that likelihood of drive failure can be increased by utilising SSD's manufactured within the same batch (same factory, same time). If you are not careful, you could end up with all 3 nodes going down because of this issue.

Disclaimer: The above calculations have been simplified - they are still relatively accurate.

1
14

I'm thinking that there would be multiple benefits to running them on RAID0 arrays over other methods with few, if any, real drawbacks.

This is a pretty common configuration when running AGs with internal / direct-attached storage drives. Especially with NVMe or other PCI-based flash storage devices.

It simply amounts to treating a drive failure like a server failure. With small number of solid state drives you don't really have a significantly lower MTBF for the drives than you do for the other solid-state components of the server, and so you simply treat each drive as a point-of-failure for the server, and replace/rebuild the server in case of a drive failure.

0
2

I'm intrigued as to what you are trying to achieve? You mention yourself that you're not trying to get performance gains out of this setup, so what gain are you trying to get?

Note on the performance issue: if you're running Enterprise Class SSDs, is your RAID calculation really that much of a bottleneck that you need to improve it?

Taking your 3 pros, I don't think you've thought it through enough:

  1. Will SQL failover straight away? What is going to cause the failover to automatically trigger? Will the server take the drive offline as soon as someone hits it? What if it's just a bad sector on one disk? If SQL doesn't hit the bad sector, will it failover? I'm not 100% sure on that.

  2. Does it reduce the chance of failure overall per TB capacity. Your thinking seems to be the fewer disks means less points of failure, but I don't think that's right. The odds of 1 disk failing remain the same if you have 1 disk or 10 disks (or 100 disks), but with RAID 0 it also means it is a catastrophic failure.

  3. Is one extra SSD going to cost too much more for you to get RAID5? I get how RAID1 OR 1+0 could blow the budget, but 1 extra disk?

With no redundancy, if a disk fails and the RAID goes offline, that node will be offline until you rebuild the RAID and restore all of your databases from scratch. What process are you going to take to make that happen? You can't remove the database from the Availability Group as that will stop the replication to DR, but if you don't take some action then the other two servers won't be able to truncate their log files. Is that ok? What happens if it fails on a Friday night of a long weekend? Is that still ok? Can your secondaries cope with that amount of data build up?

My last questions would be around the rebuild time that you mention will be quicker. Are you 100% sure it's going to be quicker? How much quicker?

Brent Ozar server setup is still my go-to guide for setting up new SQL instances. The very first point in the guide is validate that you're not using RAID0 for any drives.

====UPDATE====

One extra thought, what happens when your secondary servers are out of sync with your primary? Even with Synchronous Replication, your secondaries can still automatically revert to async, and once they do you lose the ability to auto-failover as any failover will result in data loss. A couple of examples when this could happen:

  1. Rebuilding a very large index - replication may fall behind on one or both of the secondaries
  2. Disk failure on the RAID0 while patching the secondary. The server you are patching may not be able to come back online due to the primary being offline.

They are edge cases, but could be catestrophic depending on what is lost during those times.

9
  • Adding to your point on #3, if the cost of an extra disk (or three) is what makes or breaks the budget, then from where will the money come to replace it when one disk fails?
    – user
    Commented Aug 30, 2019 at 8:01
  • @Greg The fact that I might not have thought everything through is why I'm asking this question. I guess I would say I'm seeing where I can improve efficiency as a whole. To answer your questions: 1. Yes. The failure of the array will immediately cause the AG to fail to a different node. A bad sector depend on whether it was a recoverable bit error or not, but this would cause a failure whether the disk was in any kind of RAID or not. 2. Fewer disks would decrease the chance of failure IN the array. RAID0 would increase the chance of failure OF the array. 3. No, money savings is perk.
    – zsqlman
    Commented Aug 30, 2019 at 14:24
  • @Greg Good follow up questions and some I had not fully fleshed out. There are numerous layers of redundancy with the servers being triple. Restoring all the databases can be easily scripted. If a node fails, we would kick that replica from the AG removing the Tlog backlog issue and even if we don't remove the node, we have plenty of space to contain a few days worth of log growth. Regarding recovery time, I only have one data point and don't have more spare hardware to test. We've only had 1 RAID failure and it took 2+ days to recover and we can do the restores in 8ish hours.
    – zsqlman
    Commented Aug 30, 2019 at 14:38
  • @zsqlman - I've added an extra time of when you might lose data because you don't have RAID. Also, the logic you apply to reduced failure I think is still flawed. The odds of one disk failing with fewer disks in the RAID is the same as 1 disk failing with redundancy in the RAID. Reducing the number of disks doesn't reduce the risk of any one disk failing - each disk is just as likely to fail as any other disk.
    – Greg
    Commented Aug 30, 2019 at 20:26
  • You are correct that each disk has the same odds of failure. Fewer disks mean fewer chances of failure.
    – zsqlman
    Commented Sep 5, 2019 at 18:50

Not the answer you're looking for? Browse other questions tagged or ask your own question.