49

Assuming you need to make sure your application that relies on SQL Server 2012 as its database backend is available around the clock, even if one server machine fails.

As a developer and not a DBA, I am struggling to understand when to use which scenario for my failover / high availability:

  • Two (or more) servers in a Windows Failover cluster, SQL Server as a clustered instance
  • Two (or more) SQL Server instances that are kept up to date with transactional replication
  • Two (or more) SQL Servers in a SQL Server Availability Group, configured in a synchronous commit mode

Which of each of those scenarios works for what kind of workload, and what kind of failure / outage can be handled by those scenarios? Are they even comparable / exchangable?

4 Answers 4

58

The way I always like to visualize high availability solutions is the following:

SQL Server Failover Cluster Instance (FCI)

What is highly available? The entire instance. That includes all server-objects (logins, SQL Server Agent jobs, etc.). This also includes databases and their containing entities. It's a great solution for highly available SQL Server instances, because that is going to be the level of containment with this given solution.

What about reporting? None, NULL, nonexistent. A failover cluster instance has an active node delivering the cluster group containing the instance, VNN, etc. and all other nodes are passive, sitting idle (as far as the current cluster group is concerned) and waiting for a failover.

What happens when there is failover? The downtime for an FCI is going to be determined by the amount of time that the passive node takes to grab the cluster resource and bring the SQL Server instance in a running state. This is typically minimal in time.

Any client abstraction? Yes, this is going to be innately built in with the virtual network name for the failover cluster instance. This will always point to the active node that is currently delivering the SQL Server cluster resource.

AlwaysOn Availability Groups

What is highly available? An availability group is going to be the logical containment of high availability here, whereas an availability group consists of a number of databases and a virtual network name (the listener, an optional cluster resource). It is worth noting that server objects such as logins and SQL Server Agent jobs will not be part of the HA solution, and special consideration needs to be taken to ensure that these are properly implemented with an availability group. Not an overly burdening requirement, but needs to be cared for.

What about reporting? This is a great solution for reporting, although I probably wouldn't use a synchronous replica as my reporting instance. There are two commit relationships, synchronous and asynchronous. In my opinion and from what I've seen in practice, is that your synchronous secondary replica is there waiting for a disaster. Think of it as that replica that's ready to take a no-data-loss failover in the event of an issue. Then there are asynchronous replicas that can handle that reporting workload. You aren't using this replica as the aforementioned solution, but moreso for things like reporting. Reporting workloads can be pointed to this replica (either directly, or indirectly through read-only routing via the listener).

What happens when there is failover? For a synchronous commit secondary replica that is paired with automatic failover, this will be the replica role state change from SECONDARY_NORMAL to PRIMARY_NORMAL. In order for there to be automatic failover, you need to have a synchronous secondary replica that is currently synchronized, and what's implemented is the Flexible Failover Policy to determine when in fact this failover should occur. That policy is indeed configurable.

Any client abstraction? Yes, you could optionally configure an AlwaysOn Availability Group listener. This is basically just a virtual network name (can be seen through WSFC as a cluster resource in the AG's cluster group) that points to the current primary replica. This is a key part of shifting your reporting workload around, as well as setting up a read-only routing list on any servers that you want to redirect ReadOnly traffic (this is set through the connection string, with the .NET Framework Provider for SQL Server, this will be the Application Intent parameter, set to ReadOnly). You would also need to set a read-only routing URL for each replica that you want to receive this reporting workload while in the secondary replica role.

Transactional Replication

What is highly available? This is arguable, but I'm going to say nothing. I don't see replication as a high availability solution whatsoever. Yes, data modifications are being pushed to subscribers but we're talking at the publication/article level. This is going to be a subset of the data (could include all the data, but that won't be enforced. I.e. you create a new table in the publisher database, and that will not automatically be pushed to the subscribers). As far as HA goes, this is bottom-of-the-barrel and I will not group it in there with a rock-solid HA solution.

What about reporting? A great solution for reporting on a subset of data, no question about that. If you have a 1 TB database that is highly transactional and you want to keep that reporting workload off the OLTP database, transactional replication is a great way to push a subset of data to a subscriber (or subscribers) for the reporting workload. What happens if out of that 1 TB of data your reporting workload is only about 50 GB? This is a smart solution, and relatively configurable to meet your business needs.

Summary

What it boils down to are a handful of questions that need to be answered (partly by the business):

  1. What needs to be highly available?
  2. What does the SLA dictate for HA/DR?
  3. What kind of reporting will be taking place and what latencies are acceptable?
  4. What do we need to handle with geographically dispersed HA? (storage replication is expensive, but a must with an FCI. AGs don't require shared storage from standalone instances, and you could use a file share witness for quorum potentially eliminating the need for shared storage)
3
  • Thanks for a great answer, Thomas! So if I understand correctly, FCI would automatically switch over to a "hot standby" server if the main machine goes down - right? What about AlwaysOn? Does that offer some kind of automatic "failover", too, or is it just a secondary copy of the database, but some admin needs to manually switch over, in case of a failure?
    – marc_s
    Commented Nov 22, 2013 at 20:10
  • 1
    @marc_s Glad to help! You are correct in your understanding about an FCI, provided that the WSFC itself doesn't go down (i.e. loses quorum) and that there is a passive node able to take the SQL Server cluster resource group in the event of failover. As for an AlwaysOn AG, yes there is possible automatic failover. I've edited my answer to include that information, but basically you need a synchronized secondary replica configured for automatic failover. You could have a manual failover as well with no data loss to a synchronized second replica. Commented Nov 22, 2013 at 20:18
  • @ThomasStringer - this is very helpful. Thank you! I wonder if you could address making schema changes for each of three options. We set up Transactional Replication only to find out that making schema changes is really hard on the publisher. What about AlwaysOn? Would we run into the same issue here as well? Commented Jan 2, 2018 at 17:04
23

two (or more) servers in a Windows Failover cluster, SQL Server as a clustered instance

  1. What Kind of workload? "It depends" - but seriously, this is useful for an online application where you need to have local in data center High Availability. You are protected against a failure of one machine, or of one operating system. The logins, jobs, new databases, maintenance, etc. all are automatically kept in sync by the fact that it is a cluster with two nodes that are exactly the same sharing the same storage so they have all the same system databases. Very fast failover, but there is still a hiccup that looks like a SQL Server restart when the failover occurs.

  2. Cons/Concerns - Single point of failure is your storage and all of it's components. SAN vendors always say "SANs don't fail" but there are lot of moving parts in a storage area network and as I blogged about here, they can. Also - you are paying for a secondary server that can't do anything but hang around and wait.. Now you can do Active/Active/Multi-Node and have two active instances that can failover in either direction and use the second node.

  3. Automatic Failover? The "most" automatic. No witness needed, it's a cluster. This is the job of a cluster, to make it as seamless as possible. Now with any of these, when a failover happens you will "feel" it, because SQL has to start up or connections have to point. Here when it happens, you'll basically feel like a restart of SQL, DBs come back up and run recovery/etc.

If I have a client say "I want to be fully up with all databases, all logins, etc" in a High Availability environment in my local data center because I have an incredibly low tolerance for downtime I would consider Failover Cluster Instances (though the last option you mention is a strong contender, save for having to do some management overhead). I would probably do a local FCI and an AG async secondary to protect against site failure or SAN failure.

two (or more) SQL Server instances that are kept up to date with transactional replication

  1. What Kind of workload? I honestly wouldn't go here for a lot of cases of a need for High Availability or Disaster Recovery as a first choice. Not in SQL 2012 for sure. But basically this is good if you had to go to a data center that wasn't close, you couldn't use an AG (maybe a domain issue preventing you from using the windows cluster required for the AG), perhaps you wanted to be in SQL Server standard which can do replication, but not AGs but you still wanted to have the ability to read on the secondary side and be asynchronous.
  2. Cons/Concerns - It's replication. It has overhead, it can get out of sync, you can develop problems with performance on the source side, etc.
  3. Automatic Failover - No. You have to manage it yourself. Either through CNAMEs that point to one or the other, and you could theoretically write your own process to do this, but out of the box? Note here.

two (or more) SQL Servers in a SQL Server Availability Group, configured in a synchronous commit mode

This is what I've been helping people implement more and more lately, though sometimes I still go to clustering.

  1. What Kind of Workload? This is great when I have a manageable set of databases to keep in sync, and the resources and time to make sure that jobs, logins, new databases, etc stay in sync (though the team at SQL Skills have built a great add in to automate some of this for you making it even stronger of an option). I like this when I want to keep things completely separate. I am protecting against hardware issues, OS issues, SQL installation issues, patching issues and SAN/Storage issues. I also get the benefit of the ability to have a secondary (If I want to pay an enterprise license for it) to be an active secondary that I can read from, take backups on, etc. Plus in the future I can add a third secondary that is asynchronous at a remote site and have failover/DR.
  2. Cons/Concerns Licensing, maximum number of replicas, licensing costs to take advantage of some of the biggest benefit (active secondary), requires enterprise, requires twice as much storage than clustering.
  3. Automatic Failover - Yes. This can occur with a witness setup, and your app developers can connect to the listener instead of a node so the failover happens with where the listener points and you should be good there. So yes you can do that here - and should - but of course you should test it well.

Summary

HA and DR are different. And these technologies help provide pieces of either. High Availability means (to me) that you can quickly recover if something bad happens to one machine, you are up with a short Recovery Point Objective and Recovery Time Objective. That is clustering and a synchronous AG.

Disaster Recovery is "you can get up when you have a failure even in your HA solution. To me that can be AGs when you go to another data center, mirroring or even replication.

4
  • Great stuff - thanks! So given the choice between failover clustering and AG - which do you choose when? Clustering seems to be more mature (been around longer), but AG seems to offer more flexibility (in that the "secondary" can be used for e.g. reporting etc.). So when and why would you still use failover clustering then?
    – marc_s
    Commented Nov 22, 2013 at 20:35
  • It depends but where I see it more is someone is either more comfortable with it, there is a lot of transition on the server and they just want to cover (for HA) Every database, every job, etc. It's "easier" in that regard.. And then for those folks they may want to use an Async AG to a DR site for that extra DR protection. I see it there. I'll see it if they passed out when they see Enterprise licensing prices and don't need features but still want HA - Clustering a single instance works in standard with FCI, etc.
    – Mike Walsh
    Commented Nov 22, 2013 at 20:47
  • 2
    @marc_s clustering(FCI) and AG are not mutually exclusive. You can have Node1 and Node2 clustered in same datacenter(sharing storage) and do AG to a third stand alone instance in remote data center(in same cluster but not sharing storage)
    – DaniSQL
    Commented Nov 22, 2013 at 20:48
  • I feel compelled to mention that if you have a local active-passive FCI and a remote passive server using AlwaysOn you will have 2 passive servers and will have to pay for a full license for one of them. :(
    – influent
    Commented Nov 26, 2014 at 21:31
9

It is also important to consider what is shared.

Failover Clustering uses two or more server nodes sharing one disk array. If the disk array goes down then you lose service, regardless of how many server nodes there are. If the server room where that disk array is located catches fire or floods then you lose service.

AlwaysOn Availability Groups and Database Mirroring are a "shared nothing" clustering technology. The database is present on multiple disk arrays in multiple servers. If you have good network links then the multiple serves can be in multiple server rooms, protecting you against fires and floods.

0
6

Just for completeness, there is the option of using plain old mirroring. The advantages here include having two copies of the database without the complexity of using Availability Groups, and without needing shared storage for Failover Clustering. Disadvantage, although slight, is mirroring is deprecated.

Failover times with mirroring are on the order of 10 seconds, although the application code needs to be able to retry any transactions that are occurring at the time of failover.

1
  • 2
    +1 for bringing it up separately and specifically :) That said - yes you can certainly argue that mirroring is less complex and it doesn't have the cluster requirements, the domain requirements that come with that, etc. that AGs have. So there is still certainly complexity, and a need to keep logins, jobs, new databases, etc. in sync like with AGs. So it has some of those same costs and, like you said, is deprecated. But I still setup and deploy new mirrors today for folks :)
    – Mike Walsh
    Commented Nov 23, 2013 at 0:01

Not the answer you're looking for? Browse other questions tagged or ask your own question.