AlwaysOn failures and failovers

Question

This is the setup: Firstly, it's in Azure 3 SQL Servers: SQL Server 2019 RTM CU18. AlwaysOn configured between these 3 servers. Server 1 & 2 in one region (Synchronous + Manual Failover Mode) Server 3 separate region (Asynchronous + Manual Failover Mode) Failure Condition Level = 3

"On critical server error. Specifies that an automatic failover is initiated on critical SQL Server internal errors, such as orphaned spinlocks, serious write-access violations, or too many memory dumps being genrated in a short period of time.

This is the default level."

The ongoing problem that I am facing is that backups of a very large DB (19TB, 7TB Compressed)takes extremely long compared to it's SQL 2016 counterpart.

Context: I am migrating from 2016 to 2019. The 2016 environment is mirrored so no AlwaysOn there.

So this is what happened:

Server 2 is the primary replica
I am testing backups on Server 1 (Secondary Replica), copy_only obviously.
Observed these errors in the SQL log about an hour into the backup:

Process 0:0:0 (0x22c0) Worker 0x0000028AAE660160 appears to be non-yielding on Scheduler 6. Thread creation time: 13331887877376. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 15%. System Idle 83%. Interval: 70844 ms.

Long Sync IO: Scheduler 7 had 1 Sync IOs in nonpreemptive mode longer than 1000 ms

It was at around this point that the AG went offline and online again.

I generated cluster logs and dug into them, saw this:

The node 2 is slow! The gum handler took 1296 ms on that node, average is 1 ms, acceptable is 1003 ms, is witness: false

Saw many of the above messages.

Also saw this:

[RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost
[RES] SQL Server Availability Group <AGName>: [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel

[RES] SQL Server Availability Group <AGName>: [hadrag] Resource Alive result 0.

[RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost

[RCM] Res AGName: Online -> ProcessingFailure( StateUnknown )

This message sticks out the most to me:

[RCM] rcm::RcmGroup::Failover: Group AGName has failed but has no other node it can move to; delay-restarting

My thoughts:(Keep in mind that the backup is running on the secondary replica)

The backup process caused the non-yielding scheduler state. This trigged failure condition level 3 and the AG started taking corrective action. Since all nodes are set to manual failover, the AG was restarted.

My concern:

Why is the backup causing this?
More importantly, why is it attempting a failover for a failure condition that is happening on a secondary replica?
What would have happened if the AG was configured with Automatic failover, would it have failed over the AG to the secondary node where the failure condition stems from?

This makes no sense to me.

Any thoughts?

Thanks

Sean Gallardy · Accepted Answer · 2023-07-13 14:07:43Z

Why is the backup causing this?

Looks like too much work on the system for the specs to handle the workload properly. Approx Thread CPU Used: kernel 0 ms, user 0 ms. The worker was on the scheduler (so it should be running) but the windows thread tied to it used no cpu (user mode or kernel mode) in ~71 seconds. Either the thread was frozen at that moment by some 3rd party item, or the thread never actually had time to run (or some edge cases with stuff). Looks more like the environment was busy and it never had time to run.

More importantly, why is it attempting a failover for a failure condition that is happening on a secondary replica?

It doesn't. There is no diagnostics output on a secondary replica. The failure condition happened on the primary replica, whichever that one was at the time. Looks like the primary was the one that dumped.

What would have happened if the AG was configured with Automatic failover, would it have failed over the AG to the secondary node where the failure condition stems from?

Depends on many factors. It very well might have. Assuming base cluster setup, the AG would have been restarted 3 times. If none of those restarts had good diagnostics output, then it would have attempted failover. Whether or not that would succeed, who knows.

Thanks - Does this mean that sp_server_diagnostics does not run on secondary replicas? — Sicilian-Najdorf, Commented Jul 20, 2023 at 7:45
Keep in mind that I was running the backup on the secondary replica when this happened. Also, managed to get rid of Non-Yielding scheduler issues with SQL 2019 patch — Sicilian-Najdorf, Commented Jul 20, 2023 at 7:50
sp_server_diagnostics, as I stated, only runs on primary replicas. — Sean Gallardy, Commented Jul 20, 2023 at 13:56
If this is the case then why when looking at the system_health extended events on the secondary replica are there entries for sp_server_diagnostics_component_result for every 5 minutes. I'm confused and obviously missing something — Sicilian-Najdorf, Commented Jul 20, 2023 at 15:36

Stack Exchange Network

AlwaysOn failures and failovers

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
sql-server
availability-groups
clustering
or ask your own question.

Hot Network Questions

AlwaysOn failures and failovers

1 Answer 1

Not the answer you're looking for? Browse other questions tagged sql-serveravailability-groupsclustering or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
sql-server
availability-groups
clustering
or ask your own question.