This is the setup: Firstly, it's in Azure 3 SQL Servers: SQL Server 2019 RTM CU18. AlwaysOn configured between these 3 servers. Server 1 & 2 in one region (Synchronous + Manual Failover Mode) Server 3 separate region (Asynchronous + Manual Failover Mode) Failure Condition Level = 3
"On critical server error. Specifies that an automatic failover is initiated on critical SQL Server internal errors, such as orphaned spinlocks, serious write-access violations, or too many memory dumps being genrated in a short period of time.
This is the default level."
The ongoing problem that I am facing is that backups of a very large DB (19TB, 7TB Compressed)takes extremely long compared to it's SQL 2016 counterpart.
Context: I am migrating from 2016 to 2019. The 2016 environment is mirrored so no AlwaysOn there.
So this is what happened:
Server 2 is the primary replica
I am testing backups on Server 1 (Secondary Replica),
copy_only
obviously.Observed these errors in the SQL log about an hour into the backup:
Process 0:0:0 (0x22c0) Worker 0x0000028AAE660160 appears to be non-yielding on Scheduler 6. Thread creation time: 13331887877376. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 15%. System Idle 83%. Interval: 70844 ms.
Long Sync IO: Scheduler 7 had 1 Sync IOs in nonpreemptive mode longer than 1000 ms
It was at around this point that the AG went offline and online again.
I generated cluster logs and dug into them, saw this:
The node 2 is slow! The gum handler took 1296 ms on that node, average is 1 ms, acceptable is 1003 ms, is witness: false
Saw many of the above messages.
Also saw this:
[RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost
[RES] SQL Server Availability Group <AGName>: [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel
[RES] SQL Server Availability Group <AGName>: [hadrag] Resource Alive result 0.
[RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost
[RCM] Res AGName: Online -> ProcessingFailure( StateUnknown )
This message sticks out the most to me:
[RCM] rcm::RcmGroup::Failover: Group AGName has failed but has no other node it can move to; delay-restarting
My thoughts:(Keep in mind that the backup is running on the secondary replica)
The backup process caused the non-yielding scheduler state. This trigged failure condition level 3 and the AG started taking corrective action. Since all nodes are set to manual failover, the AG was restarted.
My concern:
- Why is the backup causing this?
- More importantly, why is it attempting a failover for a failure condition that is happening on a secondary replica?
- What would have happened if the AG was configured with Automatic failover, would it have failed over the AG to the secondary node where the failure condition stems from?
This makes no sense to me.
Any thoughts?
Thanks