Most Always On Availability Group deployments work fine — right up until the moment they're actually needed. The architecture looks complete on paper: replicas configured, listener registered, synchronization showing healthy. Then a real failover happens, and assumptions that were never tested turn into an extended outage.
The gap is almost always the same: teams configure for the happy path and never validate the failure path.
Synchronous vs. Asynchronous Commit Is a Business Decision, Not a Technical Default
Synchronous-commit replicas guarantee zero data loss on failover, but every transaction waits for the secondary to harden the log before committing on the primary — which means network latency between replicas becomes part of every write's response time. Asynchronous-commit replicas remove that wait, at the cost of a small, bounded data-loss window if the primary fails before the secondary catches up. The right choice depends on what the business actually needs: a trading platform and an internal reporting database have very different tolerances for both latency and data loss, and the AG should be designed around that tolerance explicitly — not defaulted to whatever the wizard suggests.
Quorum Strategy Determines Whether Automatic Failover Actually Happens
Windows Server Failover Clustering quorum models — node majority, node and file share majority, cloud witness — decide what happens when replicas lose contact with each other. Get this wrong and you get one of two failure modes: a split-brain scenario where two replicas both believe they're primary, or a cluster that simply won't fail over because it can't establish quorum. For two-replica configurations especially, a witness (file share or cloud) isn't optional — without one, losing a single node can take the entire cluster offline.
Readable Secondaries Need Their Own Capacity Plan
Offloading reporting and read-only workloads to a secondary replica is one of the most valuable features of Always On — and one of the most commonly mis-sized. A readable secondary still has to apply the redo log from the primary in real time. If reporting queries on that secondary saturate CPU or I/O, redo falls behind, and your "synchronous" replica quietly becomes asynchronous in practice because it can't keep up. Size readable secondaries for redo headroom first, reporting workload second.
The Listener Is a Single Point of Configuration Failure
The availability group listener is what makes failover transparent to applications — but only if every application actually connects through it, and only if MultiSubnetFailover=True is set correctly in connection strings for multi-subnet configurations. A surprising number of "AG failover didn't work" incidents trace back to an application connecting directly to a replica's instance name instead of the listener, or a missing multi-subnet flag that leaves the client waiting on the default TCP timeout instead of the faster AG-aware reconnect.
Contained Availability Groups Solve a Real, Specific Problem
Standard Availability Groups replicate user databases but not the logins, SQL Agent jobs, and other instance-level objects those databases depend on — which is a common source of "the failover worked but the application can't log in" incidents. Contained Availability Groups (introduced in SQL Server 2022) extend replication to include this previously-excluded metadata, closing a gap that previously had to be solved with manual synchronization scripts or third-party tooling.
Test Failover Like You Mean It
A planned manual failover during a maintenance window tells you the happy path works. It does not tell you what happens when a replica fails mid-transaction under production load, when the witness becomes unreachable, or when an application's connection pool doesn't recycle correctly after a forced failover. If your failover testing has never included an unplanned, under-load scenario, you don't actually know your recovery time objective — you know your best case.
Monitor What Actually Predicts Failure
The Always On dashboard in SSMS shows current state, not trend. The DMVs that matter for catching problems before they become incidents are sys.dm_hadr_database_replica_states (for redo and send queue size, which reveal a secondary falling behind before it becomes "unhealthy") and sys.dm_hadr_availability_replica_states (for replica connection and synchronization state over time). Alert on growing redo queues, not just on a replica going fully offline — by the time the dashboard turns red, the gap has usually existed for a while.
If you want a structured review of your own Always On configuration — including an actual unplanned-failover test — request a SQL Server Health Check.