What we've already done

Eliminating 600-Second Transaction Latencies in SQL Server AG

Client: Enterprise Customer

Industry: Retail Electronics (B2B/B2C)

Technology: SQL Server Always On Availability Groups (AG)

Primary Challenge: Critical performance degradation and extreme HADR_SYNC_COMMIT wait times.

1. Executive Summary

The client reported a massive spike in transaction wait times, specifically the HADR_SYNC_COMMIT wait type. Preliminary observations suggested that transactions on the Primary node were stalling while waiting for hardening confirmations from the Secondary Replica. Our mission was to identify the root cause. Whether it was a SQL Server internal bottleneck, resource contention, or network infrastructure failure.

2. The Problem

During peak nightly workloads, the system became virtually unresponsive. Key symptoms included:

Extreme Latency: 4-byte inserts (minimal transactions) were taking up to 595 seconds to complete.
Massive Wait Accumulation: Total HADR_SYNC_COMMIT waits ranged between 500k and 800k seconds per day.
Nightly Peaks: 50% of all daily wait times occurred between 01:00 and 03:00 AM.

3. Technical Diagnostic Methodology

To move beyond guesswork, we executed three structured tests:

TEST 1: AG Metrics & Log Throughput

Tooling: Performance Monitor (PerfMon) and custom T-SQL jobs logging to a control database (SMT_Operations).
Metrics: log_send_queue, redo_queue, and Log Bytes Flushed/sec.
Finding: Nightly procedures (Import/Export) were generating log data at 200 MB/s. This flooded the log send queue with gigabytes of data, forcing small transactions to wait behind massive log blocks.

TEST 2: Transaction Round-Trip Measurement

Method: A dedicated job (SMT_Adhoc_InsertTest) performed a 1-row insert every 10 seconds.
Metrics: Time elapsed between transaction start and ACK (Acknowledgement) from the secondary.
Finding: Confirmed regular daytime latencies of 500ms–4s, spiking to 6+ minutes at night. AG "Flow Gates" remained at 0, meaning the AG was not internally throttling the throughput.

TEST 3: Internal Module Latency (Extended Events)

Method: 10-minute high-granularity Extended Events session capturing 20+ internal SQL modules.
Finding: The longest waits occurred in the Primary-Send and Primary-RemoteHarden modules. This pointed directly to the network layer’s inability to pass data from the SQL capture log to the wire.

4. Root Cause Analysis

The investigation shifted to the physical and logical network layer. The findings were twofold:

Hardware Degradation: A physical audit revealed that 2 out of 4 paths to the secondary node had autonegotiated down to 100 Mbit instead of 1 Gbit.
Routing Misconfiguration: The SQL Server Endpoint URLs were configured via FQDNs that resolved to Public IP addresses. This forced all AG synchronization traffic over the "Public" interface (Team #1) instead of the intended high-speed "Private" interface (Team #2).

5. The Solution

We proposed and implemented a complete infrastructure overhaul:

Redesign: Transitioned to a 2x10Gbit Consolidated Interface.
Hardware: Replaced legacy core switches with dedicated DELL SQL Cluster switches.
Network Logic: Implemented an untagged VLAN 2 (Access Link) for internal cluster communication.
Optimization: While compression is default-on for synchronous AGs, the 10Gbit throughput rendered compression unnecessary for this workload to save CPU.

6. Results

Following the implementation of the new 10Gbit infrastructure and corrected routing:

Zero Latency Spikes: HADR_SYNC_COMMIT wait times returned to baseline (near-zero) levels.
Stability: Nightly export jobs no longer impact synchronous transaction processing.
Throughput: The system is now capable of sustaining over 230 MB/s of log throughput without affecting application response times.

7. Conclusion

This case study highlights that SQL Server performance is only as fast as its slowest underlying component. By using a data-driven approach with Extended Events and PerfMon, we were able to prove that the "SQL problem" was actually a "Network problem," saving the client from unnecessary software-level troubleshooting and resolving the core issue at the hardware level.

Based on this experience, we have updated our SMT monitoring (available from version 1.11) with the metrics used to resolve this issue, making them available as baseline data for all users.