-
SQL Server Replication Monitoring Scripts
In this post I’m sharing scripts and tips that will help you monitor, identify and troubleshoot SQL Server Replication latency and issues.
The following is included in this article:
1. Check Latency in Log Reader Agent.
2. Check Latency in Distribution Agent.
3. Check Undistributed Commands.
4. List All Publications and Subscriptions.
5. Troubleshooting Replication Issues.Having SQL scripts to hand for checking SQL Server Replication issues is key for your tool-belt as a DBA. It’s our job to ensure Replication is flowing without errors, and the data is accurate/up-to-date on the Replication Subscribers.
if you need more information and understanding on SQL Server Replication, I recommend you read this recent Microsoft article:
Effectively troubleshoot latency in SQL Server Transactional replication
Or watch the YouTube Video.1. Check Latency in Log Reader Agent
This script monitors the Log Reader Agent activity in SQL Server replication, providing details on its status, duration, errors, and performance metrics such as:
– Status: Current status of the agent (e.g., Start, Succeed, In Progress).
– Latency: Delivery latency (time delay in ms between transaction generation and delivery).
– Transaction Stats: Delivered transactions, commands, and average delivery rate.
– Error Information: Details of any replication errors encountered.It helps identify issues early in the replication process, such as delays in reading the transaction log, which could eventually affect the Distribution Agent and overall replication latency.
-- Query to monitor log reader agent activity and its status, duration, and errors in SQL Server replication. USE distribution; GO SELECT a.name AS agent_name, CASE [runstatus] WHEN 1 THEN 'Start' WHEN 2 THEN 'Succeed' WHEN 3 THEN 'In progress' WHEN 4 THEN 'Idle' WHEN 5 THEN 'Retry' WHEN 6 THEN 'Fail' END AS Status, [start_time], h.[time], -- Time the message is logged. [duration], -- Duration in seconds. [comments], h.[xact_seqno], -- Last processed transaction sequence number. [delivery_time], -- Time first transaction delivered. [delivered_transactions], -- Total transactions delivered. [delivered_commands], -- Total commands delivered. [average_commands], -- Avg. commands delivered. [delivery_rate], -- Avg. commands per second. [delivery_latency], -- Latency in ms. [error_id], -- ID of the error in MSrepl_error table. e.error_text -- Error text. FROM [distribution].[dbo].[MSlogreader_history] h JOIN MSlogreader_agents a ON a.id = h.agent_id LEFT JOIN MSrepl_errors e ON e.id = h.error_id ORDER BY h.time DESC;
2. Check Latency in Distribution Agent
This script monitors the Distribution Agent activity in SQL Server transactional replication, focusing on latency and delivery metrics. It provides the following details:
– Status: Current status of the Distribution Agent (e.g., Start, Succeed, In Progress).
– Latency: Delivery latency in milliseconds, including current delivery latency since the last entry and overall delivery latency.
– Transaction Stats: Delivered transactions, delivered commands, and average commands delivered.
– Delivery Rate: Average commands per second delivered by the agent.
– Error Information: Error ID and text from the MSrepl_errors table for any issues encountered.This script helps track the Distribution Agent’s performance, measure latency, and identify any issues or delays in delivering transactions to Subscribers.
-- Query to monitor distribution agent activity, status, delivery metrics, and errors in SQL Server replication. USE distribution; go SELECT a.name AS agent_name, CASE [runstatus] WHEN 1 THEN 'Start' WHEN 2 THEN 'Succeed' WHEN 3 THEN 'In progress' WHEN 4 THEN 'Idle' WHEN 5 THEN 'Retry' WHEN 6 THEN 'Fail' END AS Status, [start_time], h.[time], -- Time the message is logged. [duration], -- Duration in seconds. [comments], h.[xact_seqno], -- Last processed transaction sequence number. [current_delivery_rate], -- Avg. commands/sec since last entry. [current_delivery_latency], -- Latency in ms since last entry. [delivered_transactions], -- Total transactions delivered. [delivered_commands], -- Total commands delivered. [average_commands], -- Avg. commands delivered. [delivery_rate], -- Avg. commands/sec. [delivery_latency], -- Latency in ms. [total_delivered_commands], -- Total commands delivered since creation. [error_id], -- Error ID in MSrepl_error table. e.error_text -- Error text. FROM MSdistribution_history h JOIN MSdistribution_agents a ON a.id = h.agent_id LEFT JOIN MSrepl_errors e ON e.id = h.error_id ORDER BY h.time DESC;
3. Check Undistributed Commands
This script checks for undistributed commands in SQL Server replication by querying the MSdistribution_status table.
It provides the following details:
– Undistributed Commands: The count of commands that have not yet been distributed.
– Publication Name: The name of the publication.
– Subscriber Database: The database on the Subscriber.USE Distribution; GO SELECT publication AS PublicationName, subscriber_db AS SubscriberDatabase, COUNT(*) AS UndistributedCommands FROM dbo.MSdistribution_status GROUP BY publication, subscriber_db ORDER BY UndistributedCommands DESC;
A high number of undistributed commands may indicate issues with the distribution agent or network performance.
4. List All Publications and Subscriptions
If you want to see all active publications and subscriptions in your environment, this script helps track the replication topology by listing all publications and their subscribers.
-- List Publications USE [YourPublicationDatabase]; GO SELECT name AS PublicationName, publication_type AS PublicationType, allow_push AS PushEnabled, allow_pull AS PullEnabled, allow_anonymous AS AllowAnonymous FROM sys.publications; GO -- List Subscriptions USE Distribution; GO SELECT publication AS PublicationName, subscriber_db AS SubscriberDatabase, subscriber AS SubscriberServer, subscription_type AS SubscriptionType, status AS SubscriptionStatus FROM dbo.MSsubscriptions; GO
Alternatively, you can use this script that helps display your Replication Topology with much more info, including Subscription configuration options and Articles with size info. You run it against your Distribution Database (and, if applicable, the Publisher database when the Distributor also acts as the Publisher) to gather information about the replication topology. It helps us understand the replication structure and displays important details needed to diagnose and troubleshoot complex replication issues.
5. Troubleshooting Replication Issues
To resolve replication issues, follow these steps:
1. Use Replication Monitor: Start by checking Replication Monitor. It provides a real-time view of your replication environment, showing the status of agents, latency, and undistributed commands. This tool is invaluable for quickly identifying where issues may lie.
2. Check Agent Status: If the Log Reader Agent or Distribution Agent is failing or showing issues, you can stop and restart the agent jobs. Sometimes, simply retrying the jobs can resolve transient connection issues or allow replication to go through successfully.
3. Undistributed Commands: High numbers of undistributed commands may indicate a distribution issue. Check the MSdistribution_status table to see if commands are backing up. If this is the case, investigate the Distribution Agent’s performance and check for network or resource issues.
4. Tailor Replication Options: For some issues, you may need to adjust replication options, such as using the –SkipErrors option to bypass specific errors temporarily. However, be cautious with these settings as they might mask underlying issues.
5. Review System Resources: If replication is lagging, check system resources like CPU, memory, and disk usage. Resource contention on the Publisher, Distributor, or Subscriber can slow down replication and cause delays.
For more detailed troubleshooting, including specific steps for different error scenarios, refer to the Microsoft Troubleshooting Replication Tutorial I recommended above in this post. If you’ve made it this far, the tips in that article and video will equip you with all the knowledge you need to resolve this. Also, we should consider contacting Microsoft Support during this process.
-
Get Last Node Blip in SQL Server FCI
In a SQL Server Failover Cluster Instance (FCI) environment, a “node blip” refers to the event when the active SQL Server instance moves from one cluster node to another. This movement, typically triggered by a Failover or planned maintenance, causes a brief outage—usually 30 to 60 seconds.
While planned blips are expected during maintenance, unexpected ones should be investigated promptly to identify root causes and prevent recurrence. These events are especially critical in environments hosting high-impact SQL Server workloads.
Tracking SQL Node Blips Using PowerShell
I’m sharing a command that retrieves all recent failover events logged in the Failover Clustering Operational Log.
Replace
ListenerName
below with your Always On Availability Group Listener Name or the relevant cluster resource name:# Get last failover time Get-WinEvent -ComputerName ListenerName -FilterHashtable @{logname='Microsoft-Windows-FailoverClustering/Operational'; id=1641} | Format-Table -AutoSize -Wrap
This command filters the log for Event ID 1641, which tracks clustered role movements. The output lists all failover events, including timestamps and details about why the failover occurred. By reviewing these events, you can determine whether a failover was planned or unexpected.
Why Node Blips Matter in SQL Server FCI
Every node blip results in a temporary outage as the SQL Server instance transitions between nodes. While a 1 minute disruption might seem minor, it can have significant implications for critical databases or applications.
Monitoring and checking these failover events helps identify the point in time of last SQL Server restart, diagnose issues and prevent re-occurrence.
Alternative Methods to Retrieve Node Blip Data
In addition to PowerShell, you can use SQL Server’s error logs to identify failovers. The
xp_readerrorlog
stored procedure can be used to find messages related to failover events:-- check sql error log for failover events EXEC xp_readerrorlog 0, 1, 'failover';
This command filters the SQL Server error log for failover-related entries, providing another layer of insight into node blips.
The last thing I want to share on this is a small PowerShell script that is more useful for automation. It returns only the most recent node blip and includes a param for the AG Listener Name. If you ran this on a frequent schedule it would be easy to add alerting and/or log to a table.
# Define the cluster listener name $listenerName = "ListenerName" # Replace with your actual Listener Name # Get the latest failover events (Event ID 1641) $lastNodeBlip = Get-WinEvent -ComputerName $listenerName -FilterHashtable @{ logname = 'Microsoft-Windows-FailoverClustering/Operational'; id = 1641 } | Sort-Object TimeCreated -Descending | Select-Object -First 1 # Output the most recent node blip $lastNodeBlip | Format-Table -AutoSize -Wrap
-
SQL Script to Get AAG Latency
In this post, I’m sharing a script that I use frequently as a Production SQL DBA to check Always On Availability Group (AAG) latency.
Always On Availability Groups (AAG) are a high-availability and disaster recovery solution in SQL Server, allowing you to synchronize databases across multiple replicas. These replicas can serve as failover targets or provide read access to offload queries from the primary. Latency in an AAG refers to the delay between when a transaction is committed on the primary replica and when it is applied on the secondary replicas.
Having proper monitoring and alerting in place for AAG latency is critical because high latency means secondary replicas may lag behind, potentially leading to outdated or incomplete data being read by applications or users. The goal is always to keep latency as low as possible—ideally no latency or just a few seconds.
Get AAG Latency Using SQL Server Script
Why This Script Is Useful
This script provides insight into the health of your Availability Group(s). Specifically, it shows:
- Time Behind Primary: How far behind the secondary replica is from the primary.
- Redo Queue Size & Redo Rate: Metrics that help estimate how long it will take for the secondary to catch up.
- Estimated Completion Time: A projection of when synchronization will finish based on current redo rates.
High latency can mean that secondary reads are outdated, which is a problem for systems relying on accurate, real-time data.
Async vs. Sync and AAG Latency
It’s worth noting that latency is influenced by the availability mode of your AAG:
Synchronous Commit: Ensures data is committed to secondary replicas before the transaction is confirmed on the primary. This minimizes latency but may introduce slight delays in write operations due to network overhead.
Asynchronous Commit: Prioritizes primary performance by allowing transactions to commit without waiting for secondary synchronization. While this reduces write delays, it can lead to higher latency on secondary replicas, making it unsuitable for scenarios requiring real-time reads.
When failing over an AAG configured for Asynchronous Commit, it’s crucial to switch to Synchronous Commit beforehand to avoid potential data loss. Asynchronous mode doesn’t guarantee that all transactions have been applied to the secondary replica, which could result in missing data after the failover.
Resolving AAG Latency Issues
Here are some things to consider looking at if you’re experiencing AAG Latency or any other issues:
> Verify there’s not a long running transaction or big insert happening at the Primary. You can check out my other post which will help quickly review the current performance of your SQL Server.
> Review the official Microsoft Docs for Troubleshooting SQL Server Always On Issues.
> Consider contacting Microsoft Support. They’ll help you however it may come at a charge if it’s not a bug or feature issue. You’ll likely have a Microsoft Support Contract in-place if working for medium/larger company.