Peter Whyte - SQL Server DBA Blog

SQL DBA Blog

SQL Server Replication Monitoring Scripts
In this post I’m sharing scripts and tips that will help you monitor, identify and troubleshoot SQL Server Replication latency and issues.

The following is included in this article:
1. Check Latency in Log Reader Agent.
2. Check Latency in Distribution Agent.
3. Check Undistributed Commands.
4. List All Publications and Subscriptions.
5. Troubleshooting Replication Issues.

Having SQL scripts to hand for checking SQL Server Replication issues is key for your tool-belt as a DBA. It’s our job to ensure Replication is flowing without errors, and the data is accurate/up-to-date on the Replication Subscribers.

if you need more information and understanding on SQL Server Replication, I recommend you read this recent Microsoft article:
Effectively troubleshoot latency in SQL Server Transactional replication
Or watch the YouTube Video.

1. Check Latency in Log Reader Agent

This script monitors the Log Reader Agent activity in SQL Server replication, providing details on its status, duration, errors, and performance metrics such as:

– Status: Current status of the agent (e.g., Start, Succeed, In Progress).
– Latency: Delivery latency (time delay in ms between transaction generation and delivery).
– Transaction Stats: Delivered transactions, commands, and average delivery rate.
– Error Information: Details of any replication errors encountered.

It helps identify issues early in the replication process, such as delays in reading the transaction log, which could eventually affect the Distribution Agent and overall replication latency.
```
-- Query to monitor log reader agent activity and its status, duration, and errors in SQL Server replication.
USE distribution;
GO
SELECT a.name AS agent_name, 
       CASE [runstatus]  
              WHEN 1 THEN 'Start' 
              WHEN 2 THEN 'Succeed' 
              WHEN 3 THEN 'In progress' 
              WHEN 4 THEN 'Idle' 
              WHEN 5 THEN 'Retry' 
              WHEN 6 THEN 'Fail' 
       END AS Status,
      [start_time], 
      h.[time], -- Time the message is logged. 
      [duration], -- Duration in seconds. 
      [comments], 
      h.[xact_seqno], -- Last processed transaction sequence number. 
      [delivery_time], -- Time first transaction delivered. 
      [delivered_transactions], -- Total transactions delivered. 
      [delivered_commands], -- Total commands delivered. 
      [average_commands], -- Avg. commands delivered. 
      [delivery_rate], -- Avg. commands per second. 
      [delivery_latency], -- Latency in ms. 
      [error_id], -- ID of the error in MSrepl_error table. 
      e.error_text -- Error text. 
  FROM [distribution].[dbo].[MSlogreader_history] h 
  JOIN MSlogreader_agents a 
  ON a.id = h.agent_id 
  LEFT JOIN MSrepl_errors e 
  ON e.id = h.error_id 
ORDER BY h.time DESC;
```
2. Check Latency in Distribution Agent

This script monitors the Distribution Agent activity in SQL Server transactional replication, focusing on latency and delivery metrics. It provides the following details:

– Status: Current status of the Distribution Agent (e.g., Start, Succeed, In Progress).
– Latency: Delivery latency in milliseconds, including current delivery latency since the last entry and overall delivery latency.
– Transaction Stats: Delivered transactions, delivered commands, and average commands delivered.
– Delivery Rate: Average commands per second delivered by the agent.
– Error Information: Error ID and text from the MSrepl_errors table for any issues encountered.

This script helps track the Distribution Agent’s performance, measure latency, and identify any issues or delays in delivering transactions to Subscribers.
```
-- Query to monitor distribution agent activity, status, delivery metrics, and errors in SQL Server replication.
USE distribution;
go
SELECT a.name AS agent_name, 
       CASE [runstatus]  
            WHEN 1 THEN 'Start' 
            WHEN 2 THEN 'Succeed' 
            WHEN 3 THEN 'In progress' 
            WHEN 4 THEN 'Idle' 
            WHEN 5 THEN 'Retry' 
            WHEN 6 THEN 'Fail' 
       END AS Status,
      [start_time], 
      h.[time], -- Time the message is logged. 
      [duration], -- Duration in seconds. 
      [comments], 
      h.[xact_seqno], -- Last processed transaction sequence number. 
      [current_delivery_rate], -- Avg. commands/sec since last entry. 
      [current_delivery_latency], -- Latency in ms since last entry. 
      [delivered_transactions], -- Total transactions delivered. 
      [delivered_commands], -- Total commands delivered. 
      [average_commands], -- Avg. commands delivered. 
      [delivery_rate], -- Avg. commands/sec. 
      [delivery_latency], -- Latency in ms. 
      [total_delivered_commands], -- Total commands delivered since creation. 
      [error_id], -- Error ID in MSrepl_error table. 
      e.error_text -- Error text. 
  FROM MSdistribution_history h 
  JOIN MSdistribution_agents a 
  ON a.id = h.agent_id 
  LEFT JOIN MSrepl_errors e 
  ON e.id = h.error_id 
  ORDER BY h.time DESC;
```
3. Check Undistributed Commands

This script checks for undistributed commands in SQL Server replication by querying the MSdistribution_status table.

It provides the following details:
– Undistributed Commands: The count of commands that have not yet been distributed.
– Publication Name: The name of the publication.
– Subscriber Database: The database on the Subscriber.
```
USE Distribution;
GO

SELECT 
    publication AS PublicationName,
    subscriber_db AS SubscriberDatabase,
    COUNT(*) AS UndistributedCommands
FROM 
    dbo.MSdistribution_status
GROUP BY 
    publication, subscriber_db
ORDER BY 
    UndistributedCommands DESC;
```
A high number of undistributed commands may indicate issues with the distribution agent or network performance.

4. List All Publications and Subscriptions

If you want to see all active publications and subscriptions in your environment, this script helps track the replication topology by listing all publications and their subscribers.
```
-- List Publications
USE [YourPublicationDatabase];
GO

SELECT 
    name AS PublicationName,
    publication_type AS PublicationType,
    allow_push AS PushEnabled,
    allow_pull AS PullEnabled,
    allow_anonymous AS AllowAnonymous
FROM 
    sys.publications;
GO

-- List Subscriptions
USE Distribution;
GO

SELECT 
    publication AS PublicationName,
    subscriber_db AS SubscriberDatabase,
    subscriber AS SubscriberServer,
    subscription_type AS SubscriptionType,
    status AS SubscriptionStatus
FROM 
    dbo.MSsubscriptions;
GO
```
Alternatively, you can use this script that helps display your Replication Topology with much more info, including Subscription configuration options and Articles with size info. You run it against your Distribution Database (and, if applicable, the Publisher database when the Distributor also acts as the Publisher) to gather information about the replication topology. It helps us understand the replication structure and displays important details needed to diagnose and troubleshoot complex replication issues.

5. Troubleshooting Replication Issues

To resolve replication issues, follow these steps:

1. Use Replication Monitor: Start by checking Replication Monitor. It provides a real-time view of your replication environment, showing the status of agents, latency, and undistributed commands. This tool is invaluable for quickly identifying where issues may lie.

2. Check Agent Status: If the Log Reader Agent or Distribution Agent is failing or showing issues, you can stop and restart the agent jobs. Sometimes, simply retrying the jobs can resolve transient connection issues or allow replication to go through successfully.

3. Undistributed Commands: High numbers of undistributed commands may indicate a distribution issue. Check the MSdistribution_status table to see if commands are backing up. If this is the case, investigate the Distribution Agent’s performance and check for network or resource issues.

4. Tailor Replication Options: For some issues, you may need to adjust replication options, such as using the –SkipErrors option to bypass specific errors temporarily. However, be cautious with these settings as they might mask underlying issues.

5. Review System Resources: If replication is lagging, check system resources like CPU, memory, and disk usage. Resource contention on the Publisher, Distributor, or Subscriber can slow down replication and cause delays.

For more detailed troubleshooting, including specific steps for different error scenarios, refer to the Microsoft Troubleshooting Replication Tutorial I recommended above in this post. If you’ve made it this far, the tips in that article and video will equip you with all the knowledge you need to resolve this. Also, we should consider contacting Microsoft Support during this process.
December 1, 2024
Get Last Node Blip in SQL Server FCI
In a SQL Server Failover Cluster Instance (FCI) environment, a “node blip” refers to the event when the active SQL Server instance moves from one cluster node to another. This movement, typically triggered by a Failover or planned maintenance, causes a brief outage—usually 30 to 60 seconds.

While planned blips are expected during maintenance, unexpected ones should be investigated promptly to identify root causes and prevent recurrence. These events are especially critical in environments hosting high-impact SQL Server workloads.

Tracking SQL Node Blips Using PowerShell

I’m sharing a command that retrieves all recent failover events logged in the Failover Clustering Operational Log.

Replace ListenerName below with your Always On Availability Group Listener Name or the relevant cluster resource name:
```
# Get last failover time
Get-WinEvent -ComputerName ListenerName -FilterHashtable @{logname='Microsoft-Windows-FailoverClustering/Operational'; id=1641} | Format-Table -AutoSize -Wrap
```
This command filters the log for Event ID 1641, which tracks clustered role movements. The output lists all failover events, including timestamps and details about why the failover occurred. By reviewing these events, you can determine whether a failover was planned or unexpected.

Why Node Blips Matter in SQL Server FCI

Every node blip results in a temporary outage as the SQL Server instance transitions between nodes. While a 1 minute disruption might seem minor, it can have significant implications for critical databases or applications.

Monitoring and checking these failover events helps identify the point in time of last SQL Server restart, diagnose issues and prevent re-occurrence.

Alternative Methods to Retrieve Node Blip Data

In addition to PowerShell, you can use SQL Server’s error logs to identify failovers. The xp_readerrorlog stored procedure can be used to find messages related to failover events:
```
-- check sql error log for failover events
EXEC xp_readerrorlog 0, 1, 'failover';
```
This command filters the SQL Server error log for failover-related entries, providing another layer of insight into node blips.

The last thing I want to share on this is a small PowerShell script that is more useful for automation. It returns only the most recent node blip and includes a param for the AG Listener Name. If you ran this on a frequent schedule it would be easy to add alerting and/or log to a table.
```
# Define the cluster listener name
$listenerName = "ListenerName"  # Replace with your actual Listener Name

# Get the latest failover events (Event ID 1641)
$lastNodeBlip = Get-WinEvent -ComputerName $listenerName -FilterHashtable @{
    logname = 'Microsoft-Windows-FailoverClustering/Operational'; 
    id = 1641
} | Sort-Object TimeCreated -Descending | Select-Object -First 1

# Output the most recent node blip
$lastNodeBlip | Format-Table -AutoSize -Wrap
```
December 1, 2024
SQL Script to Get AAG Latency
In this post, I’m sharing a script that I use frequently as a Production SQL DBA to check Always On Availability Group (AAG) latency.

Always On Availability Groups (AAG) are a high-availability and disaster recovery solution in SQL Server, allowing you to synchronize databases across multiple replicas. These replicas can serve as failover targets or provide read access to offload queries from the primary. Latency in an AAG refers to the delay between when a transaction is committed on the primary replica and when it is applied on the secondary replicas.

Having proper monitoring and alerting in place for AAG latency is critical because high latency means secondary replicas may lag behind, potentially leading to outdated or incomplete data being read by applications or users. The goal is always to keep latency as low as possible—ideally no latency or just a few seconds.

Get AAG Latency Using SQL Server Script

Why This Script Is Useful

This script provides insight into the health of your Availability Group(s). Specifically, it shows:
- Time Behind Primary: How far behind the secondary replica is from the primary.
- Redo Queue Size & Redo Rate: Metrics that help estimate how long it will take for the secondary to catch up.
- Estimated Completion Time: A projection of when synchronization will finish based on current redo rates.
High latency can mean that secondary reads are outdated, which is a problem for systems relying on accurate, real-time data.

Async vs. Sync and AAG Latency

It’s worth noting that latency is influenced by the availability mode of your AAG:

Synchronous Commit: Ensures data is committed to secondary replicas before the transaction is confirmed on the primary. This minimizes latency but may introduce slight delays in write operations due to network overhead.

Asynchronous Commit: Prioritizes primary performance by allowing transactions to commit without waiting for secondary synchronization. While this reduces write delays, it can lead to higher latency on secondary replicas, making it unsuitable for scenarios requiring real-time reads.

When failing over an AAG configured for Asynchronous Commit, it’s crucial to switch to Synchronous Commit beforehand to avoid potential data loss. Asynchronous mode doesn’t guarantee that all transactions have been applied to the secondary replica, which could result in missing data after the failover.

Resolving AAG Latency Issues

Here are some things to consider looking at if you’re experiencing AAG Latency or any other issues:

> Verify there’s not a long running transaction or big insert happening at the Primary. You can check out my other post which will help quickly review the current performance of your SQL Server.
> Review the official Microsoft Docs for Troubleshooting SQL Server Always On Issues.
> Consider contacting Microsoft Support. They’ll help you however it may come at a charge if it’s not a bug or feature issue. You’ll likely have a Microsoft Support Contract in-place if working for medium/larger company.
November 30, 2024
SQL Server User Permissions Audit Script

As a Production SQL DBA, checking user permissions is a task I handle frequently. Whether it’s for proactive audits, troubleshooting access issues, or responding to unexpected requests, having a quick and reliable way to verify user permissions is needed.

One of the most useful tools I rely on for this task is a SQL script that provides a comprehensive view of user roles, permissions, and group memberships. It’s especially helpful for identifying permissions inherited from Active Directory (AD) groups, which can sometimes be tricky to track manually.

SQL Script to Check User Permissions

This SQL script audits user permissions across all databases, listing roles, SQL users, and Windows groups:

November 29, 2024
Creating SQL Logins on an Availability Group (AG) Environment
In an Always On Availability Group (AG) environment, SQL logins must be configured with consistent Security Identifiers (SIDs) across all replicas to avoid issues like orphaned users and ensure seamless authentication during failovers.

For Domain users in SQL, this sync’s automatically and we just need to create the SQL login on each SQL replica.

In this blog post we’ll demo creating a SQL login, replicating it to secondary replica with the same SID, and assigning the necessary permissions.

Step 1: Create the SQL Login on the Primary Replica

1. Connect to the Primary Replica
Use SQL Server Management Studio (SSMS) to connect to the primary replica of your AG environment.

2. Create the Login
Run the following T-SQL to create the login:
```
CREATE LOGIN [YourLoginName] WITH PASSWORD = 'YourStrongPassword'; 
```
3. Retrieve the Login SID
Retrieve the SID of the newly created login using one of these methods:
```
-- Using SUSER_SID function
SELECT SUSER_SID('YourLoginName') AS LoginSID; 

-- Alternatively, retrieve it from sys.server_principals 
SELECT sid AS LoginSID 
FROM sys.server_principals 
WHERE name = 'YourLoginName'; 
```
Save this SID for creating the login on the secondary replicas.

Step 2: Create the SQL Login on the Secondary Replica

1. Connect to the Secondary Replica
Use SSMS to connect to the secondary replica.

2. Create the Login with the Same SID
Use the SID retrieved from the primary replica to create the login on the secondary:
```
CREATE LOGIN [YourLoginName] WITH PASSWORD = 'YourStrongPassword', SID = 0xYourSIDValue; 
```
Replace 0xYourSIDValue with the exact SID retrieved from the primary replica.

3. Verify the Login SID
Confirm the SID matches by running:
```
SELECT SUSER_SID('YourLoginName') AS LoginSID; 
```
Step 3: Assign Permissions to the Login

1. Grant Database Access
Map the login to a database user in the relevant databases:
```
USE [YourDatabaseName]; 
CREATE USER [YourLoginName] FOR LOGIN [YourLoginName]; 
```
2. Assign Roles
Grant the necessary database roles to the user. For example:
```
USE [YourDatabaseName]; 
ALTER ROLE db_datareader ADD MEMBER [YourLoginName]; 
ALTER ROLE db_datawriter ADD MEMBER [YourLoginName]; 
```
3. Grant Additional Permissions
For server-level permissions, such as VIEW SERVER STATE, explicitly grant them on both the primary and secondary replicas:
```
GRANT VIEW SERVER STATE TO [YourLoginName];
```
Testing and Best Practices

To test the login, simply connect to the AG listener in SSMS using the credentials for the login. This verifies that the login was created correctly and resolves any potential orphaned user issues.

Tip: When connecting to AG environments, include the MultiSubnetFailover=True parameter in your connection string. This ensures optimal failover performance in multi-subnet deployments.

While you wouldn’t typically perform a failover just to test logins, regular failover testing as part of a disaster recovery plan can help confirm that logins remain functional and avoid orphaned SQL login issues during actual failovers.
November 28, 2024