Wednesday, April 25, 2012

Replication Troubleshooting - How to deal with out of sync publications

Transactional Replication and nasty errors that cause out of sync publications.

The other day we had an issue on our distributor that caused deadlocks on the Distribution database.  Several of the Log Reader Agents suffered fatal errors due to being chosen as the deadlock victim.  This caused the following error to occur:
  • The process could not execute 'sp_repldone/sp_replcounters' on 'MyPublisherServer'
When I drilled in to view the detail, I found this error:
  • The specified LSN (%value) for repldone log scan occurs before the current start of replication in the log (%newervalue)


After much searching on the error, I came across several forum posts that indicated I was pretty well up a creek.  I then found this post on SQLServerCentral.  Hilary Cotter's response was the most beneficial for devising a recovery plan and Stephen Cassady's response helped me refine that plan.

Hilary Cotter (Blog) is an expert when it comes to SQL replication.  He certainly knows his stuff!


The Recovery Plan
Recovering from this issue involves several steps.  

For small databases or publications where the snapshot to reinitialize the publication will be small and push quickly, it's simplest and best to just reinitialize the entire publication and generate/push a new snapshot.  

For larger publications (my publication contained almost 1,000 tables) and situations where pushing the snapshot will take an inordinate amount of time (24+ hours in my case) the following process can be used to skip the missing transactions and identify the tables that are now out of sync:
  • Recover the Log Reader Agent by telling it to skip the missing transactions
  • Recover the Distribution Agent by configuring it to ignore data consistency issues
  • Validate the publication to determine which tables are out of sync
  • Drop and republish out of sync tables


Log Reader Agent Recovery
The simplest way to recover the Log Reader Agent is to run the following command against the published database:
  • sp_replrestart
This effectively tells SQL to restart replication NOW, thus ignoring all transactions that have occurred between the time of the failure and the time you run the command.  The longer you wait to run this command, the more activity in the database that gets ignored, which likely results in more tables that fall out of sync.


Distribution Agent Recovery
Now that the Log Reader Agent is capturing transactions for replication, the Distribution Agent will likely get upset because there are transactions missing.  I specifically received the following error:
  • The row was not found at the Subscriber when applying the replicated command
This error causes the Distribution Agent to fail, but there is a system profile for the Distribution Agent that you can select to bypass the data consistency errors.
  • Launch Replication Monitor
  • In the left-hand column
    • Expand the DB server that contains the published database
    • Select the Publication 
  • In the right-hand pane
    • Double-click the Subscription
  • In the Subscription window
    • Go to the Action menu and select Agent Profile
    • Select the profile: Continue on data consistency errors. and click OK
      • Be sure to note which profile was selected before changing it so that you can select the appropriate option once recovery is complete
  • If the Distribution Agent is currently running (it's likely in a fail/retry loop), you'll need to:
    • Go to the Action menu and select Stop Distribution Agent
    • Go to the Action menu and select Start Distribution Agent
  • If there is more than one subscription, repeat these steps for any additional subscriptions


Subscription Validation
Validating the Subscription(s) is a fairly straightforward task.
  • Launch Replication Monitor
  • In the left-hand column of Replication Monitor
    • Expand the DB server that contains the published database
    • Right-click the Publication and select Validate Subscriptions...
    • Verify Validate all SQL Server Subscriptions is selected
    • Click the Validation Options... button and verify the validation options - I recommend selecting the following options:
      • Compute a fast row count: if differences are found, compute an actual row count
      • Compare checksums to verify row data (this process can take a long time)
    • Once you are satisfied with the validation options, click OK and then click OK to actually queue up the validation process
      • Please note: for large databases, this process may take a while (and the Validate Subscriptions window may appear as Not Responding)
For my publications (~1,000 tables and DB was ~100GB) the validation process took about 20 minutes, but individual results will vary.
If you wish to monitor the validation progress
  • In the right-hand pane of Replication Monitor
    • Double-click the Subscription
  • In the Subscription window:
    • Go to the Action menu and select Auto Refresh


Identify out of sync tables
I created the following script that will return the tables that failed validation:

-- This script will return out of sync tables after a Subscription validation has been performed
-- Set the isolation level to prevent any blocking/locking 
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

SELECT 
mda.publication [PublicationName],
mdh.start_time [SessionStartTime],
mdh.comments [Comments]

FROM distribution.dbo.MSdistribution_agents mda 
JOIN distribution.dbo.MSdistribution_history mdh ON mdh.agent_id = mda.id 

-- Update Publication name as appropriate
WHERE mda.publication = 'My Publication'
AND mdh.comments LIKE '%might be out of%'
-- This next line restricts results to the past 24 hours.
AND mdh.start_time > (GETDATE() - 1) 
-- Alternatively, you could specify a specific date/time: AND mdh.start_time > '2012-04-25 10:30'
-- View most recent results first
ORDER BY mdh.start_time DESC

The Comments column will contain the following message if a table is out of sync:
  • Table 'MyTable' might be out of synchronization.  Rowcounts (actual: %value, expected: %value).  Checksum values  (actual: -%value, expected: -%value).
Make a list of all tables that are returned by the aforementioned script.

Now the determination needs to be made as to the level of impact.
  • The Reinitialize All Subscriptions option should be used if the following is true:
    • Large number of tables affected (majority of published tables)
    • Unaffected tables are small in size (if the snapshot for the unaffected tables is going to be very small, it's much easier to just reinitialize everything)
  • Dropping and re-adding individual tables should be used if the following is true:
    • The number of tables affected is far less than the total number of tables
    • The tables that are unaffected are very large in size and will cause significant latency when pushing the snapshot
The latter was the case in my scenario (about 100 out of 1,000 tables were out of sync, and the ~900 tables that were in sync included some very large tables).


Reinitialize All Subscriptions
Follow this process if the determination has been made to use the Reinitialize All Subscriptions option:
  • In the left-hand column of Replication Monitor
    • Expand the DB server that contains the published database
    • Right-click the Publication and select Reinitialize All Subscriptions...
    • Verify Use a new snapshot is selected
    • Verify Generate the new snapshot now is NOT selected
    • Click the Mark For Reinitialization button
      • Please note: for large databases, this process may take a while (and the Replication Monitor window may appear as Not Responding)
  • In the right-hand pane of Replication Monitor
    • Select the Agents tab (in SQL 2005 select the Warnings and Agents tab)
    • Right click the Snapshot Agent and select Start Agent
      • The reason for performing this manually is that sometimes when you select the Generate the new snapshot now option, it kicks off the Snapshot Agent before the reinitialization is complete which causes blocking, deadlocks and major performance issues.

Recover out of sync tables
If the determination has been made to recover the individual tables, use the list of tables generated from the validation process and follow this process:
  • In the left-hand column of Replication Monitor
    • Expand the DB server that contains the published database
    • Right-click the Publication and select Properties
    • Select the Articles page in the left-hand column
    • Once the center page has populated, expand each table published to determine if the table is filtered (i.e. not all columns in the table are published).
      • If tables are filtered, make a note of the columns that are not pushed for each table
    • Once review of the tables is complete, click Cancel
      • If you click OK after expanding tables, it will invalidate the entire snapshot and you will end up reinitializing all articles in the publication
    • Right-click the Publication and select Properties
    • Select the Articles page in the left-hand column
    • Clear the check boxes for all out of sync tables and click OK
    • Right-click the Publication and select Properties
    • Select the Articles page in the left-hand column
    • Select the affected tables in the center pane 
      • If any tables were not completely replicated, be sure to reference your notes regarding which columns are replicated
    • Click OK when table selection is complete
      • Note: If you receive an error that the entire snapshot will be invalidated, close the Publication Properties window and try adding in a few tables at a time until all tables are selected.
    • In the right-hand pane of Replication Monitor
      • Select the Agents tab (in SQL 2005 select the Warnings and Agents tab)
      • Right click the Snapshot Agent and select Start Agent
    • Double-click the Subscription
    • Go to the Action menu and select Auto Refresh

Final cleanup
Once the snapshot has been delivered and replication has caught up on all queued transactions, perform the following to return replication to a normally running state.
    • In the left-hand column of Replication Monitor
      • Expand the DB server that contains the published database
      • Select the Publication 
    • In the right-hand pane of Replication Monitor
      • Double-click the Subscription
    • In the Subscription window
      • Go to the Action menu and select Agent Profile
      • Select the profile that was configured before you changed it (if unsure, the Default agent profile is typically the default) and click OK
    • If there is more than one subscription, repeat these steps for any additional subscriptions


I hope this helps if you run into the same situation.  I would like to especially thank Hilary Cotter for sharing his knowledge with the community as his forum and blog posts really helped me resolve the issue.

7 comments:

  1. Thank you thank you! We go to it too late.... but still helped out a ton!

    ReplyDelete
  2. Thanks for your informative post.I am searching post about SQL Server Replication. And i think i can get my topics here.

    ReplyDelete
  3. Thanks Matt.. I went through Hilary's articles and then stumbled upon your blog and really like the meticulous way to explain the problem and solution.

    ReplyDelete
  4. Very Informative, Thanks Matt for sharing this helpful post with us. I have found another helpful post see here: http://www.sqlserverlogexplorer.com/fixing-sql-server-transactional-replication-performance-issues/

    ReplyDelete
  5. This is so clear to understand, Thanks!

    ReplyDelete
  6. Thank You very much! Your article very helped me.

    ReplyDelete
  7. If Unaffected tables are small in size or The number of tables affected is far less than the total number of tables, another option would be to snapshot only affected tables.
    To do that, first make sure columns 'allow_anonymous' and 'immediate_sync' values are zero for all publications by running this script select immediate_sync,* from distribution.dbo.MSpublications. Then remove/add the articles and generate snapshot; snapshot will only be taken for the articles newly added.

    ReplyDelete