Monday, February 3, 2014

Replication Monitor Not Able to Display Publications - I Love The Smell of CPUs Burning In The Morning

Recently, I was working on load-balancing multiple replication Publications across two Distributors.  I had dropped the Publications and Subscriptions on one Publisher that was being migrated, and then registered the Publisher with a newer Distributor that had less load.

Quick refresher of replication terms:
  Publisher - Source DB server that contains the database you want to replicate
  Publication - Record of articles (tables, views, etc...) to be replicated
  Subscription - Specifies destination server (Subscriber) and database for the Publication
  Distributor - Intermediary DB server that collects the data that needs to be replicated and pushes it to the Subscriber
  Subscriber - Destination DB server, contains the database to which the replicated data is pushed

I was able to recreate the Publications and Subscriptions without issue and they appeared properly in SQL Server Management Studio.  However, none of the Publications appeared in Replication Monitor giving the impression that nothing was happening.  The Snapshot Agents did appear in Replication Monitor and I was able to successfully generate snapshots, but other than that, I was flying blind.

We ended up rolling back to the old, overloaded distributor until we could determine the issue.  Initially, I assumed it was something with the Publisher I was migrating as other Publications on other Publishers were working properly.  However, it worked perfectly with the old Distributor before the migration and after rolling it back, so the root cause was initially a mystery.

When I discussed the issue with another DBA, I found out that this has been an occasional problem with any new Publishers/Publications using the newer Distributor.  We then changed our focus to the Distributor and found that the newer Distributor server was experiencing CPU contention.  

We knew that there was some CPU contention, but we thought it was minor as average CPU utilization was only 50%.  After finding an excerpt from the excellent Red-Gate book Performance Tuning with SQL Server Dynamic Management Views, we were able to find that the signal waits time percentage was pretty high (~25%), clearly indicating that our CPU contention was a greater issue than we originally thought.  Ideally, this percentage should be less than 10-15%.  If you're pushing over 20% like we were, you've got CPU contention issues.
  SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM(wait_time_ms) 
  AS NUMERIC(20,2)) AS [% Signal (CPU) Wait Time]
  FROM sys.dm_os_wait_stats;

These status are cumulative since instance startup, but you can run this script after you make your CPU changes to reset the stats and observe the effects (note, it will take some time for the stats to stabilize):
  DBCC SQLPERF('sys.dm_os_wait_stats'CLEAR);

We were able to add additional CPUs to the newer distributor and our signal wait time percentage dropped to about 8% and all issues with Replication Monitor are now a thing of the past.

Many thanks to Glenn Berry, Louis Davidson, and Tim Ford for your contributions to this fantastic book and to Red-Gate for publishing it.

No comments:

Post a Comment