Quick refresher of replication terms:
Publisher - Source DB server that contains the database you want to replicate
Publication - Record of articles (tables, views, etc...) to be replicated
Subscription - Specifies destination server (Subscriber) and database for the Publication
Distributor - Intermediary DB server that collects the data that needs to be replicated and pushes it to the Subscriber
Subscriber - Destination DB server, contains the database to which the replicated data is pushed
We ended up rolling back to the old, overloaded distributor until we could determine the issue. Initially, I assumed it was something with the Publisher I was migrating as other Publications on other Publishers were working properly. However, it worked perfectly with the old Distributor before the migration and after rolling it back, so the root cause was initially a mystery.
When I discussed the issue with another DBA, I found out that this has been an occasional problem with any new Publishers/Publications using the newer Distributor. We then changed our focus to the Distributor and found that the newer Distributor server was experiencing CPU contention.
We knew that there was some CPU contention, but we thought it was minor as average CPU utilization was only 50%. After finding an excerpt from the excellent Red-Gate book Performance Tuning with SQL Server Dynamic Management Views, we were able to find that the signal waits time percentage was pretty high (~25%), clearly indicating that our CPU contention was a greater issue than we originally thought. Ideally, this percentage should be less than 10-15%. If you're pushing over 20% like we were, you've got CPU contention issues.
SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM(wait_time_ms)
AS NUMERIC(20,2)) AS [% Signal (CPU) Wait Time]
These status are cumulative since instance startup, but you can run this script after you make your CPU changes to reset the stats and observe the effects (note, it will take some time for the stats to stabilize):
DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR);
We were able to add additional CPUs to the newer distributor and our signal wait time percentage dropped to about 8% and all issues with Replication Monitor are now a thing of the past.
Many thanks to Glenn Berry, Louis Davidson, and Tim Ford for your contributions to this fantastic book and to Red-Gate for publishing it.