Thursday, February 20, 2014

Why Are My Database Restores So Slow? - How To Take Advantage Of Instant File Initialization

I like to keep a close eye on things and one of my favorite scripts queries the sys.dm_exec_requests Dynamic Management View.  This query reports the status of backups, restores and DBCC commands (like SHRINKFILE) including Wait Type and Estimated Time Remaining (I convert these values to seconds):

SELECT  session_id AS [SPID], 
wait_time/1000 AS [WaitTime(sec)], 
wait_type AS [WaitType], 
Command, 
percent_complete AS [PercentComplete], 
estimated_completion_time/1000 AS [TimeRemaining(sec)]
FROM sys.dm_exec_requests
WHERE Command LIKE '%RESTORE%'
OR Command LIKE '%BACKUP%'
OR Command LIKE '%DBCC%';

What's happening?
When SQL Server needs to create or expand a data file (AUTOGROWTH, or CREATE, RESTORE, or ALTER DATABASE commands) it needs to write zeros to the entire contents of the file (or the portion of the file that has been expanded) before it can perform any IO to that file.  If you change the default AUTOGROWTH size (and you should change the default to prevent heavily fragmented data or log files), SQL Server will then need to initialize all of the freshly allocated space before performing additional IO to that data file.  

If you see a lot of ASYNC_IO_COMPLETION waits when you start to restore a database, you're waiting for SQL Server to write out the file to disk.  

Why does this happen?
SQL Server best practice is to run the SQL Server service as a user that does not have full rights to the server that it is running on.  However, if this restricted user does not have the appropriate rights, you can't take advantage of instant file initialization when creating, restoring, or expanding database data files.  

Instant file initialization allows SQL Server to immediately start writing data to a file without having to write out the entire file.  This dramatically speeds database creation, restore and expansion processes by nearly eliminating the ASYNC_IO_COMPLETION waits (there is still a negligible amount of wait incurred while writing to the file allocation table, which typically takes mere milliseconds).

How do I fix it?
The user that the SQL Server service is running as need only one parameter changed in Group Policy Editor.

  • Launch the Services management console (Services.msc) and record the account that the SQL Server service is running as
  • Launch Group Policy Editior (GPEdit.msc)
  • Expand Computer Configuration -> Windows Settings -> Security Settings -> Local Policies
  • Select User Rights Assignment 
  • Double-click Perform volume maintenance tasks and click Add User or Group...
  • Type in the name of the account (you may need to click the Locations... button to select the correct source domain/server).
  • Click the Check Names button to be sure it is properly recognized and then click OK
You will need to restart the SQL Server service to get it to recognize this change.  Once this has been enabled, your database creation, restoration, and expansion processes will no longer occur an IO penalty for just writing out the data file.

Note: This only affects the data files, log files still need to be full written to disk.

I've run across this many times with multiple clients as they are locking down user permissions.  It is very easy to overlook if you don't know about it.

Microsoft also has a blog post on this with some additional technical detail: http://blogs.msdn.com/b/sql_pfe_blog/archive/2009/12/23/how-and-why-to-enable-instant-file-initialization.aspx

Thursday, February 13, 2014

HELP - My Distribution Database Is HUGE, But I Don't Have A Lot Of Commands In Queue

I recently ran across an issue where the Distribution database on the Distributor was almost 150 GB, but when I looked at the Publications in Replication Monitor, I found that there were less than 50,000 commands in queue across all Publications.

I immediately thought, "Why on earth is the Distribution database bigger than this truck?"
I started my investigation by running this query to see how many Transactions were in the Distribution database (there were several million):
SELECT PD.Publisher_DB, COUNT(RT.xact_id) AS #TransactionsInDistributionDB
FROM distribution.dbo.MSrepl_transactions RT
JOIN distribution.dbo.MSpublisher_databases PD ON PD.id = RT.publisher_database_id
GROUP BY PD.Publisher_DB;

However, Replication Monitor tells you how many Commands are in queue (not Transactions), so I ran this query and found that there were almost 500 million commands total in queue (NOTE: Use this query with caution, it will take a long time to run if there are a lot of commands in queue - in my case it took almost 45 minutes):
SELECT PD.Publisher_DBCOUNT(RC.command_idAS #CommandsInDistributionDB
FROM distribution.dbo.MSrepl_commands RC
JOIN distribution.dbo.MSpublisher_databases PD ON PD.id RC.publisher_database_id
GROUP BY PD.Publisher_DB;

We also observed a lot of Disk IO on the Distributor, and the Distribution clean up: distribution job was taking a lot longer than normal (1-4  hours instead of 2-5 minutes).

After some Googling I found a great article by Paul Ibison (http://www.replicationanswers.com/TransactionalOptimisation.asp) that reveals that the immediate_sync option causes the Distributor to queue all Transactions for a Publication until the retention period is reached, regardless of whether or not the Transaction has been delivered or not.

I ran this query at the Distributor to see what Publications had the immediate_sync option enabled:
SELECT SS.Name AS PublisherNamePubs.Publisher_DBPubs.PublicationPubs.Immediate_Sync
FROM distribution.dbo.MSpublications AS Pubs
JOIN master.sys.servers AS SS ON SS.server_id = Pubs.publisher_ID
WHERE Pubs.Immediate_Sync = 1
ORDER BY PublisherName, Publisher_DB, Publication;

This revealed 2 Snapshot Publications that had that option enabled.  Yes, you read that correctly, the immediate_sync option also causes the Distributor to queue all Transactions for Snapshot Publications.  Even though they're not necessary for the Publication.  

How Do I Fix It?
To correct the issue, you need to run a couple scripts in the Published database (update PublicationName as appropriate):
EXEC sp_changepublication
@publication = 'PublicationName',
@property = 'allow_anonymous'
@value = 'FALSE';

EXEC sp_changepublication 
@publication = 'PublicationName',
@property = 'immediate_sync'
@value = 'FALSE';

The next execution of the Distribution clean up: distribution job will take much longer, but it will then clean up all the unnecessary transactions.  In my case, there is now less than 10 GB of data in the Distribution database.

These scripts can be executed on the fly without impact as they do not interrupt or affect the Publication (aside from telling the Distributor that it no longer needs to queue commands that it doesn't really need).

How Did It Happen?
This option is a result of checking a very innocent looking box in the New Publication Wizard:
This setting is very BAD

























If you have scripted out the Publication creation, edit your scripts and look for this snippet:
EXEC sp_addpublication ... @immediate_sync = N'true'

This will need to be changed to:
EXEC sp_addpublication ... @immediate_sync = N'false'

This option can greatly impact the performance and storage of your Distributor.  If you suspect this an issue on a Distributor you manage, I recommend you run the scripts above to check for and correct the issue.

Thanks again Paul Ibison for your invaluable assistance.

Tuesday, February 11, 2014

Red Gate Releases SQL Monitor v4 - One Small Step For Man, One Giant Leap for SQL Server Monitoring

When I attended Red Gate's SQL In The City in Charlotte, I spent a great deal of time chatting with the SQL Monitor project manager, Daniel Röthig.  He let me in on a little secret about SQL Monitor v4 that really excited me.  It's no longer a secret and I recently upgraded from SQL Monitor v3 to v4.  Overall, I was very happy with the upgrade process and the new features.  

So without further ado...

The Upgrade Process
The upgrade to v4 was very straightforward.  While it didn't initially give any initial indication that it had discovered that v3 was installed, it did carry over all of my settings and at the end of the process it warned that a downgrade was not possible due to the changes that the upgrade made to the structure of the repository database.

The upgrade was quick and painless.  My only wish here would be for it to check (or indicate) that it found another version already installed.  This would improve confidence level through the install process (I ended up going back and forth a couple times between the install and the current version to verify folder path and configured ports).

New Features
Within moments of launching the new version, I immediately noticed two new features.
  1. DBs list shortened to first 5 databases in SQL Instance view 
    • When you select an instance of SQL, SQL Monitor v4 only lists the first 5 databases on that server.  I have several servers with 500-2000 databases and by the time SQL Monitor v3 listed them all, it's time to refresh the page (making the SQL Instance page nearly useless as it was constantly refreshing/loading).  This feature greatly improves loading speed of the main page for SQL Instance monitoring.
      • The issue still exists if you expand all the DBs on the server, but you can just click the Pause button (or rewind time 1 minute) to suspend the page refreshes.  Once you've drilled into the specific database you're interested in, you can click the Return to present button to resume real-time monitoring (and page refreshes).
  2. SQL Waits Monitoring
    • Top 10 SQL Server wait stats (sorted by Wait time, Waiting tasks, Average Wait time, or Signal wait time) with a brief description of each wait state.
    • Each wait state breaks out the top 10 queries that generated those waits (sortable by Execution count, Duration, CPU time, Physical or Logical reads, or Logical writes)
    • This is the best new feature of SQL Monitor v4, and this is what Daniel and I spent so much time discussing.
      • Just knowing the query resource utilization only reveals part of the story.  By monitoring the wait stats, you can better focus your tuning efforts and have a much clearer understanding of what SQL Server is doing.


Existing features that I love
SQL Monitor does a great job in multiple areas:

  • Analysis
    • Easy access the performance graphs, just click on a metric on the main page to drill into the analysis graph
    • Baselines (Regions View) - Awesome for tracking problems that occur regularly or comparing performance over multiple time periods
    • Overlay additional stats (you can stack multiple metrics on the same graph to help pinpoint trouble areas)
  • Alerts
    • Many default alerts are active out of the box to give you a good idea of what your SQL Server is doing that might not be best practice
    • Deadlocks appear here and you can drill into the detail of the deadlock participants.

Wish list
SQL Monitor does a great job monitoring many aspects of SQL Server (and the underlying OS), but there are a few new features I'd love to see added to make the product even more robust:
  • Charts for SQL waits and query resource usage
    • You can currently view charts for server resource utilization (OS and SQL resources) with baselines on the Analysis tab, but charts for wait types and query waits/resources would also be beneficial.  
    • Charts should be clickable to drill into detail for a specific wait type, resource, or query.
    • Charts help to reveal queries that are more bursty in nature.  I would be more likely to tune a poorly performing query that completely pegs system resources for a couple minutes than one that consumes the same amount of resources over a larger timeframe.
  • # Blocked sessions (clickable to see the blocking tree - including query detail for all queries involved)
  • Additional Deadlock detail (graphical deadlock graph would be great, but I did find that the text is available under Alerts -> {specific alert} -> Output)
  • Additional Memory stats monitored
    • Free list stalls/sec
    • Lazy writes/sec
      • Page Life Expectancy only reveals so much information and it's value decreases greatly when running on multiple NUMA nodes (PLE is an average across all NUMA nodes.  If one NUMA node is crushed, but the others are fine, PLE will not reveal the a true measure of memory pressure).  These stats are additional ways to measure memory utilization/pressure on an instance of SQL.
  • Signal wait time % (total Signal Wait time/Total Wait Time)
    • This is the percentage of time that SQL Server is waiting on the CPU to be ready to process a query (10% = low, 15% = medium, 20% = high and anything over 20% indicates high CPU resource contention)
In conclusion, SQL Monitor is maturing into a robust product and the v4 upgrade brings it one step closer to being the best SQL monitoring tool on the planet.

Monday, February 3, 2014

Replication Monitor Not Able to Display Publications - I Love The Smell of CPUs Burning In The Morning

Recently, I was working on load-balancing multiple replication Publications across two Distributors.  I had dropped the Publications and Subscriptions on one Publisher that was being migrated, and then registered the Publisher with a newer Distributor that had less load.

Quick refresher of replication terms:
  Publisher - Source DB server that contains the database you want to replicate
  Publication - Record of articles (tables, views, etc...) to be replicated
  Subscription - Specifies destination server (Subscriber) and database for the Publication
  Distributor - Intermediary DB server that collects the data that needs to be replicated and pushes it to the Subscriber
  Subscriber - Destination DB server, contains the database to which the replicated data is pushed

I was able to recreate the Publications and Subscriptions without issue and they appeared properly in SQL Server Management Studio.  However, none of the Publications appeared in Replication Monitor giving the impression that nothing was happening.  The Snapshot Agents did appear in Replication Monitor and I was able to successfully generate snapshots, but other than that, I was flying blind.

We ended up rolling back to the old, overloaded distributor until we could determine the issue.  Initially, I assumed it was something with the Publisher I was migrating as other Publications on other Publishers were working properly.  However, it worked perfectly with the old Distributor before the migration and after rolling it back, so the root cause was initially a mystery.

When I discussed the issue with another DBA, I found out that this has been an occasional problem with any new Publishers/Publications using the newer Distributor.  We then changed our focus to the Distributor and found that the newer Distributor server was experiencing CPU contention.  

We knew that there was some CPU contention, but we thought it was minor as average CPU utilization was only 50%.  After finding an excerpt from the excellent Red-Gate book Performance Tuning with SQL Server Dynamic Management Views, we were able to find that the signal waits time percentage was pretty high (~25%), clearly indicating that our CPU contention was a greater issue than we originally thought.  Ideally, this percentage should be less than 10-15%.  If you're pushing over 20% like we were, you've got CPU contention issues.
  SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM(wait_time_ms) 
  AS NUMERIC(20,2)) AS [% Signal (CPU) Wait Time]
  FROM sys.dm_os_wait_stats;

These status are cumulative since instance startup, but you can run this script after you make your CPU changes to reset the stats and observe the effects (note, it will take some time for the stats to stabilize):
  DBCC SQLPERF('sys.dm_os_wait_stats'CLEAR);

We were able to add additional CPUs to the newer distributor and our signal wait time percentage dropped to about 8% and all issues with Replication Monitor are now a thing of the past.

Many thanks to Glenn Berry, Louis Davidson, and Tim Ford for your contributions to this fantastic book and to Red-Gate for publishing it.