Wednesday, December 17, 2014

SQL Server Clustered Instance Error 422 - Ambiguous Error With A (Hopefully) Simple Fix

Recently I observed an issue with bringing up SQL Services in a clustered environment.  In this particular Windows Failover Cluster, there are two instances of SQL Server that typically run on the same physical node in the cluster.  All of the clustered services ran without issue on the primary node.  When the instances were failed over to the second node, the SQL Server services would start, but the SQL Agent services would not start.  Failover Cluster Manager was reporting Error 422 bringing resource online.  Our investigation efforts ended up being fruitless as there was no additional detail reported in Failover Cluster Manager or the Event Log.  There was no also SQLAGENT.OUT log file generated for that particular service start attempt.

Without any additional detail into the issue, we decided to see what would happen if we tried to start the services manually through the Windows Services console.  This is when we discovered the issue.  Someone had decided to be “helpful” and had reset the services to Disabled.  I can only imagine that the thought process included: “The services don't normally run on this node in the cluster, so they don't ever need to run on this node, right?”

Once the SQL Agent services were reset to Manual (they should not be set to Automatic in a Windows Failover Cluster), we were able to bring the clustered SQL Agent services online through Failover Cluster Manager without issue and the cluster reported that it was completely healthy.

There are a few other scenarios that can cause this issue, but hopefully this helps if you also have someone “helpful” reset your clustered services to Disabled.

Tuesday, November 25, 2014

Red Gate SQL In The City Seattle 2014 - Training The Red Gate Way

I've attended Red Gate's SQL In The City Seattle 2012 and SQL In The City Charlotte 2013.  This year, I not only had the privilege of attending SQL In The City in Seattle event, but I was thrilled to be selected to speak at the event!  It's a fantastic opportunity to share my passion of SQL Server with others.  

The venue this year was McCaw Hall in Seattle Center. It was a fabulous multilevel facility that was a perfect fit for an event of this type.  McCaw Hall is not far from the Seattle Monorail and the iconic EMP Museum:


John Theron (President of Red Gate) and Steve Jones (SQL Server MVP) presented the keynote which focused on Database Lifecycle Management, "Ship often, ship safe."  You can catch a recording of the keynote here (from the London event).

After the keynote, I caught Steve Jones session on Avoiding A DBA's Worst Days With Monitoring. It was a great session that highlighted the importance of monitoring SQL Server so that you know what's going on so you can quickly respond to sudden issues and anticipate performance slowdowns before they become critical.  SQL Monitor v4 was featured and is a fantastic option for monitoring your database servers.

I followed Steve's session with my own, 101 Stupid Mistakes Your Colleagues Make When Setting Up A SQL Server (because, of course, no one that is reading this would ever make any of these mistakes). You can snag the slide deck here if you're interested in checking out the presentation.  The presentation went very well, there were great questions and audience participation.

Lunch was a great time to hang out with members of the SQL community and chat with Red Gate product experts.

After lunch, I attended Kevin Boles session on new SQL Server 2014 features (I'm in the middle of a SQL Server 2014 upgrade project, so this was a key session to attend).  I also attended Bob Pusateri excellent session Passive Security For Hostile Environments.  Bob blogged about the SQL In The City event here.

After that I needed to go check into the apartment that I rented for the week, so I reluctantly left the event early to make sure I had a place to sleep at night.  I made sure to get a picture of the Space Needle on my way by:


Overall, it is a fantastic, free event to get your SQL Server learning on and get ready for the main event, the PASS Summit.  This is a brilliant option if you can't get your company to pay for a pre-con at the PASS Summit, and I very easily convinced several friends to attend the event (it was an easy sell for their employers, as they only had to cover an additional day of food and lodging in exchange for another full day of SQL Server training).

Many thanks to Red Gate for selecting me as a speaker.  I had a blast and I look forward to attending (and hopefully speaking at) the event again next year.

Tuesday, October 21, 2014

What On Earth Is Consuming My Transaction Log - I've Got That Bloating Feeling

One more than one occasion, I've gotten the call that the log drive on one of my database servers has suddenly filled to capacity.  Some things may continue to work, but the situation gets worse and as time goes on, the possibility of a SQL Server crash increases.

Usually, a quick check of files in the log folder (sorted by size) reveals that the log growth is attributed to one database (in my experience, TempDB seems to be the most common database for this to happen in, but I have also seen other databases pop up with this issue). 

Massive log growth typically results from one rogue transaction that either has a really nasty execution plan, or is just incredibly inefficient.

Once you've identified which database is the problem child, check the free space on the log file.  I usually use this script:
SELECT DB_NAME() AS DBName, name AS LogicalFileName, filename AS PhysicalFileName,
        CONVERT(DECIMAL(12, 2), ROUND(size / 128.000, 2)) AS FileSizeMB,
        CONVERT(DECIMAL(12, 2), ROUND(FILEPROPERTY(name, 'SpaceUsed') / 128.000, 2)) AS DataSizeMB,
        CONVERT(DECIMAL(12, 2), ROUND((size - FILEPROPERTY(name,
'SpaceUsed')) / 128.000, 2)) AS FreeSpaceMB
FROM    dbo.sysfiles;


If there is plenty of free space in the log file, then then rogue transaction completed.  If there is no (or precious little) free space, you'll want to run the following command in the context of that database to find the open transactions in the database:
DBCC OPENTRAN

This will give you the SPID(s) for the open transactions in that database.  A quick sp_WhoIsActive SPID# (or a combination of EXEC sp_who2 SPID# and DBCC INPUTBUFFER (SPID#) if you have not yet implemented sp_WhoIsActive) will reveal what query is running and who is running it.

You can find Adam Machanic's awesome sp_WhoIsActive here (personally, I recommend v11.11): http://sqlblog.com/files/default.aspx

At this point, chances are you just want to kill the offending SPID(s) using a simple KILL SPID# command.  However, if it is a long running UPDATE, INSERT, or DELETE, you may be in for quite a wait while a rollback is performed and it may be better to try to free up or allocate additional space to allow the transaction to complete.

Once the transaction is killed (or completes), you should see the free space in the log file increase dramatically.  At this point you can shrink the log file back down to a reasonable, appropriate size (emphasis here on appropriate size, you do not want to shrink it to 0 MB as it will need to grow again to the normal working size).

SQL Server should recover at this point and you're on your way to fight another fire.  However, before you chalk it up as complete, you should follow up on the cause of the rogue query.  If it was a user query, check with the user to see what they were trying to accomplish.  If it was an automated process that has worked fine in the past, check to see if the statistics are up-to-date.  It's amazing how bad of an execution plan SQL Server can pick/generate when the statistics are no good.

By following up on the issue, you'll become more than "just a DBA"; you're well on your way to becoming a Rockstar DBA!  Many thanks to Thomas LaRock for publishing an awesome book (and making it a free download)!

Monday, March 10, 2014

SQL 2012 Installation Error - Error while enabling Windows feature NexFx3

While installing SQL Server 2012 on a server running Windows Server 2012, I occasionally run into an issue where the installer couldn't properly enable the .NET 3.5 role.  I was able to make it completely through all of the setup dialogs and configurations (including clicking Install to start the installation process), and it failed shortly into the install process with this error message: 
Error while enabling Windows feature : NexFx3, Error Code : -2146498298 , Please try enabling Windows feature : NetFx3 from Windows management tools and then run setup again. For more information on how to enable Windows features , see http://go.microsoft.com/fwlink/?linkid=227143

In order to get around this issue, you need to manually enable the .NET Framework v3.5 role:
  • Launch Server Manager
  • Go to Manager and choose Add Roles and Features
  • Click Next at the Before you begin page
  • Verify Role-based or feature-based installation is selected and click Next
  • On the Select destination server page, verify the local server is selected and click Next
  • On the Select server roles page, just click Next
  • On the Features page, check the box for .NET Framework 3.5 Features and click Next
  • Click Install to perform the installation

If you have issues with the Add Roles and Features wizard, you can also run the following command from a Command Prompt (update Z:\ to the appropriate drive letter where the Windows Server DVD is inserted or mounted:
DSIM /online /enable-feature /featurename:netfx3 /all /source:Z:\Sources\sxs

Once the .NET Framework 3.5 feature is enabled, you will no longer get the aforementioned error message and SQL Server 2012 installation will proceed.

You should not experience this issue if you check the box to have the SQL Server 2012 installer Include SQL Server product updates.  Note that this check-box only updates the setup files, it will not automatically slipstream any service packs.

Thursday, February 20, 2014

Why Are My Database Restores So Slow? - How To Take Advantage Of Instant File Initialization

I like to keep a close eye on things and one of my favorite scripts queries the sys.dm_exec_requests Dynamic Management View.  This query reports the status of backups, restores and DBCC commands (like SHRINKFILE) including Wait Type and Estimated Time Remaining (I convert these values to seconds):

SELECT  session_id AS [SPID], 
wait_time/1000 AS [WaitTime(sec)], 
wait_type AS [WaitType], 
Command, 
percent_complete AS [PercentComplete], 
estimated_completion_time/1000 AS [TimeRemaining(sec)]
FROM sys.dm_exec_requests
WHERE Command LIKE '%RESTORE%'
OR Command LIKE '%BACKUP%'
OR Command LIKE '%DBCC%';

What's happening?
When SQL Server needs to create or expand a data file (AUTOGROWTH, or CREATE, RESTORE, or ALTER DATABASE commands) it needs to write zeros to the entire contents of the file (or the portion of the file that has been expanded) before it can perform any IO to that file.  If you change the default AUTOGROWTH size (and you should change the default to prevent heavily fragmented data or log files), SQL Server will then need to initialize all of the freshly allocated space before performing additional IO to that data file.  

If you see a lot of ASYNC_IO_COMPLETION waits when you start to restore a database, you're waiting for SQL Server to write out the file to disk.  

Why does this happen?
SQL Server best practice is to run the SQL Server service as a user that does not have full rights to the server that it is running on.  However, if this restricted user does not have the appropriate rights, you can't take advantage of instant file initialization when creating, restoring, or expanding database data files.  

Instant file initialization allows SQL Server to immediately start writing data to a file without having to write out the entire file.  This dramatically speeds database creation, restore and expansion processes by nearly eliminating the ASYNC_IO_COMPLETION waits (there is still a negligible amount of wait incurred while writing to the file allocation table, which typically takes mere milliseconds).

How do I fix it?
The user that the SQL Server service is running as need only one parameter changed in Group Policy Editor.

  • Launch the Services management console (Services.msc) and record the account that the SQL Server service is running as
  • Launch Group Policy Editior (GPEdit.msc)
  • Expand Computer Configuration -> Windows Settings -> Security Settings -> Local Policies
  • Select User Rights Assignment 
  • Double-click Perform volume maintenance tasks and click Add User or Group...
  • Type in the name of the account (you may need to click the Locations... button to select the correct source domain/server).
  • Click the Check Names button to be sure it is properly recognized and then click OK
You will need to restart the SQL Server service to get it to recognize this change.  Once this has been enabled, your database creation, restoration, and expansion processes will no longer occur an IO penalty for just writing out the data file.

Note: This only affects the data files, log files still need to be full written to disk.

I've run across this many times with multiple clients as they are locking down user permissions.  It is very easy to overlook if you don't know about it.

Microsoft also has a blog post on this with some additional technical detail: http://blogs.msdn.com/b/sql_pfe_blog/archive/2009/12/23/how-and-why-to-enable-instant-file-initialization.aspx

Thursday, February 13, 2014

HELP - My Distribution Database Is HUGE, But I Don't Have A Lot Of Commands In Queue

I recently ran across an issue where the Distribution database on the Distributor was almost 150 GB, but when I looked at the Publications in Replication Monitor, I found that there were less than 50,000 commands in queue across all Publications.

I immediately thought, "Why on earth is the Distribution database bigger than this truck?"
I started my investigation by running this query to see how many Transactions were in the Distribution database (there were several million):
SELECT PD.Publisher_DB, COUNT(RT.xact_id) AS #TransactionsInDistributionDB
FROM distribution.dbo.MSrepl_transactions RT
JOIN distribution.dbo.MSpublisher_databases PD ON PD.id = RT.publisher_database_id
GROUP BY PD.Publisher_DB;

However, Replication Monitor tells you how many Commands are in queue (not Transactions), so I ran this query and found that there were almost 500 million commands total in queue (NOTE: Use this query with caution, it will take a long time to run if there are a lot of commands in queue - in my case it took almost 45 minutes):
SELECT PD.Publisher_DBCOUNT(RC.command_idAS #CommandsInDistributionDB
FROM distribution.dbo.MSrepl_commands RC
JOIN distribution.dbo.MSpublisher_databases PD ON PD.id RC.publisher_database_id
GROUP BY PD.Publisher_DB;

We also observed a lot of Disk IO on the Distributor, and the Distribution clean up: distribution job was taking a lot longer than normal (1-4  hours instead of 2-5 minutes).

After some Googling I found a great article by Paul Ibison (http://www.replicationanswers.com/TransactionalOptimisation.asp) that reveals that the immediate_sync option causes the Distributor to queue all Transactions for a Publication until the retention period is reached, regardless of whether or not the Transaction has been delivered or not.

I ran this query at the Distributor to see what Publications had the immediate_sync option enabled:
SELECT SS.Name AS PublisherNamePubs.Publisher_DBPubs.PublicationPubs.Immediate_Sync
FROM distribution.dbo.MSpublications AS Pubs
JOIN master.sys.servers AS SS ON SS.server_id = Pubs.publisher_ID
WHERE Pubs.Immediate_Sync = 1
ORDER BY PublisherName, Publisher_DB, Publication;

This revealed 2 Snapshot Publications that had that option enabled.  Yes, you read that correctly, the immediate_sync option also causes the Distributor to queue all Transactions for Snapshot Publications.  Even though they're not necessary for the Publication.  

How Do I Fix It?
To correct the issue, you need to run a couple scripts in the Published database (update PublicationName as appropriate):
EXEC sp_changepublication
@publication = 'PublicationName',
@property = 'allow_anonymous'
@value = 'FALSE';

EXEC sp_changepublication 
@publication = 'PublicationName',
@property = 'immediate_sync'
@value = 'FALSE';

The next execution of the Distribution clean up: distribution job will take much longer, but it will then clean up all the unnecessary transactions.  In my case, there is now less than 10 GB of data in the Distribution database.

These scripts can be executed on the fly without impact as they do not interrupt or affect the Publication (aside from telling the Distributor that it no longer needs to queue commands that it doesn't really need).

How Did It Happen?
This option is a result of checking a very innocent looking box in the New Publication Wizard:
This setting is very BAD

























If you have scripted out the Publication creation, edit your scripts and look for this snippet:
EXEC sp_addpublication ... @immediate_sync = N'true'

This will need to be changed to:
EXEC sp_addpublication ... @immediate_sync = N'false'

This option can greatly impact the performance and storage of your Distributor.  If you suspect this an issue on a Distributor you manage, I recommend you run the scripts above to check for and correct the issue.

Thanks again Paul Ibison for your invaluable assistance.

Tuesday, February 11, 2014

Red Gate Releases SQL Monitor v4 - One Small Step For Man, One Giant Leap for SQL Server Monitoring

When I attended Red Gate's SQL In The City in Charlotte, I spent a great deal of time chatting with the SQL Monitor project manager, Daniel Röthig.  He let me in on a little secret about SQL Monitor v4 that really excited me.  It's no longer a secret and I recently upgraded from SQL Monitor v3 to v4.  Overall, I was very happy with the upgrade process and the new features.  

So without further ado...

The Upgrade Process
The upgrade to v4 was very straightforward.  While it didn't initially give any initial indication that it had discovered that v3 was installed, it did carry over all of my settings and at the end of the process it warned that a downgrade was not possible due to the changes that the upgrade made to the structure of the repository database.

The upgrade was quick and painless.  My only wish here would be for it to check (or indicate) that it found another version already installed.  This would improve confidence level through the install process (I ended up going back and forth a couple times between the install and the current version to verify folder path and configured ports).

New Features
Within moments of launching the new version, I immediately noticed two new features.
  1. DBs list shortened to first 5 databases in SQL Instance view 
    • When you select an instance of SQL, SQL Monitor v4 only lists the first 5 databases on that server.  I have several servers with 500-2000 databases and by the time SQL Monitor v3 listed them all, it's time to refresh the page (making the SQL Instance page nearly useless as it was constantly refreshing/loading).  This feature greatly improves loading speed of the main page for SQL Instance monitoring.
      • The issue still exists if you expand all the DBs on the server, but you can just click the Pause button (or rewind time 1 minute) to suspend the page refreshes.  Once you've drilled into the specific database you're interested in, you can click the Return to present button to resume real-time monitoring (and page refreshes).
  2. SQL Waits Monitoring
    • Top 10 SQL Server wait stats (sorted by Wait time, Waiting tasks, Average Wait time, or Signal wait time) with a brief description of each wait state.
    • Each wait state breaks out the top 10 queries that generated those waits (sortable by Execution count, Duration, CPU time, Physical or Logical reads, or Logical writes)
    • This is the best new feature of SQL Monitor v4, and this is what Daniel and I spent so much time discussing.
      • Just knowing the query resource utilization only reveals part of the story.  By monitoring the wait stats, you can better focus your tuning efforts and have a much clearer understanding of what SQL Server is doing.


Existing features that I love
SQL Monitor does a great job in multiple areas:

  • Analysis
    • Easy access the performance graphs, just click on a metric on the main page to drill into the analysis graph
    • Baselines (Regions View) - Awesome for tracking problems that occur regularly or comparing performance over multiple time periods
    • Overlay additional stats (you can stack multiple metrics on the same graph to help pinpoint trouble areas)
  • Alerts
    • Many default alerts are active out of the box to give you a good idea of what your SQL Server is doing that might not be best practice
    • Deadlocks appear here and you can drill into the detail of the deadlock participants.

Wish list
SQL Monitor does a great job monitoring many aspects of SQL Server (and the underlying OS), but there are a few new features I'd love to see added to make the product even more robust:
  • Charts for SQL waits and query resource usage
    • You can currently view charts for server resource utilization (OS and SQL resources) with baselines on the Analysis tab, but charts for wait types and query waits/resources would also be beneficial.  
    • Charts should be clickable to drill into detail for a specific wait type, resource, or query.
    • Charts help to reveal queries that are more bursty in nature.  I would be more likely to tune a poorly performing query that completely pegs system resources for a couple minutes than one that consumes the same amount of resources over a larger timeframe.
  • # Blocked sessions (clickable to see the blocking tree - including query detail for all queries involved)
  • Additional Deadlock detail (graphical deadlock graph would be great, but I did find that the text is available under Alerts -> {specific alert} -> Output)
  • Additional Memory stats monitored
    • Free list stalls/sec
    • Lazy writes/sec
      • Page Life Expectancy only reveals so much information and it's value decreases greatly when running on multiple NUMA nodes (PLE is an average across all NUMA nodes.  If one NUMA node is crushed, but the others are fine, PLE will not reveal the a true measure of memory pressure).  These stats are additional ways to measure memory utilization/pressure on an instance of SQL.
  • Signal wait time % (total Signal Wait time/Total Wait Time)
    • This is the percentage of time that SQL Server is waiting on the CPU to be ready to process a query (10% = low, 15% = medium, 20% = high and anything over 20% indicates high CPU resource contention)
In conclusion, SQL Monitor is maturing into a robust product and the v4 upgrade brings it one step closer to being the best SQL monitoring tool on the planet.