Recently I observed an issue with bringing up SQL Services in a clustered environment. In this particular Windows Failover Cluster, there are two instances of SQL Server that typically run on the same physical node in the cluster. All of the clustered services ran without issue on the primary node. When the instances were failed over to the second node, the SQL Server services would start, but the SQL Agent services would not start. Failover Cluster Manager was reporting Error 422 bringing resource online. Our investigation efforts ended up being fruitless as there was no additional detail reported in Failover Cluster Manager or the Event Log. There was no also SQLAGENT.OUT log file generated for that particular service start attempt.
Without any additional detail into the issue, we decided to see what would happen if we tried to start the services manually through the Windows Services console. This is when we discovered the issue. Someone had decided to be “helpful” and had reset the services to Disabled. I can only imagine that the thought process included: “The services don't normally run on this node in the cluster, so they don't ever need to run on this node, right?”
Once the SQL Agent services were reset to Manual (they should not be set to Automatic in a Windows Failover Cluster), we were able to bring the clustered SQL Agent services online through Failover Cluster Manager without issue and the cluster reported that it was completely healthy.
There are a few other scenarios that can cause this issue, but hopefully this helps if you also have someone “helpful” reset your clustered services to Disabled.