I had a little bit of excitement with one of my production servers couple of days ago, that I thought I would document (especially since it turned out well in the end). I have a little VB.NET WinForms application that calls a series of DMV stored procedures on the production database instances.
Below (with some sensitive information obscured) is a screenshot of Operations Monitor (during a normal time).
As I woke up and checked the system with Operations Monitor, I noticed several unusual readings that led me to believe that one of my database servers was not a happy camper. First, my Avg Task Count for the instance was running in the 25-30 range (where it normally is in the 4-7 range). Second, my SQL Server CPU utilization was in the 25-30% range (where it is normally in the 10-15% range). Third, My FullText Pending Changes were all very high (where they are usually close to zero). Finally, the Log Reuse Description for one of the databases on that instance was Active Transaction (which is unusual for my OLTP workload).
I initially thought that I had an issue with Integrated Full Text Search (iFTS), so I ran a command to resume population, which had no effect. Then, I tried to start a manual population, and that had no effect. Puzzled, I decided to look at the SQL Error log, where I saw a number of errors like you see below.
Process 0:0:0 (0x1470) Worker 0x000000000583E1A0 appears to be non-yielding on Scheduler 13. Thread creation time: 12903249354211. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 19%. System Idle 78%. Interval: 10704548 ms.
Oh-oh, that was not good. I remembered a relevant white paper on this subject: How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 by Bob Dorr and Sameer Tejani at Microsoft, which I took a quick look at. While I was doing this, I noticed that I still had Active Transaction for the Log Reuse Description on that OLTP database.
I decided to run DBCC OPENTRAN, and discovered a transaction that had been active for a couple of hours, which was very unusual. Then, I ran DBCC INPUTBUFFER with the SPID for that transaction, and discovered that it was for a stored procedure that was kicked off by a nightly SQL Agent job. I tried cancelling the Agent job, but that had no effect. Next, I ran KILL 290 to kill that SPID, which caused the transaction to to be rolled back (which took awhile). After that SPID was finally dead, the non-yielding worker errors stopped, and all of my other reading quickly returned to normal.
The moral of this story is to stay calm, and work the problem as you gather information about what is going on. I wasted some time chasing the iFTS problem before I ran DBCC OPENTRAN. The non-yielding worker was not bad enough to have caused a mini-dump by SQL Server, and the database server was not in acute distress, so I had time to keep working on it before trying something more drastic.
I have some of these basic but useful queries and commands shown below. They are very handy to have around and use to monitor your database servers. It is very important to know what your baseline values are, so that you will notice when something is different.
-- Get Avg Task Count and Avg Runnable Task Count SELECT AVG(current_tasks_count) AS [Avg Task Count], AVG(runnable_tasks_count) AS [Avg Runnable Task Count] FROM sys.dm_os_schedulers WITH (NOLOCK) WHERE scheduler_id < 255; -- Get Log reuse wait description SELECT [name], log_reuse_wait_desc FROM sys.databases WITH (NOLOCK) WHERE database_id > 4; -- Find out how many full-text changes are pending on a table SELECT OBJECTPROPERTY(OBJECT_ID('CurrentPostFullTextMonday'), 'TableFulltextPendingChanges') AS 'Monday Full Text Pending Changes'; -- Find oldest open transaction DBCC OPENTRAN -- Find command for a given SPID DBCC INPUTBUFFER(290) -- Kill a SPID KILL 290