handling sql agent job failures
estimated read time: 7-9 minutesHopefully you don't experience a lot of job failures. But when multiple high-frequency jobs start failing, your inbox can get overloaded. Besides being annoying, an inbox full of job failure emails can make it hard to sift out the signal from the noise. In this post we'll walk through a system that I have used to monitor job failures on 100's of instances running 150+ jobs a piece, many of which run every 5 minutes.
Overview
At it's core, the system has three components: a stored procedure, a SQL Agent job, and a table. The table stores job failure details. The agent job executes a procedure to copy job failure details to the table, and cleans up the job history table in MSDB.
The Table
The table is simple, and holds the information you would expect: :::sql USE [DBA] GO
CREATE TABLE [Maintenance].[JobFailureArchive] (
JobFailureArchiveID INT IDENTITY(1,1) NOT NULL,
JobID UNIQUEIDENTIFIER NOT NULL,
JobName SYSNAME NOT NULL,
RunDate DATETIME NOT NULL,
StepID INT NOT NULL,
OutcomeText NVARCHAR(4000) NOT NULL,
RunDurationSec INT NOT NULL,
LogDateGMT DATETIME NOT NULL, -- Date and time the record was added
CONSTRAINT PK_JobFailureArchive PRIMARY KEY CLUSTERED (
JobFailureArchiveID
) WITH ( FILLFACTOR = 100 ),
INDEX IX_JobFailureArchive_RunDate NONCLUSTERED (
RunDate,
JobName
) WITH ( FILLFACTOR = 100 )
);
GO
Throughout the rest of this post I will be using the
DBA
database. This is a database I create on any instance I manage. I use it to store any objects I need to perform monitoring and troubleshooting tasks. I also tend to break out my objects by schema, in these examples I will be using theMaintenance
schema.
The Procedure
The procedure copies job failure data from the dbo.sysjobhistory
table in MSDB
to the JobFailureArchive
table. Before it does this, the procedure scans the JobFailureArchive
table to determine the date of the most recent job failure it had captured.
USE [DBA]
GO
----------------------------------------------------------------------------------
-- Procedure Name: Maintenance.ArchiveFailedJobs
--
-- Desc: Archives failed job details before purging job history.
--
-- Parameters:
--
-- Auth: Mark Wilkinson (@m82labs)
-- Date: 2017.08.20
----------------------------------------------------------------------------------
CREATE OR ALTER PROCEDURE [Maintenance].[ArchiveFailedJobs]
AS
DECLARE @lastFailure DATETIME -- Last time a failure was captured
-- Get the most recent job failure recorded.
SET @lastFailure = (
SELECT
ISNULL(MAX(RunDate),'19000101') AS MaxRunDate
FROM
Maintenance.JobFailureArchive
)
-- Insert new job failures
INSERT INTO Maintenance.JobFailureArchive
( JobID, JobName, RunDate, StepID, OutcomeText, RunDuration )
SELECT
sj.job_id,
sj.name,
msdb.dbo.agent_datetime(jh.run_date,jh.run_time),
jh.step_id,
jh.message,
(jh.run_duration / 10000 * 60 * 60) +
(jh.run_duration / 100 % 100 * 60) +
(jh.run_duration % 100 ) As RunDurationSec,
GETUTCDATE()
FROM
msdb.dbo.sysjobhistory AS jh
INNER JOIN msdb.dbo.sysjobs AS sj
ON jh.job_id = sj.job_id
WHERE
jh.run_status = 0 -- Just get failures
AND jh.step_id <> 0 -- Skip the '0' step that gives us summary info
AND msdb.dbo.agent_datetime(jh.run_date,jh.run_time) > @lastFailure;
Alerting
How you approach alerting is up to you, but you have a lot of different options. In my environment I query the JobFailureArchive
table via a Nagios service check. You could do the same thing with any monitoring solution that allows you to add custom queries. No matter which direction you take, writing a query to get the job failure details is simple:
SELECT TOP(1) WITH TIES
JobName,
Rundate,
StepID,
OutcomeText,
RunDurationSec
FROM DBA.Maintenance.JobFailureArchive
ORDER BY LogDateGMT
Since we are storing the date the records are added to the table, this query will always return the latest set of failures. This is a simple example, but the possibilities are endless:
* Send the results of this query via database mail
* Join with dbo.sysjobs
and dbo.syscategories
, alerting on different thresholds per job category
* Extend the TOP (1)
to include multiple capture periods and alert on average failures per capture
Configuring the SQL Agent Job
A SQL Agent job is the last component that ties everything together. At it's simplest the job executes the procedure in the example above. This will get your job failures into the archive table, but there is more this job can do.
Since we are archiving job failures, we can clean up job history in MSDB more aggressively. To do this, add another step to the agent job that executes the following (which limits job history to a rolling 24 hours):
DECLARE @OldestJob DATETIME = DATEADD(day,-1,GETDATE())
EXEC msdb.dbo.sp_purge_jobhistory @oldest_date = @OldestJob;
Reducing the size of the sysjobhistory
table can have a lot of benefits including: quicker restore times on MSDB, more responsive SSMS when viewing job history, and a reduction in contention when writing to a busy sysjobhistory
table.
You could also add a step to send an email alert when new job failures are inserted. Database mail is a rabbit hole we aren't going to explore in this post. If you do plan on using this job to alert, start by reading through the resources at the bottom of this page. Specifically the one by Jes Borland on sending query results using database mail.
The article on using tokens in agent jobs is also a good read. Using tokens you can find job failures inserted since the job started executing:
SELECT JobName,
Rundate,
StepID,
OutcomeText,
RunDurationSec
FROM DBA.Maintenance.JobFailureArchive
WHERE LogDateGMT >= msdb.dbo.agent_datetime($(ESCAPE_SQUOTE(STRTDT)), $(ESCAPE_SQUOTE(STRTTM)))
ORDER BY LogDateGMT
The system function msdb.dbo.agent_datetime
converts the date and time format used by SQL Agent into a standard datetime. The tokens $(ESCAPE_SQUOTE(STRTDT))
and $(ESCAPE_SQUOTE(STRTTM))
, will return the date and time the job started executing. Combined with Jes's post this could be all you need for job failure alerts.
If this job fails for any reason you will stop receiving notifications for any job failures. Make sure to configure this job to alert via email if it fails!
PowerShell (Bonus Feature!)
One nice thing about having job failure data in a dedicated table is that it is easy to query. You can also give normal users access to the data without having to put them in special agent roles or granting them access to system databases. Below is a PowerShell function you, or your users, could use to retrieve job failures based on several criteria:
function Get-SqlJobFailures
{
<#
.SYNOPSIS
Retrieves job failure details from the given instance
.PARAMETER SqlInstance (Accepts values from the pipeline)
Instance we are getting job failures from
.PARAMETER JobName
All or part of the job name you are interested in
.PARAMETER Newest
The number of results to return based on date
.PARAMETER Since
The oldest date to show in the results
.PARAMETER SearchString
This parameter can be used to search the error output
#>
[CmdletBinding()]
param(
[Parameter(Mandatory=$true,ValueFromPipeline=$true)]
[string]$SqlInstance,
[string]$JobName = '',
[int]$Newest = 10000,
[datetime]$Since = '2017-01-01',
[string]$SearchString
)
BEGIN
{
[string]$GetFailures_Query = @"
SELECT TOP($($Newest))
@@SERVERNAME AS ServerName,
JobName,
Rundate,
StepID,
OutcomeText,
RunDurationSec
FROM DBA.Maintenance.JobFailureArchive
WHERE rundate >= '$($Since.ToString('yyyMMdd HH:mm'))'
AND JobName LIKE '%$($JobName)%'
$( if($SearchString){ "AND OutcomeText LIKE '%$($SearchString)%'" })
ORDER BY RunDate DESC;
"@
$data = @()
}
PROCESS
{
Write-Host "Testing connection to $($SqlInstance): " -NoNewline
try
{
Test-Connection -ComputerName $SqlInstance -Count 1 -Quiet -ErrorAction Stop | Out-Null
Write-Host "success" -ForegroundColor Green
}
catch
{
Write-Host "error - $($_.Exception.Message)" -ForegroundColor Red
throw
}
Write-Host "Getting job failure data: " -NoNewline
try
{
Invoke-Sqlcmd -Query $GetFailures_Query -ServerInstance $SqlInstance -Database master | % {
$data += $_
}
Write-Host "done" -ForegroundColor Green
}
catch
{
Write-Host "error - $($_.Exception.Message)" -ForegroundColor Red
throw
}
}
END
{
if ( $data )
{
return $($data | Sort-Object -Property ServerName, RunDate)
}
}
}
If you are not familiar with PowerShell I recommend giving it a try. It has become an indispensable tool for my day to day tasks. The power of this function comes when you run it across multiple instances:
PS> @('sql-instance-01', `
'sql-instance-02', `
'sql-instance-03' ) | Get-SqlJobFailures -Newest 15 | Out-Gridview
This will get the 15 most recent job failures for the three instances listed, and display them in a graphical grid view.
Final Thoughts
Writing job failures to a user table seems like a simple idea, but you'll be surprised how much you use it. We only scratched the surface in this post, I would love to hear what other uses you find for this.
Resources
- Use Tokens in Job Steps - Microsoft Docs
- Email Query Results Using a SQL Server Agent Job - Jes Borland via BrentOzar.com
- Nagios - Nagios Home Page