When you notice a performance issue on an instance, what is your first instinct? I would say the majority of DBAs would start looking at the user and application processes the were running against the instance. While it's a pretty safe bet to make, how do you know you're not a part of the problem?
When was the last time you looked at your SQL Agent jobs? Do you have any jobs with overlapping execution times? I'm not just talking about jobs using the same schedule, but jobs that overlap every tenth execution, or every fourth. If you have one or two instances to manage it can be a trivial task to check this out, there are even a few pieces of software that can help. But what if you manage 100+ instances, with 100 jobs each? In this post we'll go over a custom built solution developed just for such a case.
The general concept is straight forward: We will calculate and add a delay to the execution of each job in such a way that we get the least amount of overlapping executions. We do this by getting all job execution data for a 24 hour period and then we add a small (ever increasing) delay to each job until we reach the lowest level of overlaps possible. This delay information is then stored in a table on the instance and is used by a 'Delay' step added to each job to delay job execution by the amount specified.
There are a few moving parts to this solution:
JobDelay Table: This table stores the job name and an integer representing the number of seconds to delay each execution of the job.
AddJobDelay Stored Procedure: This procedure loops through all jobs on the instance and adds a new 'Delay' step as the first step of the job. This step executes the following code:
DECLARE @delay INT = NULL; DECLARE @waitfor CHAR(8); SELECT @delay = delay_sec FROM JobDelay WHERE job_name = '[job name goes here]' SET @waitfor = LEFT(DATEADD(second,ISNULL(@delay,0),CAST('00:00:00' AS TIME)),8); WAITFOR DELAY @waitfor;
Because I didn't want to have to keep track of job ids, this solution requires that you use unique job names. Which is arguably something you should be doing anyhow.
GetJobData Stored Procedure: This procedure returns one row for every job execution scheduled for the next 24 hours. Currently it does not pay attention to the time of day that a job can run (coming in a future version) it just assumes all jobs execute on a 24 hour schedule. As an example, if a job executes every 15 minutes, you would see 96 rows for this job. Along with executions, this proc also determines the average duration for the job (minus the delay step), this is very important later in the process.
Overlap Checker: This is a C# console app that takes the data from the
GetJobData stored procedure and calculates the optimal delay for each job. Here is an overview of what this program is doing (with as little detail as I can muster):
In order for this to work we have to operate under a few assumptions:
- Jobs don't need to execute exactly when they were defined to run
- There is currently enough job history data in
msdb to be considered representative of a typical day
- We don't care about reducing overlaps of jobs that execute every minute, or every twelve hours.
Limitations: - All jobs must be uniquely named - Jobs with multiple schedules are currently ignored
Overall this solution has worked great. We saw an overall CPU reduction and "smoothing" effect on CPU spikes. I did doubt the effects at first, but turning the delays off quickly proved me wrong. Most jobs never see a delay more than 10-100 seconds, and overall this process only takes ~1 minute to execute per instance.
There was a big focus on getting execution times down on the application portion of this because I wanted to be able to run it directly on the instance. The initial version of this solution relied on pure TSQL and took ~45 minutes to execute, subsequent iterations used PowerShell which got it down to ~15 miuntes, and finally C# which got the execution times down to 30-45 seconds on average. Because of the amount of looping that occurs, moving this into a C# app was the right thing to do.
Exclusions: The case has not yet come up where we need a job excluded from getting a delay assigned to it, but it is bound to happen. To accommodate this I will eventually be adding an exclusions table to each instance. It will likely be replicated or kept in sync between instances in some other fashion (I'm all about centralized management).
Move More Into The App: I may also look into moving even more of the processing work into the C# app, currently I am using some CTE magic to extrapolate the job execution times based on the start date and execution interval, but this could easily be done in the app, and with less CPU overhead.
Multiple Schedule Support: Eventually I want to add support for jobs with multiple schedules. To be honest it was something I didn't think about until the end, and in our environment we don't have many jobs using multiple schedules.
Support For Execution Periods:
If a job is set to run every 15 minutes between 9am and 5pm, the overlap checker still treats it like it executes every 15 minutes all day long. This isn't an issue in my current environment but I will be adding support for this in future versions. THIS FUNCTIONALITY HAS BEEN ADDED
Eventually this code will be incorporated into a bigger project to create a central job management system (similar features to the MSX/TSX framework currently available on EE, but not requiring EE). When that happens it will open up the option to run this across an entire environment, this would allow you to reduce overlap on IO/CPU/Network intensive jobs across all instances to reduce load on your SAN, and network. Stay tuned for updates!
I created a new GitHub repo for this project. All code is available there, along with a README that walks you through installation. Keep an eye on this repo for changes: https://github.com/m82labs/overlap_checker