sql server query store - two stories

by mark wilkinson, 2020-03-11 estimated read time: 9-12 minutes

This post is a part of the SQL Server community's T-SQL Tuesday event. This month is being hosted by Tracy Boggiano. Thanks Tracy!

When the Query Data Store (QDS) feature was announced for SQL Server 2016, we were excited about the prospect of being able to have deep insight on any query running in our environment. I work for a company that deals heavily in the e-commerce space, and we have a large SQL Server footprint. Our environment is unique in that is essentially a multi-tenant system, but all the tenants could have wildly different workloads. It's really the kind of query execution scenario QDS was built for. We had the pleasure of working with the Microsoft SQLCAT team to get 2016 and QDS up and running in our production environment before it was GA.

In this post I'm going to share two stories about our QDS experience (from pre and post GA of the feature). One from the perspective of the Database Developer, and one from the Database Administrator. For the most part this is not a technical post full of queries and code samples. It's just me talking about some things I have experienced using QDS in production.

DISCLAIMER - These are MY observations. Your story may be different, and in many respects I hope it was. We worked closely with Microsoft to roll out 2016 before it was GA, and after to make sure things were running smoothly. It just turns out that our systems stretch the limitations of a lot of products, and SQL Server is no exception.

The Developer

Our initial interactions with QDS were nothing short of amazing. We now had full details about every query that was being executed against our instances. This came with some immediate benefits that created actionable tasks. We could start asking questions like "do we need to run this query 2 million times per minute?" and "when did this procedure start performing poorly?".

Since those early days, QDS has gotten more and more useful for us. The addition of Automatic Plan Regression Correction (APRC) was huge, showing us upwards of 15% CPU usage reduction in a very short period of time. Later they added wait information as well, which allowed us to see what waits specific plans typically accumulated. Overall it has given us all the information we could ever need to to troubleshoot performance issues at the database level, but there lyes the rub. QDS is database specific, which is not super helpful when you are looking at the performance of thousands of databases.

Without a consolidated view of our total environment, or even a whole server, QDS data was starting to become less useful. Querying the data by hand also presented some challenges, as it seems the data was hyper-optimized for INSERTs, but not for SELECTs. Because of this, running huge SELECT statements across all databases on a server can pretty expensive, especially if you are doing it ad-hoc. Enter CenteralQDS.

CentralQDS

With no obvious communications from Microsoft about a future "full server QDS", a member of my team embarked on a project to create an environment-wide QDS system. With the help of our development team (a true DBA/Dev collaboration project that any manager would be proud of) we now have a fully centralized view of all QDS data gathered across our hundreds of servers and thousands of databases. With the help of a PowerBI front-end, anyone can now ask even more important questions than the original "when did this procedure start performing poorly?"; instead they can ask questions like "do we have any procedures that aren't executed anymore?" or "what is the most expensive procedure in production?". This system has revolutionized how we troubleshoot performance. In most cases it has even moved performance troubleshooting away from my team (the DBAs) and back to the developers themselves.

Conclusions for the Developer

From a developer standpoint QDS is an outstanding feature (even if it has some database-centric short-comings). It has changed how troubleshooting is done and also allows for more "big picture" planning and analysis. Without having a full-environment view it can be less useful, but the fact that the data is there to aggregate and analyse is very very useful.

The Administrator

The developer story of QDS is a good one. The administrator story is less so. If you are looking at an ancient map, and part of the map is labeled "there be dragons here", QDS is beyond that, in the part of the world that folks hadn't dared go before. While QDS has been a boon for performance improvement projects, it has a lot of hidden impact on your instances. I struggled with how best to write this section and decided to first present the catastrophic failures we had with QDS, followed by a break down by issue.

The Perfect Storm

Going to production with a CTP version of SQL Server can be a bit nerve wracking to be honest. In testing SQL Server 2016 was solid and we didn't see any issues except maybe the odd query regression. As anyone supporting production systems knows though, all the testing and planning in the world tend to crumble under the weight of production.

A month or so after upgrading production to SQL Server 2016 we started to see breathtaking (it's the only word clean enough to use here) amounts of tempdb contention on a number of our higher-volume instances. We struggled to find a root cause of the contention, but assumed there had been a workload change due to a recent release (there was a release the day before this started happening). We couldn't pin down the exact change, but figured it had to be related. When the contention would occur, the entire instance would become unresponsive, to the point where you couldn't even connect to it. Luckily an AG failover to the secondary would still work and seemed to clear the contention.

After further investigation we discovered that contention was happening on the base system tables in tempdb; it wasn't standard PFS contention. This made even less sense. Based on the tables involved it pointed at auto-stats potentially being the issue. Whenever we saw the issue we usually saw contention on the base table that stores stats objects for tempdb. This was the beginning of a chain of trial and error that lead us to suspect a lot of different issues:

Too much tempdb usage
Large sorting operations spilling into tempdb
Index maintenance operations using tempdb
Large table variables and other objects that can use tempdb on the backend

In the end though, we finally discovered that it was none of that, and all of it. It turns out that in 2016 a change was made that resulted in ~100% MORE PAGELATCH waits when creating objects in tempdb. This combined with a few other issues to result in what I came to call "pagelatch storms". Eventually Microsoft released a fix for the PAGELATCH issue which brought things back to pre-2016 levels (KB4013999). So what were these other issues? Here's a list:

Size-based cleanup - The size-based cleanup process in QDS is BAD. On systems with a lot of unique queries it can consume upwards of 70% CPU (this is on boxes with 8+ cores) on an instance, especially if multiple cleanups are running. This is mostly because QDS is designed to handle lots of writes, but reads are expensive. Whenever size-based cleanup runs on a database with lots of unique queries, it will likely spill into tempdb. Combine that spilling with the increase in PAGELATCH waits in 2016 and you already have an issue. Combine it with the others issues below and it can bring down a server.

Look for future posts about some custom QDS clean-up scripts we run to avoid the size-based clean-up operations.
Lots of tempdb usage - We use a lot of tempdb, and when the issue would occur, anything that used tempdb was dead in it's tracks.
Auto-stats - While the issue was occurring auto stats updates would get blocked and then end up blocking other operations, adding to the chaos.
Missing and misplaced information - When QDS was first released the QDS operations were hidden from dm_exec_requests, so we had no way of telling what was happening. Beyond that, a lot of the resource consumption for QDS occurs in the Default resource pool, which again masked the issues.
Timing - These issues didn't start until QDS started hitting max size, so that was over a month in some cases, and it didn't hit all instances and databases at the same time. This means the issue seemed "random" when it was happening. Not only was it random, in some cases (which we still see today) the clean-up would just happen to fire off for multiple databases at a time, further adding to contention.

Conclusion

Overall QDS is an amazing feature worth using. Like any new technology though, you have to be on the lookout for unexpected issues. I wish Microsoft would release their own Central QDS system, as I think it would make it infinitely more useful for larger shops. When these issues first started cropping up it made me question installing CTPs in production, but honestly we would have been bitten by this issue regardless. If you are thinking about using QDS, you just need to remember a few things:

Start with the default settings and adjust as needed
Keep an eye on how much of the allocated QDS space is being consumed to avoid the size-based clean-up
The more unique your queries, the quicker space will fill up