This post is a part of the SQL Server community's T-SQL Tuesday event. This month is being hosted by Tracy Boggiano. Thanks Tracy!
When the Query Data Store (QDS) feature was announced for SQL Server 2016, we were excited about the prospect of being able to have deep insight on any query running in our environment. I work for a company that deals heavily in the e-commerce space, and we have a large SQL Server footprint. Our environment is unique in that is essentially a multi-tenant system, but all the tenants could have wildly different workloads. It's really the kind of query execution scenario QDS was built for. We had the pleasure of working with the Microsoft SQLCAT team to get 2016 and QDS up and running in our production environment before it was GA.
In this post I'm going to share two stories about our QDS experience (from pre and post GA of the feature). One from the perspective of the Database Developer, and one from the Database Administrator. For the most part this is not a technical post full of queries and code samples. It's just me talking about some things I have experienced using QDS in production.
DISCLAIMER - These are MY observations. Your story may be different, and in many respects I hope it was. We worked closely with Microsoft to roll out 2016 before it was GA, and after to make sure things were running smoothly. It just turns out that our systems stretch the limitations of a lot of products, and SQL Server is no exception.
Our initial interactions with QDS were nothing short of amazing. We now had full details about every query that was being executed against our instances. This came with some immediate benefits that created actionable tasks. We could start asking questions like "do we need to run this query 2 million times per minute?" and "when did this procedure start performing poorly?".
Since those early days, QDS has gotten more and more useful for us. The addition of Automatic Plan Regression Correction (APRC) was huge, showing us upwards of 15% CPU usage reduction in a very short period of time. Later they added wait information as well, which allowed us to see what waits specific plans typically accumulated. Overall it has given us all the information we could ever need to to troubleshoot performance issues at the database level, but there lyes the rub. QDS is database specific, which is not super helpful when you are looking at the performance of thousands of databases.
Without a consolidated view of our total environment, or even a whole server, QDS data was starting to become less useful. Querying the data by hand also presented some challenges, as it seems the data was hyper-optimized for INSERTs, but not for SELECTs. Because of this, running huge SELECT statements across all databases on a server can pretty expensive, especially if you are doing it ad-hoc. Enter CenteralQDS.
With no obvious communications from Microsoft about a future "full server QDS", a member of my team embarked on a project to create an environment-wide QDS system. With the help of our development team (a true DBA/Dev collaboration project that any manager would be proud of) we now have a fully centralized view of all QDS data gathered across our hundreds of servers and thousands of databases. With the help of a PowerBI front-end, anyone can now ask even more important questions than the original "when did this procedure start performing poorly?"; instead they can ask questions like "do we have any procedures that aren't executed anymore?" or "what is the most expensive procedure in production?". This system has revolutionized how we troubleshoot performance. In most cases it has even moved performance troubleshooting away from my team (the DBAs) and back to the developers themselves.
From a developer standpoint QDS is an outstanding feature (even if it has some database-centric short-comings). It has changed how troubleshooting is done and also allows for more "big picture" planning and analysis. Without having a full-environment view it can be less useful, but the fact that the data is there to aggregate and analyse is very very useful.
The developer story of QDS is a good one. The administrator story is less so. If you are looking at an ancient map, and part of the map is labeled "there be dragons here", QDS is beyond that, in the part of the world that folks hadn't dared go before. While QDS has been a boon for performance improvement projects, it has a lot of hidden impact on your instances. I struggled with how best to write this section and decided to first present the catastrophic failures we had with QDS, followed by a break down by issue.
Going to production with a CTP version of SQL Server can be a bit nerve wracking to be honest. In testing SQL Server 2016 was solid and we didn't see any issues except maybe the odd query regression. As anyone supporting production systems knows though, all the testing and planning in the world tend to crumble under the weight of production.
A month or so after upgrading production to SQL Server 2016 we started to see breathtaking (it's the only word clean enough to use here) amounts of tempdb contention on a number of our higher-volume instances. We struggled to find a root cause of the contention, but assumed there had been a workload change due to a recent release (there was a release the day before this started happening). We couldn't pin down the exact change, but figured it had to be related. When the contention would occur, the entire instance would become unresponsive, to the point where you couldn't even connect to it. Luckily an AG failover to the secondary would still work and seemed to clear the contention.
After further investigation we discovered that contention was happening on the base system tables in tempdb; it wasn't standard PFS contention. This made even less sense. Based on the tables involved it pointed at auto-stats potentially being the issue. Whenever we saw the issue we usually saw contention on the base table that stores stats objects for tempdb. This was the beginning of a chain of trial and error that lead us to suspect a lot of different issues:
In the end though, we finally discovered that it was none of that, and all of it. It turns out that in 2016 a change was made that resulted in ~100% MORE
PAGELATCH waits when creating objects in tempdb. This combined with a few other issues to result in what I came to call "pagelatch storms". Eventually Microsoft released a fix for the
PAGELATCH issue which brought things back to pre-2016 levels (KB4013999). So what were these other issues? Here's a list:
PAGELATCHwaits in 2016 and you already have an issue. Combine it with the others issues below and it can bring down a server.
Look for future posts about some custom QDS clean-up scripts we run to avoid the size-based clean-up operations.
dm_exec_requests, so we had no way of telling what was happening. Beyond that, a lot of the resource consumption for QDS occurs in the
Defaultresource pool, which again masked the issues.
Overall QDS is an amazing feature worth using. Like any new technology though, you have to be on the lookout for unexpected issues. I wish Microsoft would release their own Central QDS system, as I think it would make it infinitely more useful for larger shops. When these issues first started cropping up it made me question installing CTPs in production, but honestly we would have been bitten by this issue regardless. If you are thinking about using QDS, you just need to remember a few things: