exists optimizations
estimated read time: 4-5 minutesIf you've done any amount of SQL development, you've probably seen an EXISTS
clause. It might have been part of an IF
statement, or a WHERE
clause, but I'm sure you've seen it, and maybe used it yourself. One thing you'll notice is that all developers have their own way of writing it. Some developers swear by ... EXISTS( SELECT 1 ...
, another might use ... EXISTS( SELECT TOP(1) * ...
. So what's the deal? Which method is best?
The Setup
First things first, lets see some of the different variations of the EXISTS
clause, then we'll take a look at the execution plans and see what's going on behind the scenes.
For these demo scripts we'll be using the AdventureWorks database
Example 1: SELECT *
SELECT
a.FirstName,
a.LastName
FROM Person.Person AS a
WHERE EXISTS (
SELECT *
FROM HumanResources.Employee AS b
WHERE a.BusinessEntityID = b.BusinessEntityID
AND a.LastName = 'Johnson'
);
Example 2: SELECT TOP(1) *
SELECT
a.FirstName,
a.LastName
FROM Person.Person AS a
WHERE EXISTS (
SELECT TOP(1) *
FROM HumanResources.Employee AS b
WHERE a.BusinessEntityID = b.BusinessEntityID
AND a.LastName = 'Johnson'
);
Example 3: SELECT TOP(1) with defined column
SELECT
a.FirstName,
a.LastName
FROM Person.Person AS a
WHERE EXISTS (
SELECT TOP(1) b.BusinessEntityID
FROM HumanResources.Employee AS b
WHERE a.BusinessEntityID = b.BusinessEntityID
AND a.LastName = 'Johnson'
);
Example 4: SELECT 1
SELECT
a.FirstName,
a.LastName
FROM Person.Person AS a
WHERE EXISTS (
SELECT 1
FROM HumanResources.Employee AS b
WHERE a.BusinessEntityID = b.BusinessEntityID
AND a.LastName = 'Johnson'
);
I'm sure there are more examples, but this will be enough for our purposes here. Let's take a look at the plans and io stats for these queries now and see how each performed.
The Plans
Below is the plan for the first query in our list. Nothing too exciting here:
IO Stats
Scan Count | Logical Reads | Physical Reads |
---|---|---|
2 | 6 | 0 |
The interesting thing happens when we take a look at the other query plans. Because of optimizations to the EXISTS
clause, all 4 of these queries have an identical execution plan (other that the statement text included in the plan XML of course). From these tests it would seem that the optimizer doesn't care what you select, it's just looking at the query predicates to determine if something exists.
That sounds fine and good, but what if we make things a little more complicated? Here we are going to create a function that does some pointless work:
ALTER FUNCTION fnWaitForSomeTime( @num INT )
RETURNS INT
AS
BEGIN
DECLARE @count_num INT;
DECLARE @loop_i INT = 0;
WHILE ( @loop_i < @num )
BEGIN
SET @count_num = (SELECT COUNT(*) FROM sys.columns);
SET @loop_i += 1;
END
RETURN ABS(CHECKSUM(@count_num))
END
To test this function I ran the following:
SELECT TOP(1000)
dbo.fnWaitForSomeTime(10000)
FROM
HumanResources.Employee AS b
On my system this ran for 60 seconds before I stopped it. So what happens if we add this to our EXISTS
? What if we also add a CPU heavy ORDER BY
?
SELECT
a.FirstName,
a.LastName
FROM Person.Person AS a
WHERE EXISTS (
SELECT TOP(1000)
dbo.fnWaitForSomeTime(10000)
FROM
HumanResources.Employee AS b
WHERE
a.BusinessEntityID = b.BusinessEntityID
AND a.LastName = 'Johnson'
ORDER BY
HASHBYTES('MD5',CAST(CHECKSUM(NEWID()) AS varbinary))
);
Lets see the plan:
Notice anything familiar? Again we have the exact same plan. If you run this script yourself you'll see that it also takes the same amount of time to execute as the rest.
To take this a little further, lets try something we know shouldn't work (Thanks Kenneth Fischer for this great example!):
SELECT
a.FirstName,
a.LastName
FROM Person.Person AS a
WHERE EXISTS (
SELECT 1/0 -- <---- What?!
FROM HumanResources.Employee AS b
WHERE a.BusinessEntityID = b.BusinessEntityID
AND a.LastName = 'Johnson'
);
Again, same plan, same execution time, and no error. This example just further illustrates that the optimizer doesn't care what you are selecting in an EXISTS
clause, it only cares about the predicate.
Conclusion
We looked at a few variations of the EXISTS
clause, and then saw how SQL Server handles them. Due to some nice optimizations SQL seems to do the least amount of work possible to check if something exists. This means that it is really up to you and your team to decide which is the most readable.
Personally I like to avoid *
whenever I can, even if it doesn't cause a performance hit, so I tend to use the syntax in example 4 ... EXISTS( SELECT 1 ...
.