Selectively Numb
The sort of wonderful thing about SQL is that it has many directives which are fairly easy to decipher and use appropriately.
The sort of awful thing about SQL is that it has just as many rules that are somewhat selectively applied as the English language itself.
I have my gripes and grievances with some of the choices, of course, and so will you as you delve further into the language. A petty example is that I sort of wish that SQL used GET instead of SELECT for retrieving data.
Very few people go to the store to select milk, eggs, steak, butter, salt, pepper, and scotch. Most of us just go get it. But enough about breakfast.
Let’s talk about two of the most overlooked and undervalued facilities in the SQL language: EXISTS and NOT EXISTS. Perhaps they’d get more traction is they were called THERE and NOT THERE, but but it would be perilously easy to start confusing your WHERE and THERE clause.
Often besmirched by someone who read a blog post about MySQL in 1998 as “subqueries, which are slower than joins”, they’re two of the most useful things you can grasp and use regularly.
Though they are a bit like subqueries, the columns that you select in an EXISTS or NOT EXISTS subquery can’t be used in the outer query. You can put whatever you want in the select list, from * to 1 to COUNT to 1/0 to the entire contents of the King James Bible, and it will never end up making even the dimmest difference in the world. Likewise, adding DISTINCT, TOP, or any other row-limiting device will do absolutely nothing to change the query plan or performance.
Get over yourself.
Both EXISTS and NOT EXISTS already set a row goal of 1, because all either one has to do is determine if a single row is there or not, just with the logic reversed for each.
Your First Mistakes
Let’s say someone asks you to gin up a list of Users who have Posted anything at all, but whose Reputation still floats at the dreaded 1.
Your first instinct would likely be to write a query that looks like this.
SELECT u.Id, u.DisplayName FROM dbo.Users AS u JOIN dbo.Posts AS p ON u.Id = p.OwnerUserId WHERE u.Reputation = 1 ORDER BY u.Id;
But you’d quickly find yourself confronted by many duplicate row values, because that’s what one-to-many joins produce. Duplicates.
Your next move, tongue hanging out, sweating profusely, knuckles creaking, nearly paralyzed by the uncertainty of your continued human functioning, would be to do something like this:
SELECT DISTINCT u.Id, u.DisplayName FROM dbo.Users AS u JOIN dbo.Posts AS p ON u.Id = p.OwnerUserId WHERE u.Reputation = 1 ORDER BY u.Id;
And, sure, with a small enough data set, this is an okay choice. You passed the pop quiz, hot shot. Your instinct to use DISTINCT was not wholly incorrect, but there’s a better way.
But as you start dealing with larger and more imposing sets of data, DISTINCT will no longer cut it.
What EXISTS Does Different
While EXISTS will still use a join to match rows between tables, the semantics are quite a bit different. It can move on once it has determined that a row is either there or not there.
You don’t need to add DISTINCT, grouping, or anything else to get the results you wanted in the first place.
SELECT u.Id, u.DisplayName FROM dbo.Users AS u WHERE u.Reputation = 1 AND EXISTS ( SELECT 1/0 FROM dbo.Posts AS p WHERE p.OwnerUserId = u.Id ) ORDER BY u.Id;
Once EXISTS locates a match, it moves on to the next value from the outer side (in this case the Users table), and attempts to find a match. If no match is found, the row is discarded, which is common to inner joins.
Where a lot of developers get hung up at first is in assuming that EXISTS and NOT EXISTS work like IN or NOT in, and they miss the inner where clause to tell the database which rows should match.
I’ve seen a lot of EXISTS queries written, quite incorrectly, like this:
SELECT u.Id, u.DisplayName FROM dbo.Users AS u WHERE u.Reputation = 1 AND EXISTS ( SELECT p.OwnerUserId FROM dbo.Posts AS p ) ORDER BY u.Id;
Which will, of course, return absolutely everything. Don’t do this.
The column you select inside of the EXISTS subquery does not infer any sort of matching logic.
Like I said before, it’s essentially discarded by the optimizer.
Your Second Mistakes
No half-assed SQL tutorial is complete without showing you the wrong way to find non-matching rows between two tables.
It will undoubtedly look something like this:
SELECT records = COUNT_BIG(u.Id) FROM dbo.Users AS u LEFT JOIN dbo.Posts AS p ON u.Id = p.OwnerUserId WHERE p.Id IS NULL;
It’s not that this pattern is never better, it’s just that it shouldn’t be your go-to for each and every query with this goal in mind.
You take two tables, you join them together, and you add a predicate to your where clause to find rows where an ordinarily not-NULL column returns NULLs.
The problem is that SQL Server’s query optimizer doesn’t contain any logic to turn this into the type of query plan that you’d get using NOT EXISTS instead.
You end up needing to fully join any tables involved together, and then later on use a filter to remove rows where no match was found. This can be incredibly inefficient, especially on large data sets.
One may even be dealing with “big data” when the follies of this paradigm become quite clear.
A generally better approach to writing this type of query is to tell the database you’re using exactly what you’re after and exactly what you expect:
SELECT records = COUNT_BIG(u.Id) FROM dbo.Users AS u WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Posts AS p WHERE p.OwnerUserId = u.Id );
Your developer-life will be a whole lot less confusing and tiresome if you arm yourself with options and alternatives, which means you’ll have lots of mental energy left over to, like, learn 17 new frameworks and really impress your friends.
Think of the frameworks.
Gear Up
You should make good use of the EXISTS and NOT EXISTS patterns in your queries when you don’t require any rows from another table, and you only need to validate if something is there or not.
In cases where you need to get information from another table, joins are likely the most direct path to getting back the data you need.
But this all brings up an interesting question: what if you want to get back information in the select list without adding in join clauses, worrying about inner, outer, full, or cross, and wondering silently if one day things might go pear shaped.
We’ll talk about that in the next post, when we go over correlated subqueries.
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.