ramblings on PHP, SQL, the web, politics, ultimate frisbee and what else is on in my life
back

Bad smells are relative to where you are coming from

So today Roman, one of the developers on the Doctrine project, pointed me to an article that was discussing the misuse of DISTINCT. This article went so far as to say that "A SELECT DISTINCT query is frequently a "code smell". The article pointed to another article hosted at onlamp.com. That author was advocating the use of subqueries to more efficiently filter out redundant rows. Immediately I began to wonder if this is really a feasible approach since MySQL's subquery handling is very slow. I could imagine this being faster on RDBMS where the subquery implementation is more mature. That being said comments in the onlamp.com article point to the fact that even on Oracle things get slower with the subquery approach. The original article does however talk about how rewriting similar queries with a subquery improved performance on PostgreSQL.

Anyways, I briefly wrote up similar queries using the Sakila sample database for MySQL and ran them on my local system. I used version 0.8.0 of Sakila. I did execute them a few times to prime caches (but the query cache was of course disabled), but it was running on my desktop machine, so results could have been scewed, but with a few runs the results always come out consistently:

mysql> SELECT DISTINCT c.customer_id, c.last_name FROM customer c, rental r
> WHERE c.email LIKE '%sakilacustomer.org' AND r.rental_date >= '2005-08-23 20:44:36' AND c.customer_id = r.customer_id
> ORDER BY c.last_name;  
+-------------+-------------+
| customer_id | last_name   |
+-------------+-------------+
|         505 | ABNEY       | 
|         504 | ADAM        | 
..
|         107 | WOODS       | 
|         496 | WREN        | 
|          28 | YOUNG       | 
+-------------+-------------+
201 rows in set (0.01 sec)

mysql> SELECT c.customer_id, c.last_name FROM customer c
> WHERE c.email LIKE '%sakilacustomer.org' AND EXISTS
> (SELECT NULL FROM rental r WHERE r.rental_date >= '2005-08-23 20:44:36' AND c.customer_id = r.customer_id)
> ORDER BY c.last_name;
+-------------+-------------+
| customer_id | last_name   |
+-------------+-------------+
|         505 | ABNEY       | 
|         504 | ADAM        | 
..
|         107 | WOODS       | 
|         496 | WREN        | 
|          28 | YOUNG       | 
+-------------+-------------+
201 rows in set (0.04 sec)

mysql> SELECT customer_id, last_name FROM customer
> WHERE email LIKE '%sakilacustomer.org' AND customer_id IN
> (SELECT customer_id FROM rental WHERE rental_date >= '2005-08-23 20:44:36')
> ORDER BY last_name;
+-------------+-------------+
| customer_id | last_name   |
+-------------+-------------+
|         505 | ABNEY       | 
|         504 | ADAM        | 
..
|         107 | WOODS       | 
|         496 | WREN        | 
|          28 | YOUNG       | 
+-------------+-------------+
201 rows in set (0.04 sec)

I did not do much more digging, I did not check the EXPLAIN output or verify what indexes are being used. I guess I was satisfied with the answer, since I was expecting exactly this result anyways. Maybe if I get poked enough I will try to run this query again Pagila.

Comments



Re: Bad smells are relative to where you are coming from

Here's the thing -- the "bad smell" to me isn't that the query is slow, and I don't think the point was "use subqueries to eliminate DISTINCT".

To me, using SELECT DISTINCT says, to paraphrase the article, "get me this information but randomly throw away duplicates." As in, I want one item from each set, throw away duplicates, I don't care which.

Take a query like: "SELECT DISTINCT name FROM name_tbl". This will, say, give you all the names in a table. But it's more likely the query really wants:

"SELECT DISTINCT name FROM name_tbl ORDER BY name"

in which case

"SELECT name FROM name_tbl GROUP BY name"

is more useful.

MySQL gets chastised for allowing queries like

"SELECT name, address from name_tbl GROUP BY name"

because for duplicate names it will pick some random address. The SELECT DISTINCT "code smell" is the same chastisement. WHY don't you care about the duplicates? If you don't want the duplicates, why are you retrieving them in the first place?

I don't see it as a speed issue at all, I see it as a "code smell" -- a bad SQL query, because you're not getting what you want.

If your boss asks you for a set of reports every day and rips all but one up every week, your work was wasted. And it's obvious your boss doesn't want your reports......one random one per day is saved. If you ask him, you realize he just wants which day of the week it is, and if you give him 3 reports that say "today is Tuesday" then he knows "aha, it's Tuesday!"

So yeah, it's code smell. The rewrite is not a semantic one, it's a logical one. SQL is a declarative language, and if you say "get me a list, and throw away most of it" then yes, there's something wrong. It can be done more efficiently, and as SQL is declarative, not procedural, you have to figure out how to *declare* what you really want.

which is worse?

I agree it's a code smell.

If you have a list of possible values something could be, and you're using DISTINCT to find out what they are (to build a drop down list etc), then you are doing something wrong.

The better solution is to use some sort of definitions table. i.e. the smell is a non-normalized database.

Re: Bad smells are relative to where you are coming from

Right, in the comments the author of the second article says its about "doing the right thing" (tm) in relational terms. Then again in the article there is a fair bit of talk about performance.

Now one of my points was that MySQL needs to tweak its subquery performance, but then again according to the comments even Oracle does not provide good performance in the clean approach either (maybe someone else like validating this).

Now I do not really agree however that DISTINCT is ugly per se. It does carry a high potential of a "bad smell". However I think in this case the query with the DISTINCT seems just as, or even more readable than the other two. And if its runs faster all the merrier. As such I do not see however overzealous purity thinking gets you much further or otherwise we would all normalize to 5NF. So just because something usually tends to "smell bad", does not mean it has to stink in all use cases.

However I hope I have now made it more clear that I am not advocating using DISTINCT as a default approach.

Re: Bad smells are relative to where you are coming from

The main reason why I think DISTINCT smells is that I've seen far too many people write SQL by trial and error. They're joining tables, and getting duplicate rows, and rather than figuring out why they're getting them, they just slip in a DISTINCT and cross their fingers.

Re: Bad smells are relative to where you are coming from

Sheeri explained it very well IMO.

SELECT DISTINCT is redundant - you can achieve the same effect with GROUP BY, except that GROUP BY is more powerful and more explicit. To me, DISTINCT seems a kind of language pollution more than anything else.

Of course, DISTINCT is not categorically bad - in aggregate functions it's highly useful, such as in COUNT(DISTINCT ...).

As for the subqueries - it appears to be more subtle than "DISTINCT is faster than a correlated subquery". For example, change the date to something long ago, in the past - '2000-08-23 20:44:36', (or remove the rental_date criterion all together.) In my system, the subquery is significantly *faster* for this particular case. (faster than both DISTINCT and the equivalent GROUP BY)

Sometimes is just laziness

Sometimes, DISTINCT is a symptom of bad normalization, as explained in
this post.

Giuseppe

Re: Bad smells are relative to where you are coming from

Well telling people that DISTINCT is redundant to GROUP BY seems even more dangerous to me. But I am starting to see where you guys are all coming from. A quick test showed that indeed GROUP BY performs just as fast as DISTINCT.

@Roland: Like I said, I did not try at all to understand the performance of the queries and how indexes are leveraged. If I finish the chores for the day (Doctrine talk slides and TestFest preparations), I might play a bit more with these queries.