Understanding group bys


Yesterday, at work, it occurred to me, not for the first time, that something that comes so easily to me, does not to others.  I had tasked someone with, what seemed a simple task, to retrieve a list of users who have reached minimum payout.

The payouts were stored in a separate table from the users and users could have multiple payment records prior to reaching payout, thus we would need to use a SUM() on the amount.  Because we also required other data, we need to use a group by to properly sum the amounts a user has.

The task was handed back to me several hours later to review, it was quite evident that the main concept was “grasped”, but the understanding of group bys was not.

I began by asking the employee if they understood what a group by does.  They answered, yes, it groups related data together.  Correct, it does group related data together.  I then proceeded to review their code with them.  I said, re-read your SQL query and let me know if you see any problems.  They were unable to.

The query was as follows:

SELECT user_payments.id, user_payments.user_id, SUM(user_payments.amount) FROM user_payments INNER JOIN users ON users.id = user_payments.user_id GROUP BY user_payments.user_id HAVING SUM(user_payments.amount) >= users.min_payout ORDER BY user_payments.user_id

Can you spot the problem?

If not, let’s review by re-reading the description of what a group by is: it groups related data together.  The only data that is related in the above query is the user_id.  The user_payments.id is a unique key for each row.  The second this item is included in the query causes the group by to fail, because no rows of data are “related”.  The rows must be the exact same.

To solve the problem above, we simply need to remove the first column from our query.  Once we re-run it, it will successfully group the multiple user payments together.

The other key point to the above is, we don’t use a WHERE clause, instead we use a HAVING.  Let me clarify further.  You can use a where clause with a group by, but not on the field that is using the SQL function SUM(), COUNT(), etc… on it.

There are many other useful reasons for using group bys.  One of my favorite is to determine duplicate records.  Often times systems get built that accept, let’s say, a URL.  When first built, there was no condition to check for duplicate URLs.

Later on down the road, you find out you are having problems because you are querying by the URL and returning incorrect data or find weird problems with data not being saved properly.  The reasons could be endless.  To write a query to check for duplicate URLs, you could write a query with a group by.

SELECT url, COUNT(id) FROM websites GROUP BY url HAVING COUNT(id) > 1 ORDER BY url

The above query will return all URLs that exist more than once in your table.  We list the count, so we can see just how many times it exists.

Hopefully the above examples paint a good picture of how to use a group by.

About the author

By Jamie

My Books