Created 2024/12/08 at 10:44PM
Last Modified 2025/01/04 at 01:01AM
I have often seen people forget about a powerful SQL clause PARTITION BY and end up writing queries using GROUP BY and joins, which are way slower and less performant.
Let's do a quick go through of GROUP BY and PARTITION BY first.
GROUP BY collapses multiple rows into a single row for each unique combination of the grouped columns, and is used with aggregation functions (such as COUNT(), SUM(), MIN() etc).
sort-based approach or a hash-based approach.PARTITION BY could be your friend), you often have to either:PARTITION BY does not collapse rows; it produces a value for each row based on a subset ("partition") of the data. It is used with window functions (such as SUM(...) OVER (PARTITION BY ...), AVG(...) OVER (PARTITION BY ...) etc) which are applied over each subset.
PARTITION BY.sort-based approach or hash-based approach to split data into partitions.Let's consider the following example to analyze the difference between the two and power of PARTITION BY clause.
We have a executions table which has millions of records and its schema looks like
> DESC executions;
+----------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| exchange | varchar(16) | NO | | NULL | |
| ticker | varchar(16) | NO | | NULL | |
| execution_date | date | NO | | NULL | |
| quantity | int unsigned | NO | | NULL | |
+----------------+--------------+------+-----+---------+----------------+
> SELECT * FROM executions LIMIT 5;
+----+----------+--------+----------------+----------+
| id | exchange | ticker | execution_date | quantity |
+----+----------+--------+----------------+----------+
| 1 | NYSE | FSM | 2024-09-01 | 637 |
| 2 | LSE | RKLB | 2024-09-01 | 880 |
| 3 | CSE | APPF | 2024-09-01 | 108 |
| 4 | NYSE | DLB | 2024-09-01 | 6 |
| 5 | JPX | CHWY | 2024-09-01 | 692 |
+----+----------+--------+----------------+----------+
Requirement: Calculate daily total quantity traded across each ticker and report the results alongside individual row details.
We calculate aggregate and join it back to our original table.
SELECT
e.*,
t.daily_ticker_quantity
FROM executions e
JOIN (
SELECT ticker, execution_date, SUM(quantity) AS daily_ticker_quantity
FROM executions
GROUP BY ticker, execution_date
) t ON e.ticker = t.ticker AND e.execution_date = t.execution_date;
# 9540371 rows in set (42.51 sec) [average over 10 runs]
SELECT *,
SUM(quantity) OVER (
PARTITION BY ticker, execution_date
) AS daily_ticker_quantity
FROM executions;
# 9540371 rows in set (41.67 sec) [average over 10 runs]
So we see that the PARTITION BY is slightly faster, and is easier on the eyes to read and understand.
Lets say that the database engine follows hash-based approach.
O(N) on average to build and populate the hash table (assuming good hashing and enough memory), which produces M resulting aggregate rows. Then the join back to original table would require building a hash table with M grouped rows, then scanning N rows to match which results in O(M + N). The space complexity would be O(N) for grouping and O(N) for joining.
It would require potentially O(N) on average to build partitions via hashing, then compute sums, since it's a single pass. And the space complexity would again be O(N).