mteq.pro

April 16, 2026

How SQL Window Functions Add Context Without Losing Rows

Window functions let us calculate across related rows without collapsing them like GROUP BY does. This post explains how OVER(), PARTITION BY, ORDER BY, and common window functions work in real data engineering queries.

Window functions allow us to perform calculations across a set of rows related to the current row, without collapsing those rows the way GROUP BY does.

They are essential for:

  • analytics: "what percentage of total revenue did each product contribute?"
  • ranking: "who are the top five earners in each department?"
  • time-series processing: "what is the 7-day moving average of daily signups?"
  • deduplication: "which row do I keep when a customer appears twice with different timestamps?"
  • trend analysis: "did this customer spend more or less than last month?"
  • cumulative metrics: "what is the running total of sales up to today?"
  • comparisons between rows: "how many days passed between this order and the previous one?"

This is one of the most important SQL topics for data engineers because the same patterns appear in dashboards, ELT models, anomaly detection, and interview questions.

Let's use one mental model:

GROUP BY folds rows into one result per group.

Window functions keep every row visible and place a calculation frame around it.

GROUP BY customer_id

customer 1 orders -> one output row
customer 2 orders -> one output row

Window function PARTITION BY customer_id

customer 1 row 1 -> keep row 1 + calculate over customer 1 rows
customer 1 row 2 -> keep row 2 + calculate over customer 1 rows
customer 1 row 3 -> keep row 3 + calculate over customer 1 rows

That is the core difference.

Example:

The OVER() clause is what activates window behavior. Without it, SUM() would still be a regular aggregate that collapses rows. When we add OVER(), we tell the database: keep every row visible, and apply this calculation across a related set of rows.

Most window functions follow this structure:

<function>() OVER (
    PARTITION BY <column(s)>
    ORDER BY <column(s)>
)

The angle brackets are placeholders. Replace <function> with the actual function name like SUM or ROW_NUMBER, and <column(s)> with real column names like customer_id. A real example looks like: ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC).

Each part has a job:

  • PARTITION BY splits the dataset into logical groups without collapsing rows
  • ORDER BY defines the sequence inside each group
  • the function defines what to calculate
  • an optional frame clause defines exactly which surrounding rows participate

Both PARTITION BY and ORDER BY are optional. Include only what the calculation needs. A plain SUM() OVER () with empty parentheses calculates over all rows. A RANK() OVER (ORDER BY salary DESC) needs ordering but no partitioning. A deduplication query needs both.

One subtle point matters here: the ORDER BY inside OVER() controls the calculation, not the final display order. If we want the final result sorted, we still need an outer ORDER BY.

Let's start with a small orders dataset.

orders
+-------------+------------+--------+
| customer_id | order_date | amount |
+-------------+------------+--------+
| 1           | 2026-01-01 | 100    |
| 1           | 2026-01-05 | 150    |
| 1           | 2026-01-09 | 120    |
| 2           | 2026-01-03 | 80     |
| 2           | 2026-01-10 | 120    |
+-------------+------------+--------+

If we write:

SELECT customer_id, SUM(amount)
FROM orders
GROUP BY customer_id;

we get one row per customer:

+-------------+-------------+
| customer_id | SUM(amount) |
+-------------+-------------+
| 1           | 370         |
| 2           | 200         |
+-------------+-------------+

If we write:

SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id) AS customer_total
FROM orders
ORDER BY customer_id, order_date;

we keep every order row and add the customer total beside it.

Result:

+-------------+------------+--------+----------------+
| customer_id | order_date | amount | customer_total |
+-------------+------------+--------+----------------+
| 1           | 2026-01-01 | 100    | 370            |
| 1           | 2026-01-05 | 150    | 370            |
| 1           | 2026-01-09 | 120    | 370            |
| 2           | 2026-01-03 | 80     | 200            |
| 2           | 2026-01-10 | 120    | 200            |
+-------------+------------+--------+----------------+

That is the window-function mindset.

ROW_NUMBER() for deduplication

ROW_NUMBER() assigns a unique incremental number inside each partition.

This is one of the most common data-engineering patterns because many pipelines need to keep only the latest row per business key.

Use this when we need to:

  • deduplicate records and keep only one row per business key
  • keep the most recent row per customer, device, or entity
  • number events inside a group to process them step by step

Assume customers looks like this:

+-------------+-----------------------+------------+
| customer_id | email                 | updated_at |
+-------------+-----------------------+------------+
| 1           | alice-old@example.com | 2026-01-02 |
| 1           | alice-new@example.com | 2026-01-08 |
| 2           | bob-old@example.com   | 2026-01-03 |
| 2           | bob-new@example.com   | 2026-01-09 |
+-------------+-----------------------+------------+

We sort by updated_at DESC so the most recent row gets rn = 1. We cannot write WHERE rn = 1 in the same SELECT that defines rn. Window functions are evaluated after the SELECT list is processed, so the alias does not yet exist when WHERE runs. A CTE solves this by first computing the ranked result, then letting an outer query filter it.

WITH ranked AS (
    SELECT
        customer_id,
        email,
        updated_at,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY updated_at DESC
        ) AS rn
    FROM customers
)
SELECT *
FROM ranked
WHERE rn = 1
ORDER BY customer_id;

Inside the CTE, the ranked rows conceptually look like this:

+-------------+-----------------------+------------+----+
| customer_id | email                 | updated_at | rn |
+-------------+-----------------------+------------+----+
| 1           | alice-new@example.com | 2026-01-08 | 1  |
| 1           | alice-old@example.com | 2026-01-02 | 2  |
| 2           | bob-new@example.com   | 2026-01-09 | 1  |
| 2           | bob-old@example.com   | 2026-01-03 | 2  |
+-------------+-----------------------+------------+----+

After WHERE rn = 1, the final result is:

+-------------+-----------------------+------------+----+
| customer_id | email                 | updated_at | rn |
+-------------+-----------------------+------------+----+
| 1           | alice-new@example.com | 2026-01-08 | 1  |
| 2           | bob-new@example.com   | 2026-01-09 | 1  |
+-------------+-----------------------+------------+----+

This keeps the most recent row for each customer.

One production note matters here: if two rows can share the same updated_at, add a second tie-breaker to the ORDER BY. Otherwise the "latest" row might not be deterministic.

ORDER BY updated_at DESC, customer_record_id DESC

RANK() and DENSE_RANK()

Row numbers give every row a unique sequence position. Sometimes we need relative standing instead, where two employees who earn the same salary should share the same rank. That is what RANK() and DENSE_RANK() are for.

RANK() and DENSE_RANK() both assign rank values inside a partition, but they handle ties differently.

  • RANK() leaves gaps after ties
  • DENSE_RANK() does not leave gaps

Use this when we need to:

  • rank employees by salary inside each department
  • find the top performer in each category
  • build leaderboards or scoring tables where ties must be handled explicitly
SELECT
    department,
    employee_name,
    salary,
    RANK() OVER (
        PARTITION BY department
        ORDER BY salary DESC
    ) AS rank_in_dept,
    DENSE_RANK() OVER (
        PARTITION BY department
        ORDER BY salary DESC
    ) AS dense_rank_in_dept
FROM employees
ORDER BY department, salary DESC, employee_name;

If salaries inside IT are 9000, 9000, 8000, the result looks like this:

+------------+---------------+--------+--------------+--------------------+
| department | employee_name | salary | rank_in_dept | dense_rank_in_dept |
+------------+---------------+--------+--------------+--------------------+
| IT         | Ana           | 9000   | 1            | 1                  |
| IT         | Leo           | 9000   | 1            | 1                  |
| IT         | Mira          | 8000   | 3            | 2                  |
+------------+---------------+--------+--------------+--------------------+

We use this pattern for top-N per category, leaderboard logic, and compensation analysis.

Use DENSE_RANK when downstream logic must step through every rank without gaps, for example awarding prizes to tiers 1, 2, and 3. Use RANK when the gaps are acceptable or meaningful, such as showing that no one finished second because two people tied for first.

LAG() and LEAD() for row-to-row comparisons

Ranking tells us where a row sits inside a group. Lag and lead tell us what happened in the rows immediately before and after it.

LAG() gives us a value from a previous row. Use it when we need to compare each row against the one before it: detecting growth, calculating differences, or spotting state changes.

LEAD() gives us a value from a following row. Use it when we need to look forward: finding the next event date, calculating time to the next action, or checking what comes after the current row.

Common use cases for both:

  • detecting whether sales grew or shrank period over period
  • calculating the number of days between consecutive events
  • comparing each transaction to the previous one in a sequence
  • building retention signals by flagging when activity drops

LAG(amount) is shorthand for LAG(amount, 1, NULL). It fetches the value from 1 row back and returns NULL if no prior row exists. The two optional arguments let us change both: LAG(amount, 2) looks back two rows, and LAG(amount, 1, 0) substitutes 0 instead of NULL for the first row in a partition.

SELECT
    customer_id,
    order_date,
    amount,
    LAG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS previous_amount,
    LEAD(order_date) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS next_order_date
FROM orders
ORDER BY customer_id, order_date;

Result:

+-------------+------------+--------+-----------------+-----------------+
| customer_id | order_date | amount | previous_amount | next_order_date |
+-------------+------------+--------+-----------------+-----------------+
| 1           | 2026-01-01 | 100    | NULL            | 2026-01-05      |
| 1           | 2026-01-05 | 150    | 100             | 2026-01-09      |
| 1           | 2026-01-09 | 120    | 150             | NULL            |
| 2           | 2026-01-03 | 80     | NULL            | 2026-01-10      |
| 2           | 2026-01-10 | 120    | 80              | NULL            |
+-------------+------------+--------+-----------------+-----------------+

We can also calculate the difference immediately:

SELECT
    customer_id,
    order_date,
    amount,
    amount - LAG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS diff_from_previous
FROM orders
ORDER BY customer_id, order_date;

Result:

+-------------+------------+--------+--------------------+
| customer_id | order_date | amount | diff_from_previous |
+-------------+------------+--------+--------------------+
| 1           | 2026-01-01 | 100    | NULL               |
| 1           | 2026-01-05 | 150    | 50                 |
| 1           | 2026-01-09 | 120    | -30                |
| 2           | 2026-01-03 | 80     | NULL               |
| 2           | 2026-01-10 | 120    | 40                 |
+-------------+------------+--------+--------------------+

For the first row in each partition, LAG() returns NULL, so the difference is also NULL.

This is the pattern behind questions like:

  • did sales grow or shrink?
  • what was the previous purchase?
  • how many days passed before the next event?

Running totals

Row-to-row comparisons look one step at a time. Running totals look backward across every previous row and keep a growing sum.

Running totals are one of the clearest examples of why window functions matter.

Use this pattern for:

  • dashboards that need a growing cumulative total over time
  • finance reports tracking revenue or spend month by month
  • funnel analysis that counts users through successive stages

Assume monthly_sales looks like this:

+---------+-------+
| month   | sales |
+---------+-------+
| 2026-01 | 1000  |
| 2026-02 | 1200  |
| 2026-03 | 900   |
| 2026-04 | 1400  |
+---------+-------+
SELECT
    month,
    sales,
    SUM(sales) OVER (
        ORDER BY month
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM monthly_sales
ORDER BY month;

Result:

+---------+-------+---------------+
| month   | sales | running_total |
+---------+-------+---------------+
| 2026-01 | 1000  | 1000          |
| 2026-02 | 1200  | 2200          |
| 2026-03 | 900   | 3100          |
| 2026-04 | 1400  | 4500          |
+---------+-------+---------------+

The frame clause matters here:

  • UNBOUNDED PRECEDING means start from the first row in the ordered set
  • CURRENT ROW means stop at the current row

ROWS counts physical row positions. An alternative keyword RANGE defines boundaries by value instead of position, and some databases silently default to RANGE when ORDER BY is present without an explicit frame. Writing ROWS explicitly avoids that ambiguity and produces consistent results across platforms.

This gives us a cumulative total that grows row by row.

That pattern appears constantly in dashboards, finance reporting, and funnel analysis.

Moving averages

Running totals keep expanding. Moving averages keep a fixed window.

Use this pattern for:

  • smoothing noisy daily or weekly data
  • spotting trends while reducing random spikes
  • anomaly detection in time-series signals

Assume sales looks like this:

+------------+--------+
| date       | amount |
+------------+--------+
| 2026-01-01 | 100    |
| 2026-01-02 | 130    |
| 2026-01-03 | 90     |
| 2026-01-04 | 150    |
| 2026-01-05 | 120    |
+------------+--------+
SELECT
    date,
    amount,
    AVG(amount) OVER (
        ORDER BY date
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS moving_avg_3_days
FROM sales
ORDER BY date;

The frame ROWS BETWEEN 2 PRECEDING AND CURRENT ROW spans 2 rows before the current row plus the current row itself, making 3 rows total. That is where the name moving_avg_3_days comes from.

Result:

+------------+--------+-------------------+
| date       | amount | moving_avg_3_days |
+------------+--------+-------------------+
| 2026-01-01 | 100    | 100.00            |
| 2026-01-02 | 130    | 115.00            |
| 2026-01-03 | 90     | 106.67            |
| 2026-01-04 | 150    | 123.33            |
| 2026-01-05 | 120    | 120.00            |
+------------+--------+-------------------+

The early rows use a smaller frame because there are not yet two previous rows available.

Here, each row sees only three rows at most:

  • the current row
  • one previous row
  • two previous rows

This is useful for smoothing noisy data, spotting trends, and anomaly detection.

PERCENT_RANK() and NTILE()

Cumulative functions produce absolute numbers: totals, averages, differences. Sometimes the more useful question is relative: where does this value sit in the distribution compared to all others?

PERCENT_RANK() tells us where a row sits inside an ordered distribution.

Use PERCENT_RANK() for:

  • salary benchmarking to see where an employee stands relative to peers
  • identifying which percentile a value falls in
  • comparing an individual metric against a group baseline

Assume employees looks like this:

+-------------+--------+
| employee_id | salary |
+-------------+--------+
| 1           | 4000   |
| 2           | 6000   |
| 3           | 8000   |
| 4           | 10000  |
+-------------+--------+
SELECT
    employee_id,
    salary,
    PERCENT_RANK() OVER (ORDER BY salary) AS percentile
FROM employees
ORDER BY salary;

Result:

+-------------+--------+------------+
| employee_id | salary | percentile |
+-------------+--------+------------+
| 1           | 4000   | 0.00       |
| 2           | 6000   | 0.33       |
| 3           | 8000   | 0.67       |
| 4           | 10000  | 1.00       |
+-------------+--------+------------+

The percentile values here are rounded for readability. The formula is (rank − 1) / (total rows − 1). With 4 employees, the lowest earner scores 0 / 3 = 0.00 and the highest scores 3 / 3 = 1.00. That is why the first value is always 0.00 and the last is always 1.00.

NTILE() splits rows into equal-sized buckets.

Use NTILE() for:

  • segmenting customers into quartiles or deciles by spend
  • creating tiered groups for marketing or analysis
  • splitting any ranked list into score bands

Assume customers looks like this:

+-------------+--------------+
| customer_id | amount_spent |
+-------------+--------------+
| 1           | 900          |
| 2           | 800          |
| 3           | 700          |
| 4           | 600          |
| 5           | 500          |
| 6           | 400          |
+-------------+--------------+
SELECT
    customer_id,
    amount_spent,
    NTILE(4) OVER (ORDER BY amount_spent DESC) AS spend_quartile
FROM customers
ORDER BY amount_spent DESC;

Result:

+-------------+--------------+----------------+
| customer_id | amount_spent | spend_quartile |
+-------------+--------------+----------------+
| 1           | 900          | 1              |
| 2           | 800          | 1              |
| 3           | 700          | 2              |
| 4           | 600          | 2              |
| 5           | 500          | 3              |
| 6           | 400          | 4              |
+-------------+--------------+----------------+

Because 6 rows do not divide evenly into 4 buckets, the first buckets receive one extra row.

That is useful when we want quartiles, deciles, or tiered segmentation.

A real pipeline pattern

One of the most common real-world uses of window functions is deduplication inside raw event streams.

Assume raw_events looks like this:

+---------+------------+---------------------+
| user_id | event_type | event_timestamp     |
+---------+------------+---------------------+
| 1       | click      | 2026-01-01 10:00:00 |
| 1       | click      | 2026-01-01 10:05:00 |
| 1       | view       | 2026-01-01 10:07:00 |
| 2       | click      | 2026-01-02 09:00:00 |
| 2       | click      | 2026-01-02 09:30:00 |
+---------+------------+---------------------+
WITH events_ranked AS (
    SELECT
        user_id,
        event_type,
        event_timestamp,
        ROW_NUMBER() OVER (
            PARTITION BY user_id, event_type
            ORDER BY event_timestamp DESC
        ) AS rn
    FROM raw_events
)
SELECT *
FROM events_ranked
WHERE rn = 1
ORDER BY user_id, event_type;

Result:

+---------+------------+---------------------+----+
| user_id | event_type | event_timestamp     | rn |
+---------+------------+---------------------+----+
| 1       | click      | 2026-01-01 10:05:00 | 1  |
| 1       | view       | 2026-01-01 10:07:00 | 1  |
| 2       | click      | 2026-01-02 09:30:00 | 1  |
+---------+------------+---------------------+----+

This pattern works because the window function adds rank without collapsing rows.

We then filter the ranked result in an outer query.

That outer step is often a CTE.

A CTE, or Common Table Expression, is a temporary named result set defined with WITH. It exists only for the query that follows it.

We use it here because it makes the window result easier to filter and easier to read.

An interview pattern worth knowing

Window functions show up constantly in interviews because they test whether we can reason across rows without falling back to awkward self-joins.

WITH enriched AS (
    SELECT
        customer_id,
        order_date,
        amount,
        LAG(amount) OVER (
            PARTITION BY customer_id
            ORDER BY order_date
        ) AS prev_amount
    FROM orders
)
SELECT *
FROM enriched
WHERE amount > prev_amount
ORDER BY customer_id, order_date;

Result:

+-------------+------------+--------+-------------+
| customer_id | order_date | amount | prev_amount |
+-------------+------------+--------+-------------+
| 1           | 2026-01-05 | 150    | 100         |
| 2           | 2026-01-10 | 120    | 80          |
+-------------+------------+--------+-------------+

This query answers a practical question: which purchases increased compared to the previous purchase for the same customer?

That is not just interview material. It is also a real retention and behavior-analysis pattern.

Takeaway:

Window functions let us calculate across related rows without losing row-level detail.

If we remember only one contrast, let it be this:

  • GROUP BY collapses rows
  • window functions keep rows and add context

Keep this map in mind:

  • ROW_NUMBER() for deduplication and latest-row logic
  • RANK() and DENSE_RANK() for ranking within groups
  • LAG() and LEAD() for previous and next row comparisons
  • SUM() and AVG() with frames for cumulative and rolling metrics
  • PERCENT_RANK() and NTILE() for relative position and bucketing

And keep these two warnings next to it:

  • always define ORDER BY carefully inside OVER() because sequence changes the result
  • add explicit tie-breakers when deduplication must be deterministic

Challenge:

Take an orders table with customer_id, order_date, and amount, then write three queries:

  1. keep only the latest order per customer
  2. calculate a running total per customer over time
  3. flag rows where the amount is greater than the previous order for the same customer

Then explain which of those three problems would be awkward or lossy with GROUP BY alone. That is where window functions stop looking like syntax tricks and start looking like essential SQL.