Which Rows Survive a SQL Join?

A JOIN does not just combine tables. It decides which rows survive.

That is why one keyword can change row counts, introduce NULLs, or silently drop records we expected to keep. If a query returns too few rows, too many rows, or missing dimension values, the join type is usually the first thing to inspect.

Let's use one mental model:

the ON clause decides who matches whom
the join type decides what happens to rows that do not match

That second part is where most mistakes happen.

Here is the shortest possible cheat sheet:

INNER JOIN keeps matched rows only
LEFT JOIN keeps all rows from the left table
RIGHT JOIN keeps all rows from the right table
FULL OUTER JOIN keeps all rows from both sides
CROSS JOIN keeps every possible combination
SELF JOIN joins a table to itself

Example:

Let's start with one small dataset and reuse it for most join types.

customers                          orders
+----+-------+                    +----------+-------------+
| id | name  |                    | order_id | customer_id |
+----+-------+                    +----------+-------------+
|  1 | Alice |                    | 101      | 1           |
|  2 | Bob   |                    | 102      | 3           |
|  3 | Cara  |                    | 103      | 99          |
+----+-------+                    +----------+-------------+

Two pairs matter here:

Alice matches order 101
Cara matches order 102

Two rows do not match:

Bob has no order
order 103 points to customer_id = 99, which does not exist in customers

In a strictly constrained transactional database, that orphaned key might be impossible. In analytics pipelines, it is common enough. Late-arriving dimensions, weak source constraints, or partially loaded data can all create this shape.

That is enough to explain every join type.

INNER JOIN

INNER JOIN keeps only rows that match on both sides.

SELECT c.name, o.order_id
FROM customers c
INNER JOIN orders o
    ON c.id = o.customer_id;

Result:

+-------+----------+
| name  | order_id |
+-------+----------+
| Alice | 101      |
| Cara  | 102      |
+-------+----------+

Bob disappears because there is no matching order. Order 103 disappears because there is no matching customer.

This is the most restrictive join. It is correct when we only care about rows that fully match.

LEFT JOIN

LEFT JOIN keeps every row from the left table and fills the right side with NULL when there is no match.

SELECT c.name, o.order_id
FROM customers c
LEFT JOIN orders o
    ON c.id = o.customer_id;

Result:

+-------+----------+
| name  | order_id |
+-------+----------+
| Alice | 101      |
| Bob   | NULL     |
| Cara  | 102      |
+-------+----------+

Bob survives because customers is on the left side.

This is the join we use when the left table is the grain we want to preserve.

RIGHT JOIN

RIGHT JOIN does the same thing in the other direction. It keeps every row from the right table.

SELECT o.order_id, c.name
FROM customers c
RIGHT JOIN orders o
    ON c.id = o.customer_id;

Result:

+----------+-------+
| order_id | name  |
+----------+-------+
| 101      | Alice |
| 102      | Cara  |
| 103      | NULL  |
+----------+-------+

Order 103 survives because orders is on the right side.

Technically this is fine. In practice, many teams prefer to swap table order and write LEFT JOIN because it is easier to read in one consistent direction.

FULL OUTER JOIN

FULL OUTER JOIN keeps all rows from both sides. Matched rows are combined. Unmatched rows survive with NULL on the missing side.

SELECT c.name, o.order_id
FROM customers c
FULL OUTER JOIN orders o
    ON c.id = o.customer_id;

Result:

+-------+----------+
| name  | order_id |
+-------+----------+
| Alice | 101      |
| Bob   | NULL     |
| Cara  | 102      |
| NULL  | 103      |
+-------+----------+

Tip: FULL JOIN and FULL OUTER JOIN mean the same thing.

One production note matters here: not every SQL engine supports FULL OUTER JOIN directly. For example, MySQL does not. In those engines, we usually emulate it with a LEFT JOIN, a RIGHT JOIN, and a UNION.

CROSS JOIN

CROSS JOIN does not use matching keys at all. It returns the Cartesian product.

That means every row on the left is paired with every row on the right.

SELECT p.product_name, r.region
FROM products p
CROSS JOIN regions r;

If products has 5 rows and regions has 10 rows, the result has 5 x 10 = 50 rows.

This is useful when we intentionally want all combinations, such as:

product x region matrices
date x category grids
test datasets
scaffold tables for missing combinations

This join is powerful, but it is also the fastest way to create accidental row explosion.

SELF JOIN

A SELF JOIN joins a table to itself. We use it when rows inside the same table relate to other rows in that same table.

A common example is employee to manager.

SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m
    ON e.manager_id = m.id;

Result idea:

+----------+---------+
| employee | manager |
+----------+---------+
| Ana      | Sara    |
| Leo      | Sara    |
| Sara     | NULL    |
+----------+---------+

Sara has no manager, so LEFT JOIN preserves her row and fills the manager column with NULL.

This is why SELF JOIN is common in hierarchies, org charts, and parent-child relationships.

The join type is not enough

Even when the join type is correct, post-join filtering can still change the result.

The classic trap is LEFT JOIN followed by a WHERE filter on the right table.

-- Wrong if we want to keep all customers
SELECT c.name, o.order_id
FROM customers c
LEFT JOIN orders o
    ON c.id = o.customer_id
WHERE o.order_id IS NOT NULL;

That WHERE clause removes the NULL rows created by the LEFT JOIN, which makes the result behave like an INNER JOIN.

If we need to preserve left-side rows, filters on right-table columns usually belong in ON, not in WHERE.

put the filter in ON when we want to keep unmatched left rows and only limit which right-side rows can match
put the filter in WHERE when we want to filter the final joined result, even if that removes unmatched rows

Multiple joins

Now let's move from one join to a more realistic warehouse pattern.

When we write multiple joins, the useful mental model is this: SQL builds a working result, then joins the next table to that result, then joins the next table again.

Logically, that means each join produces an intermediate result set that becomes the input to the following join.

sales
  -> join products
      -> intermediate result
          -> join customers
              -> final result

That is the right model for reasoning about row counts and NULL propagation.

One precision note matters: the optimizer is free to change the physical execution plan. It might reorder joins or choose a different strategy internally. But for understanding what the query means, the intermediate-result model is the right place to start.

If sales is our fact table and products and customers are dimensions, we often want to preserve every sales row even when a dimension lookup is missing.

SELECT
    s.sales_id,
    s.product_id,
    s.customer_id,
    s.quantity,
    p.product_name,
    c.customer_name
FROM sales s
LEFT JOIN products p
    ON s.product_id = p.product_id
LEFT JOIN customers c
    ON s.customer_id = c.customer_id;

Both joins are LEFT JOINs because sales is the grain we want to keep.

That is the real question behind multiple joins: which table defines the grain, and which rows are non-negotiable?

WHERE vs HAVING after a join

Once a join produces rows, filtering still follows normal SQL rules.

WHERE filters rows before grouping
HAVING filters grouped results after aggregation

SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o
    ON c.id = o.customer_id
WHERE c.id > 0
GROUP BY c.name
HAVING COUNT(o.order_id) > 1;

If we put COUNT(o.order_id) > 1 in WHERE, the query fails because aggregates are not available yet.

Using a CTE before a join

Sometimes the best join is not a direct join from raw tables. Sometimes we first need to shape the data.

A CTE, short for Common Table Expression, is a temporary named result set defined at the start of a query with WITH.

It is not a permanent table. It only exists for the statement that follows it.

We use CTEs when we want to:

break a query into readable steps
name an intermediate result
reuse the same logic inside a larger query
separate aggregation logic from lookup logic

A common pattern is to aggregate first, then join the aggregated result.

WITH product_sales AS (
    SELECT
        product_id,
        SUM(quantity) AS total_quantity
    FROM sales
    GROUP BY product_id
    HAVING SUM(quantity) > 50
)
SELECT
    p.product_name,
    ps.total_quantity
FROM product_sales ps
JOIN products p
    ON ps.product_id = p.product_id;

This is where a CTE helps.

Here, product_sales is the named intermediate result. First we compute total quantity per product. Then we join that smaller shaped result to products.

That makes the query easier to read, easier to reuse, and easier to test mentally. We separate the aggregation step from the lookup step instead of burying everything inside one large statement.

Takeaway:

A JOIN is a row-retention decision, not just a table-combination feature.

If we remember only one rule, let it be this:

ON decides who matches
join type decides who survives

That one mental model explains most join behavior.

Keep this map in mind:

INNER JOIN keeps matched rows only
LEFT JOIN keeps the left side
RIGHT JOIN keeps the right side
FULL OUTER JOIN keeps both sides
CROSS JOIN multiplies rows
SELF JOIN compares rows inside one table

And keep these two warnings next to it:

a WHERE clause after a LEFT JOIN can quietly remove rows we meant to keep
joins should preserve the grain we care about, especially in fact-to-dimension models

Challenge:

Take the sales, products, and customers tables and write three versions of the same query:

an INNER JOIN version
a LEFT JOIN version that preserves all sales rows
a CTE version that first aggregates sales by product, then joins to products

Compare the row counts and explain which rows disappear in each version. That comparison is where join types stop feeling abstract and start feeling operational.