SQL Set Operators: UNION, INTERSECT, and EXCEPT

A set operator combines the results of two SELECT queries into one result set. Instead of joining two tables on a shared key, we stack their results vertically and let the operator decide which rows to keep.

SQL has four set operators: UNION ALL, UNION, INTERSECT, and EXCEPT. Each one answers a different question about the relationship between two datasets.

Before we use any of them, three rules must hold for both queries:

the same number of columns
compatible data types in each matching column position
the column order must match across both queries

The database uses the column names from the first query in the result. The second query's column names are ignored, so naming columns clearly in the first query matters.

Let's use one mental model.

Think of two overlapping circles, the classic Venn diagram. One circle is all the orders from 2025. The other is all the orders from 2026. Some customers appear in both. Some appear in only one. Set operators are the controls that decide which parts of those circles end up in our result.

 2025 only  │  customer 7
            │
 both years │  customer 42
            │  customer 15
            │  customer 99
            │
 2026 only  │  customer 33
            │  customer 88

UNION ALL  → all rows, duplicates kept
UNION      → unique rows only
INTERSECT  → overlap only (both years)
EXCEPT     → one side, minus the overlap

Set operators are useful for:

merging datasets from multiple time periods or sources without a shared key column
checking data consistency between two environments, such as production and staging
detecting records that are present in one dataset but missing in another
deduplicating results that come from more than one query

Example:

Let's build two tables and walk through each operator with the same data throughout.

orders_2025
+----------+-------------+--------+
| order_id | customer_id | amount |
+----------+-------------+--------+
| 1        | 42          | 150    |
| 2        | 15          | 80     |
| 3        | 99          | 320    |
| 4        | 7           | 200    |
| 5        | 42          | 90     |
+----------+-------------+--------+

orders_2026
+----------+-------------+--------+
| order_id | customer_id | amount |
+----------+-------------+--------+
| 6        | 42          | 410    |
| 7        | 33          | 450    |
| 8        | 15          | 175    |
| 9        | 88          | 60     |
| 10       | 99          | 280    |
+----------+-------------+--------+

Customers 42, 15, and 99 appear in both years. Customer 7 is only in 2025. Customers 33 and 88 are only in 2026. Every result below traces back to those relationships.

UNION ALL

UNION ALL merges both result sets and keeps every row, including duplicates. It adds no deduplication step, which makes it the fastest of the four operators.

Use UNION ALL when:

we need every row from both sources and duplicates are expected or acceptable
we are stacking time-partitioned tables, such as monthly or yearly datasets
we are building an incremental dataset where each source is a distinct batch
performance matters and we know the data is already clean or deduplication happens downstream

This query combines every row from both tables:

SELECT order_id, customer_id, amount
FROM orders_2025

UNION ALL

SELECT order_id, customer_id, amount
FROM orders_2026;

Result:

+----------+-------------+--------+
| order_id | customer_id | amount |
+----------+-------------+--------+
| 1        | 42          | 150    |
| 2        | 15          | 80     |
| 3        | 99          | 320    |
| 4        | 7           | 200    |
| 5        | 42          | 90     |
| 6        | 42          | 410    |
| 7        | 33          | 450    |
| 8        | 15          | 175    |
| 9        | 88          | 60     |
| 10       | 99          | 280    |
+----------+-------------+--------+

All ten rows appear. Customer 42 appears three times and customer 99 appears twice because they placed orders in both years. UNION ALL does not evaluate whether those rows are duplicates. It appends everything from the second query directly after the first.

UNION ALL is the default choice for stacking partitioned data. If we need to remove duplicates from the combined result, the next operator does that in one step.

UNION

UNION works exactly like UNION ALL, except it removes duplicate rows from the combined result. Internally, the database performs a DISTINCT over the entire output, which means it reads all rows from both queries before removing duplicates. That extra step makes UNION slower than UNION ALL.

Use UNION when:

we are merging two lists that may contain the same values
we need one clean list of unique values from multiple sources
deduplication correctness matters more than raw speed

This query returns a unique list of customer IDs seen across both years:

SELECT customer_id
FROM orders_2025

UNION

SELECT customer_id
FROM orders_2026;

Result (order not guaranteed without ORDER BY):

+-------------+
| customer_id |
+-------------+
| 7           |
| 15          |
| 33          |
| 42          |
| 88          |
| 99          |
+-------------+

Six unique customer IDs appear, even though the two tables together contain ten rows and five distinct customer IDs appear more than once. UNION collapses every repeated value down to one output row.

UNION tells us what exists on either side of the comparison. INTERSECT tells us what exists on both sides at the same time.

INTERSECT

INTERSECT returns only the rows that appear in both queries. A row must exist in both result sets to make it into the output.

Use INTERSECT when:

we need to find records that are confirmed in two systems or datasets
we are checking whether two tables share a common set of values
we are validating that a sync or migration transferred all expected rows

This query finds customers who placed orders in both 2025 and 2026:

SELECT customer_id
FROM orders_2025

INTERSECT

SELECT customer_id
FROM orders_2026;

Result:

+-------------+
| customer_id |
+-------------+
| 15          |
| 42          |
| 99          |
+-------------+

Only customers 42, 15, and 99 appear in both tables. Customer 7 is in 2025 only, and customers 33 and 88 are in 2026 only. Those four customer IDs are excluded because they do not satisfy the "exists in both" condition.

Notice that customer 42 appears twice in orders_2025 but only once in the result. Like UNION, INTERSECT removes duplicates from its output — if a value appears multiple times in one table, it still appears only once in the result.

INTERSECT finds the common territory. EXCEPT finds what one side has that the other does not.

EXCEPT

EXCEPT returns rows from the first query that do not appear in the second query. Think of it as subtraction: take the left side and remove everything that also exists on the right side.

One compatibility note: Oracle uses the keyword MINUS instead of EXCEPT. MySQL added native EXCEPT support in version 8.0.31. On older MySQL versions, the same result requires a LEFT JOIN ... WHERE b.id IS NULL pattern.

Use EXCEPT when:

we need to detect records that are present in one source but missing from another
we are comparing two snapshots of the same data to find what was added or removed
we are reconciling two systems and need to find the gap

This query finds customers who placed orders in 2026 but not in 2025:

SELECT customer_id
FROM orders_2026

EXCEPT

SELECT customer_id
FROM orders_2025;

Result:

+-------------+
| customer_id |
+-------------+
| 33          |
| 88          |
+-------------+

Customers 33 and 88 appear in orders_2026 but not in orders_2025. Those are customers acquired in 2026.

Reversing the query order gives us the other direction: customers who bought in 2025 but not in 2026, which means customers we lost.

SELECT customer_id
FROM orders_2025

EXCEPT

SELECT customer_id
FROM orders_2026;

Result:

+-------------+
| customer_id |
+-------------+
| 7           |
+-------------+

Customer 7 placed orders in 2025 but not in 2026. The order of the two queries in EXCEPT controls which direction the subtraction goes. Swapping them produces a different result.

With all four operators understood, the practical side is putting them together to answer a real engineering question.

Data reconciliation in practice

One of the most common tasks in data engineering is detecting drift between two copies of the same data: a production database and a staging replica, two versions of a table after a migration, or a source system and a data warehouse.

Imagine we are validating a migration. We have the original orders_2025 and a migrated copy called orders_2025_migrated. The migrated table is not part of our worked dataset above — this section shows the pattern rather than a computed result, because the value is in the structure, not the specific rows.

Find rows in the original that are missing from the migrated copy:

SELECT order_id, customer_id, amount
FROM orders_2025

EXCEPT

SELECT order_id, customer_id, amount
FROM orders_2025_migrated;
-- expected: zero rows if migration is complete

An empty result means every row in the original made it through. Any rows in the output are data that was lost during the migration.

Find rows in the migrated copy that do not exist in the original:

SELECT order_id, customer_id, amount
FROM orders_2025_migrated

EXCEPT

SELECT order_id, customer_id, amount
FROM orders_2025;
-- expected: zero rows if no phantom rows were introduced

An empty result means no unexpected rows were introduced. Any rows in the output appeared during or after the migration and need investigation.

Confirm the core dataset survived intact:

SELECT order_id, customer_id, amount
FROM orders_2025

INTERSECT

SELECT order_id, customer_id, amount
FROM orders_2025_migrated;
-- expected: same row count as orders_2025 if migration was lossless

Running all three queries in sequence gives a complete picture: what was lost, what was added, and what arrived intact. This three-query reconciliation pattern is one of the most practical uses of set operators in production data engineering.

Takeaway:

A set operator combines the results of two SELECT queries vertically. The operator controls which rows from that combined set appear in the final output.

One map to keep in mind:

UNION ALL: every row from both queries, duplicates included, no deduplication cost
UNION: every unique row from both queries, duplicates removed with an implicit DISTINCT
INTERSECT: only rows that appear in both queries
EXCEPT: only rows from the first query that do not appear in the second

Four rules apply to all four operators:

same number of columns in both queries
compatible data types in matching column positions
columns in the same order across both queries
column names come from the first query — the second query's column names are ignored

And one performance note: prefer UNION ALL when duplicates are acceptable or already handled upstream. Reserve UNION for when deduplication is required, since it adds an extra sorting or hashing pass over the entire combined result.

Challenge:

You have two tables: customers_prod and customers_staging, each with columns customer_id, email, status.

Write a query to find customers present in production but missing from staging. Those are records that were not migrated.
Write a query to produce one unified list of all customer IDs across both tables, with each ID appearing only once.
Write a query to find customers whose email or status differs between the two tables. Hint: include all three columns in the comparison. A mismatch means a row exists in one table that does not have an identical match in the other — which operator returns what is in one side but not both?

For each query, choose the operator that fits and write one sentence explaining why you chose it over the others.