How SQL Indexes Work (And How to Optimize Them)

An index is a data structure the database builds and maintains alongside a table. When a query filters on an indexed column, the database jumps directly to the matching rows instead of reading every row from top to bottom.

Without an index on a column used in WHERE, the database performs a full table scan: every row is read, every row is evaluated, and only the matching rows are returned. On a table with a few thousand rows this is fast enough. On a table with ten million rows it becomes the difference between a millisecond query and a query that times out.

Let's use one mental model.

Think of a printed book with no back-of-book index. Finding every mention of "customer churn" means reading the entire book page by page. A proper index lets us jump directly to page 43, 107, and 291. SQL indexes work the same way: the database builds the lookup structure once, and then uses it for every query that follows.

Without index on customer_id:

orders table: row 1, row 2, row 3 ... row 10,000,000
DB reads: all 10 million rows
Returns: rows where customer_id = 42

With index on customer_id:

index: customer 42 -> positions 3901, 7204, 9112
DB reads: 3 rows directly
Returns: rows where customer_id = 42

That is the core difference.

Indexes are useful for:

WHERE filters: "find all orders for customer 42" without scanning the full table
JOIN operations: matching rows across two tables on a shared key column
ORDER BY: returning pre-sorted results without an additional sort pass
large table lookups where a full scan grows slower as the data grows

Example:

Let's build one orders table and walk through index creation, covering indexes, SARGability, and query planning step by step.

orders
+----------+-------------+---------------------+--------+
| order_id | customer_id | order_timestamp     | amount |
+----------+-------------+---------------------+--------+
| 1        | 42          | 2026-01-01 08:00:00 | 150    |
| 2        | 15          | 2026-01-01 09:30:00 | 80     |
| 3        | 42          | 2026-01-02 11:00:00 | 220    |
| 4        | 99          | 2026-01-03 14:20:00 | 50     |
| 5        | 15          | 2026-01-04 10:00:00 | 310    |
+----------+-------------+---------------------+--------+

Creating an index

A single-column index on customer_id looks like this:

CREATE INDEX idx_orders_customer
ON orders(customer_id);

After running this, the database builds a B-Tree structure keyed on customer_id. A B-Tree is a sorted tree structure where each lookup halves the search space, reaching any value in a small fraction of the total rows rather than scanning all of them. Every subsequent query that filters or joins on customer_id can use that structure to skip the full scan.

A composite index covers two or more columns together:

CREATE INDEX idx_orders_customer_date
ON orders(customer_id, order_timestamp);

This supports queries that filter on customer_id alone, or on customer_id together with order_timestamp. It does not help a query that filters on order_timestamp alone, because the first column in the index must be present. This is the leading column rule: the index is only usable if the query starts from the left side of the column list.

Use an index when:

the column appears frequently in WHERE conditions
the column is a join key between two tables
the dataset is large and full scans are visibly slow
the column has high cardinality, meaning many distinct values

Cardinality: when an index helps and when it does not

An index is most effective on columns with many distinct values. On columns with very few distinct values, the database may decide a full scan is faster than going through an index pointing to half the table.

High cardinality (index helps):      Low cardinality (index wastes space):
email                                 gender
------                                --------
user1@example.com                     M
user2@example.com                     F
user3@example.com                     M
... 10 million unique values          2 values total

An index on email lets the database find one specific row in milliseconds. An index on gender would point to five million rows for every lookup. Query planners detect this automatically and ignore the index when it is not efficient, but the index still occupies disk space and adds write overhead.

Cardinality tells us whether an index is worth creating. The next question is which columns to put inside it.

Covering indexes

A covering index includes every column a query needs. When an index covers the entire query, the database answers it without touching the main table at all, which removes the second lookup step.

Use a covering index when:

a query reads only a small subset of columns
the query runs very frequently and every millisecond matters
we need to eliminate an extra round-trip from the index to the main table

This query reads two columns and filters on both:

SELECT customer_id, order_timestamp
FROM orders
WHERE customer_id = 42 AND order_timestamp >= '2026-01-01'
ORDER BY order_timestamp;

The index we already created covers it:

CREATE INDEX idx_orders_customer_date
ON orders(customer_id, order_timestamp);

Because both customer_id and order_timestamp are inside the index, the database reads the index and returns the result without opening the main table.

Result:

+-------------+---------------------+
| customer_id | order_timestamp     |
+-------------+---------------------+
| 42          | 2026-01-01 08:00:00 |
| 42          | 2026-01-02 11:00:00 |
+-------------+---------------------+

Having the right index is only half the equation. A poorly written query can prevent the database from using it at all.

SARGability

A query is SARGable, short for Search ARGument Able, when the WHERE clause allows the database to use an index. A query is not SARGable when a function or transformation is applied to the indexed column directly, forcing the database to evaluate every row before it can compare values.

This is one of the most common performance mistakes in production SQL, and one of the most frequently tested concepts in interviews.

The rule is simple: never wrap an indexed column inside a function. Place any transformation on the literal value instead.

SARGable, uses the index:

SELECT *
FROM orders
WHERE order_timestamp >= '2026-01-01'
  AND order_timestamp < '2026-01-02';

Not SARGable, forces a full scan:

SELECT *
FROM orders
WHERE DATE(order_timestamp) = '2026-01-01';

The second query applies DATE() to every row in the table before comparing. The index on order_timestamp stores the full timestamp values. Because the indexed values and the transformed values are different things, the database cannot use the index. It reads all five rows, applies the function, and then checks the result.

Both queries return the same rows:

+----------+-------------+---------------------+--------+
| order_id | customer_id | order_timestamp     | amount |
+----------+-------------+---------------------+--------+
| 1        | 42          | 2026-01-01 08:00:00 | 150    |
| 2        | 15          | 2026-01-01 09:30:00 | 80     |
+----------+-------------+---------------------+--------+

On five rows the difference is invisible. On ten million rows the SARGable version takes milliseconds and the non-SARGable version may take minutes.

Other common non-SARGable patterns to avoid:

-- function on column breaks the index
WHERE UPPER(email) = 'USER@EXAMPLE.COM'

-- implicit type mismatch breaks the index (customer_id is INTEGER)
WHERE customer_id = '42'

-- arithmetic on the column breaks the index
WHERE amount * 1.1 > 100

Rewrite the last one by moving the math to the other side:

WHERE amount > 100.0 / 1.1  -- equivalent to amount > 90.909...

Writing SARGable queries is the right move, but how do we confirm the database is actually using the index? That is what EXPLAIN is for.

Reading a query plan with EXPLAIN

EXPLAIN shows how the database intends to execute a query before it runs. We use it to confirm that an index is being used and to catch full table scans before they reach production.

EXPLAIN
SELECT *
FROM orders
WHERE customer_id = 42;

In PostgreSQL, the output looks roughly like this:

Index Scan using idx_orders_customer on orders
  (cost=0.29..8.31 rows=2 width=36)
  Index Cond: (customer_id = 42)

This confirms the planner is using our index. The cost=0.29..8.31 notation shows two numbers: the first is startup cost to return the first row, and the second is total estimated cost to return all rows. If we see Seq Scan instead, the database is reading every row. That is the signal to check whether an index exists and whether the WHERE clause is SARGable.

EXPLAIN ANALYZE runs the query and adds actual timing alongside the plan:

EXPLAIN ANALYZE
SELECT *
FROM orders
WHERE customer_id = 42;

Index Scan using idx_orders_customer on orders
  (cost=0.29..8.31 rows=2 width=36) (actual time=0.041..0.044 rows=2 loops=1)
  Index Cond: (customer_id = 42)
Planning Time: 0.08 ms
Execution Time: 0.06 ms

Use EXPLAIN to inspect the plan without running the query. Use EXPLAIN ANALYZE when we need to measure actual performance and confirm the fix worked.

Key things to look for in the output:

Seq Scan on a large table means the index is missing or the query is not SARGable
Index Scan confirms the index is being used for row lookup
Bitmap Index Scan is common when the planner combines multiple index conditions; it still uses an index but batches matching rows into a bitmap before fetching them, which reduces random I/O compared to individual row lookups
high cost values or high actual time point to where optimization effort belongs

Now let's apply all of this together: a broken query, a SARGable fix, and the index to seal it.

Optimizing a slow query end to end

Let's walk through a complete fix from problem to solution.

The slow query:

SELECT *
FROM orders
WHERE DATE(order_timestamp) = '2026-01-01';

Two problems here. The DATE() function makes it non-SARGable, and SELECT * fetches every column including ones the caller may not need. When a query requests more columns than the index holds, the database must open the main table for every matched row, which blocks any covering index optimization.

Step 1: fix the WHERE clause to make it SARGable.

SELECT order_id, customer_id, order_timestamp, amount
FROM orders
WHERE order_timestamp >= '2026-01-01'
  AND order_timestamp < '2026-01-02';

Step 2: add the index.

CREATE INDEX idx_orders_timestamp
ON orders(order_timestamp);

Result:

+----------+-------------+---------------------+--------+
| order_id | customer_id | order_timestamp     | amount |
+----------+-------------+---------------------+--------+
| 1        | 42          | 2026-01-01 08:00:00 | 150    |
| 2        | 15          | 2026-01-01 09:30:00 | 80     |
+----------+-------------+---------------------+--------+

The result is identical. The path the database took to produce it changed from a full table scan to a two-row index lookup. On large tables, this fix routinely reduces query time from minutes to milliseconds.

Indexes solve read performance problems. On write-heavy tables, they can create new ones.

When indexes hurt

Indexes trade disk space and write overhead for read speed. Every index on a table must be updated every time a row is inserted, updated, or deleted. On write-heavy tables, too many indexes slow down every write operation.

Be cautious about adding indexes when:

the table is small and a full scan is already fast
the column has very low cardinality (boolean flags, status codes with two or three values)
the table receives significantly more writes than reads
the column never appears in WHERE, JOIN, or ORDER BY

The right number of indexes depends on the read-to-write ratio of the table and the actual query patterns hitting it.

With the trade-offs mapped out, let's close with a query that ties all of it together.

An interview pattern worth knowing

This pattern combines everything: a SARGable date filter, a composite index, aggregation, and window-based ranking. It is one of the most frequently asked optimization questions in data engineering interviews.

The question: find the top three customers by total purchase amount per month.

WITH totals AS (
    SELECT
        customer_id,
        DATE_TRUNC('month', order_timestamp) AS month,
        SUM(amount) AS total_amount
    FROM orders
    WHERE order_timestamp >= '2026-01-01'
      AND order_timestamp < '2026-04-01'
    GROUP BY customer_id, DATE_TRUNC('month', order_timestamp)
),
ranked AS (
    SELECT
        customer_id,
        month,
        total_amount,
        RANK() OVER (
            PARTITION BY month
            ORDER BY total_amount DESC
        ) AS rn
    FROM totals
)
SELECT *
FROM ranked
WHERE rn <= 3
ORDER BY month, rn;

The index that makes the aggregation fast:

CREATE INDEX idx_orders_optim
ON orders(order_timestamp, customer_id, amount);

This composite index covers the WHERE filter on order_timestamp and includes customer_id and amount, so the database can serve the aggregation entirely from the index without reading the main table rows.

Result (using our five-row sample):

+-------------+------------+--------------+----+
| customer_id | month      | total_amount | rn |
+-------------+------------+--------------+----+
| 15          | 2026-01-01 | 390          | 1  |
| 42          | 2026-01-01 | 370          | 2  |
| 99          | 2026-01-01 | 50           | 3  |
+-------------+------------+--------------+----+

Customer 15 totals 390 (80 + 310), customer 42 totals 370 (150 + 220), and customer 99 totals 50. All five rows fall in January in our sample dataset, so there is one partition and three ranked results. On a real dataset with millions of rows across many months, the index makes the date filter fast enough that the aggregation and ranking steps never become the bottleneck.

Takeaway:

An index is a lookup structure the database builds once and uses for every matching read. It trades disk space and write overhead for dramatically faster queries on large tables.

If we remember only one rule, let it be this:

without an index, every query reads every row
with a well-chosen index, the database reads only the rows it needs

Keep this map in mind:

single-column index for one frequently-filtered column
composite index when two or more columns are always filtered together
covering index when a query reads only a narrow column set and must run fast
no index when the table is small, the column has low cardinality, or writes dominate

And keep these three warnings next to it:

always make WHERE clauses SARGable: no functions or transforms applied to the indexed column
use EXPLAIN to confirm the database is actually using the index, not scanning the table
too many indexes on a write-heavy table will slow down inserts, updates, and deletes

Challenge:

Take an orders table with order_timestamp, customer_id, and amount. You have three slow queries:

WHERE DATE(order_timestamp) = '2026-03-01'
WHERE customer_id = 99
SELECT customer_id, order_timestamp FROM orders WHERE customer_id = 99 ORDER BY order_timestamp

For each one: rewrite the query to be SARGable if it is not, decide what index to create, and use EXPLAIN to verify the plan changed from Seq Scan to Index Scan. Then decide whether a single covering index could serve all three queries. That is where query optimization stops being a checklist and starts being a design decision.