PostgreSQL index design is often misunderstood.

Many developers think that creating a good index simply means:

“Create an index containing the columns from the WHERE clause.”

In reality, efficient index design is far more nuanced.

The order of columns inside a composite index matters enormously, and the best choice depends on:

Predicate types (=, >=, BETWEEN, LIKE, etc.)
Column selectivity
Table size
PostgreSQL planner statistics
Actual execution plans

This article explains the core principles behind efficient PostgreSQL index design before showing how pgAssistant automates this process using execution plans and database statistics.

Why Index Design Is Difficult

Consider the following query:

SELECT
    order_id,
    customer_id,
    employee_id,
    order_date,
    ship_country
FROM public.orders
WHERE customer_id = $1
  AND employee_id = $2
  AND order_date >= DATE $3;

At first glance, several index definitions may appear reasonable:

(customer_id, employee_id, order_date)

(order_date, customer_id, employee_id)

(employee_id, customer_id, order_date)

But these indexes do not behave the same way.

Choosing the correct ordering requires understanding how PostgreSQL traverses B-Tree indexes.

The Fundamental Rule of B-Tree Indexes

For PostgreSQL B-Tree indexes:

Equality predicates should come first
Range predicates should come last

This is the single most important rule in multi-column index design.

Equality Predicates Are Highly Selective

Predicates such as:

=
IN (...)
IS NULL

allow PostgreSQL to navigate directly to a very precise section of the index tree.

In our query:

customer_id = $1
employee_id = $2

are equality predicates.

If the index begins with these columns:

(customer_id, employee_id, ...)

PostgreSQL can rapidly narrow the search space.

Conceptually:

customer_id = exact branch
employee_id = exact sub-branch

The planner can jump almost directly to the matching rows.

Why Range Predicates Should Come Last

Now consider:

order_date >= DATE $3

This is a range predicate.

Once PostgreSQL enters a range scan inside a B-Tree index, the remaining columns become far less useful for navigation.

For example, with this index:

(order_date, customer_id, employee_id)

the planner must first scan all matching dates:

order_date >= DATE $3

which may represent a very large portion of the table.

Only afterward can additional filtering occur.

This usually produces significantly more index scanning.

That is why range predicates are generally placed at the end of composite indexes.

Column Order Among Equality Predicates

Once equality predicates are identified, the next challenge is:

Which equality column should come first?

The answer depends on selectivity.

PostgreSQL exposes this information through planner statistics.

PostgreSQL Statistics Drive Good Index Design

For our query, PostgreSQL statistics are:

order_date [>=]:
    n_distinct=814
    null_frac=0.0000
    mcv_count=100
    histogram_bounds=101

customer_id [=]:
    n_distinct=89
    null_frac=0.0000
    mcv_count=89
    histogram_bounds=0

employee_id [=]:
    n_distinct=9
    null_frac=0.0000
    mcv_count=9
    histogram_bounds=0

The most important metric here is:

n_distinct

Understanding `n_distinct`

n_distinct estimates the number of distinct values in a column.

Higher n_distinct usually means:

higher selectivity
fewer matching rows
better filtering efficiency

In our example:

Column	n_distinct
customer_id	89
employee_id	9

customer_id is significantly more selective.

Therefore, PostgreSQL benefits more from filtering by customer_id first.

Why Selectivity Matters

Imagine the table contains 1 million rows.

Filtering by:

employee_id

may still leave:

1,000,000 / 9 ≈ 111,111 rows

Filtering by:

customer_id

may reduce the result to:

1,000,000 / 89 ≈ 11,236 rows

Starting with the most selective equality predicate drastically reduces the search space.

This improves:

index scan efficiency
cache locality
heap access reduction
execution time

The Correct Index Design

Applying these principles:

Equality predicates first
Most selective equality columns first
Range predicates last

produces:

CREATE INDEX CONCURRENTLY
    pga_idx_orders_customer_id_employee_id_order_date
ON public.orders
    (customer_id, employee_id, order_date);

This ordering allows PostgreSQL to:

Navigate efficiently using exact matches
Reduce scanned rows as early as possible
Apply the range scan only after narrowing the search space

Good Index Design Also Depends on Table Size

One of the biggest misconceptions about PostgreSQL optimization is:

“Indexes are always faster.”

This is false.

For small tables, PostgreSQL often prefers a Sequential Scan (Seq Scan) even when an index exists.

Why?

Because using an index has overhead:

traversing the B-Tree
reading index pages
performing heap lookups
random I/O access

For sufficiently small tables, scanning the entire table sequentially is cheaper.

Query Plans Matter More Than Theory

A theoretically perfect index is useless if PostgreSQL never uses it.

That is why index recommendation engines should never rely only on SQL syntax.

They must also inspect:

execution plans
estimated costs
table statistics
row estimates
planner decisions

The Importance of Execution Plans

The execution plan reveals how PostgreSQL actually executes a query.

For example:

EXPLAIN ANALYZE
SELECT ...

may show:

Seq Scan on orders

or:

Index Scan using ...

This distinction is critical.

A query may contain filter predicates that look index-friendly, but PostgreSQL may correctly determine that:

the table is too small
selectivity is too low
too many rows would still be scanned

and therefore prefer a sequential scan.

How pgAssistant Recommends Indexes

pgAssistant does not simply parse SQL queries.

It combines multiple sources of information:

1. Query Plan

pgAssistant analyzes nodes in the query plan to identify candidate index columns.

2. Predicate Types

It classifies predicates into categories:

Equality predicates

=
IN
IS NULL

Range predicates

>
>=
<
<=
BETWEEN
LIKE 'prefix%'

Equality predicates are prioritized before range predicates.

3. Column Statistics

pgAssistant uses PostgreSQL planner statistics such as:

n_distinct
null_frac
most_common_vals
most_common_freqs
histogram_bounds

to estimate column selectivity.

Columns with higher selectivity are prioritized earlier in the index definition.

4. Table Statistics

pgAssistant also evaluates table-level statistics, including:

estimated row counts
table size
planner cost estimates

This is extremely important because some tables are simply too small to justify an index.

In these cases, recommending an index would create unnecessary maintenance overhead without improving performance.

How pgAssistant Recommends PostgreSQL Indexes

Why Query Syntax Alone Is Not Enough

A good recommendation depends on:

predicate types
column selectivity
table statistics
planner estimates
execution plans
existing indexes already used by PostgreSQL

This is precisely the approach implemented by pgAssistant.

pgAssistant Uses Execution Plans First

pgAssistant starts from the PostgreSQL execution plan.

It analyzes:

EXPLAIN (ANALYZE, FORMAT JSON)

This is extremely important because the execution plan reveals:

whether PostgreSQL uses a Seq Scan
whether an Index Scan already exists
whether residual filtering still occurs after index access
whether planner row estimations are inaccurate
whether a composite index could reduce heap filtering

This avoids many false-positive recommendations.

A query may look index-friendly while PostgreSQL is already using the optimal access path.

pgAssistant Analyzes Existing Access Paths

The advisor first determines how PostgreSQL currently accesses the table.

Examples:

Seq Scan
Index Scan
Index Only Scan
Bitmap Heap Scan

This distinction is critical.

Sequential Scan Case

If PostgreSQL performs a Seq Scan, pgAssistant evaluates whether an index could realistically improve performance.

Indexed Access Case

If PostgreSQL already uses an index, pgAssistant does not stop there.

It also analyzes:

Index Cond
Filter
Recheck Cond
Rows Removed by Filter

This allows pgAssistant to detect situations such as:

Index Scan using idx_customer on orders
  Index Cond: (customer_id = 42)
  Filter: (employee_id = 5)

In this case, PostgreSQL uses an index, but still visits many rows that are later discarded by the executor.

pgAssistant can therefore recommend a more selective composite index such as:

(customer_id, employee_id)

instead of considering the existing index “good enough”.

Predicate Classification

Once predicates are extracted from the execution plan, pgAssistant classifies them by operator type.

Internally, predicates are ranked according to B-Tree efficiency.

Equality Predicates

Highest priority:

=
IN
IS NULL

These predicates allow PostgreSQL to navigate directly to a very small portion of the index tree.

Prefix Search Predicates

Second priority:

LIKE 'abc%'

Prefix searches can still benefit efficiently from B-Tree traversal.

Range Predicates

Lowest priority:

>
>=
<
<=
BETWEEN

Once PostgreSQL enters a range scan, subsequent columns become far less effective for index navigation.

That is why range predicates are typically placed last in composite indexes.

How pgAssistant Orders Index Columns

After classifying predicates, pgAssistant computes candidate index ordering.

The internal ordering logic is:

1. Equality predicates first
2. Prefix predicates second
3. Range predicates last
4. Inside each category:
   highest cardinality first

This logic is implemented directly inside:

reorder_index_candidate_columns()

The advisor therefore builds indexes that align with PostgreSQL B-Tree traversal behavior.

Why Column Cardinality Matters

pgAssistant uses PostgreSQL statistics to estimate selectivity.

The most important metric is:

n_distinct

which estimates the number of distinct values in a column.

Higher cardinality usually means:

fewer matching rows
better filtering
smaller index scan ranges

For our example:

customer_id [=]: n_distinct=89
employee_id [=]: n_distinct=9
order_date [>=]: n_distinct=814

Although order_date has the highest cardinality, it is a range predicate and therefore placed last.

Among equality predicates:

customer_id > employee_id

because:

89 > 9

The final ordering becomes:

(customer_id, employee_id, order_date)

pgAssistant Uses PostgreSQL Statistics

pgAssistant enriches every recommendation using planner statistics extracted from PostgreSQL.

Examples include:

n_distinct
null_frac
most_common_vals
most_common_freqs
histogram_bounds

The advisor even exposes these statistics in its recommendation reasoning.

Example:

customer_id [=]: n_distinct=89, null_frac=0.0000
employee_id [=]: n_distinct=9, null_frac=0.0000
order_date [>=]: n_distinct=814

This makes the recommendation transparent and explainable.

pgAssistant Also Uses Table Statistics

An index is not always beneficial.

This is one of the most important concepts in PostgreSQL optimization.

For small tables, PostgreSQL often correctly prefers:

Seq Scan

instead of:

Index Scan

because:

sequential reads are cheap
index traversal has overhead
heap fetches introduce random I/O
scanning the entire table may cost less

This is why pgAssistant also evaluates:

estimated row counts
table size
planner costs
execution frequency
workload intensity

The advisor does not blindly recommend indexes whenever a sequential scan appears.

Detecting Inefficient Indexed Access

One particularly powerful aspect of pgAssistant is its ability to analyze residual filtering.

For indexed scans, the advisor evaluates:

Rows Removed by Filter

If PostgreSQL retrieves many tuples from the index only to discard them afterward, pgAssistant detects that the current index may be incomplete.

Internally, the advisor computes:

Residual filter kept X% of tuples visited

This helps identify situations where adding an additional column to a composite index could drastically reduce heap filtering.

Detecting Planner Estimation Problems

pgAssistant also compares:

plan_rows
vs
actual_rows

Large estimation gaps may indicate:

stale statistics
data skew
correlation issues
missing extended statistics

This additional analysis improves the reliability of recommendations.

The pgAssistant Recommendation Algorithm

Conceptually, pgAssistant follows this workflow:

1. Analyze EXPLAIN ANALYZE JSON plan
2. Detect access paths
3. Extract predicates from:
   - Index Cond
   - Filter
   - Recheck Cond
4. Classify predicates by operator type
5. Rank predicates for B-Tree efficiency
6. Use PostgreSQL statistics to estimate selectivity
7. Order columns by:
   - predicate class
   - cardinality
8. Evaluate table statistics
9. Evaluate execution costs
10. Detect residual filtering
11. Compare against existing indexes
12. Recommend index only if beneficial

Example: Final Recommendation

Given:

SELECT
    order_id,
    customer_id,
    employee_id,
    order_date,
    ship_country
FROM public.orders
WHERE customer_id = $1
  AND employee_id = $2
  AND order_date >= DATE $3;

pgAssistant evaluates:

Column	Predicate	Priority	n_distinct
customer_id	`=`	Equality	89
employee_id	`=`	Equality	9
order_date	`>=`	Range	814

The advisor therefore generates:

CREATE INDEX CONCURRENTLY
    "pga_idx_orders_customer_id_employee_id_order_date"
ON "public"."orders"
    ("customer_id", "employee_id", "order_date");

This ordering follows PostgreSQL B-Tree optimization principles while also considering:

planner statistics
table characteristics
execution plan behavior
residual filtering
existing indexes already in use

Live demo

A public demo is available here:

https://ov-004f8b.infomaniak.ch/

Demo connection:

postgresql://postgres:demo@demo-db:5432/northwind

The public demo intentionally runs without AI.

Project links

GitHub: https://github.com/beh74/pgassistant-community
Documentation: https://beh74.github.io/pgassistant-blog/
Docker image: https://hub.docker.com/r/bertrand73/pgassistant

Feedback welcome

The project is still evolving and many parts can certainly be improved.

If you work with PostgreSQL and have ideas, feedback, or criticisms, feel free to open an issue or discussion on GitHub.

Thanks for reading.

Designing the Right PostgreSQL Index Using Query Plans and Statistics