Increase SQL performance with partial indices

6/19/16

Finding the right index to a SQL query is not only important for performance improvements, it is also relevant to reduce the memory on disc for every created index of every table of the project's databases. Sometimes there are multiple indices on one table, that can be covered with only one smart index, which can improve INSERTS too. The right index is a guarantee for fast reads without slowing down writes too much. Usually we come in contact with performance problems when the amount of data is rising. With more and more rows to read, a review of index-columns and queries can be as valuable as a code review, but sometimes also the index itself should be reviewed by the team.

The information and index we need


Let's assume we have a simple table containing payment status of a shop that we need to access to for further information about customer purchases. Every row contains the payment status from "pending", for an open invoice, up to "complete", when the payment was successful. For a simple example in PostgreSQL, the table looks like this:
CREATE TABLE payment_normal (
    id serial,
    name varchar(255),
    payment_status varchar(10)
);
In the backend of our shop, we want to see all rows of that table, that has status "pending", so we can observe how many open payments are still there. If we want to have fast select-queries to see all rows that are on "pending" status, we would create an index like this:
CREATE INDEX payment_status_normal ON payment_normal (payment_status);
Now we are able to get all rows we need fast, even when payment_normal will reach a high amount of rows because of the shop's past "complete" payments.
SELECT * FROM payment_normal WHERE payment_status = 'pending';
It's easy, right? We have millions of rows and still get a fast lookup on payment_normal.payment_status for the payment information we need.

When you think about it, we have millions of rows that are always on "complete" but only a handful of rows that are on pending or another status. For just these few interesting rows, the chosen index takes a lot of memory and even with a BTREE, the lookup through an index tree takes some time too. Every inserted or changed index relevant column will also trigger changes in the index tree, that takes some time on a huge amount of rows. Actually, we need an index only for the relevant payment status which are not on status "complete".

Introducing: partial indices


A partial index only affects data of a chosen condition, that was defined on an index creating alter statement. With only a subset of entries in the index, it will take less memory on your harddrive and the lookup for a query is going to be faster. This is not only an optimization for reading rows from a table, write processes also take less time because there are less index entries to update. Of course, if your query doesn't fit to the predicate of your partial index, your query will gain no advantages from it and a full table scan will be done.

Building a predicate to create a partial index is as simple as writing a where condition in your select-query. For a later comparison we create a new table:
CREATE TABLE payment_partial (
    id serial,
    name varchar(255),
    payment_status varchar(10)
);

CREATE INDEX payment_status_partial ON payment_partial (payment_status) WHERE payment_status != 'complete';
The chosen condition is, that only those status of payment_status will be added to the index, that don't contain "complete" as value. Any other payment_status will be fine.

Now we have two similar tables, one with a regular index and one with a partial index. It's time we compare them after inserting the same data into both tables: one million rows where the payment_status is "complete" and 20 rows having status "pending". We analyse a simple select-query with EXPLAIN for both tables selecting all the "pending" payment status:

Table with regular index
EXPLAIN SELECT * FROM foo_normal WHERE payment_status = 'pending';

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Index Scan using payment_status_normal on foo_normal  (cost=0.42..9.00 rows=33 width=28)
   Index Cond: ((payment_status)::text = 'pending'::text)
Table with partial index
EXPLAIN SELECT * FROM foo_partial WHERE payment_status = 'pending';

                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Index Scan using payment_status_partial on foo_partial  (cost=0.12..4.14 rows=1 width=29)
   Index Cond: ((payment_status)::text = 'pending'::text)
When we compare the cost of both execution plans, the partial index must be the faster one, but to get the real execution time, we have to execute both queries with EXPLAIN ANALYZE. This is like a simple explain-query that additionally also runs the query. It is a real time analysis that needs to be used with caution when it comes to row-changing queries like updates or deletes.

Table with regular index
EXPLAIN ANALYZE SELECT * FROM foo_normal WHERE payment_status = 'pending';

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Index Scan using payment_status_normal on foo_normal  (cost=0.42..9.00 rows=33 width=28)
 (actual time=0.036..0.045 rows=20 loops=1)
   Index Cond: ((payment_status)::text = 'pending'::text)
Table with partial index
EXPLAIN ANALYZE SELECT * FROM foo_partial WHERE payment_status = 'pending';

                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Index Scan using payment_status_partial on foo_partial  (cost=0.12..4.14 rows=1 width=29)
(actual time=0.029..0.039 rows=20 loops=1)
   Index Cond: ((payment_status)::text = 'pending'::text)
The difference in time isn't as big as in the cost, but one million rows aren't really that much for a database. Usually a data relevant project has much more rows and even with this small data amount, we can see how a partial index can improve your queries.

What about MySQL?


If you're a developer who is working with MySQL only, I have to disappoint you. Partial indices don't exist in the latest version of MySQL. Actually, there is a partial index named part in the documentation, but this refers to a completely different feature than the PostgreSQL one. In MySQL, a partial index is set on a varchar column with a limit number, so it won't put the whole value into the index, but only the first X characters. This can make LIKE queries faster on text-relevant columns, but only for its first characters. This seems to be kind of hidden in the documentation and I only found older entries for this feature during my research.

MySQL has not the benefit of partial indices like PostgreSQL.

If we need "complete" status


Okay, we have a benefit on a select query for status "pending" in payment_status, but what if we need to select all the rows that are status "complete"? In that case a full table scan will be executed by the database, because the partial index has no entries for that value. So is that an advantage for the regular index and a reason to ditch the partial index? Let's check how fast the regular index will be when we try to get one million rows of "complete" values of payment_status:
EXPLAIN ANALYZE SELECT * FROM foo_normal WHERE payment_status = 'complete';

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Seq Scan on foo_normal  (cost=0.00..20065.25 rows=999987 width=28)
(actual time=13.082..549.146 rows=1000000 loops=1)
   Filter: ((payment_status)::text = 'complete'::text)
   Rows Removed by Filter: 20
Wait, what? Even though we have an index, PostgreSQL does a full table scan for our query and removes the "pending" rows after reading all data. Why?

PostgreSQL knows from its execution plan, that it has to read nearly the whole table. If it would choose to go through the BTREE of the index "payment_status_normal" first, the execution time would be even higher. It would have to read the index and nearly all rows anyway. That makes the full table scan a more efficient choice. In conclusion, the regular index only occupies more memory on your harddrive with no use. Usually, you wouldn't search for the payment_status "complete" without a second condition like a date and/or time range, which would result in a choice for an additional composite index. An index always depends on query use cases, so be sure to regularly review your queries and indices in your code reviews.

Archive data instead


Don't be hasty and wait before you change your index structure for your tables. Another approach to improve performance would be to move older data rows to somewhere else. You have to ask yourself if and why you need all rows in a table. Does your project need to access all rows? Or is it more like that a few data scientists, marketeers or other groups need to access those information? Historical data doesn't have to stay in your database and can be archived into another storage or format where others can mine for their analysis. This way less users are allowed to work on the project's core, because the relevant parts were separated for them. This is also a separation of concern.

In the end handling less data is the best performance improvement a project can have.


About Claudio Zizza
I'm working as a software developer for more than 15 years and used many coding languages and frameworks during that time, like PHP (of course), ASP, .NET, Javascript, Symfony, Laminas and more. I also contribute to the open source communities and am a part of the PHP Usergroup in Karlsruhe, Germany.
Follow me on Twitter (@SenseException)