Optimize groupwise maximum query

Question

select * 
from records 
where id in ( select max(id) from records group by option_id )

This query works fine even on millions of rows. However as you can see from the result of explain statement:

                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=30218.84..31781.62 rows=620158 width=44) (actual time=1439.251..1443.458 rows=1057 loops=1)
->  HashAggregate  (cost=30218.41..30220.41 rows=200 width=4) (actual time=1439.203..1439.503 rows=1057 loops=1)
     ->  HashAggregate  (cost=30196.72..30206.36 rows=964 width=8) (actual time=1438.523..1438.807 rows=1057 loops=1)
           ->  Seq Scan on records records_1  (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.103..527.914 rows=1240315 loops=1)
->  Index Scan using records_pkey on records  (cost=0.43..7.80 rows=1 width=44) (actual time=0.002..0.003 rows=1 loops=1057)
     Index Cond: (id = (max(records_1.id)))
Total runtime: 1443.752 ms

(cost=0.00..23995.15 rows=1240315 width=8) <- Here it says it is scanning all rows and that is obviously inefficient.

I also tried reordering the query:

select r.* from records r
inner join (select max(id) id from records group by option_id) r2 on r2.id= r.id;

                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------

Nested Loop  (cost=30197.15..37741.04 rows=964 width=44) (actual time=835.519..840.452 rows=1057 loops=1)
->  HashAggregate  (cost=30196.72..30206.36 rows=964 width=8) (actual time=835.471..835.836 rows=1057 loops=1)
     ->  Seq Scan on records  (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.336..348.495 rows=1240315 loops=1)
->  Index Scan using records_pkey on records r  (cost=0.43..7.80 rows=1 width=44) (actual time=0.003..0.003 rows=1 loops=1057)
     Index Cond: (id = (max(records.id)))
Total runtime: 840.809 ms

(cost=0.00..23995.15 rows=1240315 width=8) <- Still scanning all rows.

I tried with and without index on (option_id), (option_id, id), (option_id, id desc), none of them had any effect on the query plan.

Is there a way of executing a groupwise maximum query in Postgres without scanning all rows?

What I am looking for, programmatically, is an index which stores the maximum id for each option_id as they are inserted into the records table. That way, when I query for maximums of option_ids, I should only need to scan index records as many times as there are different option_ids.

I've seen select distinct on answers all over SO from high ranking users (thanks to @Clodoaldo Neto for giving me keywords to search for). Here's why it doesn't work:

create index index_name on records(option_id, id desc)

select distinct on (option_id) *
from records
order by option_id, id desc
                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=0.43..76053.10 rows=964 width=44) (actual time=0.049..1668.545 rows=1056 loops=1)
  ->  Index Scan using records_option_id_id_idx on records  (cost=0.43..73337.25 rows=1086342 width=44) (actual time=0.046..1368.300 rows=1086342 loops=1)
Total runtime: 1668.817 ms

That's great, it's using an index. However using an index to scan all ids doesn't really make much sense. According to my executions, it is in fact slower than a simple sequential scan.

Interesting enough, MySQL 5.5 is able to optimize the query simply using an index on records(option_id, id)

mysql> select count(1) from records;

+----------+
| count(1) |
+----------+
|  1086342 |
+----------+

1 row in set (0.00 sec)

mysql> explain extended select * from records
       inner join ( select max(id) max_id from records group by option_id ) mr
                                                      on mr.max_id= records.id;

+------+----------+--------------------------+
| rows | filtered | Extra                    |
+------+----------+--------------------------+
| 1056 |   100.00 |                          |
|    1 |   100.00 |                          |
|  201 |   100.00 | Using index for group-by |
+------+----------+--------------------------+

3 rows in set, 1 warning (0.02 sec)

"However using an index to scan all rows doesn't really make much sense" --- it does. Indexes are smaller than the whole dataset and it's more chance they are in a cache. It doesn't scan actual rows though, but the index. — zerkms, Commented Jun 23, 2014 at 5:00
What is the plan for the original query with index created? — zerkms, Commented Jun 23, 2014 at 5:02
@zerkms indexing option_id made no difference (as I stated in the question) Indexing option_id_id_desc or option_id_id also makes no difference in the query plan. — nurettin, Commented Jun 23, 2014 at 5:08
what if you add a (option_id, id desc) index and run ANALYZE against the given table? Btw, what posgtresql version are you running? — zerkms, Commented Jun 23, 2014 at 5:10
"I tried putting and removing index on option_id which had no effect on the query plan." --- Index on a single option_id will unlikely affect it in any way since you still need to retrieve MAX(id) hence iterate over all rows. — zerkms, Commented Jun 23, 2014 at 5:12

Erwin Brandstetter · Accepted Answer · 2020-08-05 21:28:09Z

Assuming relatively few rows in options for many rows in records.

Typically, you would have a look-up table options that is referenced from records.option_id, ideally with a foreign key constraint. If you don't, I suggest to create one to enforce referential integrity:

CREATE TABLE options (
  option_id int  PRIMARY KEY
, option    text UNIQUE NOT NULL
);

INSERT INTO options
SELECT DISTINCT option_id, 'option' || option_id -- dummy option names
FROM   records;

Then there is no need to emulate a loose index scan any more and this becomes very simple and fast. Correlated subqueries can use a plain index on (option_id, id).

SELECT option_id, (SELECT max(id)
                   FROM   records
                   WHERE  option_id = o.option_id) AS max_id
FROM   options o
ORDER  BY 1;

This includes options with no match in table records. You get NULL for max_id and you can easily remove such rows in an outer SELECT if needed.

Or (same result):

SELECT option_id, (SELECT id
                   FROM   records
                   WHERE  option_id = o.option_id
                   ORDER  BY id DESC NULLS LAST
                   LIMIT  1) AS max_id
FROM   options o
ORDER  BY 1;

May be slightly faster. The subquery uses the sort order DESC NULLS LAST - same as the aggregate function max() which ignores NULL values. Sorting just DESC would have NULL first:

Why do NULL values come first when ordering DESC in a PostgreSQL query?

The perfect index for this:

CREATE INDEX on records (option_id, id DESC NULLS LAST);

Index sort order doesn't matter much while columns are defined NOT NULL.

There can still be a sequential scan on the small table options, that's just the fastest way to fetch all rows. The ORDER BY may bring in an index (only) scan to fetch pre-sorted rows.
The big table records is only accessed via (bitmap) index scan or, if possible, index-only scan.

db<>fiddle here - showing two index-only scans for the simple case
_{Old sqlfiddle}

Or use LATERAL joins for a similar effect in Postgres 9.3+:

Optimize GROUP BY query to retrieve latest row per user

jjanes · Accepted Answer · 2014-06-23 19:59:49Z

You mention wanting an index that only indexes the max(id) for each option_id. This is not currently supported by PostgreSQL. If such a feature is added in the future, it would be probably done through the mechanism of making a materialized view on the aggregate query, and then indexing the materialized view. I wouldn't expect for at least a couple years, though.

What you can do now, though, is use a recursive query make it skip through the index to each unique value of option_id. See the PostgreSQL wiki page for a general description of technique.

The way you can use this for your case it write the recursive query to return the distinct values of option_id, and then for each one of those subselect the max(id):

with recursive dist as (
  select min(option_id) as option_id from records
union all
  select (select min(option_id) from records where option_id > dist.option_id) 
     from dist where dist.option_id is not null
) 

select option_id, 
  (select max(id) from records where records.option_id=dist.option_id)
from dist where option_id is not null;

It is ugly, but you can hide it behind a view.

In my hands this runs in 43ms, rather than 513ms for the on distinct variety.

It could probably be made about twice as fast if you can find a way to incorporate the max(id) into the recursive query, but I couldn't find a way to do that. The problem is that these queries have a rather restrictive syntax, you cannot use "limit" or "order by" in conjunction with the UNION ALL.

This query touches page widely scattered throughout the index, and if those pages don't fit in the cache, then you will be doing a lot of inefficient IO. However, if this type of query is popular, then the 1057 leaf index pages will have little problem staying in cache.

This is how a set up my test case:

create table records  as select floor(random()*1057)::integer as option_id, floor(random()*50000000)::integer as id from generate_series(1,1240315);
create index on records (option_id ,id);
explain analyze;

Quassnoi · Accepted Answer · 2014-06-23 20:27:06Z

PostgreSQL does not support loose scan which MySQL is able to use for queries like this. It's the Using index for group-by you're seeing on the MySQL plan.

Basically, it's returning the first or last entry in a range matching a subset of a composite key, then searching for the next or previous value of this subset.

In your case it first returns the last value of the whole index on (option_id, id) (which by definition happens to hold the MAX(id) for the greatest option_id), then searches for the last value with next to largest option_id and so on.

PostgreSQL's optimizer is not able to build such a plan, however, PostgreSQL lets you emulate it in SQL. If you have lots of records but few distinct option_id, it's worth doing.

To do this, first create the index:

CREATE INDEX ix_records_option_id ON records (option_id, id);

then run this query:

WITH RECURSIVE q (option_id) AS
        (
        SELECT  MIN(option_id)
        FROM    records
        UNION ALL
        SELECT  (
                SELECT  MIN(option_id)
                FROM    records
                WHERE   option_id > q.option_id
                )
        FROM    q
        WHERE   option_id IS NOT NULL
        )
SELECT  option_id,
        (
        SELECT  MAX(id)
        FROM    records r
        WHERE   r.option_id = q.option_id
        )
FROM    q
WHERE   option_id IS NOT NULL

See it on sqlfiddle.com: http://sqlfiddle.com/#!15/4d77d/4

Clodoaldo Neto · Accepted Answer · 2014-06-16 12:57:15Z

1

select distinct on (option_id) *
from records
order by option_id, id desc

Indexes will only be used if the cardinality is favorable. That said you can try a composite index

create index index_name on records(option_id, id desc)

answered Jun 16, 2014 at 12:57

Clodoaldo Neto

123k28 gold badges243 silver badges268 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Optimize groupwise maximum query

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
sql
postgresql
query-optimization
greatest-n-per-group
groupwise-maximum
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Not the answer you're looking for? Browse other questions tagged sqlpostgresqlquery-optimizationgreatest-n-per-groupgroupwise-maximum or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
sql
postgresql
query-optimization
greatest-n-per-group
groupwise-maximum
or ask your own question.