Setting up indexes for PostGIS distance queries

Question

I'm building an application that is supposed to query and return every Record in a table that is X kilometers away from PointX. Records and PointX's positions are determined from (long/lat) information provided by Google Geocode API.

I'm new to PostGIS. After a quick research, I've found this question. The answer seems to be along the lines of:

SELECT *
FROM your_table
WHERE ST_Distance_Sphere(the_geom, ST_MakePoint(your_lon,your_lat)) <= radius_mi * 1609.34

The problem is that even though I'm only getting started at GIS, when I look at the above query, I can't possibly imagine how this can use an index. There are 2 function calls. I imagine the table being scanned for every Record.

Does PostGIS have any index type capable of making the above query performant and, if not, what would be the recommended approach doing what I need?

Make sure you build the right index, on a cast to geography, and apply a ST_SetSRID() to the ST_MakePoint before casting to geography in the query. — Vince, Commented Jul 10, 2017 at 23:49
Side note: ST_Distance_Sphere is now ST_DistanceSphere postgis.net/docs/ST_DistanceSphere.html — Big_Boulard, Commented Jul 25, 2022 at 11:38

Vince · Accepted Answer · 2022-08-16 15:05:37Z

There are two keys to getting good geodetic query performance with large tables with geometry columns using WGS 1984 geographic data (SRID 4326):

Use the ST_DWithin function, which searches using an available spatial index, and will find geography features with a Cartesian distance
Build an index on the geography cast, so ST_DWithin can use it

So let's look at what happens in the real world. First we need to create and populate a table of one million random points:

DROP TABLE IF EXISTS example1
;

CREATE TABLE example1 (
    idcol   serial      NOT NULL,
    geomcol geometry        NULL,
    CONSTRAINT  example1_pk PRIMARY KEY (idcol),
    CONSTRAINT  enforce_srid CHECK (st_srid(geomcol) = 4326)
);

INSERT INTO example1(geomcol)
SELECT  ST_SetSRID(
            ST_MakePoint(
            (random()*360.0) - 180.0,
            (acos(1.0 - 2.0 * random()) * 2.0 - pi()) * 90.0 / pi()),
            4326) as geomcol
FROM  generate_series(1, 1000000) vtab;

CREATE INDEX example1_spx ON example1 USING GIST (geomcol);
-- (took about 22 sec)

If we execute the ST_Distance query, we get your expected full table scan:

EXPLAIN ANALYZE VERBOSE
SELECT  count(*)
FROM    example1
WHERE   ST_Distance(geomcol::geography,ST_SetSRID(ST_MakePoint(6.9333,46.8167),4326)::geography) < 30 * 1609.34
;

Aggregate  (cost=274167.33..274167.34 rows=1 width=0) (actual time=4940.531..4940.532 rows=1 loops=1)
  Output: count(*)
  ->  Seq Scan on bob.example1  (cost=0.00..273334.00 rows=333333 width=0) (actual time=592.766..4940.509 rows=11 loops=1)
        Output: idcol, geomcol
        Filter: (_st_distance((example1.geomcol)::geography, '0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography, 0::double precision, true) < 48280.2::double precision)
        Rows Removed by Filter: 999989
Planning time: 2.137 ms
Execution time: 4940.568 ms

Now, if we use ST_DWithin, we still get a full table scan (albeit a faster one):

EXPLAIN ANALYZE VERBOSE
SELECT  count(*)
FROM    example1
WHERE   ST_DWithin(geomcol::geography,ST_SetSRID(ST_MakePoint(6.9333,46.8167),4326)::geography,30 * 1609.34)
;

Aggregate  (cost=405867.33..405867.34 rows=1 width=0) (actual time=908.716..908.716 rows=1 loops=1)
  Output: count(*)
  ->  Seq Scan on bob.example1  (cost=0.00..405834.00 rows=13333 width=0) (actual time=38.449..908.700 rows=7 loops=1)
        Output: idcol, geomcol
        Filter: (((example1.geomcol)::geography && '0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography) AND ('0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography && _st_expand((example1.geomcol)::geography, 48280.2::double precision) (...)
        Rows Removed by Filter: 999993
Planning time: 2.017 ms
Execution time: 908.763 ms

And this is the last piece -- Building the covering index (cast geography):

CREATE INDEX example1_gpx ON example1 USING GIST (geography(geomcol));
-- (Takes an extra 13 sec)

EXPLAIN ANALYZE VERBOSE
SELECT  count(*)
FROM    example1
WHERE   ST_DWithin(geomcol::geography,ST_SetSRID(ST_MakePoint(6.9333,46.8167),4326)::geography,30 * 1609.34)
;

Aggregate  (cost=96538.95..96538.96 rows=1 width=0) (actual time=0.775..0.775 rows=1 loops=1)
  Output: count(*)
  ->  Bitmap Heap Scan on bob.example1  (cost=8671.62..96505.62 rows=13333 width=0) (actual time=0.586..0.769 rows=19 loops=1)
        Output: idcol, geomcol
        Recheck Cond: ((example1.geomcol)::geography && '0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography)
        Filter: (('0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography && _st_expand((example1.geomcol)::geography, 48280.2::double precision)) AND _st_dwithin((example1.geomcol)::geography, '0101000020E61000005D6DC5FEB2BB1B40545227A089684740':: (...)
        Rows Removed by Filter: 14
        Heap Blocks: exact=33
        ->  Bitmap Index Scan on example1_gpx  (cost=0.00..8668.29 rows=200000 width=0) (actual time=0.384..0.384 rows=33 loops=1)
              Index Cond: ((example1.geomcol)::geography && '0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography)
Planning time: 2.572 ms
Execution time: 0.820 ms

Finally, the optimizer is using the spatial index, and it shows, but what's three orders of magnitude between friends?

Some caveats:

I'm a database nerd, so my home PC has got 16Gb RAM, six 3.3Ghz cores, and a 256Gb SSD for the database default tablespace; your mileage may vary
I re-ran the creation SQL before each query, to level the playing field with respect to "hot" pages in the cache, but this could produce slightly different results because the same random seed wasn't used for different runs

And a note:

I tweaked the original {-90,+90} latitude range to use arc-cosine for an equal-area distribution (less biased toward the poles)

Is there any reason why not to store the geomcol as geography? Both ST_Distance and ST_DWithin expect geographies. And if we did so, we wouldn't need the extra index casting geometry to geography. — andrerpena, Commented Jan 8, 2018 at 9:04
This is a different question, and if asked, might be closed as opinion-based. — Vince, Commented Jan 8, 2018 at 11:52
Came across this result in google and thank you @Vince for your answer. The smallest difference of forcefully casting a geom point to a geograhpy took my query time from 43 seconds on average to 10msec instead.. — Angry 84, Commented Aug 3, 2018 at 2:21
learning about optimization of spatial queries to postgres/postgis and this was an incredibly helpful and full example. Thanks a bunch @Vince — Mark Essel, Commented Jul 18, 2019 at 16:40
"Extra index" is based on the assumption of at least one index, probably more. example1_spx would be used when doing nominal mapping (extent envelope searching). It's hard to imagine a situation where only ST_DWithin searches would be necessary, and no primary key, attribute, or spatial index needs would be present. Obviously, fewer indexes are called for in frequently modified tables, but it's rare that a table wouldn't need any indexes. Still, if the "extra" is distracting I can remove it. — Vince, Commented Apr 15, 2020 at 17:05

Stack Exchange Network

Setting up indexes for PostGIS distance queries

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
postgis
postgresql
indexing
or ask your own question.

Linked

Hot Network Questions

Setting up indexes for PostGIS distance queries

1 Answer 1

Not the answer you're looking for? Browse other questions tagged postgispostgresqlindexing or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
postgis
postgresql
indexing
or ask your own question.