Efficient way to find nearest feature between huge postgres tables

Question

I have two tables (g1 and g2), both with around two million features. I need to know for each feature in g1, which is the closest feature from g2, with a condition. Since they are both so huge, I also restricted the distance to 150 meters. I'm using pgadmin, and I'm using this query:

SELECT DISTINCT ON (g1.gid) g1.gid As gref_gid, g2.gid as gnn_gid, g2.code_mun, g1.codigo_mun, g2.text, g2.via INTO igcvia1
FROM u_nomen_dom As g1, u_nomen_via As g2   
WHERE g1.codigo_mun = g2.codigo_mun AND ST_DWithin(g1.geom, g2.geom, 150)    
ORDER BY g1.gid, ST_Distance(g1.geom,g2.geom)

The thing is that it is taking more than 24 hours to finish the query, and I don't know how much longer it's gonna take. I am wondering, is this the most efficient query? Technically, this is measuring all distances and then just show the closest one. Is there a way it can just find the closest one and then stop measuring? It would be great if you guys could help me find a more efficient way of doing this. I also tried QGIS but since these tables are so big, it just shuts down.

make sure you have spatial indexes in place and VACUUM ANALYZE run on both tables, then implement a LATERAL JOIN with the KNN operator like this (possible duplicate?). conditions go into the sub-query. what type are your geometries? — geozelot, Commented Sep 27, 2018 at 9:15
btw., are your geometries projected (since you used meter)? a dead-sure index driven filter would be ... WHERE ST_Expand(g1.geom, 150) && g2.geom ...; this will expand the bbox instead of true radius search and should be faster than ST_DWithin. — geozelot, Commented Sep 27, 2018 at 9:33
I have read about spatial indexes but I'm not sure which fields I should apply the index to. Should I apply a index to all fields that are in WHERE? Or just to the ST_Distance? How so? The geometries are lines in both files, but I will also have to do another one very similar to this one with polygons in the future. Yes, they are projected (4326). I will try the ST_Expand instead of ST_DWithin — A.T., Commented Sep 27, 2018 at 9:36
ha! I see there is plenty of room for improvement...actually, and don´t take this as an offense, there is much to learn as well...I´ll try to come up with a compact answer, but you will need to dive into a lot of things yourself here, e.g. things like projections/CRS and their units and indexes... — geozelot, Commented Sep 27, 2018 at 9:52
Yes, I am indeed new to this world, and I am swimming in new information. I'm reading as fast as possible, but also, deadlines have to be met. Thank you very much for your help, ThingumaBob! — A.T., Commented Sep 27, 2018 at 10:00

geozelot · Accepted Answer · 2018-09-27 11:51:55Z

EXPLAIN ANALYZE:

The EXPLAIN ANALYZE <your_query> command will execute the query and returns the Execution plan (only, no table), as developed and applied by the Query planner, with details about cost (e.g. execution time) and strategy (e.g. usage of indexes).

To allow the Query planner to work with updated table statistics, run VACUUM ANALYZE [<table>] (<table> is optional, when omitted is run on the whole database) after each table operation (INSERT, DELETE, UPDATE), and, if unsure, before you run possibly expensive queries.

The Execution plan is the main pool of information about how your queries perform and it is a good idea to include the result here in your questions for all inquiries concerning query performance.

KNN:

The (K) Nearest Neighbor in PostGIS search is best applied via the KNN operator <-> as the ORDER BY parameter, and wrapped into a LATERAL JOIN; in your case:

SELECT g1.gid AS gref_gid,
       g2.gid AS gnn_gid,
       g2.code_mun,
       g1.codigo_mun,
       g2.text,
       g2.via AS igcvia1
FROM u_nomen_dom As g1
JOIN LATERAL (
  SELECT gid,
         code_mun,
         text,
         via
  FROM u_nomen_via AS g
  WHERE g1.codigo_mun = g.codigo_mun    
  ORDER BY g1.geom <-> g.geom
  LIMIT 1
) AS g2
ON true;

Indexes:

This is mainly efficient due to the spatial index in place on your tables geom column, as the <-> operator will include an index search (if used in ORDER BY); create a spatial index on both tables with:

CREATE INDEX u_nomen_dom_geom_idx
  ON u_nomen_dom
  USING gist (geom);

CREATE INDEX u_nomen_via_geom_idx
  ON u_nomen_via
  USING gist (geom);

In general (but not necessary), a column used in a JOIN, in a filter (WHERE) and in ORDER BY may benefit from an index...unused indexes, however, will rather decrease overall performance, and indexes do take considerable space on disk. A primary key column should always have an index.

You might want to try if the Query planner sees an index on codigo_mun as beneficial (if not, drop it):

CREATE INDEX u_nomen_dom_codigo_mun_idx
  ON u_nomen_dom
  USING btree (codigo_mun);

CREATE INDEX u_nomen_via_codigo_mun_idx
  ON u_nomen_via
  USING btree (codigo_mun);

Indexes and their proper usage are crucial for performance, but they are a complex topic, worth to dive in with a bit of free time at hand.

CRS:

Most functions in PostGIS will have an explicit note on their doc pages concerning the CRS, e.g. results will be in CRS units.

With this in mind, consider your usage of ST_DWithin: you intend to find geometries in a radius of 150 meter - but your CRS is EPSG:4326, which has degree as unit. Thus you are searching for geometries in a radius of 150 degrees, with the result of e.g. ST_Distance being useless due to it being in degrees as well.

The same applies to ST_Expand, and that´s why I asked if your geometries are projected - and I should have asked for cartesian projection, to differentiate from a geographical projection (not the official terminology I guess...).

If you want consistent results (one degree of Longitude doesn´t represent the same distance over different Latitudes), and to use meter as units for your calculations (and as return value from used functions) you will need to reproject your data into an appropriate cartesian projection for your AOI.

However, if that is undesirable, PostGIS implements a second data type for geometries that is worth getting to know better: geography. Many functions (e.g. ST_DWithin, but not ST_Expand) offer an own function signature to pass in geography types instead of geometry; if used, those functions will do their calculations based on spherical/spheroidal algebra (Haversine/Vincenty's formulae etc.) with significantly better precision (at the cost of performance) and will implicitly use meter as units. This data type assumes EPSG:4326 as CRS to transform from geometry, and can simply be cast on-the-fly.

This is all pretty loosely explained and some things are only the tip of the iceberg. With the KNN query you should get better speed, given that you set up indexes and make sure you know what units you are working with. If precision is of concern, use a cast to geography, at the cost of some speed (e.g. g1.geom::geography).

you could use the 'ST_Expand` filter with a value of 0.001, which represents about 110m at the equator. adjust carefully if needed. — geozelot, Commented Sep 27, 2018 at 12:02
Great answer. You made everything much more clear. KNN and indexes make it run faster. Solved! — A.T., Commented Sep 27, 2018 at 13:45
KNN operator <-> thanks for sharing this information, answers like this make GIS.SE a valuable resource. — Mapperz, Commented Sep 27, 2018 at 14:49

Stack Exchange Network

Efficient way to find nearest feature between huge postgres tables

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
postgis
postgresql
nearest-neighbor
knn
or ask your own question.

Linked

Hot Network Questions

Efficient way to find nearest feature between huge postgres tables

1 Answer 1

Not the answer you're looking for? Browse other questions tagged postgispostgresqlnearest-neighborknn or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
postgis
postgresql
nearest-neighbor
knn
or ask your own question.