I'm new to Postgres and PostGIS, but not to geospatial applications.
I have a table loaded up with graph data in the form of links (edges). The link database has about 60,000,000 rows. I am trying to extract the nodes to allow for quicker searching. For some of my proof-of-concept work I was able to use the link table for the searches, but there will be lots of duplicates, and there's no guarantee that either the source column or the target column contains all nodes.
This is Postgres & PostGIS, so I am also using the table to cache a geometry->geography conversion. Yes I do need to use geography fields. I'm also copying the geometry information "just in case".
Table creation SQL:
-- Recreate table and index
DROP TABLE IF EXISTS nodes;
CREATE TABLE nodes (node integer PRIMARY KEY, geog geography(POINT,4326) );
CREATE INDEX geogIndex ON nodes USING GIST(geog);
SELECT AddGeometryColumn('nodes', 'geom', 4326, 'POINT', 2);
-- Insert all unique nodes from the source column
INSERT INTO nodes (node,geog,geom)
SELECT DISTINCT ON (source) source,geography( ST_Transform(geom_source,4326)),geom_source
FROM view_topo;
-- Insert any nodes in the target column that we don't have already
INSERT INTO nodes (node,geog,geom)
SELECT DISTINCT ON (target) target,geography( ST_Transform(geom_target,4326)),geom_target
FROM view_topo
WHERE NOT EXISTS( SELECT 1 FROM nodes WHERE nodes.node = view_topo.target);
VACUUM ANALYZE;
I left the first INSERT
running overnight and it took about 2-3hrs to run. This resulted in about 40,000,000 unique nodes being added.
I have just enabled the second INSERT
and the VACUUM ANALYZE
. I am expecting it to take until at least lunchtime.
Luckily this is only a batch job that has to be executed once after I've loaded a new link table, but is there a better way? Is there a faster way?