8
\$\begingroup\$

I'm new to Postgres and PostGIS, but not to geospatial applications.

I have a table loaded up with graph data in the form of links (edges). The link database has about 60,000,000 rows. I am trying to extract the nodes to allow for quicker searching. For some of my proof-of-concept work I was able to use the link table for the searches, but there will be lots of duplicates, and there's no guarantee that either the source column or the target column contains all nodes.

This is Postgres & PostGIS, so I am also using the table to cache a geometry->geography conversion. Yes I do need to use geography fields. I'm also copying the geometry information "just in case".

Table creation SQL:

-- Recreate table and index
DROP TABLE IF EXISTS nodes;

CREATE TABLE nodes (node integer PRIMARY KEY, geog geography(POINT,4326) );

CREATE INDEX geogIndex ON nodes USING GIST(geog);

SELECT AddGeometryColumn('nodes', 'geom', 4326, 'POINT', 2);

-- Insert all unique nodes from the source column
INSERT INTO nodes (node,geog,geom)
       SELECT DISTINCT ON (source) source,geography( ST_Transform(geom_source,4326)),geom_source 
       FROM view_topo;

-- Insert any nodes in the target column that we don't have already
INSERT INTO nodes (node,geog,geom)
       SELECT DISTINCT ON (target) target,geography( ST_Transform(geom_target,4326)),geom_target 
       FROM view_topo
       WHERE NOT EXISTS( SELECT 1 FROM nodes WHERE nodes.node = view_topo.target);

VACUUM ANALYZE;

I left the first INSERT running overnight and it took about 2-3hrs to run. This resulted in about 40,000,000 unique nodes being added.

I have just enabled the second INSERT and the VACUUM ANALYZE. I am expecting it to take until at least lunchtime.

Luckily this is only a batch job that has to be executed once after I've loaded a new link table, but is there a better way? Is there a faster way?

\$\endgroup\$
1
  • \$\begingroup\$ Yes Postgres doesn't like the VACUUM ANALYZE in a script like that. However the rest appears to have worked okay. \$\endgroup\$
    – winwaed
    Commented Jan 21, 2011 at 0:40

2 Answers 2

5
\$\begingroup\$

Check out PostgreSQL's tips on adding a lot of data into a table. In particular, are you sure that you need that index before INSERTing all that data? It might speed things up if you create the index after all the data has been added to the table.

\$\endgroup\$
1
  • 1
    \$\begingroup\$ Thanks. Yesterday I founded I needed an index on the geometry field - retroactively adding that took quite a while to execute. This initially made me doubtful, but then in conventional programming a create-store-access-destroy pattern on sorted data is usually a lot faster implemented with an unsorted container and as create-store-sort-access-destroy - ie. essentially what you are suggesting. Build it then sort it one big sort. \$\endgroup\$
    – winwaed
    Commented Jan 27, 2011 at 14:48
4
\$\begingroup\$

WHERE NOT EXISTS is not fast SQL. How about a union of the two selects, inside a select distinct? I haven't got a postgresql instance to test this on, but maybe something like this: INSERT INTO nodes (node, geom, geom) SELECT DISTINCT ON (node), geom1, geom2 FROM (SELECT source AS node, geography( ST_Transform(geom_source,4326)) AS geom1, geom_source AS geom2 FROM view_topo UNION SELECT target AS node, geography( ST_Transform(geom_target,4326)) AS geom1, geom_target AS geom2 FROM view_topo)

\$\endgroup\$
2
  • \$\begingroup\$ Thanks, I'm trying a UNION. I guess the NOT EXISTS relies on a nested SQL statement which would be executed for each row. \$\endgroup\$
    – winwaed
    Commented Jan 27, 2011 at 15:17
  • \$\begingroup\$ Out of memory! It sounds like your approach should be faster for smaller datasets but with 60M rows in my view_topo view, and about 8-9M unique rows to be inserted, my PC can't hack it. (4GB installed, but booted as 32 bit because PostGIS is currently only available for 32 bit) \$\endgroup\$
    – winwaed
    Commented Jan 27, 2011 at 15:35

Not the answer you're looking for? Browse other questions tagged or ask your own question.