6

I am using PostgreSQL and PostGIS to do some exploratory analysis and eventually spatial analysis. I have a table with 40M+ points (geolocated tweets) in the United States that has a spatial index. I also have multiple polygon tables: counties, census tracts, CSAs, and UACs - all with spatial indices. I want to compare various polygons based on the amount/type of tweeting activity.

I am in the process of region tagging the points table with the various polygon types so that I can avoid having to always perform spatial queries, which from my understanding are expensive and from my experience do indeed take longer. I then put an index on each of the region tag columns in the points table.

The problem is that this process takes a really long time. Each region tagging (I'm using use 4) takes well over an hour. This isn't that big of a deal since it theoretically should only happen once, but as I add more points I will have re-region tag (perhaps just those where some_region_tag IS NULL?). This all seems cumbersome as I add more data. Further, my table updates take longer than I would like.

For example, I have these indices:

public | points_geom_idx            | index | matt  | points
public | usa_census_tracts_geom_idx | index | matt  | usa_census_tracts

And I region-tag with the points with

UPDATE
    points
SET
    tract_geoid = r.geoid
FROM
    usa_census_tracts as r
WHERE
    ST_Intersects(points.geom, r.geom);

-- Result
UPDATE 39638978
Time: 4242357.695 ms -- (70 minutes)

which has this explain result:

    QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Update on points  (cost=877.02..2000416249.03 rows=5628971170 width=466)
   ->  Nested Loop  (cost=877.02..2000416249.03 rows=5628971170 width=466)
         ->  Seq Scan on usa_census_tracts r  (cost=0.00..23394.33 rows=74133 width=8181)
         ->  Bitmap Heap Scan on points  (cost=877.02..26962.22 rows=2162 width=448)
               Recheck Cond: (geom && r.geom)
               Filter: _st_intersects(geom, r.geom)
               ->  Bitmap Index Scan on points_geom_idx  (cost=0.00..876.48 rows=6487 width=0)
                     Index Cond: (geom && r.geom)

I update a column in the census tracts table with a query like:

UPDATE
    usa_census_tracts 
SET
    total_points = ts.cnt
FROM
    (SELECT t.tract_geoid, COUNT(t.geom) AS cnt
    FROM points AS t
    GROUP BY t.tract_geoid) AS ts
WHERE
    usa_census_tracts.geoid = ts.tract_geoid;

-- Result
UPDATE 72526
Time: 330689.666 ms -- (about 6 minutes)

with this explain result:

 QUERY PLAN                                           
-----------------------------------------------------------------------------------------------

Update on usa_states  (cost=11091818.64..11091826.02 rows=49 width=249183)
   ->  Hash Join  (cost=11091818.64..11091826.02 rows=49 width=249183)
         Hash Cond: ((usa_states.statefp)::text = (ts.statefp)::text)
         ->  Seq Scan on usa_states  (cost=0.00..6.56 rows=56 width=249140)
         ->  Hash  (cost=11091818.03..11091818.03 rows=49 width=46)
               ->  Subquery Scan on ts  (cost=11091817.04..11091818.03 rows=49 width=46)
                     ->  HashAggregate  (cost=11091817.04..11091817.54 rows=49 width=35)
                           Group Key: t.statefp
                           ->  Seq Scan on points t  (cost=0.00..10767463.03 rows=64870803 width=35)

Six minutes is not an unbearable amount of time, but I want to make updates like this with 30ish other columns based on the attributes of the points table (30is x 6 = 180ish minutes). I understand that this query does not take advantage of an index, so I sometimes subset the data by state, for example, with:

UPDATE
    usa_census_tracts 
SET
    total_points = ts.cnt
FROM
    (SELECT t.tract_geoid, COUNT(t.geom) AS cnt
    FROM points AS t
    WHERE statefp = '48' -- state fips code
    GROUP BY t.tract_geoid) AS ts
WHERE
    usa_census_tracts.geoid = ts.tract_geoid;

which is only marginally faster. Since I would like to, for example, be to compare census tracts in Texas to those in California and Montana or compare the counties in Nebraska to counties in South Carolina, for instance, it makes sense to update all polygon tables at once.

The more I've analyzed this problem the more that my biggest concern is with adding more data - I will have to re-region tag, and re-update polygon tables.

Am I properly using PostgreSQL/PostGIS for this task? Is there a way to improve this?

11
  • 2
    Not an answer, but quite relevant lwn.net/Articles/653411 . The link describes how newer versions of postgresql than yours gain tools to address 'mid-sized data' (1 to 10 terabyte)
    – nickves
    Commented Apr 22, 2016 at 23:30
  • 2
    "It depends", which is why you are seeing voting to close because its opinion based. The answer depends on the operations you want to do, relative expertise in the technologies (e.g. do you know anything about map-reduce?), environment (e.g. do you have a cluster to support sharding), and probably things I don't know about (because I don't know your environment).
    – BradHards
    Commented Apr 23, 2016 at 0:03
  • @nickves Thanks for the link. That piece is very informative. I'll be upgrading to PostgreSQL 9.5 to test those concepts. However, something still doesn't make much sense. The 1 to 10 TB range is much larger than my current data set. My 40M points are 77 GB. I can't really imagine doing these operations on something 2, 3, or 4x this size. Are spatial calculations being considered here too? Or am I really mismanaging this data..
    – haff
    Commented Apr 23, 2016 at 6:09
  • 2
    I think this is the right place, but not in this shape. The question should actually seems to be more of a beginners question about how to make PostGIS work as expected. Not a comparission about different software. So, relevant questons to @haff is:you don't mention indeximg the polygons, why? What is your query? What is the result of explain query? Commented Apr 23, 2016 at 7:47
  • 1
    Hi all, thanks for the comments and the direction for this question. I've heavily edited the question and decided to focus this one on PostGIS. I think I've addressed your questions in the edits, @Nicklas Avén. In the mean time I'll be building a cluster to some alternative ideas (some of which are not related to this GIS components of the problem).
    – haff
    Commented Apr 25, 2016 at 19:11

1 Answer 1

3

There's two pieces to the performance pie here:

  • The time to calculate the tagging, which you should break out by doing some test runs with things like

    CREATE TABLE tract_pt_counts AS
    SELECT Count(*), tract_id FROM tracts t
    JOIN pts p ON ST_Intersects(t.geom, p.geom)
    GROUP BY tract_id
    
  • The time to update a table in place. This is actually probably a lot more expensive than you think, thanks to row level concurrency and locking overheads.

You cannot avoid the calculation time, that's just a given. However, there are a couple schemes you could use to avoid the update overhead.

  • Rewrite your table every time, using the CREATE TABLE AS option. This is a big hammer, but not the end of the world. However, if you're planning on adding or updating column values regularly, maybe not the best, particularly the wider the table is.
  • Rewrite a concordance table every time. PostgreSQL is really good at doing joins on integer keys, so a structure where you have a master census table, and then a "census point count" table. Or a master points table, and a "points region table" and join them together is not the end of the world, if the kind of queries you are going to do pull a "few" records at a time (not the whole table).

Not the answer you're looking for? Browse other questions tagged or ask your own question.