I am using PostgreSQL and PostGIS to do some exploratory analysis and eventually spatial analysis. I have a table with 40M+ points (geolocated tweets) in the United States that has a spatial index. I also have multiple polygon tables: counties, census tracts, CSAs, and UACs - all with spatial indices. I want to compare various polygons based on the amount/type of tweeting activity.
I am in the process of region tagging the points table with the various polygon types so that I can avoid having to always perform spatial queries, which from my understanding are expensive and from my experience do indeed take longer. I then put an index on each of the region tag columns in the points table.
The problem is that this process takes a really long time. Each region tagging (I'm using use 4) takes well over an hour. This isn't that big of a deal since it theoretically should only happen once, but as I add more points I will have re-region tag (perhaps just those where some_region_tag IS NULL?). This all seems cumbersome as I add more data. Further, my table updates take longer than I would like.
For example, I have these indices:
public | points_geom_idx | index | matt | points
public | usa_census_tracts_geom_idx | index | matt | usa_census_tracts
And I region-tag with the points with
UPDATE
points
SET
tract_geoid = r.geoid
FROM
usa_census_tracts as r
WHERE
ST_Intersects(points.geom, r.geom);
-- Result
UPDATE 39638978
Time: 4242357.695 ms -- (70 minutes)
which has this explain result:
QUERY PLAN
-----------------------------------------------------------------------------------------------
Update on points (cost=877.02..2000416249.03 rows=5628971170 width=466)
-> Nested Loop (cost=877.02..2000416249.03 rows=5628971170 width=466)
-> Seq Scan on usa_census_tracts r (cost=0.00..23394.33 rows=74133 width=8181)
-> Bitmap Heap Scan on points (cost=877.02..26962.22 rows=2162 width=448)
Recheck Cond: (geom && r.geom)
Filter: _st_intersects(geom, r.geom)
-> Bitmap Index Scan on points_geom_idx (cost=0.00..876.48 rows=6487 width=0)
Index Cond: (geom && r.geom)
I update a column in the census tracts table with a query like:
UPDATE
usa_census_tracts
SET
total_points = ts.cnt
FROM
(SELECT t.tract_geoid, COUNT(t.geom) AS cnt
FROM points AS t
GROUP BY t.tract_geoid) AS ts
WHERE
usa_census_tracts.geoid = ts.tract_geoid;
-- Result
UPDATE 72526
Time: 330689.666 ms -- (about 6 minutes)
with this explain result:
QUERY PLAN
-----------------------------------------------------------------------------------------------
Update on usa_states (cost=11091818.64..11091826.02 rows=49 width=249183)
-> Hash Join (cost=11091818.64..11091826.02 rows=49 width=249183)
Hash Cond: ((usa_states.statefp)::text = (ts.statefp)::text)
-> Seq Scan on usa_states (cost=0.00..6.56 rows=56 width=249140)
-> Hash (cost=11091818.03..11091818.03 rows=49 width=46)
-> Subquery Scan on ts (cost=11091817.04..11091818.03 rows=49 width=46)
-> HashAggregate (cost=11091817.04..11091817.54 rows=49 width=35)
Group Key: t.statefp
-> Seq Scan on points t (cost=0.00..10767463.03 rows=64870803 width=35)
Six minutes is not an unbearable amount of time, but I want to make updates like this with 30ish other columns based on the attributes of the points table (30is x 6 = 180ish minutes). I understand that this query does not take advantage of an index, so I sometimes subset the data by state, for example, with:
UPDATE
usa_census_tracts
SET
total_points = ts.cnt
FROM
(SELECT t.tract_geoid, COUNT(t.geom) AS cnt
FROM points AS t
WHERE statefp = '48' -- state fips code
GROUP BY t.tract_geoid) AS ts
WHERE
usa_census_tracts.geoid = ts.tract_geoid;
which is only marginally faster. Since I would like to, for example, be to compare census tracts in Texas to those in California and Montana or compare the counties in Nebraska to counties in South Carolina, for instance, it makes sense to update all polygon tables at once.
The more I've analyzed this problem the more that my biggest concern is with adding more data - I will have to re-region tag, and re-update polygon tables.
Am I properly using PostgreSQL/PostGIS for this task? Is there a way to improve this?