2

I built an app that display time series data. The display is very similar to a yahoo finance chart (ie https://finance.yahoo.com/quote/AMZN/). I am using Postgres to store the data because that was the database available to me when I was building it. The issue I am having is when I try to scale the different time frames (1 day, vs month, vs max, ect). The raw data is saved in minute resolution in the below schema. The app_id is an id the app manages. Price is hopefully obvious, and time is milliseconds since epoch.

column | Type   |
-------+--------+---
app_id | bigint |
price  | double |
time   | bigint |

The problem is when someone clicks 5 days (5D in the yahoo plot) for example I can't load 7200 (# of minutes in 5 days) data points into the front end I need to scale the data to say 60 data points. That would mean I need every 120th data point. To accomplish this I use MOD(app_id,120). So my query would be something like

 select price from my_table 
 where app_id > (select count(*) from my_table) - 7200 
 and MOD(app_id,120) = 0

However I am wondering if there is a more efficient or scalable way to handle this? Because in order to guarantee I have the first and last data point I have to grab those separately. Because if there are 7201 data points in the db that query will not grab the first one. Also if there are gaps in the saved data that I have to manually fill, for whatever reason, all the ids that come after it have to be updated. So I my question boils down to:

Is there a way to do this with the time stamps themselves, or a simpler way than what I am currently doing with MOD ?

1
  • If the data is static, there is no reason you can't store it in a faster data store, such as read-only files, that support fast random access. Commented Jul 11, 2017 at 23:26

1 Answer 1

3

The query you gave us isn't very scalable because it will still have to scan the entire or a large range of rows. You've helped reduce the amount of data returned, but I believe it will still put more load on your database than needed. Also using MOD will not reflect the data accurately in certain cases. For example if the data point that matches the MOD clause is an outlier, it would be misleading to show that value on the graph. Instead you should be using some kind of average of the raw values in the time period.

Materialized Views Approach

I would use a set of materialized views to aggregate the raw data into averages. Here's an example:

create materialized view
  my_table_hourly_agg 
as
  select 
    min(app_id) as app_id, 
    avg(price) as price, 
    min(time) as time 
  from 
    my_table 
  group by
    floor(time / (60 * 60 * 1000)) -- group raw data to the nearest hour

This solution has the benefit that at higher granularities you no longer have to query a large range of rows. Since the view is materialized it means the rows are stored and can be accessed like a table instead of a normal database view where the query is executed each time the view is used. You can build larger granularity materialized views from the lower granularity materialized views.

This solution has the downside that the materialized view will have some sort of update latency associated with it. You must call refresh materialized view to update the contents of the materialized view. You can do so on a schedule based on your requirements. The view update may take a while if the number of rows is large.

You can experiment with using regular views instead (using the same query as above), but it won't be as scalable.

Aggregate Tables Approach

I think the ideal design--if possible is to create tables for each of the aggregate periods, and fill them as raw data enters your system. For example:

create table my_table_hourly_agg (
  first_app_id integer primary key,
  avg_price real,
  start_time bigint,
  num_samples integer, -- we need number of samples to produce a rolling average
  constraint uniq_hour_agg unique (start_time) -- one row per hour
)

Then when a new row of raw data enters your system, you simultaneously insert the data point and update the aggregation tables.

begin transaction
  insert into 
    my_table 
  values ($app_id, $price, $time)

  insert into 
    my_table_hourly_agg (first_app_id, avg_price, start_time, num_samples)
  values ($app_id, $price, floor($time / (60 * 60 * 1000)), 1)
  on conflict (start_time) -- if there is a conflict that means this is not the first data point for the given hour, and we should update the average price and the number of samples
  do update 
    set avg_price = (avg_price * num_samples + $price) / (num_samples + 1)
    set num_samples = num_samples + 1
commit

With this design your aggregates are immediately up to date, and you have the benefit of lightning fast low res queries (like the materialized views), but this potentially clutters your code, and it's very important that you protect yourself with transactions to prevent your aggregates from being incorrect. You could potentially write database triggers to update your aggregate tables, but I don't have enough time to show an example.

2
  • The group by approach is interesting. I think my users would complain I'm averaging away the real trends though.
    – achyrd
    Commented Jul 15, 2017 at 4:46
  • I was playing around with explicitly saving all the different time frames but that won't work for the max option because the time between data points depends on the size of the table which is constantly changing.
    – achyrd
    Commented Jul 15, 2017 at 4:50

Not the answer you're looking for? Browse other questions tagged or ask your own question.