What I want to achieve:
I have several tables of the following schema:
+------+-------+
| date | val |
+------+-------+
| DATE | INT64 |
+------+-------+
I want to create the following:
+--------+---------+----------+----------------+--------------+
| table | val_cur | val_prev | val_last_month | val_expected |
+--------+---------+----------+----------------+--------------+
| STRING | INT64 | INT64 | INT64 | INT64 |
+--------+---------+----------+----------------+--------------+
where table
is the current table, val_cur
is the sum of values of the current timeframe (e.g. '2021-12-01' - '2021-12-09'), val_prev
the sum of values of the previous timeframe (e.g. '2021-11-01' - '2021-11-09'), val_last_month
the sum of values of last month (relative to the current timeframe) and val_expected
the sum of the average values per day of the current timeframe times the number of days left of the current month. I'm using the Standard SQL dialect of Google BigQuery.
What I'm trying to optimize
I would like to make the query less verbose and less error-prone to changes. Furthermore, it would be nice to make the query as efficient as possible concerning memory and execution speed.
My Solution
This is what I have - first I create some useful date variables:
DECLARE START_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-01'); -- Start of current timeframe
DECLARE END_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-09'); -- End of current timeframe
DECLARE PREV_START_DATE DATE DEFAULT DATE_SUB(START_DATE, INTERVAL 1 MONTH);
DECLARE PREV_END_DATE DATE DEFAULT DATE_SUB(END_DATE, INTERVAL 1 MONTH);
DECLARE FIRST_DAY_LAST_MONTH DATE DEFAULT DATE_SUB(START_DATE, INTERVAL 1 MONTH);
DECLARE LAST_DAY_LAST_MONTH DATE DEFAULT DATE_SUB(DATE_TRUNC(END_DATE, MONTH), INTERVAL 1 DAY);
DECLARE LAST_DAY_CUR_MONTH DATE DEFAULT DATE_SUB(DATE_TRUNC(DATE_ADD(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH), INTERVAL 1 DAY);
DECLARE DAYS_PASSED INT64 DEFAULT DATE_DIFF(END_DATE, START_DATE, DAY)+1;
DECLARE DAYS_LEFT INT64 DEFAULT (EXTRACT(DAY FROM LAST_DAY_LAST_MONTH) - DAYS_PASSED) + 1;
SELECT DISTINCT START_DATE, END_DATE, PREV_START_DATE, PREV_END_DATE, DAYS_PASSED, DAYS_LEFT, FIRST_DAY_LAST_MONTH, LAST_DAY_LAST_MONTH, LAST_DAY_CUR_MONTH
FROM (SELECT 0);
Output:
+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+
| Row | START_DATE | END_DATE | PREV_START_DATE | PREV_END_DATE | DAYS_PASSED | DAYS_LEFT | FIRST_DAY_LAST_MONTH | LAST_DAY_LAST_MONTH |
+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+
| 1 | 2021-12-01 | 2021-12-09 | 2021-11-01 | 2021-11-09 | 9 | 22 | 2021-11-01 | 2021-11-30 |
+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+
Then I create two example tables and execute my code:
WITH table_1 AS
(SELECT date, CAST(RAND()*10 AS INT64) AS val
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
) AS date),
table_2 AS
(SELECT date, CAST(RAND()*10 AS INT64) AS val
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
) AS date)
(SELECT
'table_1' AS table,
(SELECT SUM(val)
FROM table_1
WHERE date >= START_DATE AND date <= END_DATE) AS val_cur,
(SELECT SUM(val)
FROM table_1
WHERE date >= PREV_START_DATE AND date <= PREV_END_DATE) AS val_prev,
(SELECT SUM(val)
FROM table_1
WHERE date >= FIRST_DAY_LAST_MONTH AND date <= LAST_DAY_LAST_MONTH) AS val_last_month,
(SELECT (SUM(val)/DAYS_PASSED) * DAYS_LEFT
FROM table_1
WHERE date >= START_DATE AND date <= END_DATE) AS val_expected)
UNION ALL
(SELECT
'table_2' AS table,
(SELECT SUM(val)
FROM table_1
WHERE date >= START_DATE AND date <= END_DATE) AS val_cur,
(SELECT SUM(val)
FROM table_1
WHERE date >= PREV_START_DATE AND date <= PREV_END_DATE) AS val_prev,
(SELECT SUM(val)
FROM table_1
WHERE date >= FIRST_DAY_LAST_MONTH AND date <= LAST_DAY_LAST_MONTH) AS val_last_month,
(SELECT (SUM(val)/DAYS_PASSED) * DAYS_LEFT
FROM table_1
WHERE date >= START_DATE AND date <= END_DATE) AS val_expected)
Output:
-- Replicating the example on your machine will change values because of RAND()
+---------+---------+----------+----------------+--------------------+
| table | val_cur | val_prev | val_last_month | val_expected |
+---------+---------+----------+----------------+--------------------+
| table_1 | 64 | 50 | 168 | 107.55555555555557 |
| table_2 | 32 | 48 | 169 | 114.88888888888889 |
+---------+---------+----------+----------------+--------------------+
Thanks for any feedback!
FROM table_2
in subqueries of secondSELECT
inUNION
query? \$\endgroup\$