1
\$\begingroup\$

What I want to achieve:

I have several tables of the following schema:

+------+-------+
| date |  val  |
+------+-------+
| DATE | INT64 |
+------+-------+

I want to create the following:

+--------+---------+----------+----------------+--------------+
| table  | val_cur | val_prev | val_last_month | val_expected |
+--------+---------+----------+----------------+--------------+
| STRING | INT64   | INT64    | INT64          | INT64        |
+--------+---------+----------+----------------+--------------+

where table is the current table, val_cur is the sum of values of the current timeframe (e.g. '2021-12-01' - '2021-12-09'), val_prev the sum of values of the previous timeframe (e.g. '2021-11-01' - '2021-11-09'), val_last_month the sum of values of last month (relative to the current timeframe) and val_expected the sum of the average values per day of the current timeframe times the number of days left of the current month. I'm using the Standard SQL dialect of Google BigQuery.

What I'm trying to optimize

I would like to make the query less verbose and less error-prone to changes. Furthermore, it would be nice to make the query as efficient as possible concerning memory and execution speed.

My Solution

This is what I have - first I create some useful date variables:

DECLARE START_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-01'); -- Start of current timeframe
DECLARE END_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-09'); -- End of current timeframe
DECLARE PREV_START_DATE DATE DEFAULT DATE_SUB(START_DATE, INTERVAL 1 MONTH);
DECLARE PREV_END_DATE DATE DEFAULT DATE_SUB(END_DATE, INTERVAL 1 MONTH);
DECLARE FIRST_DAY_LAST_MONTH DATE DEFAULT DATE_SUB(START_DATE, INTERVAL 1 MONTH);
DECLARE LAST_DAY_LAST_MONTH DATE DEFAULT DATE_SUB(DATE_TRUNC(END_DATE, MONTH), INTERVAL 1 DAY);
DECLARE LAST_DAY_CUR_MONTH DATE DEFAULT DATE_SUB(DATE_TRUNC(DATE_ADD(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH), INTERVAL 1 DAY);
DECLARE DAYS_PASSED INT64 DEFAULT DATE_DIFF(END_DATE, START_DATE, DAY)+1;
DECLARE DAYS_LEFT INT64 DEFAULT (EXTRACT(DAY FROM LAST_DAY_LAST_MONTH) - DAYS_PASSED) + 1;

SELECT DISTINCT START_DATE, END_DATE, PREV_START_DATE, PREV_END_DATE, DAYS_PASSED, DAYS_LEFT, FIRST_DAY_LAST_MONTH, LAST_DAY_LAST_MONTH, LAST_DAY_CUR_MONTH
FROM (SELECT 0);

Output:

+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+
| Row | START_DATE |  END_DATE  | PREV_START_DATE | PREV_END_DATE | DAYS_PASSED | DAYS_LEFT | FIRST_DAY_LAST_MONTH | LAST_DAY_LAST_MONTH |
+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+
|   1 | 2021-12-01 | 2021-12-09 | 2021-11-01      | 2021-11-09    |           9 |        22 | 2021-11-01           | 2021-11-30          |
+-----+------------+------------+-----------------+---------------+-------------+-----------+----------------------+---------------------+

Then I create two example tables and execute my code:

WITH table_1 AS
  (SELECT date, CAST(RAND()*10 AS INT64) AS val
  FROM UNNEST(
      GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
  ) AS date),
  table_2 AS
  (SELECT date, CAST(RAND()*10 AS INT64) AS val
  FROM UNNEST(
      GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
  ) AS date)

(SELECT 

  'table_1' AS table,

  (SELECT SUM(val)
  FROM table_1
  WHERE date >= START_DATE AND date <= END_DATE) AS val_cur,

  (SELECT SUM(val)
  FROM table_1
  WHERE date >= PREV_START_DATE AND date <= PREV_END_DATE) AS val_prev,

  (SELECT SUM(val)
  FROM table_1
  WHERE date >= FIRST_DAY_LAST_MONTH AND date <= LAST_DAY_LAST_MONTH) AS val_last_month,

  (SELECT (SUM(val)/DAYS_PASSED) * DAYS_LEFT
  FROM table_1
  WHERE date >= START_DATE AND date <= END_DATE) AS val_expected)

UNION ALL 

(SELECT

  'table_2' AS table,

  (SELECT SUM(val)
  FROM table_1
  WHERE date >= START_DATE AND date <= END_DATE) AS val_cur,

  (SELECT SUM(val)
  FROM table_1
  WHERE date >= PREV_START_DATE AND date <= PREV_END_DATE) AS val_prev,

  (SELECT SUM(val)
  FROM table_1
  WHERE date >= FIRST_DAY_LAST_MONTH AND date <= LAST_DAY_LAST_MONTH) AS val_last_month,

  (SELECT (SUM(val)/DAYS_PASSED) * DAYS_LEFT
  FROM table_1
  WHERE date >= START_DATE AND date <= END_DATE) AS val_expected)

Output:

-- Replicating the example on your machine will change values because of RAND()
+---------+---------+----------+----------------+--------------------+
|  table  | val_cur | val_prev | val_last_month |    val_expected    |
+---------+---------+----------+----------------+--------------------+
| table_1 |      64 |       50 |            168 | 107.55555555555557 |
| table_2 |      32 |       48 |            169 | 114.88888888888889 |
+---------+---------+----------+----------------+--------------------+

Thanks for any feedback!

\$\endgroup\$
1
  • \$\begingroup\$ Did you mean FROM table_2 in subqueries of second SELECT in UNION query? \$\endgroup\$
    – Parfait
    Commented May 15, 2022 at 16:50

1 Answer 1

2
\$\begingroup\$

To reduce verbosity and the many subqueries, consider conditional aggregation where you move WHERE conditions into CASE statements nested within aggregate functions.

Additionally, run your dates query as another CTE to CROSS JOIN with main tables. Try also the BETWEEN operator for range conditions.

DECLARE START_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-01');   -- Start of current timeframe
DECLARE END_DATE DATE DEFAULT PARSE_DATE('%Y-%m-%d', '2021-12-09');     -- End of current timeframe

WITH dates AS (
    SELECT START_DATE, 
           END_DATE,
           DATE_SUB(START_DATE, INTERVAL 1 MONTH) AS PREV_START_DATE,
           DATE_SUB(END_DATE, INTERVAL 1 MONTH) AS PREV_END_DATE,
           DATE_SUB(START_DATE, INTERVAL 1 MONTH) AS FIRST_DAY_LAST_MONTH,
           DATE_SUB(DATE_TRUNC(END_DATE, MONTH), INTERVAL 1 DAY) AS LAST_DAY_LAST_MONTH,
           DATE_SUB(DATE_TRUNC(
                DATE_ADD(CURRENT_DATE(), INTERVAL 1 MONTH), 
                MONTH), INTERVAL 1 DAY) AS LAST_DAY_CUR_MONTH,
           DATE_DIFF(END_DATE, START_DATE, DAY) + 1 AS DAYS_PASSED,
           (EXTRACT(DAY FROM DATE_SUB(DATE_TRUNC(END_DATE, MONTH), INTERVAL 1 DAY)) - 
            DATE_DIFF(END_DATE, START_DATE, DAY) + 1) + 1 AS DAYS_LEFT
    FROM (SELECT 0)
), table_1 AS (
    SELECT date, CAST(RAND()*10 AS INT64) AS val
    FROM UNNEST(
      GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
    ) AS date
), table_2 AS (
    SELECT date, CAST(RAND()*10 AS INT64) AS val
    FROM UNNEST(
      GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
    ) AS date
)

SELECT 
    'table_1' AS table,
    SUM(CASE 
           WHEN t.date BETWEEN d.START_DATE AND d.END_DATE 
           THEN t.val
        END) AS val_cur,
    SUM(CASE
           WHEN t.date BETWEEN d.PREV_START_DATE AND d.PREV_END_DATE
           THEN t.val
        END) AS val_prev,
    SUM(CASE
           WHEN t.date BETWEEN d.FIRST_DAY_LAST_MONTH AND d.LAST_DAY_LAST_MONTH
           THEN t.val
        END) AS val_last_month,
    (SUM(CASE
           WHEN t.date BETWEEN d.START_DATE AND d.END_DATE
           THEN t.val
        END) / 
     d.DAYS_PASSED) * d.DAYS_LEFT AS val_expected

FROM table_1 t
CROSS JOIN dates d
GROUP BY d.DAYS_PASSED, d.DAYS_LEFT

UNION ALL

SELECT 
    'table_2' AS table,
    SUM(CASE 
           WHEN t.date BETWEEN d.START_DATE AND d.END_DATE 
           THEN t.val
        END) AS val_cur,
    SUM(CASE
           WHEN t.date BETWEEN d.PREV_START_DATE AND d.PREV_END_DATE
           THEN t.val
        END) AS val_prev,
    SUM(CASE
           WHEN t.date BETWEEN d.FIRST_DAY_LAST_MONTH AND d.LAST_DAY_LAST_MONTH
           THEN t.val
        END) AS val_last_month,
    (SUM(CASE
           WHEN t.date BETWEEN d.START_DATE AND d.END_DATE
           THEN t.val
        END) / 
     d.DAYS_PASSED) * d.DAYS_LEFT AS val_expected

FROM table_2 t
CROSS JOIN dates d
GROUP BY d.DAYS_PASSED, d.DAYS_LEFT
ORDER BY 1

Output

Row table val_cur val_prev val_last_month val_expected
1 table_1 35 38 143 93.333333333333329
2 table_2 64 62 165 170.66666666666666
\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.