1
\$\begingroup\$

I have a query that can be seen on this fiddle and I'm relatively new to PSQL, having a few months of experience.

DB Fiddle

As mentioned in the comment I should put the query here and the code is perfectly working. The reason for this question is to find an improvement on it.

The link to the third party is to show the result of it otherwise would be difficult to visualize it here.

WITH statuses_flow AS (
    SELECT
        c.id,
        c.study_id,
        c.site_id,
        s.type AS status_type,
        s.timestamp AS status_from,
        sa.type,
        row_number() OVER (ORDER BY s.candidate_id,
            s.timestamp) AS row_no
    FROM
        public.candidates c
        JOIN public.statuses s ON s.candidate_id = c.id
        JOIN public.statuses sa ON sa.candidate_id = c.id
            AND sa.id in(
            SELECT
                max(id)
            FROM public.statuses
        GROUP BY
            candidate_id)
    WHERE
        c.study_id in('INIT1')
        AND c.site_id in('Test1')
        AND sa.type != 'ANONYMISED' ORDER BY
            row_no ASC
)
SELECT
    statuses_flow.id, statuses_flow.study_id AS "studyId", statuses_flow.site_id AS "siteId", statuses_flow.status_type AS "statusType", statuses_flow.status_from AS "statusFrom", next_status.status_from AS "statusTo", CASE WHEN next_status.status_from IS NULL THEN
        NULL
    ELSE
        (
        SELECT
            created_at AS first_contact
        FROM
            public.activities
        WHERE
            candidate_id = statuses_flow.id
            AND TYPE in('PHONE', 'SMS', 'EMAIL')
            AND created_at BETWEEN statuses_flow.status_from
            AND next_status.status_from ORDER BY
                created_at FETCH FIRST 1 ROWS ONLY)
    END AS "first_contact"
FROM
    statuses_flow
    LEFT JOIN statuses_flow next_status ON statuses_flow.id = next_status.id
        AND statuses_flow.row_no + 1 = next_status.row_no
WHERE
    statuses_flow.status_type in('PENDING_SITE', 'PENDING_CALLCENTER', 'INCOMPLETE', 'REJECTED_CALLCENTER', 'REJECTED_SITE', 'CONSENTED')
ORDER BY
    statuses_flow.id,
    statuses_flow.status_from

This query collects and returns some metrics joining 3 tables about candidates being in a specific status during a medical study process.

The query in the real life is very slow and needs some adjustments to be written better.

The indexes are already applied on the original DB but still would be useful to find a better way.

The section of the query which I'm not convinced at this point is lines 29 to 42 from the DB fiddle link above.

I was wondering if by any chance could be changed to use a left join or another method to make a better performance.

It is returning per every candidate something as for example for one candidate the table below

id studyId siteId statusType statusFrom statusTo first_contact
1 Study1 Site1 INCOMPLETE 2021-07-20 09:30:52.101055+00 2021-07-20 09:31:53.568346+00 NULL
1 Study1 Site1 PENDING_CALLCENTER 2021-07-20 09:31:53.568346+00 2021-07-20 09:35:34.171876+00 2021-07-20 09:31:55.849+00
1 Study1 Site1 PENDING_SITE 2021-07-20 09:35:34.171876+00 2021-07-20 09:52:42.185163+00 2021-07-20 09:35:56.642+00
1 Study1 Site1 REJECTED_SITE 2021-07-20 09:53:08.874271+00 NULL NULL

To explain, the same candidate, in a range of time can be in different statuses and for example, Candidate 1 (for simplicity C1) in the table above, was in 4 statuses.

C1 stayed in incomplete status from/to and on this status period was not contacted so that why we have null on last column first_contact.

C1 was contacted in the next 2 statuses and we determine the first_contact checking from the activities table the minimum activity timestamps of the types PHONE, EMAIL, and SMS in the range of time C1 was in that specific status from/to. The first_contact is the first time a candidate was contacted by EMAIL PHONE or SMS.

On the last status, we have statusTo = null so also firstContact is null.

The result I'm getting now is correct but seems not efficient to me and probably can get better but have no confidence in the way to manage the change in this situation.

\$\endgroup\$
1
  • 2
    \$\begingroup\$ I improved as per comment the question I added the code embedded \$\endgroup\$
    – Jakub
    Commented Apr 1, 2022 at 16:46

1 Answer 1

1
\$\begingroup\$

Thing number 1 is to convert your CTE to a subquery, since until recently, CTEs impose an optimisation fence. I'm unclear on whether this will affect the version of PostgreSQL that you use. Converting to a subquery will require that you repeat it, which has advantages and disadvantages: obviously the code will be longer, but you can tailor each of the copies so that it only selects columns that you care about. In the case of next_status this will only be these:

        c.id,
        s.type AS status_type,
        s.timestamp AS status_from,
        row_number() OVER (ORDER BY s.candidate_id,
            s.timestamp) AS row_no

Don't in against a tuple of one item, as here:

        c.study_id in('INIT1')
        AND c.site_id in('Test1')

Instead just =.

Your dual row_number() with join is likely to be a source of slowness. Instead consider using the PostgreSQL lead/lag support in windowing functions. You would set up a window grouping values of candidates.id; within each group next_status is the row lead()ing the current row.

\$\endgroup\$
3
  • \$\begingroup\$ That was very useful I tried to do something like this if you can check and tell me that better dbfiddle.uk/… last query. The in is because in reality, we can have various studies and sites. But do you think using '=' is still better? \$\endgroup\$
    – Jakub
    Commented Apr 2, 2022 at 13:13
  • \$\begingroup\$ Sort of? You still have CTEs. I recommend trying your query with subqueries instead of CTEs. \$\endgroup\$
    – Reinderien
    Commented Apr 2, 2022 at 16:48
  • \$\begingroup\$ Will you be able to show me in a Pastebin or similar what you mean I'm still in the learning process and not sure about it \$\endgroup\$
    – Jakub
    Commented Apr 2, 2022 at 17:31

Not the answer you're looking for? Browse other questions tagged or ask your own question.