0

i have this Postgres query where i left join a couple of tables. This query runs for hours and causes issues. When I run explain analyse I see that the most time is spent in one of the left joins, for which optimiser selects Right Hash Join. When I use inner join instead and run explain analyse, optimiser selects a different plan and query finishes in minutes. I have to use left join because with inner join some data will be excluded. How should i rewrite the query to avoid this hash right join? Many thanks in advance!

Links to the EXPLAIN ANALYSE results are attached above. I am using PostgreSQL 12.11 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit

Query:

WITH memberships AS (
    SELECT customer_sk
         , membership_sk
         , membership_state
         , membership_b2b_type
         , membership_sml_type
         , membership_start_date
         , membership_end_date
         , membership_pause_from
         , membership_pause_to
         , covid_pause_start_date
         , covid_pause_end_date
         , city_sk AS membership_city_region_sk
         , sport_persona_current
         , membership_cancellation_reason
         , membership_sequence_nr_reverse
         , company_sk
         , company_name
    FROM dwh.fact_membership
    WHERE membership_is_urban_sports IS TRUE
),
-- Data preparation
     request_cancellation AS (
         SELECT membership_sk,
                requested_cancellation_last_date
         FROM staging.request_cancellation
     ),
     blacklisted_emails AS (
         SELECT customer_sk, email, 'blacklisted' AS blacklisted
         FROM dwh_userdata.blacklist_emails
     ),
     nonanon_customer AS (
         SELECT id
              , first_name
              , last_name
              , email
         FROM dwh_userdata.customer
     ),
     nonanon_customer_address_prep AS (
         SELECT customer_id
              , city
              , state
              , country
              , zip
              , row_number() over (partition by customer_id order by created_at desc) as row_number
         FROM dwh_userdata.customer_address
     ),
     nonanon_customer_address AS (
         SELECT *
         FROM nonanon_customer_address_prep
         WHERE row_number = 1
     ),
     favorite_sport_category_prep_1 AS (
         SELECT membership_sk
              , service_top_category_name
              , count(DISTINCT booking_sk) as cnt_booking
         FROM dwh.report_venue_visitors
         WHERE booking_is_valid
         GROUP BY 1, 2
     ),
     favorite_sport_category_prep_2 AS (
         SELECT membership_sk
              , service_top_category_name
              , cnt_booking
              , row_number()
                over (partition by membership_sk order by cnt_booking DESC,service_top_category_name ) AS row_number
         FROM favorite_sport_category_prep_1
     ),
     favorite_sport_category AS (
         SELECT membership_sk
              , service_top_category_name AS favourite_sport_category
              , cnt_booking
         FROM favorite_sport_category_prep_2
         WHERE row_number = 1
     ),
     free_trial AS (
         select distinct membership_sk
                       , customer_sk
                       , trial_status     AS free_trial_status
                       , trial            AS free_trial_length
                       , trial_start_date AS free_trial_start
                       , trial_end_date   AS free_trial_end
         FROM dwh.report_memberships
         WHERE trial_status IS NOT NULL
           and trial_start_date >= '2020-06-23'
     )
-- #### OUTOPUT TABLE
SELECT c.customer_sk                                                AS named_user
     , CASE WHEN c.gender IN ('M', 'F') THEN c.gender ELSE NULL END AS gender
     , nc.first_name
     , nc.last_name
     , customer_language
     , anss.state                                                   AS newsletter_status
     , dl.city_name                                                 AS membership_city_region
     , dl.country_code                                              AS membership_country_code
     , dl.country_name                                              AS membership_country_name
     , dl.admin1                                                    AS membership_administrative_state
     , m.membership_sk
     , m.membership_state
     , m.membership_b2b_type
     , m.company_sk
     , m.company_name
     , m.membership_sml_type
     , CASE
           WHEN m.membership_start_date IS NOT NULL
               THEN CONCAT(TO_CHAR(m.membership_start_date, 'YYYY-MM-DD'), 'T00:00:00')
           ELSE NULL END                                            AS membership_start_date
     , CASE
           WHEN m.membership_end_date IS NOT NULL THEN CONCAT(TO_CHAR(m.membership_end_date, 'YYYY-MM-DD'), 'T00:00:00')
           ELSE NULL END                                            AS membership_end_date
     , ft.free_trial_status
     , ft.free_trial_length
     , CASE
           WHEN ft.free_trial_start IS NOT NULL THEN CONCAT(TO_CHAR(ft.free_trial_start, 'YYYY-MM-DD'), 'T00:00:00')
           ELSE NULL END                                            AS free_trial_start
     , CASE
           WHEN ft.free_trial_end IS NOT NULL THEN CONCAT(TO_CHAR(ft.free_trial_end, 'YYYY-MM-DD'), 'T00:00:00')
           ELSE NULL END                                            AS free_trial_end
     , CASE
           WHEN m.membership_pause_from IS NOT NULL
               THEN CONCAT(TO_CHAR(m.membership_pause_from, 'YYYY-MM-DD'), 'T00:00:00')
           ELSE NULL END                                            AS membership_pause_from
     , CASE
           WHEN m.membership_pause_to IS NOT NULL THEN CONCAT(TO_CHAR(m.membership_pause_to, 'YYYY-MM-DD'), 'T00:00:00')
           ELSE NULL END                                            AS membership_pause_to
     , CASE
           WHEN m.covid_pause_start_date IS NOT NULL THEN CONCAT(TO_CHAR(m.covid_pause_start_date, 'YYYY-MM-DD'),
                                                                 'T00:00:00')
           ELSE NULL END                                            AS covid_pause_start_date
     , CASE
           WHEN m.covid_pause_end_date IS NOT NULL
               THEN CONCAT(TO_CHAR(m.covid_pause_end_date, 'YYYY-MM-DD'), 'T00:00:00')
           ELSE NULL END                                            AS covid_pause_end_date
     , CASE
           WHEN rc.requested_cancellation_last_date IS NOT NULL THEN CONCAT(
                   TO_CHAR(rc.requested_cancellation_last_date, 'YYYY-MM-DD'), 'T00:00:00')
           ELSE NULL END                                            AS requested_cancellation_last_date
     , membership_cancellation_reason
     , be.blacklisted                                               AS blacklist_email
     , fsc.favourite_sport_category                                 AS fav_sports_category
     , m.sport_persona_current
     , ambd.membership_months_active
     , ambd.membership_months_total
     , ambd.is_gm1_positive
     , ambd.cnt_bookings_total
     , ambd.cnt_bookings_last_30_days_total
     , ambd.cnt_bookings_last_30_days_onsite
     , ambd.cnt_bookings_onsite
     , ambd.cnt_bookings_online
     , ambd.cnt_bookings_last_30_days_online
     , CASE
           WHEN ambd.latest_booking_date IS NOT NULL
               THEN CONCAT(TO_CHAR(ambd.latest_booking_date, 'YYYY-MM-DD'), 'T00:00:00')
           ELSE NULL END                                            AS latest_booking_date
     , ambd.avg_bookings_active_month
     , ambd.last_checkin_type
     , ambd.fav_sports_category_onsite
     , ambd.fav_sports_category_online
     , ambd.fav_studio_last_30_days
     , ambd.fav_studio_group_website
FROM dwh.dim_customer c
         LEFT JOIN nonanon_customer nc
                   ON nc.id = c.customer_sk
         LEFT JOIN nonanon_customer_address nca
                   ON nca.customer_id = customer_sk
         LEFT JOIN memberships m
                   ON c.customer_sk = m.customer_sk
                       AND membership_sequence_nr_reverse = 1
         LEFT JOIN request_cancellation rc
                   ON m.membership_sk = rc.membership_sk
         LEFT JOIN dwh.dim_location dl
                   ON m.membership_city_region_sk = dl.city_sk
         LEFT JOIN blacklisted_emails be
                   ON be.email = nc.email
         LEFT JOIN favorite_sport_category fsc
                   ON fsc.membership_sk = m.membership_sk
         LEFT JOIN staging.airship_newsletter_subscription_status anss
                   ON anss.customer_id = c.customer_sk
         LEFT JOIN free_trial ft
                   ON ft.customer_sk = m.customer_sk
         LEFT JOIN staging.airship_membership_booking_details ambd
                   ON ambd.membership_sk = m.membership_sk
                       AND membership_sequence_nr_reverse = 1
WHERE be.blacklisted IS NULL
  AND nc.email NOT LIKE '%delete%'
  AND nc.email IS NOT NULL
  AND ((m.membership_sk IS NULL AND anss.state = 'subscribed') OR membership_state IS NOT NULL)

Results of EXPLAIN ANALYSE:

Hash Left Join  (cost=6667580.77..6764370.56 rows=3256 width=692) (actual time=4319030.909..4328353.358 rows=518825 loops=1)
  Hash Cond: (fact_membership.customer_sk = ft.customer_sk)
  ->  Hash Left Join  (cost=6663581.42..6759951.96 rows=3256 width=380) (actual time=4318059.369..4324841.032 rows=518825 loops=1)
        Hash Cond: (fact_membership.membership_sk = ambd.membership_sk)
        Join Filter: (fact_membership.membership_sequence_nr_reverse = 1)
        ->  Hash Left Join  (cost=6655261.78..6748793.03 rows=3256 width=242) (actual time=4317733.942..4323056.862 rows=518825 loops=1)
              Hash Cond: (c.customer_sk = anss.customer_id)
              Filter: (((fact_membership.membership_sk IS NULL) AND (anss.state = 'subscribed'::text)) OR (fact_membership.membership_state IS NOT NULL))
              Rows Removed by Filter: 129098
              ->  Merge Left Join  (cost=6642237.84..6733674.25 rows=3256 width=227) (actual time=4317378.943..4321020.832 rows=647923 loops=1)
                    Merge Cond: (fact_membership.membership_sk = favorite_sport_category_prep_2.membership_sk)
                    ->  Sort  (cost=167496.47..167504.61 rows=3256 width=218) (actual time=4146517.144..4147134.144 rows=647923 loops=1)
                          Sort Key: fact_membership.membership_sk
                          Sort Method: external merge  Disk: 82352kB
                          ->  Merge Left Join  (cost=150681.68..167306.50 rows=3256 width=218) (actual time=4142397.925..4145027.017 rows=647923 loops=1)
                                Merge Cond: (c.customer_sk = nonanon_customer_address_prep.customer_id)
                                ->  Sort  (cost=59476.20..59484.34 rows=3256 width=218) (actual time=4139725.733..4140241.833 rows=647923 loops=1)
                                      Sort Key: c.customer_sk
                                      Sort Method: external merge  Disk: 82344kB
                                      ->  Hash Right Join  (cost=52983.04..59286.23 rows=3256 width=218) (actual time=33403.336..4135281.108 rows=647923 loops=1)
                                            Hash Cond: (request_cancellation.membership_sk = fact_membership.membership_sk)
                                            ->  Seq Scan on request_cancellation  (cost=0.00..5128.40 rows=308340 width=8) (actual time=1.160..228.691 rows=308340 loops=1)
                                            ->  Hash  (cost=52942.34..52942.34 rows=3256 width=214) (actual time=30038.787..30048.670 rows=647923 loops=1)
                                                  Buckets: 65536 (originally 4096)  Batches: 131072 (originally 1)  Memory Usage: 10511kB
                                                  ->  Gather  (cost=1064.24..52942.34 rows=3256 width=214) (actual time=11.564..12621.194 rows=647923 loops=1)
                                                        Workers Planned: 2
                                                        Workers Launched: 2
                                                        ->  Hash Left Join  (cost=64.24..51616.74 rows=1357 width=214) (actual time=5.510..22450.906 rows=215974 loops=3)
                                                              Hash Cond: (fact_membership.city_sk = dl.city_sk)
                                                              ->  Nested Loop Left Join  (cost=59.79..51608.59 rows=1357 width=191) (actual time=5.239..22013.464 rows=215974 loops=3)
                                                                    ->  Nested Loop  (cost=59.37..50428.72 rows=1357 width=60) (actual time=4.923..6958.191 rows=215974 loops=3)
                                                                          ->  Hash Left Join  (cost=58.94..49440.62 rows=1357 width=55) (actual time=3.419..2000.407 rows=215976 loops=3)
                                                                                Hash Cond: ((customer.email)::text = blacklist_emails.email)
                                                                                Filter: (('blacklisted'::text) IS NULL)
                                                                                Rows Removed by Filter: 122
                                                                                ->  Parallel Seq Scan on customer  (cost=0.00..46660.28 rows=271334 width=46) (actual time=0.999..1668.668 rows=216091 loops=3)
                                                                                      Filter: ((email IS NOT NULL) AND ((email)::text !~~ '%delete%'::text))
                                                                                      Rows Removed by Filter: 3191
                                                                                ->  Hash  (cost=34.53..34.53 rows=1953 width=54) (actual time=2.222..2.226 rows=1953 loops=3)
                                                                                      Buckets: 2048  Batches: 1  Memory Usage: 144kB
                                                                                      ->  Seq Scan on blacklist_emails  (cost=0.00..34.53 rows=1953 width=54) (actual time=0.263..1.207 rows=1953 loops=3)
                                                                          ->  Index Scan using customer_pk on dim_customer c  (cost=0.42..0.73 rows=1 width=13) (actual time=0.020..0.020 rows=1 loops=647929)
                                                                                Index Cond: (customer_sk = customer.id)
                                                                    ->  Index Scan using dwh_fact_membership_3b307128 on fact_membership  (cost=0.42..0.86 rows=1 width=131) (actual time=0.066..0.067 rows=1 loops=647923)
                                                                          Index Cond: (customer_sk = c.customer_sk)
                                                                          Filter: ((membership_is_urban_sports IS TRUE) AND (membership_sequence_nr_reverse = 1))
                                                                          Rows Removed by Filter: 0
                                                              ->  Hash  (cost=3.09..3.09 rows=109 width=35) (actual time=0.148..0.214 rows=109 loops=3)
                                                                    Buckets: 1024  Batches: 1  Memory Usage: 16kB
                                                                    ->  Seq Scan on dim_location dl  (cost=0.00..3.09 rows=109 width=35) (actual time=0.031..0.098 rows=109 loops=3)
                                ->  Materialize  (cost=91205.48..107807.50 rows=2553 width=4) (actual time=2668.900..3946.682 rows=470415 loops=1)
                                      ->  Subquery Scan on nonanon_customer_address_prep  (cost=91205.48..107801.12 rows=2553 width=4) (actual time=2666.188..3647.463 rows=470415 loops=1)
                                            Filter: (nonanon_customer_address_prep.row_number = 1)
                                            Rows Removed by Filter: 40218
                                            ->  WindowAgg  (cost=91205.48..101418.18 rows=510635 width=148) (actual time=2664.902..3526.361 rows=510633 loops=1)
                                                  ->  Sort  (cost=91205.48..92482.07 rows=510635 width=12) (actual time=2664.083..2833.676 rows=510634 loops=1)
                                                        Sort Key: customer_address.customer_id, customer_address.created_at DESC
                                                        Sort Method: external merge  Disk: 13032kB
                                                        ->  Seq Scan on customer_address  (cost=0.00..34063.35 rows=510635 width=12) (actual time=4.596..1522.444 rows=510635 loops=1)
                    ->  Materialize  (cost=6474741.37..6566128.10 rows=13051 width=13) (actual time=170857.053..173215.019 rows=465703 loops=1)
                          ->  Subquery Scan on favorite_sport_category_prep_2  (cost=6474741.37..6566095.47 rows=13051 width=13) (actual time=170855.731..173002.743 rows=465703 loops=1)
                                Filter: (favorite_sport_category_prep_2.row_number = 1)
                                Rows Removed by Filter: 1343535
                                ->  WindowAgg  (cost=6474741.37..6533469.01 rows=2610117 width=29) (actual time=170854.901..172755.674 rows=1809238 loops=1)
                                      ->  Sort  (cost=6474741.37..6481266.67 rows=2610117 width=21) (actual time=170853.124..171205.257 rows=1809238 loops=1)
                                            Sort Key: report_venue_visitors.membership_sk, (count(DISTINCT report_venue_visitors.booking_sk)) DESC, report_venue_visitors.service_top_category_name
                                            Sort Method: external merge  Disk: 63696kB
                                            ->  GroupAggregate  (cost=5839877.44..6063400.07 rows=2610117 width=21) (actual time=154838.978..169250.761 rows=1809238 loops=1)
                                                  Group Key: report_venue_visitors.membership_sk, report_venue_visitors.service_top_category_name
                                                  ->  Sort  (cost=5839877.44..5889232.80 rows=19742146 width=21) (actual time=154835.761..158654.645 rows=19827987 loops=1)
                                                        Sort Key: report_venue_visitors.membership_sk, report_venue_visitors.service_top_category_name
                                                        Sort Method: external merge  Disk: 694120kB
                                                        ->  Seq Scan on report_venue_visitors  (cost=0.00..2233036.56 rows=19742146 width=21) (actual time=1.868..117392.591 rows=19827987 loops=1)
                                                              Filter: booking_is_valid
                                                              Rows Removed by Filter: 6441170
              ->  Hash  (cost=7199.42..7199.42 rows=317242 width=19) (actual time=352.386..352.386 rows=317242 loops=1)
                    Buckets: 65536  Batches: 8  Memory Usage: 2606kB
                    ->  Seq Scan on airship_newsletter_subscription_status anss  (cost=0.00..7199.42 rows=317242 width=19) (actual time=1.120..154.407 rows=317242 loops=1)
        ->  Hash  (cost=4207.06..4207.06 rows=121006 width=150) (actual time=320.770..320.771 rows=121006 loops=1)
              Buckets: 32768  Batches: 8  Memory Usage: 3111kB
              ->  Seq Scan on airship_membership_booking_details ambd  (cost=0.00..4207.06 rows=121006 width=150) (actual time=1.446..107.525 rows=121006 loops=1)
  ->  Hash  (cost=3993.93..3993.93 rows=434 width=26) (actual time=951.259..951.264 rows=26392 loops=1)
        Buckets: 32768 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 1760kB
        ->  Subquery Scan on ft  (cost=3981.99..3993.93 rows=434 width=26) (actual time=857.944..888.163 rows=26392 loops=1)
              ->  Unique  (cost=3981.99..3989.59 rows=434 width=30) (actual time=857.288..878.098 rows=26392 loops=1)
                    ->  Sort  (cost=3981.99..3983.08 rows=434 width=30) (actual time=856.675..863.298 rows=26392 loops=1)
                          Sort Key: report_memberships.membership_sk, report_memberships.customer_sk, report_memberships.trial_status, report_memberships.trial, report_memberships.trial_start_date, report_memberships.trial_end_date
                          Sort Method: quicksort  Memory: 2830kB
                          ->  Bitmap Heap Scan on report_memberships  (cost=2256.96..3962.98 rows=434 width=30) (actual time=102.229..817.152 rows=26392 loops=1)
                                Recheck Cond: ((trial_start_date >= '2020-06-23'::date) AND (trial_status IS NOT NULL))
                                Heap Blocks: exact=1383
                                ->  BitmapAnd  (cost=2256.96..2256.96 rows=434 width=0) (actual time=99.478..99.479 rows=0 loops=1)
                                      ->  Bitmap Index Scan on dwh_report_memberships_bc76fe51  (cost=0.00..578.02 rows=31145 width=0) (actual time=7.497..7.497 rows=26392 loops=1)
                                            Index Cond: (trial_start_date >= '2020-06-23'::date)
                                      ->  Bitmap Index Scan on dwh_report_memberships_35525e76  (cost=0.00..1678.48 rows=90406 width=0) (actual time=91.704..91.704 rows=92029 loops=1)
                                            Index Cond: (trial_status IS NOT NULL)
Planning Time: 7.850 ms
Execution Time: 4328700.854 ms
6
  • Please consider reading this advice
    – mustaccio
    Commented Sep 7, 2022 at 11:46
  • Seeing the EXPLAIN ANALYZE would be helpful.
    – J.D.
    Commented Sep 7, 2022 at 12:06
  • 1
    @J.D. thank you for your message. it is attached to the links in the post. here is it again: explain.tensor.ru/archive/explain/…
    – Daria
    Commented Sep 7, 2022 at 12:26
  • @mustaccio thank you for your reply, I attached all the required info
    – Daria
    Commented Sep 7, 2022 at 12:27
  • Please don't post to multiple forums, especially without indicating you have done so. Now the conversation will be fragmented and disjointed.
    – jjanes
    Commented Sep 7, 2022 at 16:23

1 Answer 1

1

You can wrap your query in a function and do

ALTER PROCEDURE procname/funcname SET enable_hashjoin = off;

as suggested in Disabling hash join in a procedure to disable it for that function/procedure.

Not the answer you're looking for? Browse other questions tagged or ask your own question.