i have this Postgres query where i left join a couple of tables. This query runs for hours and causes issues. When I run explain analyse I see that the most time is spent in one of the left joins, for which optimiser selects Right Hash Join. When I use inner join instead and run explain analyse, optimiser selects a different plan and query finishes in minutes. I have to use left join because with inner join some data will be excluded. How should i rewrite the query to avoid this hash right join? Many thanks in advance!
Links to the EXPLAIN ANALYSE results are attached above. I am using PostgreSQL 12.11 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit
Query:
WITH memberships AS (
SELECT customer_sk
, membership_sk
, membership_state
, membership_b2b_type
, membership_sml_type
, membership_start_date
, membership_end_date
, membership_pause_from
, membership_pause_to
, covid_pause_start_date
, covid_pause_end_date
, city_sk AS membership_city_region_sk
, sport_persona_current
, membership_cancellation_reason
, membership_sequence_nr_reverse
, company_sk
, company_name
FROM dwh.fact_membership
WHERE membership_is_urban_sports IS TRUE
),
-- Data preparation
request_cancellation AS (
SELECT membership_sk,
requested_cancellation_last_date
FROM staging.request_cancellation
),
blacklisted_emails AS (
SELECT customer_sk, email, 'blacklisted' AS blacklisted
FROM dwh_userdata.blacklist_emails
),
nonanon_customer AS (
SELECT id
, first_name
, last_name
, email
FROM dwh_userdata.customer
),
nonanon_customer_address_prep AS (
SELECT customer_id
, city
, state
, country
, zip
, row_number() over (partition by customer_id order by created_at desc) as row_number
FROM dwh_userdata.customer_address
),
nonanon_customer_address AS (
SELECT *
FROM nonanon_customer_address_prep
WHERE row_number = 1
),
favorite_sport_category_prep_1 AS (
SELECT membership_sk
, service_top_category_name
, count(DISTINCT booking_sk) as cnt_booking
FROM dwh.report_venue_visitors
WHERE booking_is_valid
GROUP BY 1, 2
),
favorite_sport_category_prep_2 AS (
SELECT membership_sk
, service_top_category_name
, cnt_booking
, row_number()
over (partition by membership_sk order by cnt_booking DESC,service_top_category_name ) AS row_number
FROM favorite_sport_category_prep_1
),
favorite_sport_category AS (
SELECT membership_sk
, service_top_category_name AS favourite_sport_category
, cnt_booking
FROM favorite_sport_category_prep_2
WHERE row_number = 1
),
free_trial AS (
select distinct membership_sk
, customer_sk
, trial_status AS free_trial_status
, trial AS free_trial_length
, trial_start_date AS free_trial_start
, trial_end_date AS free_trial_end
FROM dwh.report_memberships
WHERE trial_status IS NOT NULL
and trial_start_date >= '2020-06-23'
)
-- #### OUTOPUT TABLE
SELECT c.customer_sk AS named_user
, CASE WHEN c.gender IN ('M', 'F') THEN c.gender ELSE NULL END AS gender
, nc.first_name
, nc.last_name
, customer_language
, anss.state AS newsletter_status
, dl.city_name AS membership_city_region
, dl.country_code AS membership_country_code
, dl.country_name AS membership_country_name
, dl.admin1 AS membership_administrative_state
, m.membership_sk
, m.membership_state
, m.membership_b2b_type
, m.company_sk
, m.company_name
, m.membership_sml_type
, CASE
WHEN m.membership_start_date IS NOT NULL
THEN CONCAT(TO_CHAR(m.membership_start_date, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS membership_start_date
, CASE
WHEN m.membership_end_date IS NOT NULL THEN CONCAT(TO_CHAR(m.membership_end_date, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS membership_end_date
, ft.free_trial_status
, ft.free_trial_length
, CASE
WHEN ft.free_trial_start IS NOT NULL THEN CONCAT(TO_CHAR(ft.free_trial_start, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS free_trial_start
, CASE
WHEN ft.free_trial_end IS NOT NULL THEN CONCAT(TO_CHAR(ft.free_trial_end, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS free_trial_end
, CASE
WHEN m.membership_pause_from IS NOT NULL
THEN CONCAT(TO_CHAR(m.membership_pause_from, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS membership_pause_from
, CASE
WHEN m.membership_pause_to IS NOT NULL THEN CONCAT(TO_CHAR(m.membership_pause_to, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS membership_pause_to
, CASE
WHEN m.covid_pause_start_date IS NOT NULL THEN CONCAT(TO_CHAR(m.covid_pause_start_date, 'YYYY-MM-DD'),
'T00:00:00')
ELSE NULL END AS covid_pause_start_date
, CASE
WHEN m.covid_pause_end_date IS NOT NULL
THEN CONCAT(TO_CHAR(m.covid_pause_end_date, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS covid_pause_end_date
, CASE
WHEN rc.requested_cancellation_last_date IS NOT NULL THEN CONCAT(
TO_CHAR(rc.requested_cancellation_last_date, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS requested_cancellation_last_date
, membership_cancellation_reason
, be.blacklisted AS blacklist_email
, fsc.favourite_sport_category AS fav_sports_category
, m.sport_persona_current
, ambd.membership_months_active
, ambd.membership_months_total
, ambd.is_gm1_positive
, ambd.cnt_bookings_total
, ambd.cnt_bookings_last_30_days_total
, ambd.cnt_bookings_last_30_days_onsite
, ambd.cnt_bookings_onsite
, ambd.cnt_bookings_online
, ambd.cnt_bookings_last_30_days_online
, CASE
WHEN ambd.latest_booking_date IS NOT NULL
THEN CONCAT(TO_CHAR(ambd.latest_booking_date, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS latest_booking_date
, ambd.avg_bookings_active_month
, ambd.last_checkin_type
, ambd.fav_sports_category_onsite
, ambd.fav_sports_category_online
, ambd.fav_studio_last_30_days
, ambd.fav_studio_group_website
FROM dwh.dim_customer c
LEFT JOIN nonanon_customer nc
ON nc.id = c.customer_sk
LEFT JOIN nonanon_customer_address nca
ON nca.customer_id = customer_sk
LEFT JOIN memberships m
ON c.customer_sk = m.customer_sk
AND membership_sequence_nr_reverse = 1
LEFT JOIN request_cancellation rc
ON m.membership_sk = rc.membership_sk
LEFT JOIN dwh.dim_location dl
ON m.membership_city_region_sk = dl.city_sk
LEFT JOIN blacklisted_emails be
ON be.email = nc.email
LEFT JOIN favorite_sport_category fsc
ON fsc.membership_sk = m.membership_sk
LEFT JOIN staging.airship_newsletter_subscription_status anss
ON anss.customer_id = c.customer_sk
LEFT JOIN free_trial ft
ON ft.customer_sk = m.customer_sk
LEFT JOIN staging.airship_membership_booking_details ambd
ON ambd.membership_sk = m.membership_sk
AND membership_sequence_nr_reverse = 1
WHERE be.blacklisted IS NULL
AND nc.email NOT LIKE '%delete%'
AND nc.email IS NOT NULL
AND ((m.membership_sk IS NULL AND anss.state = 'subscribed') OR membership_state IS NOT NULL)
Results of EXPLAIN ANALYSE:
Hash Left Join (cost=6667580.77..6764370.56 rows=3256 width=692) (actual time=4319030.909..4328353.358 rows=518825 loops=1)
Hash Cond: (fact_membership.customer_sk = ft.customer_sk)
-> Hash Left Join (cost=6663581.42..6759951.96 rows=3256 width=380) (actual time=4318059.369..4324841.032 rows=518825 loops=1)
Hash Cond: (fact_membership.membership_sk = ambd.membership_sk)
Join Filter: (fact_membership.membership_sequence_nr_reverse = 1)
-> Hash Left Join (cost=6655261.78..6748793.03 rows=3256 width=242) (actual time=4317733.942..4323056.862 rows=518825 loops=1)
Hash Cond: (c.customer_sk = anss.customer_id)
Filter: (((fact_membership.membership_sk IS NULL) AND (anss.state = 'subscribed'::text)) OR (fact_membership.membership_state IS NOT NULL))
Rows Removed by Filter: 129098
-> Merge Left Join (cost=6642237.84..6733674.25 rows=3256 width=227) (actual time=4317378.943..4321020.832 rows=647923 loops=1)
Merge Cond: (fact_membership.membership_sk = favorite_sport_category_prep_2.membership_sk)
-> Sort (cost=167496.47..167504.61 rows=3256 width=218) (actual time=4146517.144..4147134.144 rows=647923 loops=1)
Sort Key: fact_membership.membership_sk
Sort Method: external merge Disk: 82352kB
-> Merge Left Join (cost=150681.68..167306.50 rows=3256 width=218) (actual time=4142397.925..4145027.017 rows=647923 loops=1)
Merge Cond: (c.customer_sk = nonanon_customer_address_prep.customer_id)
-> Sort (cost=59476.20..59484.34 rows=3256 width=218) (actual time=4139725.733..4140241.833 rows=647923 loops=1)
Sort Key: c.customer_sk
Sort Method: external merge Disk: 82344kB
-> Hash Right Join (cost=52983.04..59286.23 rows=3256 width=218) (actual time=33403.336..4135281.108 rows=647923 loops=1)
Hash Cond: (request_cancellation.membership_sk = fact_membership.membership_sk)
-> Seq Scan on request_cancellation (cost=0.00..5128.40 rows=308340 width=8) (actual time=1.160..228.691 rows=308340 loops=1)
-> Hash (cost=52942.34..52942.34 rows=3256 width=214) (actual time=30038.787..30048.670 rows=647923 loops=1)
Buckets: 65536 (originally 4096) Batches: 131072 (originally 1) Memory Usage: 10511kB
-> Gather (cost=1064.24..52942.34 rows=3256 width=214) (actual time=11.564..12621.194 rows=647923 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Left Join (cost=64.24..51616.74 rows=1357 width=214) (actual time=5.510..22450.906 rows=215974 loops=3)
Hash Cond: (fact_membership.city_sk = dl.city_sk)
-> Nested Loop Left Join (cost=59.79..51608.59 rows=1357 width=191) (actual time=5.239..22013.464 rows=215974 loops=3)
-> Nested Loop (cost=59.37..50428.72 rows=1357 width=60) (actual time=4.923..6958.191 rows=215974 loops=3)
-> Hash Left Join (cost=58.94..49440.62 rows=1357 width=55) (actual time=3.419..2000.407 rows=215976 loops=3)
Hash Cond: ((customer.email)::text = blacklist_emails.email)
Filter: (('blacklisted'::text) IS NULL)
Rows Removed by Filter: 122
-> Parallel Seq Scan on customer (cost=0.00..46660.28 rows=271334 width=46) (actual time=0.999..1668.668 rows=216091 loops=3)
Filter: ((email IS NOT NULL) AND ((email)::text !~~ '%delete%'::text))
Rows Removed by Filter: 3191
-> Hash (cost=34.53..34.53 rows=1953 width=54) (actual time=2.222..2.226 rows=1953 loops=3)
Buckets: 2048 Batches: 1 Memory Usage: 144kB
-> Seq Scan on blacklist_emails (cost=0.00..34.53 rows=1953 width=54) (actual time=0.263..1.207 rows=1953 loops=3)
-> Index Scan using customer_pk on dim_customer c (cost=0.42..0.73 rows=1 width=13) (actual time=0.020..0.020 rows=1 loops=647929)
Index Cond: (customer_sk = customer.id)
-> Index Scan using dwh_fact_membership_3b307128 on fact_membership (cost=0.42..0.86 rows=1 width=131) (actual time=0.066..0.067 rows=1 loops=647923)
Index Cond: (customer_sk = c.customer_sk)
Filter: ((membership_is_urban_sports IS TRUE) AND (membership_sequence_nr_reverse = 1))
Rows Removed by Filter: 0
-> Hash (cost=3.09..3.09 rows=109 width=35) (actual time=0.148..0.214 rows=109 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 16kB
-> Seq Scan on dim_location dl (cost=0.00..3.09 rows=109 width=35) (actual time=0.031..0.098 rows=109 loops=3)
-> Materialize (cost=91205.48..107807.50 rows=2553 width=4) (actual time=2668.900..3946.682 rows=470415 loops=1)
-> Subquery Scan on nonanon_customer_address_prep (cost=91205.48..107801.12 rows=2553 width=4) (actual time=2666.188..3647.463 rows=470415 loops=1)
Filter: (nonanon_customer_address_prep.row_number = 1)
Rows Removed by Filter: 40218
-> WindowAgg (cost=91205.48..101418.18 rows=510635 width=148) (actual time=2664.902..3526.361 rows=510633 loops=1)
-> Sort (cost=91205.48..92482.07 rows=510635 width=12) (actual time=2664.083..2833.676 rows=510634 loops=1)
Sort Key: customer_address.customer_id, customer_address.created_at DESC
Sort Method: external merge Disk: 13032kB
-> Seq Scan on customer_address (cost=0.00..34063.35 rows=510635 width=12) (actual time=4.596..1522.444 rows=510635 loops=1)
-> Materialize (cost=6474741.37..6566128.10 rows=13051 width=13) (actual time=170857.053..173215.019 rows=465703 loops=1)
-> Subquery Scan on favorite_sport_category_prep_2 (cost=6474741.37..6566095.47 rows=13051 width=13) (actual time=170855.731..173002.743 rows=465703 loops=1)
Filter: (favorite_sport_category_prep_2.row_number = 1)
Rows Removed by Filter: 1343535
-> WindowAgg (cost=6474741.37..6533469.01 rows=2610117 width=29) (actual time=170854.901..172755.674 rows=1809238 loops=1)
-> Sort (cost=6474741.37..6481266.67 rows=2610117 width=21) (actual time=170853.124..171205.257 rows=1809238 loops=1)
Sort Key: report_venue_visitors.membership_sk, (count(DISTINCT report_venue_visitors.booking_sk)) DESC, report_venue_visitors.service_top_category_name
Sort Method: external merge Disk: 63696kB
-> GroupAggregate (cost=5839877.44..6063400.07 rows=2610117 width=21) (actual time=154838.978..169250.761 rows=1809238 loops=1)
Group Key: report_venue_visitors.membership_sk, report_venue_visitors.service_top_category_name
-> Sort (cost=5839877.44..5889232.80 rows=19742146 width=21) (actual time=154835.761..158654.645 rows=19827987 loops=1)
Sort Key: report_venue_visitors.membership_sk, report_venue_visitors.service_top_category_name
Sort Method: external merge Disk: 694120kB
-> Seq Scan on report_venue_visitors (cost=0.00..2233036.56 rows=19742146 width=21) (actual time=1.868..117392.591 rows=19827987 loops=1)
Filter: booking_is_valid
Rows Removed by Filter: 6441170
-> Hash (cost=7199.42..7199.42 rows=317242 width=19) (actual time=352.386..352.386 rows=317242 loops=1)
Buckets: 65536 Batches: 8 Memory Usage: 2606kB
-> Seq Scan on airship_newsletter_subscription_status anss (cost=0.00..7199.42 rows=317242 width=19) (actual time=1.120..154.407 rows=317242 loops=1)
-> Hash (cost=4207.06..4207.06 rows=121006 width=150) (actual time=320.770..320.771 rows=121006 loops=1)
Buckets: 32768 Batches: 8 Memory Usage: 3111kB
-> Seq Scan on airship_membership_booking_details ambd (cost=0.00..4207.06 rows=121006 width=150) (actual time=1.446..107.525 rows=121006 loops=1)
-> Hash (cost=3993.93..3993.93 rows=434 width=26) (actual time=951.259..951.264 rows=26392 loops=1)
Buckets: 32768 (originally 1024) Batches: 1 (originally 1) Memory Usage: 1760kB
-> Subquery Scan on ft (cost=3981.99..3993.93 rows=434 width=26) (actual time=857.944..888.163 rows=26392 loops=1)
-> Unique (cost=3981.99..3989.59 rows=434 width=30) (actual time=857.288..878.098 rows=26392 loops=1)
-> Sort (cost=3981.99..3983.08 rows=434 width=30) (actual time=856.675..863.298 rows=26392 loops=1)
Sort Key: report_memberships.membership_sk, report_memberships.customer_sk, report_memberships.trial_status, report_memberships.trial, report_memberships.trial_start_date, report_memberships.trial_end_date
Sort Method: quicksort Memory: 2830kB
-> Bitmap Heap Scan on report_memberships (cost=2256.96..3962.98 rows=434 width=30) (actual time=102.229..817.152 rows=26392 loops=1)
Recheck Cond: ((trial_start_date >= '2020-06-23'::date) AND (trial_status IS NOT NULL))
Heap Blocks: exact=1383
-> BitmapAnd (cost=2256.96..2256.96 rows=434 width=0) (actual time=99.478..99.479 rows=0 loops=1)
-> Bitmap Index Scan on dwh_report_memberships_bc76fe51 (cost=0.00..578.02 rows=31145 width=0) (actual time=7.497..7.497 rows=26392 loops=1)
Index Cond: (trial_start_date >= '2020-06-23'::date)
-> Bitmap Index Scan on dwh_report_memberships_35525e76 (cost=0.00..1678.48 rows=90406 width=0) (actual time=91.704..91.704 rows=92029 loops=1)
Index Cond: (trial_status IS NOT NULL)
Planning Time: 7.850 ms
Execution Time: 4328700.854 ms
EXPLAIN ANALYZE
would be helpful.