0

I recently upgraded a postgres database from version 9.3 to 14.5 (yes, quite a jump). The upgrade seemed successful, but I am getting an issue with a particular query (see at the end).

The query result is returned in about 3 seconds on the old database, but seems to hang indefinitely on the new version of postgres. From pg_stat_activity I see that 3 processes are spawned for this query, of which 2 are in wait_event MessageQueueSend. After about 10 minutes one of the processes in MessageQueueSend is gone.

I tried playing with the following settings, but without any improvements:

  • dynamic_shared_memory_type
  • shared_memory_type
  • min_dynamic_shared_memory
  • effective_io_concurrency

Any hints as to where to look for would be much appreciated. I cannot find any other question that involves getting stuck at the MessageQueueSend wait_event, so I hope someone could help me here.

Edit:

On suggestion of @jjanes, after setting max_parallel_workers_per_gather to zero it was still getting stuck (but now without the two worker processes; just the master process, albeit in active state). After running strace on the pid of the master process, I found it was stuck doing lseek(41, 0, SEEK_END). File descriptor 41 corresponds to a table for which the user running the query has no permissions to. Running the same query as an admin user runs the query directly, but returning 0 rows (was expecting more than that). Makes me think whether or not the upgrade corrupted some tables..

I guess as a follow-up question: is there a way to validate the integrity of a database once it has been upgraded?

EXPLAIN output of old database version (returns in 3 seconds): https://pastebin.com/ybXMW4b4

EXPLAIN (ANALYZE, BUFFERS) output of old database version (returns in 3 seconds): https://pastebin.com/Yh3BXked

EXPLAIN output of new database version (does not return): https://pastebin.com/K51sRifF

Query in question:

SELECT trackpoint.device_info_serial,
       trackpoint.date_time,
       message.battery_voltage,
       message.memory_usage,
       trackpoint.latitude,
       trackpoint.longitude,
       sessionbirdspecies.project_id,
       sessionbirdspecies.track_session_id,
       sessionbirdspecies.individual_id,
       sessionbirdspecies.tracker_id
FROM
  (SELECT tracksession.key_name,
          tracksession.project_id,
          tracksession.tracker_id,
          tracksession.individual_id,
          tracksession.track_session_id,
          tracksession.start_date,
          tracksession.end_date,
          tracksession.device_info_serial,
          bird.ring_number,
          bird.sex,
          bird_s.english_name,
          bird_s.latin_name
   FROM gps.ee_track_session_limited AS tracksession
   LEFT JOIN gps.ee_individual_limited AS bird ON tracksession.individual_id = bird.individual_id
   LEFT JOIN gps.ee_species_limited AS bird_s ON bird.species_latin_name=bird_s.latin_name) AS sessionbirdspecies
INNER JOIN gps.ee_sms_position_limited AS trackpoint ON sessionbirdspecies.device_info_serial = trackpoint.device_info_serial
LEFT JOIN gps.ee_sms_message_limited AS message ON trackpoint.id=message.id
WHERE 1=1
  AND trackpoint.date_time >= '2022-10-01 00:00'
  AND trackpoint.date_time < '2022-11-01 14:01'
ORDER BY trackpoint.date_time;
3
  • Sounds like Parallel Query run amok. You can turn it off with set max_parallel_workers_per_gather TO 0;, or try to figure what is going wrong with it with EXPLAIN.
    – jjanes
    Commented Dec 21, 2022 at 18:26
  • Thanks for the comment. I will update my post with my findings.
    – JimmyJumbo
    Commented Dec 21, 2022 at 23:28
  • @JimmyJumbo, I'm having same issues after upgrading the Postgres database version from to 10.6 to 13.9. Did you find the solution or work around?
    – Jay Desai
    Commented Jan 20, 2023 at 23:36

2 Answers 2

1

I fixed the issue by running following command on the server where I upgraded the Postgres database.

VACUUM VERBOSE ANALYZE;

1

We have a set of RDS Postgresql 12 which were upgraded to Postgresql 13 recently and right after the upgrades we noticed poor performance with some larger queries, where some which took around 15 minutes to complete would time out after 3 hours of execution after the upgrade on a 2xlarge instance (8vCPUs and 32GB of ram).

By checking this post, I tested the set of max_parallel_workers_per_gather to 0 (it was set to standard value of 2), which proved the solution to us, making that same report previously taking 15 minutes but increased to 3 hours after upgrading, to execute in just over 4 minutes after unsetting the parallel setting.

A ticket was raised to AWS to see if they have any explanation about the issue, which I'd add here for helping others.

Regards, Lauro Ojeda

1
  • 1
    Hi lauro, Did you get a responce from AWS team?
    – Rj_N
    Commented Dec 27, 2023 at 4:38

Not the answer you're looking for? Browse other questions tagged or ask your own question.