1

In a PostgreSQL 11 database server I have a large table (932 million rows, 2150GB, not partitioned) into which I insert about 2-3 million rows every night with a batch job. The table was created with autovacuum_enabled=FALSE. After the insert I perform a VACUUM (ANALYZE) on the table, which usually takes about 30-40 seconds. However, once every few weeks this vacuum takes very long, such as today when it took 2:11 h:mm.

I am now considering doing a vacuum ahead of my nightly job so as to avoid these long run times during the job run. I had a look at the options described in the PostgreSQL 11 vacuum documentation page. I know that a VACUUM (FULL) is very expensive in terms of runtime and thus is probably out of the question. Will a VACUUM (FREEZE) be useful in my case? The other option, VACUUM (DISABLE_PAGE_SKIPPING) seems not relevant to my needs.


Edit (2021-09-27): looking back at the past 270 days (2021-01-01 to 2021-09-27), it took 16-60 seconds most days to do the vacuum. However on 9 days the vacuum exceeded 1000 seconds, on 4 out of which it even exceeded 3500 seconds (see graph below). Interesting is the time gap between these outliers, which lies between 25 and 35 days. It looks like something builds up over the span of these 25-35 days, necessitating a longer vacuum. Quite possibly this is the transaction ID wraparound mentioned in the comments below.

VACUUM performance graph


Edit (2022-02-08): In September 2021 we modified our script by changing:

VACUUM (ANALYZE)

to:

ANALYZE (VERBOSE)
VACUUM (FREEZE, VERBOSE)

Unfortunately, the about once monthly long runs of this job step still continue. However, with the splitting into two separate analyze and vacuum steps, as well as with the verbose output, we are much better able to tell what is going on.

On a fast day the output of the analyze and vacuum steps looks like this:

INFO:  analyzing "analytics.a_box_events"
INFO:  "a_box_events": scanned 30000 of 152881380 pages, containing 264294 live rows and 0 dead rows; 30000 rows in sample, 1346854382 estimated total rows
ANALYZE
INFO:  aggressively vacuuming "analytics.a_box_events"
INFO:  "a_box_events": found 0 removable, 3407161 nonremovable row versions in 406302 out of 152881380 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 108425108
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 152475078 frozen pages.
0 pages are entirely empty.
CPU: user: 4.53 s, system: 3.76 s, elapsed: 25.77 s.
INFO:  aggressively vacuuming "pg_toast.pg_toast_1361926"
INFO:  "pg_toast_1361926": found 0 removable, 3700 nonremovable row versions in 680 out of 465143 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 108425110
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 464463 frozen pages.
0 pages are entirely empty.
CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.05 s.
VACUUM

The analyze has no timing output, and the vacuum on the main table takes around 20-40 seconds. Analyze and vacuum together usually finish in under a minute.

Once every 35-45 days, the vacuum takes much longer, as several minutes are spent vacuuming each of the table's 39 indexes. There are two variations to this:

  1. No index row versions to be removed

Example output:

INFO:  index "idx_a_box_events_node_id" now contains 1347181817 row versions in 4038010 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 5.44 s, system: 13.45 s, elapsed: 230.59 s.

The entire vacuum's final output then looks like this:

INFO:  "a_box_events": found 0 removable, 2837554 nonremovable row versions in 340887 out of 153111143 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 108429069
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 152770256 frozen pages.
0 pages are entirely empty.
CPU: user: 669.94 s, system: 870.59 s, elapsed: 10501.17 s.
  1. Index row versions to be removed

Example output:

INFO:  scanned index "idx_a_box_events_node_id" to remove 2524 row versions
DETAIL:  CPU: user: 49.34 s, system: 11.42 s, elapsed: 198.63 s

and:

INFO:  index "idx_a_box_events_node_id" now contains 1228052362 row versions in 3478524 pages
DETAIL:  2524 index row versions were removed.
6 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.

The entire vacuum's final output then looks like this:

INFO:  "a_box_events": found 56 removable, 3851482 nonremovable row versions in 461834 out of 139126225 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 97583006
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 138664391 frozen pages.
0 pages are entirely empty.
CPU: user: 2367.93 s, system: 761.34 s, elapsed: 8138.76 s.

Such long run times are currently more a nuisance than a show-stopper. An upgrade to the latest Postgres server version is planned for late 2023. At that time we will try the INDEX_CLEANUP option to VACUUM, running this once every weekend, to avoid the costly clean up of the indexes in the daily VACUUM. In the meantime we have apparently no choice but to be patient.

Edit (2022-08-24): We have, in the meantime, partitioned our large table into monthly slices, giving us currently 56 partitions, with about 75-90 GB for the table and 45-55 GB for the indexes per partition. Once a night new data is added to the current month's partition, followed by an ANALYZE (VERBOSE) (we don't do any more daily VACUUM). On the first Sunday of a month we additionally do a VACUUM (FULL, ANALYZE) of the previous month's partition.

As can be expected, the daily ANALYZE (VERBOSE) on a single partition is pretty quick, averaging around 16-18 seconds per run, building from nearly 0 seconds on the first day of a month to a maximum toward the end of a month, after which the following INSERT and ANALYZE are applied to a new partition and are again faster: ANALYZE performance graph

4
  • You normally do VACUUM (ANALYZE), presumably to update the table statistics. VACUUM (FREEZE) does not do that, so why do you think it might be useful?
    – mustaccio
    Commented Sep 15, 2021 at 13:21
  • If you do it at night in the background, why do you care if it occasionally takes 2 hours and 11 minutes? If it has 2 hours of work that needs to be done, when is it that you want that work to occur? Why would it be better if it instead took 1 one hour before the bulk load and 1 hour after it? You can switch it to VACUUM (ANALYZE, VERBOSE) and then look through the log to see what step is taking a long time.
    – jjanes
    Commented Sep 15, 2021 at 18:45
  • @mustaccio I know all the other vacuum options, but didn't understand what VACUUM (FREEZE) does, hence my question. From your reply I take it that this is not relevant for me either.
    – Robert
    Commented Sep 16, 2021 at 13:41
  • @jjanes: My nightly job starts right after midnight UTC and usually takes around 6 hours, so that when we get to work at 9am (=0700 UTC) we have fresh data. Thus time is critical. A pre-step to do a vacuum could run at 2100 or 2200 UTC, not affecting the nightly job that starts at 0002 UTC.
    – Robert
    Commented Sep 16, 2021 at 13:46

2 Answers 2

1

My theory on what is happening here is that for an INSERT only table, vacuum only needs to visit the parts of the table dirtied by the inserts, and can skip visiting the indexes at all.

But if it finds even one (in v11) dead tuple, then it needs to scan the entirety of all the indexes, and that can take a long time. I'm surprised it takes over 2 hours, but that is a mighty big table, so the indexes would be mighty big as well. Successful INSERT and COPY don't generate any dead tuples. But even one UPDATE or DELETE do, and so do any INSERT or COPY which rolled back. Does this theory seem compatible with what you are seeing? Do you ever have any DELETE or UPDATE of even one tuple, or ever have an INSERT that fails?

You could run a VACUUM at 9pm, then if it does have to scan all the indexes, then the one you run after the bulk load won't have to do it again--unless that bulk load itself is the thing which generated the dead tuples.

But what if you instead just run ANALYZE, then VACUUM, separately and in that order, after the bulk load? Is the problem that a long VACUUM is using too much IO when you start work and slows things down, or is the problem that a long VACUUM means the ANALYZE has not yet finished and so you get bad plans at the start of the work day?

Another problem is that if you are not doing anything to freeze parts of the table, then once an anti-wraparound VACUUM does kick in, you could be in for a very bad time. The first VACUUM FREEZE you do is going to generate an enormous amount of IO and take a long time, but after that it will only have to freeze small parts of the table at a time and so should be manageable. The nice thing about VACUUM FREEZE, unlike index scanning part of a VACUUM, is that if it is interrupted it doesn't lose all the work it has done so far, as the pages recorded as frozen in the visibility map can be skipped next time. So you can start VACUUM FREEZE and then just cancel if it seems to be causing problems, or if you leave your desk so you won't be around to monitor it for a while. Then once you are caught up, you can change the morning post-bulk-load VACUUM to a VACUUM FREEZE.

6
  • Indeed, this table has 35 indexes, ranging in size from 21GB to 55GB. The nightly job only does 1 INSERT, with 2-3 million rows, never UPDATE, never DELETE, so there are (or should be) no dead tuples. In recent weeks the job has been running very stably, without errors and thus without rollbacks. Very occasionally we perform manual UPDATEs (never DELETEs), but the last time we did this is quite many weeks ago.
    – Robert
    Commented Sep 17, 2021 at 11:21
  • Right now I have a VACUUM (FREEZE) running on this table. Assuming this will finish within today (there are still more than 12 hours till midnight UTC), I will make following changes to the nightly job: 1. Change the post-INSERT VACUUM (ANALYZE) to ANALYZE followed by VACUUM (FREEZE). 2. Include the VERBOSE option in both in order to get diagnostic information to help in determining what is going on. I will then observe the job behaviour for a few days.
    – Robert
    Commented Sep 17, 2021 at 11:23
  • The VACUUM (FREEZE) on this table finished in just over 5 hours. A subsequent VACUUM (ANALYZE) only took 18 seconds. All but one page (presumably the current one) is now frozen. I have made the above changes to the batch job (separate ANALYZE followed by VACUUM (FREEZE)) and next will observe performance in the coming few days.
    – Robert
    Commented Sep 17, 2021 at 16:55
  • @robert It always scans the last page of the table, even if that page is marked frozen. That is partially my "fault", see git.postgresql.org/gitweb/…
    – jjanes
    Commented Sep 17, 2021 at 19:35
  • I have now observed the performance of the nightly vacuum for the past 10 days, all of which were fast (24-45 seconds). It will be interesting to see what will happen when the 25-35 days are over, after which in the past it took longer than 1000 seconds (see edit incl. graph in my original post). Only if the vacuum then still remains fast can I claim that the problem is solved.
    – Robert
    Commented Sep 27, 2021 at 8:11
1

This may well be transaction ID wraparound, which must scan and freeze the rows inserted since the last wraparound.

You should change your nightly run to VACUUM (FREEZE) to distribute that load, particularly if it is an insert-only table, where proactive freezing doesn't expend unnecessary resources. See the documentation for a detailed discussion of the necessity to freeze tuples.

The alternative is to upgrade to PostgreSQL v13, which has added autovacuum based on the number of INSERTs.

2
  • An upgrade to PostgreSQL 13 is out of the question, unfortunately. The table is indeed an insert-only table as it is part of an analytics system: during the nightly job run, rows are inserted, then during the day only reads happen on the table. I still haven't understood what a VACUUM (FREEZE) does, though. Also, would I do this before the job starts or after the inserts are finished?
    – Robert
    Commented Sep 16, 2021 at 13:50
  • I have added a documentation link. I'd run it after loading data, but as long as it runs regularly, that should be fine. Commented Sep 16, 2021 at 14:10

Not the answer you're looking for? Browse other questions tagged or ask your own question.