11

I have a PDF that has some blank pages inserted. These pages are the background colour (grey in this case). I would like to remove these pages using a bash script.

It has been suggested that we can to scan for text using e.g. pdftotext, but in my case this does not find text even on the non-blank pages.

9 Answers 9

8

Thank you for the code, gmatht. I have modified it to check for page coverage using GhostScript and delete pages with coverage less than a threshold (0.1%).

#!/bin/sh
IN="$1"
filename=$(basename "${IN}")
filename="${filename%.*}"
PAGES=$(pdfinfo "$IN" | grep ^Pages: | tr -dc '0-9')

non_blank() {
    for i in $(seq 1 $PAGES)
    do
        PERCENT=$(gs -o -  -dFirstPage=${i} -dLastPage=${i} -sDEVICE=inkcov "$IN" | grep CMYK | nawk 'BEGIN { sum=0; } {sum += $1 + $2 + $3 + $4;} END { printf "%.5f\n", sum } ')
        if [ $(echo "$PERCENT > 0.001" | bc) -eq 1 ]
        then
            echo $i
            #echo $i 1>&2
        fi
        echo -n . 1>&2
    done | tee "$filename.tmp"
    echo 1>&2
}

set +x
pdftk "${IN}" cat $(non_blank) output "${filename}.pdf"
3
  • I had to add a -a in the fifth line between grep and ^Pages: since on my system for whatever reason the pdfinfo output is considered binary. Commented May 12, 2020 at 13:40
  • 1
    The inkcov-Device gave me very small differences for scanned documents. The ink_cov-Device worked much better for me (similar question)
    – rbs
    Commented Mar 19, 2021 at 17:08
  • You may need to change the name of the output file, since currently it's the same as the input file and pdftk refuses to overwrite it Commented Dec 4, 2023 at 13:44
5

There does not appear to be a utility to remove blank pages from PDFs, but we can create a histogram of colours using the convert command from imagemagick. Blank slides will only have one entry which can be detected with wc. Once we have a list of non-blank pages we can feed this into pdftk.

Note that imagemagick numbers pages starting from 0, so we need to adjust for this. We can use a low value in the -density flag improve performance (though too low seems to result in imagemagick segfaulting).

If we call the following script pdf_rm_blank.sh, running pdf_rm_blank.sh A will create A.rm.pdf from A.pdf

#!/bin/sh
IN="$1"
PAGES=$(pdfinfo $IN.pdf | grep ^Pages: | tr -dc '0-9')

non_blank() {
    for i in $(seq 1 $PAGES)
    do
        if [ $(convert -density 35 "$IN.pdf[$((i-1))]" -define histogram:unique-colors=true -format %c histogram:info:- | wc -l) -ne 1 ]
        then
            echo $i
            #echo $i 1>&2
        fi
        echo -n . 1>&2
    done | tee out.tmp
    echo 1>&2
}

set +x
pdftk $IN.pdf cat $(non_blank) output $IN.rm.pdf
1
  • I was just goin to suggest use print as pdf from any number of tools, then make sure to unselect blank pages. However this seems nicer. Commented Sep 8, 2017 at 10:50
2

As most of the answers show, there doesn't seem to be one tool to get this done. I had to patch together some existing tools, taking inspiration from the use of gs in @Antony's answer.

I found I couldn't automate this from top-to-bottom, it needed some fine tuning along the way. The scripts below take a directory and batch operate on all PDF files. I ended up with 3 distinct steps:

  1. Gather some per-page metrics about each page in the PDFs using Ghostscript's ink_cov output device (I found ink_cov's mean percent values much more useful than the "percent of pixels with non-zero channels" values returned by inkcov):

    #!/usr/bin/env bash
    
    device="ink_cov"
    out="/tmp/pdf_trim/analysis.txt"
    
    [ "$#" -eq 1 ] || { echo "Target directory required as argument"; exit 1; }
    in="$(realpath "$1")"
    [ -f "$out" ] && rm "$out" || mkdir -p "$(dirname "$out")"
    pushd "$in"
    find . -name '*.pdf' | while read p; do
      gs -o - -sDEVICE="$device" "$p" | grep CMYK | grep -n '' | \
        sed 's/:/ /; s|^|'$in' '$(echo "$p" | sed 's|^\./||')' |' | \
        tee -a "$out"
    done
    

    Example usage:

    ./script_1 ./target_folder
    
  2. Fiddle around with different "blank-page" criteria from the metrics reported by GhostScript, until you find a "good one" for your PDFs. A good criteria should sort your pages from blank to not blank.

    #!/usr/bin/env bash
    
    criteria='$4+$5+$6+$7'
    
    in="/tmp/pdf_trim/analysis.txt"
    out="/tmp/pdf_trim/criteria"
    
    tmp_over="/tmp/pdf_trim/over.pdf"
    tmp_under="/tmp/pdf_trim/under.pdf"
    
    # Apply the criteria to each line, and then sort
    with_criteia="$(cat "$in" | awk '{ print $0, '$criteria' }' | \
      sort -n -k 10 | tee "$out.txt")"
    
    # Create an overlay pdf with the criteria values printed
    echo "$with_criteia" | awk '{printf "%s: %s p%s\n", $10, $2, $3 }' | \
      enscript --no-header --font Courier-Bold18 --lines-per-page 1 -o - | \
      ps2pdf - "$tmp_over"
    
    # Create an underlay pdf with the sorted pages by generating PDFtk handle lists
    handles="$(paste -d ' ' \
      <(echo "$with_criteia" | grep -n '' | sed 's/:.*//' | tr '0-9' 'A-Z') \
      <(echo "$with_criteia"))"
    pushd "$1"
    pdftk $(echo "$handles" | awk '{ printf "%s=%s/%s ", $1, $2, $3 }') \
      cat $(echo "$handles" | awk '{ printf "%s%s ", $1, $4}') \
      output "$tmp_under"
    
    # Merge them into the final result & remove temporary files
    pdftk "$tmp_over" multibackground "$tmp_under" output "$out.pdf"
    rm "$tmp_over" "$tmp_under"
    

    Example usage (this will create a criteria.pdf file with pages sorted by your chosen criteria):

    ./script_2
    
  3. Batch re-generate each of the PDFs in a new directory, minus the blank pages:

    #!/usr/bin/env bash
    
    threshold=1.59
    
    input="/tmp/pdf_trim/criteria.txt"
    
    [ "$#" -eq 1 ] || { echo "Output directory required as argument"; exit 1; }
    out="$(realpath "$1")"
    
    in_list="$(cat $input)"
    out_list="$(cat "$input" | awk '$10 >'$threshold' {print}' | \
      sort -k 2,2 -k 3,3n)"
    
    in_files="$(echo "$in_list" | cut -d ' ' -f 1,2 | sort -u )"
    out_files="$(echo "$out_list" | cut -d ' ' -f 1,2 | sort -u)"
    
    echo "$out_files" | while read f; do
      dest="$(echo "$f" | sed 's|[^ ]* |'$out'/|; s/\.pdf$/_trimmed\.pdf/')"
      echo "$dest"
      mkdir -p "$(dirname "$dest")"
      pdftk "$(echo "$f" | sed 's| |/|')" \
        cat $(echo "$out_list" | grep "$f" | cut -d ' ' -f 3 | tr '\n' ' ' | \
          sed 's/ $//') \
        output "$dest"
    done
    
    printf "\nTrimmed %s pages with criteria value below %s\n" \
      "$(($(echo "$in_list" | wc -l) - $(echo "$out_list" | wc -l)))" "$threshold"
    printf "All pages were skipped from the following files:\n%s\n" \
      "$(comm -23 <(echo "$in_files") <(echo "$out_files") | sed 's/^/\t/; s| |/|')"
    

    Example usage:

    ./script_3 ./output_directory
    

I've written a detailed post about this, with more information and explanations about each of the steps.

1

Here is a simplified version of Antony's answer, targeting POSIX sh:

remove_blank_pages() {
    pdf="$1"

    # Execute the PDF using Ghostscript, outputting to the ink coverage device.
    # For each page, the ink coverage device prints the ratio of coverage for C, M, Y, and K.
    ink_coverage=$(gs -q -o - -sDEVICE=inkcov "$pdf")

    num_pages=$(printf '%s\n' "$ink_coverage" | wc -l)

    # If any of the four channels have a nonzero average, consider this page nonblank.
    non_blank_pages=$(printf '%s\n' "$ink_coverage" |
        awk '$1 + $2 + $3 + $4 > 0 {printf("%d ", NR) }')
    num_non_blank_pages=$(echo "$non_blank_pages" | wc -w)

    if [ "$num_pages" -ne "$num_non_blank_pages" ]; then
        # We will not be quoting the page parameter.
        # We do this not to permit globbing (hence disabling it),
        # but to allow splitting by IFS (hence leaving that alone).
        set -f
        # shellcheck disable=SC2086
        pdftk "$pdf" cat $non_blank_pages output temp.pdf verbose
        set +f
        mv temp.pdf "$pdf"
    fi
}

You could surely tweak the threshold in the awk statement; for my usecase, looking for fully-blank pages sufficed.

If you want to run pdftk unconditionally, you could simplify the code even further, since the page count would not need to be recorded.

0

If you can assume that an page is empty if there is no text, you can go with the following code. If you have pdfs with pages with only charts, images, ... this won't work I think.

First use xpdf/pdf2text to extract the txt of the pdf. Detect "Page break/new pages" with char 0x0C. To remove empty pages just use pdftk and cat all non-empty pages to a new pdf.

  /** page break constant in pdf */
  private static final String PAGEBREAK = new String(new byte[] { 0x0C });
  /** dummy string for an empty page */
  private static final String EMPTY_PAGE = "EMPTY_PAGE";

  /**
   * @param pdfIn
   * @param pdfOut
   * @param txt --> contains pdf2txt output of tool xpdf/pdftotext.exe
   * @return
   * @throws Exception
   */
  private static byte[] removeEmptyPages(File pdfIn, File pdfOut, String txt) throws Exception {
    // replace "page break" with some dummytext+"page break"
    txt = txt.replace(PAGEBREAK, EMPTY_PAGE + PAGEBREAK);
    StringTokenizer tokenizer = new StringTokenizer(txt, PAGEBREAK);
    int pageCounter = 0;
    String pagesWithContent = "";
    boolean foundEmptyPage = false;
    String currentPage = null;
    while (tokenizer.hasMoreTokens()) {
      currentPage = tokenizer.nextToken();
      pageCounter++;
      if (currentPage.equals(EMPTY_PAGE)) {
        foundEmptyPage = true;
      } else {
        pagesWithContent += (pageCounter + " ");
      }
    }

    if (foundEmptyPage) {
      String pdfShellCmd = "..\\tools\\pdftk\\bin\\pdftk.exe \"$IN\" cat $PAGES output \"$OUT\"";
      String cmd = pdfShellCmd.replace("$IN", pdfIn.toString());
      cmd = cmd.replace("$OUT", pdfOut.toString());
      cmd = cmd.replace("$PAGES", pagesWithContent);

      int resultCode = executeShellCmd(cmd);
      if (0 == resultCode) {
        return FileTools.readFile(pdfOut).array();
      } else {
        throw new Exception("Result code for " + cmd + " was " + resultCode);
      }
    } else {
      // if no empty pages, return input file
      copyFile(pdfIn, pdfOut);
      return read(pdfOut);
    }
  }
0

Below is an applescript - that uses pdftk and file size to skip blank pages. Simply determine the file size of a blank grey page and then modify the below byte count to exclude those.

copy (choose file with prompt "Select files to combine" with multiple selections allowed) to myfiles
set mylist to {}
copy (do shell script "date +%Y_%m_%d-%H%M") to mydate
set mytime to time of (current date)

repeat with my1stfile in myfiles
    set my1stfilePOSIX to POSIX path of my1stfile
    set myquotedPOSIX to quoted form of POSIX path of my1stfile
    copy (getparentfolder(my1stfile)) to myfolder
    
    set mysize to (do shell script "stat -f%z " & myquotedPOSIX)
    if mysize as integer > 885 then
        copy (myquotedPOSIX & " ") to end of mylist
    end if
end repeat

set myoutput to myfolder & myname & "-" & mydate & "-" & mytime & ".pdf"
set myscript to "/usr/local/bin/pdftk  " & mylist & " cat output " & quoted form of myoutput

copy (do shell script myscript) to myanswer

tell application "Finder"
    set theItem to (POSIX file myoutput) as alias
    reveal theItem
    activate
end tell

on getparentfolder(myfolderpath)
    tell application "Finder"
        set parentpath to (parent of (myfolderpath) as string)
    end tell
    return POSIX path of parentpath
end getparentfolder
0

Non of the above solved my problem. This works:

https://github.com/nklb/remove-blank-pages

My environment was Kali Linux 2023.3

1
  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From Review
    – Destroy666
    Commented Dec 30, 2023 at 0:29
-1

My intention is to first slice the original pdf into small pdfs then removed the blanks from them and again merge back.

######++++++start blankRemover.sh ++++++++++

#!/bin/sh
IN="${1}"  #Input PDF File Name
OUT="${2}"   #Output PDF File Name
filename=$(basename "${IN}")
filename="${filename%.*}"
echo "filename: ${filename}"
PAGES=$(pdfinfo "$IN" | grep ^Pages: | tr -dc '0-9')

ranges="myout ; " 
for i in $(seq 1 $PAGES)
do
        pdftk "${1}" cat $i output "${1}-temp.pdf"
        pdftotext "${1}-temp.pdf";
        cmp -s blank.txt  "${1}-temp.txt";if test $? -ne 0;then ranges="${ranges}$i ";fi;
        rm -f "${1}-temp.pdf" "${1}-temp.txt"
done
#echo "$ranges"
ranges=`echo ${ranges%?} | cut -d ';' -f 2`
pdftk "${1}" cat $ranges output "${2}"

#######++++++++ end blankRemover.sh +++++++++++++++++++++++++++

#######++++++++ start blankRemover-main.sh ++++++++++++++++++++

#!/bin/sh
IN="${1}"   #Input PDF File Name
OUT="${2}"  #Output PDF File Name 
sizeDefine=$(expr $3)    #slice the pdf
filename=$(basename "${IN}")
filename="${filename%.*}"
echo "filename: ${filename}"
PAGES=$(pdfinfo "$IN" | grep ^Pages: | tr -dc '0-9')

actualSize=${PAGES}
start=1
currentMin=$start
currentMax=$sizeDefine
tempNumPdf=1
echo $actualSize
echo $currentMin
echo $currentMax
echo $tempNumPdf
while test  $currentMax -le $actualSize
do
        pdftk "${IN}" cat $(expr $currentMin)"-"$(expr $currentMax) output "temp-$tempNumPdf.pdf"
        currentMin=$(expr $currentMax + 1 )
        currentMax=$(expr $currentMax + $sizeDefine )
        tempNumPdf=$(expr $tempNumPdf + 1 )
        echo "actualSize: $actualSize"
        echo "currentMin: $currentMin"
        echo "currentMax: $currentMax"
        echo "tempNumPdf: $tempNumPdf"
done
echo "while done"

if test $currentMin -lt $actualSize
then
        pdftk "${IN}" cat $currentMin-$actualSize  output "temp-$tempNumPdf.pdf"
else
        tempNumPdf=$(expr $tempNumPdf - 1 )
fi

for i in $(seq 1 $tempNumPdf)
do
        ./blankRemover.sh "temp-$i.pdf" "temp_update_$i.pdf"
done
echo "for done"
rm -f temp-*.pdf 
pdftk temp_update_*.pdf cat output "${2}"
rm -f temp_update_*.pdf

#######++++++++ end blankRemover-main.sh ++++++++++++++++++++++
-1

In Linux I did this (with an old dictionary of more than 1500 pages but with more than one hundred blanks):

Extract all pages as images or pdf, look at the very small size of the blank ones, and remove them all. Then, create a new pdf out of the remaining files.

To extract pages there are many ways, for example as pdf:

pdftk FILE burst

as png:

pdftoppm INPUT OUTPUT -png

pdfimages INPUT OUTPUT -png

To join pages:

pdftk INPUT cat output OUTPUT.pdf

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .