0

Why do I not succeed into making this simple pdf → raster image → pdf round-trip file-size-stable?

$ # Get original file (8KB, just extracted from my scanner).
$ curl "https://nextcloud.mbb.cnrs.fr/nextcloud/s/Rgd4qgmt5mGdifR/download?file=scan.pdf" -o scan.pdf
$ du scan.pdf
8       scan.pdf

$ # Extract image from the file.
$ pdfimages -list scan.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1653  2338  gray    1   8  jpeg   no         6  0   200   200 7012B 0.2%

$ pdfimages -all scan.pdf extract
$ du extract-000.jpg
56      extract-000.jpg # Much bigger than inside.

$ # Convert back into a pdf.
$ magick extract-000.jpg back.pdf
$ du back.pdf
32      back.pdf # Much larger than the original.
$ pdfimages -list back.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1653  2338  gray    1   8  jpeg   no         8  0   200   200 28.2K 0.7%

What's happening? What are the determinants of file size in these three files?

  • scan.pdf: the original (extracted from a scanner machine) (8KB).
  • extract-000.jpeg: extract with vanilla pdfimages command (56KB).
  • back.pdf: vanilla convert with magick (32KB).

Can I control them? Can I make back.pdf the same size as scan.pdf without loosing image quality?

(the ultimate goal is to crop the image before getting back to pdf, and my question originates from my attempts to cropping surprisingly increasing the cropped result size instead of decreasing it)

0

1 Answer 1

0

The original scan.pdf (Sharp Scanned ImagePDF) seems to have a double filter encoded image (/Filter [/FlateDecode/DCTDecode]) which was natively scanned as a very poor quality JPEG (20%) then zipped, presumably to be optimally smaller!

Normally the aim with scanning is not to degrade at the scanning stage, thus better lossless LZW compression OR highest quality JPEG 100% would be preferred.

During extraction the image area is unzipped to provide the original poor JPEG which itself at over 50 KB is noticeably large.

My extraction shows as 52.4 KB (53,746 bytes) depending on how that is resaved as PDF that could become 56.6 KB (58,006 bytes).

On resave the JPEG at 100% of 20% it may thus be recoded in another editor as 34.8 KB (35,711 bytes)!

this seems to be close to your findings.

The problem seems to be at source where that image should have been better quality.

Depending on scanner manufacturer and user choices a similar scan can, as a PDF be between 0.8 KB and 8.0 MB. So the correct choices have drastic effects on size and performance.

If roundtripping is the aim, then a standard high quality JPEG in a PDF will retain exactly the same byte for byte data without degradation.

So any edits are not going to be the source image, but you can replace with better.

Presuming the aim is to trim the edges to 1600 x 2300 then that will be much smaller i.e. under 6 KB when saved as converted to JPX.

/Subtype/Image
/Width 1600/Height 2300
/ColorSpace/DeviceRGB
/BitsPerComponent 8
/Filter/JPXDecode
/Length 5735

here as temporary download is that PDF https://filetransfer.io/data-package/c7fZ73kl#link

NOTE: not all PDF readers may see that JPX image. Legacy SumatraPDF was the builder and my current Chromium (PDFium) Edge v125 does not show the heart, unless switched back to I.E. legacy Acrobat Reader DC Plug-in. However my current "Powered by Acrobat" Edge v128 does.

1
  • I'm not sure I understand everything, but at least the picture gets clearer. How have you been able to figure all that information? For instance, how could you tell how the original jpeg was compressed within the original scan?
    – iago-lito
    Commented Jun 13 at 15:02

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .