Continuing What's the best way to view changes that emerged due to upgrades of TeX Live? , let's assume we have two files, old.pdf and new.pdf, obtained from the same LaTeX sources (a two-column paper with small fonts, some maths, and a few figures) but with slightly different installations of TeX Live 2020. Here's some data on the old file:
$ pdfinfo old.pdf
Title: …
Subject:
Keywords:
Author: …
Creator: LaTeX with hyperref
Producer: pdfTeX-1.40.21
CreationDate: Sat Jan 1 03:21:23 2022 CET
ModDate: Sat Jan 1 03:21:23 2022 CET
Custom Metadata: yes
Metadata Stream: no
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 41
Encrypted: no
Page size: 595.276 x 841.89 pts (A4)
Page rot: 0
File size: 873455 bytes
Optimized: no
PDF version: 1.5
Judging by the file date and the much earlier pdftex version and some insider knowledge, old.pdf was generated by a stock Debian or Ubuntu TeX-Live distribution (which lags behind the version from TUG at any time) on January 1, 2022. We don't have these distributions any longer and cannot easily regenerate the file old.pdf. (One option would be to try to install older Debian and/or Ubuntu versions somewhere, which would bring in a completely new set of issues – and even then it'd be unknown whether we get the same TeX Live installations and outputs as on 2022-01-01. Anywhere to grab the latest Debian/Ubuntu live that either already has stock TeX Live packages with pdfTeX-1.40.21 or allows for installing stock TeX Live packages with pdfTeX-1.40.21?)
The file new.pdf has just been produced from the same old sources using TeX Live 2020 final (which, following David's comment, is the last TeX Live version holding pdfTeX-1.40.21 -- please correct us if we're wrong here):
$ pdfinfo new.pdf
Title: …
Subject:
Keywords:
Author: …
Creator: LaTeX with hyperref
Producer: pdfTeX-1.40.21
CreationDate: Sun Dec 10 19:54:05 2023 CET
ModDate: Sun Dec 10 19:54:05 2023 CET
Custom Metadata: yes
Metadata Stream: no
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 41
Encrypted: no
Page size: 595.276 x 841.89 pts (A4)
Page rot: 0
File size: 1104321 bytes
Optimized: no
PDF version: 1.5
There are no differences between these two pieces of metadata except the dates and the file sizes. There are, however, differences in the fonts:
$ diff <(pdffonts old.pdf) <(pdffonts new.pdf)
3,60c3,59
< [none] Type 3 Custom yes no no 22 0
< [none] Type 3 Custom yes no no 23 0
< [none] Type 3 Custom yes no no 24 0
< [none] Type 3 Custom yes no no 25 0
< [none] Type 3 Custom yes no no 26 0
< [none] Type 3 Custom yes no no 27 0
< [none] Type 3 Custom yes no no 30 0
< [none] Type 3 Custom yes no no 31 0
< [none] Type 3 Custom yes no no 32 0
< [none] Type 3 Custom yes no no 33 0
< [none] Type 3 Custom yes no no 36 0
< HHFZSL+CMR10 Type 1 Builtin yes yes no 37 0
< AUZFDF+CMMI10 Type 1 Builtin yes yes no 70 0
< CHXOKT+CMSY10 Type 1 Builtin yes yes no 71 0
< EJRJQW+CMR7 Type 1 Builtin yes yes no 72 0
< [none] Type 3 Custom yes no no 74 0
< [none] Type 3 Custom yes no no 75 0
< [none] Type 3 Custom yes no no 76 0
< [none] Type 3 Custom yes no no 78 0
< [none] Type 3 Custom yes no no 113 0
< RQSHZS+CMTT10 Type 1 Builtin yes yes no 230 0
< UFOCKA+BBOLD10 Type 1 Builtin yes yes no 233 0
< WRFUCI+CMTI10 Type 1 Builtin yes yes no 235 0
< SCGGLO+CMMI7 Type 1 Builtin yes yes no 236 0
< [none] Type 3 Custom yes no no 247 0
< CYIPZU+MSAM10 Type 1 Builtin yes yes no 266 0
< NQPTHU+CMEX10 Type 1 Builtin yes yes no 268 0
< BLWZBZ+CMSY7 Type 1 Builtin yes yes no 269 0
< OYQULM+EUFM10 Type 1 Builtin yes yes no 270 0
< HCIFCV+CMR5 Type 1 Builtin yes yes no 271 0
< NLGAUZ+CMSS10 Type 1 Builtin yes yes no 272 0
< LUAFXG+rsfs10 Type 1 Builtin yes yes no 273 0
< HQDOIX+CMMI12 Type 1 Builtin yes yes no 283 0
< ZLOCQH+CMR8 Type 1 Builtin yes yes no 284 0
< LSMQVG+MSBM10 Type 1 Builtin yes yes no 305 0
< SFQIJW+CMSS9 Type 1 Builtin yes yes no 307 0
< ECGVGR+CMR9 Type 1 Builtin yes yes no 308 0
< BHDBNY+CMMI9 Type 1 Builtin yes yes no 309 0
< JIXOQS+CMSY9 Type 1 Builtin yes yes no 312 0
< [none] Type 3 Custom yes no no 332 0
< AUZEEC+TeX-mathb10 Type 1 Builtin yes yes no 377 0
< GZMRBQ+CMSS8 Type 1 Builtin yes yes no 410 0
< BNEURD+stmary10 Type 1 Builtin yes yes no 416 0
< LFVJTS+CMMI8 Type 1 Builtin yes yes no 425 0
< KKQPKP+CMMI5 Type 1 Builtin yes yes no 448 0
< MBBTWW+TeX-mathb7 Type 1 Builtin yes yes no 449 0
< GKIQMS+CMSY5 Type 1 Builtin yes yes no 450 0
< BHHCMF+CMSY8 Type 1 Builtin yes yes no 455 0
< ORHOJI+CMR6 Type 1 Builtin yes yes no 456 0
< [none] Type 3 Custom yes no no 617 0
< IPQGLW+BBOLD7 Type 1 Builtin yes yes no 658 0
< WWCTIW+CMEX7 Type 1 Builtin yes yes no 678 0
< [none] Type 3 Custom yes no no 745 0
< [none] Type 3 Custom yes no no 1013 0
< RRPOFI+MSBM7 Type 1 Builtin yes yes no 1078 0
< RKHRIA+CMEX8 Type 1 Builtin yes yes no 1086 0
< [none] Type 3 Custom yes no no 1135 0
< [none] Type 3 Custom yes no no 1136 0
---
> KYESCI+SFRM1728 Type 1 Custom yes yes no 22 0
> YHWNWE+SFCC1200 Type 1 Custom yes yes no 23 0
> UFXXMG+SFCC0800 Type 1 Custom yes yes no 24 0
> TQJZEA+SFTI0700 Type 1 Custom yes yes no 25 0
> OPFKQH+SFTI0900 Type 1 Custom yes yes no 26 0
> WXDDIU+SFBX0900 Type 1 Custom yes yes no 27 0
> KAKEVM+SFBX1000 Type 1 Custom yes yes no 30 0
> NVPASK+SFCC1000 Type 1 Custom yes yes no 31 0
> GXRPYC+SFRM1000 Type 1 Custom yes yes no 32 0
> URWOJO+SFTI1000 Type 1 Custom yes yes no 35 0
> HHFZSL+CMR10 Type 1 Builtin yes yes no 36 0
> AUZFDF+CMMI10 Type 1 Builtin yes yes no 69 0
> CHXOKT+CMSY10 Type 1 Builtin yes yes no 70 0
> EJRJQW+CMR7 Type 1 Builtin yes yes no 71 0
> CHUDXO+SFRM0700 Type 1 Custom yes yes no 73 0
> GXRPYC+SFRM1000 Type 1 Custom yes yes no 74 0
> ADNPKL+SFRM0600 Type 1 Custom yes yes no 75 0
> LMFIFZ+SFRM0800 Type 1 Custom yes yes no 77 0
> ZYIGPF+SFTT1000 Type 1 Custom yes yes no 112 0
> RQSHZS+CMTT10 Type 1 Builtin yes yes no 229 0
> UFOCKA+BBOLD10 Type 1 Builtin yes yes no 232 0
> WRFUCI+CMTI10 Type 1 Builtin yes yes no 234 0
> SCGGLO+CMMI7 Type 1 Builtin yes yes no 235 0
> RIDLKK+SFRM0900 Type 1 Custom yes yes no 246 0
> CYIPZU+MSAM10 Type 1 Builtin yes yes no 265 0
> NQPTHU+CMEX10 Type 1 Builtin yes yes no 267 0
> BLWZBZ+CMSY7 Type 1 Builtin yes yes no 268 0
> OYQULM+EUFM10 Type 1 Builtin yes yes no 269 0
> HCIFCV+CMR5 Type 1 Builtin yes yes no 270 0
> NLGAUZ+CMSS10 Type 1 Builtin yes yes no 271 0
> LUAFXG+rsfs10 Type 1 Builtin yes yes no 272 0
> HQDOIX+CMMI12 Type 1 Builtin yes yes no 282 0
> ZLOCQH+CMR8 Type 1 Builtin yes yes no 283 0
> LSMQVG+MSBM10 Type 1 Builtin yes yes no 304 0
> SFQIJW+CMSS9 Type 1 Builtin yes yes no 306 0
> ECGVGR+CMR9 Type 1 Builtin yes yes no 307 0
> BHDBNY+CMMI9 Type 1 Builtin yes yes no 308 0
> JIXOQS+CMSY9 Type 1 Builtin yes yes no 311 0
> ECVLCM+SFIT0900 Type 1 Custom yes yes no 331 0
> AUZEEC+TeX-mathb10 Type 1 Builtin yes yes no 376 0
> GZMRBQ+CMSS8 Type 1 Builtin yes yes no 409 0
> BNEURD+stmary10 Type 1 Builtin yes yes no 415 0
> LFVJTS+CMMI8 Type 1 Builtin yes yes no 424 0
> KKQPKP+CMMI5 Type 1 Builtin yes yes no 447 0
> MBBTWW+TeX-mathb7 Type 1 Builtin yes yes no 448 0
> GKIQMS+CMSY5 Type 1 Builtin yes yes no 449 0
> BHHCMF+CMSY8 Type 1 Builtin yes yes no 454 0
> ORHOJI+CMR6 Type 1 Builtin yes yes no 455 0
> QRAQFO+SFTI0800 Type 1 Custom yes yes no 616 0
> IPQGLW+BBOLD7 Type 1 Builtin yes yes no 657 0
> WWCTIW+CMEX7 Type 1 Builtin yes yes no 677 0
> JPVWTM+SFIT1000 Type 1 Custom yes yes no 744 0
> WEZJQK+SFTT0900 Type 1 Custom yes yes no 1012 0
> RRPOFI+MSBM7 Type 1 Builtin yes yes no 1077 0
> RKHRIA+CMEX8 Type 1 Builtin yes yes no 1085 0
> FYOJQA+SFTT0800 Type 1 Custom yes yes no 1134 0
> TBNRHQ+SFSS0800 Type 1 Custom yes yes no 1135 0
Let's make the diff a bit smaller:
$ diff <(pdffonts old.pdf | sed '1,2d' | cut -c-79 | cut -d '+' -f 2- | sort) <(pdffonts new.pdf | sed '1,2d'| cut -c-79 | cut -d '+' -f 2- | sort)
32,54d31
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
< [none] Type 3 Custom yes no
55a33,54
> SFBX0900 Type 1 Custom yes yes
> SFBX1000 Type 1 Custom yes yes
> SFCC0800 Type 1 Custom yes yes
> SFCC1000 Type 1 Custom yes yes
> SFCC1200 Type 1 Custom yes yes
> SFIT0900 Type 1 Custom yes yes
> SFIT1000 Type 1 Custom yes yes
> SFRM0600 Type 1 Custom yes yes
> SFRM0700 Type 1 Custom yes yes
> SFRM0800 Type 1 Custom yes yes
> SFRM0900 Type 1 Custom yes yes
> SFRM1000 Type 1 Custom yes yes
> SFRM1000 Type 1 Custom yes yes
> SFRM1728 Type 1 Custom yes yes
> SFSS0800 Type 1 Custom yes yes
> SFTI0700 Type 1 Custom yes yes
> SFTI0800 Type 1 Custom yes yes
> SFTI0900 Type 1 Custom yes yes
> SFTI1000 Type 1 Custom yes yes
> SFTT0800 Type 1 Custom yes yes
> SFTT0900 Type 1 Custom yes yes
> SFTT1000 Type 1 Custom yes yes
The old file uses a mixture of Type 1 and Type 3 fonts, whereas the new file uses Type 1 fonts only.
Trying to compare the textual contents results in a nightmare, and here's an excerpt:
$ diff <(pdftotext old.pdf -) <(pdftotext new.pdf -) | head -12298 | tail -39
yields
10571c5813
< abstract interpretation. FPCA, pp. 170181. ACM Press,
---
> abstract interpretation. FPCA, pp. 170–181. ACM Press,
10573c5815
< [70] Jones, N. D. and Muchnik, S. S. (1981) Complexity of ow
---
> [70] Jones, N. D. and Muchnik, S. S. (1981) Complexity of flow
10583c5825
< [72] Perry, D. E., Jeey, R., and Notkin, D. (eds.) (1995)
---
> [72] Perry, D. E., Jeffrey, R., and Notkin, D. (eds.) (1995)
10585c5827
< Seattle, Washington, USA, April 2330, 1995, Proceedings. ACM.
---
> Seattle, Washington, USA, April 23–30, 1995, Proceedings. ACM.
10587c5829
< Softwaretechnik-Trends, .
---
> Softwaretechnik-Trends, 21.
10593c5835
< Université d'Aix-Marseille CNRS, UMR 7279.
---
> Université d’Aix-Marseille — CNRS, UMR 7279.
10597c5839
< Structures in Computer Science, , 329366.
---
> Structures in Computer Science, 14, 329–366.
10601c5843
< Distributed Computing, , 383409.
---
> Distributed Computing, 25, 383–409.
10609,10611c5851,5852
< Proceedings, Lecture Notes in Computer Science,
< ,
< pp. 391407. Springer.
---
> Proceedings, Lecture Notes in Computer Science, 5674,
> pp. 391–407. Springer.
As you see, old.pdf probably has no en dashes and no ligatures (ff/ff, fl/fl) in the text layer as opposed to new.pdf, and therefore it's not manageable to examine the full output manually:
$ diff <(pdftotext old.pdf -) <(pdftotext new.pdf -) | wc -l
15278
15278 lines is just way too many. The tool diffpdf is not better; here's the second page of the two files side by side, and wherever diffpdf senses a difference, it colo(u)rs the background in red:
- visual comparison marks everything as different
- comparison by characters marks most of the text as different
- comparison by words marks much of the text as different
Above, we blurred the images for privacy. When we actually tried to find the difference in the contents (we considered the first paragraph on the page), we discovered nothing except that the fonts in new.pdf are smoother than in old.pdf. Still, we are unsure about the rest of the document. We clearly don't wish to re-read all the pages of each document (here 41 pages and consider every symbol, line, and space) simply for the purpose of comparison (for other purposes in a distant future perhaps but not for the purpose of comparison) whether any more-important contents (actual letters and digits, references, citations, hyperlinks, tables, graphics, math symbols, self-drawn symbols, …) changed when TeX Live was upgraded.
Any help in better automating the comparison task? Can we, perhaps, anyhow equalize the fonts in one or both PDF files before comparison? (Btw., these PDF files were produced from LaTeX via pdflatex
, and we do NOT have Postscript or DVI versions of the old file.) Or can we massage the outputs of pdftotext
before running diff
? Or can we provide any nondefault options to the tools used to make our task easier? Or, for our purposes, is the paid diffpdf
now any better for than its free version? Or are there any online tools good at this?
\pdfglyphtounicode=0
(only for the tests, not for the final document!). Then probably the ff ligatures no longer copy in the new pdf too and the diff gets manageable.\pdfgentounicode = 1
.diff <(pdftotext old.pdf -) <(pdftotext new.pdf -)| wc -l
still yields 12387 lines.