0

I've created a simple pdf [hi.pdf] with the word hi and when I open it in Notepad++, its encoding is ANSI, which I assume is Notepad++'s best guess, with it opening successfully when I Save as hiSaveAs.pdf.

However, when I copy the contents of hi.pdf from Notepad++, pasting into a new file and saving as hiANSI.pdf with an encoding of ANSI, the file is corrupted and can't be opened:

Error, failed to load pdf document.
  • When I re-open hiANSI.pdf in Notepad++, it has UTF8 listed as the encoding and when I compare it to hi.pdf, I notice it has whitespaces where hi.pdf has the NUL character:
    • hi.pdf: Screenshot1
    • hiANSI.pdf: Screenshot2

  • If I change the encoding of hiANSI.pdf to ANSI instead of UTF8, the text differs from hi.pdf even more: Screenshot3


Can someone explain what is happening here?

  • Why does Save as work, but copying the exact same text into a new Notepad++ file results with a whitespace instead of the NUL char?
  • Why does Notepad++ think hiANSI.pdf is UTF8, but hi.pdf ANSI?

This does not answer this question.

The MSB is not being stripped. Have a look at the hex comparison:

enter image description here

For example, why is 0A being added between 0D and 25 (first row, 10th byte)?

UPDATE:

I noticed Notepad did much less than Notepad++ in terms of "helping". For example when I saved hi.pdf as hiANSI.pdf using Notepad instead of Notepad++, the only thing Notepad did to help was add 0x0A (line feed) after 0x0D (carriage return), and replaced 0x00 (NUL) with 0x20 (space):

enter image description here

If I saved hi.pdf as hiANSI.bin, it did even less. It just replaced 0x00 with 0x20:

enter image description here

In the above two cases, it produced a valid PDF but with "hi" replaced with "IJ":

enter image description here

UPDATE

If I replace the following 0x20 bytes in hiANSI.pdf with 0x00 to match hi.pdf, it displays "hi" instead of "IJ" but with a different font:

Left is hi.pdf, right is hiANSI.pdf

Here are the two bytes I changed (highlighted in yellow):

enter image description here

Why does changing these two bytes have this effect?

2

1 Answer 1

2

Notepad++ is a text-editor, not a binary editor, so it "corrected" the text when pasting.

In your example, the 0D was taken to be carriage-return, which was taken to be part of the end-of-line character in Windows, but still missing the 0A (line-feed). So Notepad++ has thoughtfully corrected your text.

For more information see Wikipedia:

For a freeware hex editor, see for example HxD.

7
  • That makes sense. But what about the next byte 25? I can see that's the End Of Medium char, but then why does it add the following 8 bytes, with every other byte being C3? C3 A2 C3 A3 C3 8F C3 93. And on line 15F, why does it add C3 9E after 68? 68 is just the letter "h". Commented Feb 21, 2021 at 10:10
  • The C3 is a very common first byte in Unicode UTF-8 character-encodings. Notepad++ has understood that your file is not simple ASCII text, so decided that it must be in UTF-8 and has therefore translated the non-ASCII characters to UTF-8. Notepad++ is only trying to be helpful.
    – harrymc
    Commented Feb 21, 2021 at 10:22
  • You're right. After 68, it has replaced DE with C3 9E, because DE (ANSI) = c3 9e (UTF-8) = LATIN CAPITAL LETTER THORN (Þ). It looks like it has saved the file as UTF-8 even though I selected ANSI in the Encoding dropdown. Commented Feb 21, 2021 at 11:03
  • Do you know why it changed "hi" to "IJ"? Commented Feb 26, 2021 at 12:06
  • I don't see it in your dumps.
    – harrymc
    Commented Feb 26, 2021 at 12:09

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .