Why does copying text between Notepad++ files create files with different bytes?

Question

I've created a simple pdf [hi.pdf] with the word hi and when I open it in Notepad++, its encoding is ANSI, which I assume is Notepad++'s best guess, with it opening successfully when I Save as hiSaveAs.pdf.

However, when I copy the contents of hi.pdf from Notepad++, pasting into a new file and saving as hiANSI.pdf with an encoding of ANSI, the file is corrupted and can't be opened:

Error, failed to load pdf document.

When I re-open hiANSI.pdf in Notepad++, it has UTF8 listed as the encoding and when I compare it to hi.pdf, I notice it has whitespaces where hi.pdf has the NUL character:
- hi.pdf:
- hiANSI.pdf:
If I change the encoding of hiANSI.pdf to ANSI instead of UTF8, the text differs from hi.pdf even more:

Can someone explain what is happening here?

Why does Save as work, but copying the exact same text into a new Notepad++ file results with a whitespace instead of the NUL char?
Why does Notepad++ think hiANSI.pdf is UTF8, but hi.pdf ANSI?

This does not answer this question.

The MSB is not being stripped. Have a look at the hex comparison:

For example, why is 0A being added between 0D and 25 (first row, 10th byte)?

UPDATE:

I noticed Notepad did much less than Notepad++ in terms of "helping". For example when I saved hi.pdf as hiANSI.pdf using Notepad instead of Notepad++, the only thing Notepad did to help was add 0x0A (line feed) after 0x0D (carriage return), and replaced 0x00 (NUL) with 0x20 (space):

If I saved hi.pdf as hiANSI.bin, it did even less. It just replaced 0x00 with 0x20:

In the above two cases, it produced a valid PDF but with "hi" replaced with "IJ":

UPDATE

If I replace the following 0x20 bytes in hiANSI.pdf with 0x00 to match hi.pdf, it displays "hi" instead of "IJ" but with a different font:

Here are the two bytes I changed (highlighted in yellow):

Why does changing these two bytes have this effect?

Does this answer your question? Saving pdf from Notepad++ creates corrupted file — Toto, Commented Feb 21, 2021 at 9:49
@Toto No, I asked that question. It got closed and it said "If this question doesn’t resolve your question, ask a new one". So this is me asking a new one. — David Klempfner, Commented Feb 21, 2021 at 9:51

harrymc · Accepted Answer · 2021-02-21 10:16:44Z

2

Notepad++ is a text-editor, not a binary editor, so it "corrected" the text when pasting.

In your example, the 0D was taken to be carriage-return, which was taken to be part of the end-of-line character in Windows, but still missing the 0A (line-feed). So Notepad++ has thoughtfully corrected your text.

For more information see Wikipedia:

For a freeware hex editor, see for example HxD.

edited Feb 21, 2021 at 10:16

answered Feb 21, 2021 at 9:53

harrymc

1

That makes sense. But what about the next byte 25? I can see that's the End Of Medium char, but then why does it add the following 8 bytes, with every other byte being C3? C3 A2 C3 A3 C3 8F C3 93. And on line 15F, why does it add C3 9E after 68? 68 is just the letter "h".
– David Klempfner
Commented Feb 21, 2021 at 10:10
The C3 is a very common first byte in Unicode UTF-8 character-encodings. Notepad++ has understood that your file is not simple ASCII text, so decided that it must be in UTF-8 and has therefore translated the non-ASCII characters to UTF-8. Notepad++ is only trying to be helpful.
– harrymc
Commented Feb 21, 2021 at 10:22
You're right. After 68, it has replaced DE with C3 9E, because DE (ANSI) = c3 9e (UTF-8) = LATIN CAPITAL LETTER THORN (Þ). It looks like it has saved the file as UTF-8 even though I selected ANSI in the Encoding dropdown.
– David Klempfner
Commented Feb 21, 2021 at 11:03
Do you know why it changed "hi" to "IJ"?
– David Klempfner
Commented Feb 26, 2021 at 12:06
I don't see it in your dumps.
– harrymc
Commented Feb 26, 2021 at 12:09

| Show 2 more comments

Stack Exchange Network

Why does copying text between Notepad++ files create files with different bytes?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
pdf
notepad++
encoding
character-encoding
.

Linked

Hot Network Questions

Why does copying text between Notepad++ files create files with different bytes?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged pdfnotepad++encodingcharacter-encoding.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
pdf
notepad++
encoding
character-encoding
.