How do I recover my text file after changing the encoding a few times in Notepad++?

Question

I had a formatting problem in my text files. I assume it was Windows-1252 initially. Then I tried to reformat the file to another encoding format using Notepad++, did it a couple of times, and got a complete mess like ???A??a?s??A§???A??a?s??A ???A??a?s??Aµ???A??a?s??A®???A??a?s??A¤????????????. I do not remember the actual sequence of actions I've done. The only thing I am sure about is that I switched between ANSI, UTF-8 and Windows-1251. None of this got me my text back in proper cyrillic format, which it was before.

So, is there a way to get the information I had in this file back? Does the txt file contains all the information and I just need to figure out what encoding format I need or it was replaced and information lost forever? Initially, I had some cyrillic text in the file.

I have no backups of the file. I have Windows Recovery Points, but it seems like it won't help. I didn't know I can lose my data changing the encoding format of the file. Thank you @Giacomo1968 for your explanation. — serenetree249, Commented Sep 4, 2022 at 15:09
Before you give up, try ShadowExplorer. If you have recovery enabled for the drive, it could save your bacon, even if "Previous Versions" isn't showing anything for that file. — Xan, Commented Sep 4, 2022 at 22:36
What you see with the editor is probably not a good representation of what the data really is. In particular those question marks probably indicate some byte that doesn’t correspond to a printable character. I’d suggest using a hex editor or just doing a hex dump to see what the bytes really are. If you edit your question to show the first 100 bytes of the file in Hex, someone may be able help you. — Theodore Norvell, Commented Sep 4, 2022 at 22:45
@TheodoreNorvell I don't use NP++ at the moment but most unicode-aware editors (by default) don't show "?" unless there's a "?" in the file. They either show � (U+FFFD Replacement Character) or tofu (a rectangle, possibly with hex characters in). The latter is more common these days, but � is still used for encoding errors if the display is unicode-based — Chris H, Commented Sep 5, 2022 at 10:28

Giacomo1968 · Accepted Answer · 2022-09-04 16:46:50Z

Sorry, but the text is unrecoverable at this point.

The problem is, not all character sets have the same range of characters.

When you switched between character sets, there is an attempt by the system to retain the characters in some way. But since not all character sets have all characters, characters were lost in this process so these files are permanently garbled.

In your example, if you went from Cyrillic (which should be UTF-8) to ANSI (aka: Windows-1252) and then to Windows-1251 (an older, Cyrillic script) each conversion lost data.

Sorry, but a backup of some kind is your only hope.

FWIW, this page — “Where Did These Funny Characters Come From?” — has an excellent explanation of how this happens and what those question marks (?) mean:

A byte is 8 bits, and the value can conveniently be represented in hexadecimal (usually abbreviated to "hex") or in decimal or, less conveniently, in octal or binary.

For example the character "A" in is represented in a single byte like this:

A

Binary 01000001

Hex 41

Decimal 65

Octal 101

Unicode Code Point U+0041

The character "A" is the same in UTF-8, ASCII, ISO/IEC 8859 and Windows 12xx, all our usual sources. So in this case we don't have to worry about any incompatibility because there isn't any.

If we look at the Euro symbol (€) it's a completely different story:

€ - Euro currency symbol

Character Encoding UTF-8 (3-byte sequence) ISO/IEC 8859-15 Windows-1252

Binary 11100010 10000010 10101100 10100100 10000000

Hex e2 82 ac a4 80

Decimal 225 130 172 164 128

Octal 342 202 254 244 200

Unicode Code Point U+20ac

Our commonly-used encoding systems all represent the Euro symbol differently. If we copy the bytes from a file encoded in ISO-8859-15 to a database running in Windows-1252, our Euro symbol (hex a4) will not look like a Euro symbol any more. In Windows-1252 hex a4 is "¤". Going from Windows-1252 to ISO-8859-15 we would get a question mark or a "◼" because in ISO-8859-15 hex 80 is undefined. 7-bit ASCII and EBCDIC do not have any way to represent a Euro symbol. These encoding systems were defined before the Euro existed, so that is not surprising.

While we could get away with using one consistent 8-bit code then everything was very simple, but we can't do that in the real world any more, so we need something better. UTF-8 is that something better, so we'll explain a bit about how that works.

Stack Exchange Network

How do I recover my text file after changing the encoding a few times in Notepad++?

1 Answer 1

Sorry, but the text is unrecoverable at this point.

The problem is, not all character sets have the same range of characters.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
encoding
.

Hot Network Questions

A
Binary	01000001
Hex	41
Decimal	65
Octal	101
Unicode Code Point	U+0041

Character Encoding	UTF-8 (3-byte sequence)	ISO/IEC 8859-15	Windows-1252
Binary	11100010 10000010 10101100	10100100	10000000
Hex	e2 82 ac	a4	80
Decimal	225 130 172	164	128
Octal	342 202 254	244	200
Unicode Code Point	U+20ac

How do I recover my text file after changing the encoding a few times in Notepad++?

1 Answer 1

Sorry, but the text is unrecoverable at this point.

The problem is, not all character sets have the same range of characters.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged encoding.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
encoding
.