5

I had a formatting problem in my text files. I assume it was Windows-1252 initially. Then I tried to reformat the file to another encoding format using Notepad++, did it a couple of times, and got a complete mess like ???A??a?s??A§???A??a?s??A ???A??a?s??Aµ???A??a?s??A®???A??a?s??A¤????????????. I do not remember the actual sequence of actions I've done. The only thing I am sure about is that I switched between ANSI, UTF-8 and Windows-1251. None of this got me my text back in proper cyrillic format, which it was before.

So, is there a way to get the information I had in this file back? Does the txt file contains all the information and I just need to figure out what encoding format I need or it was replaced and information lost forever? Initially, I had some cyrillic text in the file.

6
  • 7
    Use your backup.
    – Toto
    Commented Sep 4, 2022 at 14:21
  • 1
    I have no backups of the file. I have Windows Recovery Points, but it seems like it won't help. I didn't know I can lose my data changing the encoding format of the file. Thank you @Giacomo1968 for your explanation. Commented Sep 4, 2022 at 15:09
  • 1
    Before you give up, try ShadowExplorer. If you have recovery enabled for the drive, it could save your bacon, even if "Previous Versions" isn't showing anything for that file.
    – Xan
    Commented Sep 4, 2022 at 22:36
  • What you see with the editor is probably not a good representation of what the data really is. In particular those question marks probably indicate some byte that doesn’t correspond to a printable character. I’d suggest using a hex editor or just doing a hex dump to see what the bytes really are. If you edit your question to show the first 100 bytes of the file in Hex, someone may be able help you. Commented Sep 4, 2022 at 22:45
  • 1
    @TheodoreNorvell I don't use NP++ at the moment but most unicode-aware editors (by default) don't show "?" unless there's a "?" in the file. They either show � (U+FFFD Replacement Character) or tofu (a rectangle, possibly with hex characters in). The latter is more common these days, but � is still used for encoding errors if the display is unicode-based
    – Chris H
    Commented Sep 5, 2022 at 10:28

1 Answer 1

15

Sorry, but the text is unrecoverable at this point.

The problem is, not all character sets have the same range of characters.

When you switched between character sets, there is an attempt by the system to retain the characters in some way. But since not all character sets have all characters, characters were lost in this process so these files are permanently garbled.

In your example, if you went from Cyrillic (which should be UTF-8) to ANSI (aka: Windows-1252) and then to Windows-1251 (an older, Cyrillic script) each conversion lost data.

Sorry, but a backup of some kind is your only hope.

FWIW, this page — “Where Did These Funny Characters Come From?” — has an excellent explanation of how this happens and what those question marks (?) mean:

A byte is 8 bits, and the value can conveniently be represented in hexadecimal (usually abbreviated to "hex") or in decimal or, less conveniently, in octal or binary.

For example the character "A" in is represented in a single byte like this:

A
Binary 01000001
Hex 41
Decimal 65
Octal 101
Unicode Code Point U+0041

The character "A" is the same in UTF-8, ASCII, ISO/IEC 8859 and Windows 12xx, all our usual sources. So in this case we don't have to worry about any incompatibility because there isn't any.

If we look at the Euro symbol (€) it's a completely different story:

€ - Euro currency symbol

Character Encoding UTF-8 (3-byte sequence) ISO/IEC 8859-15 Windows-1252
Binary 11100010 10000010 10101100 10100100 10000000
Hex e2 82 ac a4 80
Decimal 225 130 172 164 128
Octal 342 202 254 244 200
Unicode Code Point U+20ac

Our commonly-used encoding systems all represent the Euro symbol differently. If we copy the bytes from a file encoded in ISO-8859-15 to a database running in Windows-1252, our Euro symbol (hex a4) will not look like a Euro symbol any more. In Windows-1252 hex a4 is "¤". Going from Windows-1252 to ISO-8859-15 we would get a question mark or a "◼" because in ISO-8859-15 hex 80 is undefined. 7-bit ASCII and EBCDIC do not have any way to represent a Euro symbol. These encoding systems were defined before the Euro existed, so that is not surprising.

While we could get away with using one consistent 8-bit code then everything was very simple, but we can't do that in the real world any more, so we need something better. UTF-8 is that something better, so we'll explain a bit about how that works.

0

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .