1

I´m with a doubt about encoding/charset.

I make a test. Typing the string "TEST Á" without the quotes in Notepad++ with encode in ANSI.

"The ANSI charsets are all the same for ASCII characters such as digits 0-9, and English letters a-z and A-Z," (http://www.firstobject.com/convert-ansi-file-to-unicode.htm)

To my surprise, the notepad++ saved the file normally, I can read it on notepad++ normally with acentuation at Ansi Encoding.

Other test, I opened the same file with HxD (http://mh-nexus.de/en/hxd/), and my file it is correct too, with the latin character "Á" at end of file. See the hexa below:

54 45 53 54 20 C1 -> TEST Á

I thought should to use UTF-8 encoding to this to work, but is not necessary.

Can anyone explain to me how it is possible?

2 Answers 2

1

You can use any character-set and any encoding to create a file and to view it.

You just have to be sure, when viewing, to use the same set and encoding as was used to write the file.

Most character sets actually have a large overlap. For example, most character sets (excluding EBCDIC and others) have the ASCII character set at the same positions (i.e. with same code-points) as ASCII. Therefore you could write a file in the Unicode character set with UTF-8 encoding and, so long as the file contained only characters that are in ASCII, you could view that file using a Windows Latin-1 encoding.

Note: Microsoft are very sloppy with terms such as "ANSI" and "Unicode".


Update:

Firstly, you should pay attention to Jukka's Answer as Jukka is an expert in this subject.

As for your Á, see this extract from here

Dec Hex ASC PC  437 850 Win Lat1    Uni
192 00C0        └   └   └   À   À   À
193 00C1        ┴   ┴   ┴   Á   Á   Á
194 00C2        ┬   ┬   ┬   Â   Â   Â
195 00C3        ├   ├   ├   Ã   Ã   Ã
196 00C4        ─   ─   ─   Ä   Ä   Ä
197 00C5        ┼   ┼   ┼   Å   Å   Å

Note that Á is at code point 194 (0xC1) in Windows Latin-1, in ISO 8859-1 Latin 1 and in Unicode / ISO 10646. If you wrote Á in Windows Latin-1 you could view it as ISO 8859-1.

You would have problems if you tried to read it as Unicode as Unicode encodings use multiple bytes to represent that character,


# echo $LANG
en_US.UTF-8

# cat t
TEST Á

# hexdump -C t
00000000  54 45 53 54 20 c3 81 0a                           |TEST ...|
00000008

Note that Á (Unicode code point 00C1) is encoded in UTF-8 as c3 81

1

The default encoding in Notepad++ is called “ANSI”, without clarification; it may mean windows-1252, or it may mean whatever 8-bit encoding is the system’s native 8-bit encoding (in your case, it’s probably windows-1252 anyway). “ANSI” is a Microsoft misnomer for its 8-bit encodings, one of which (now known as windows-1252) was long ago submitted to the American National Standards Institute for approval – and rejected.

There is no problem in entering “Á” in windows-1252 encoding. Naturally, Notepad++ also displays it OK. So do many, many other programs.

You would need UTF-8 if you wanted to enter “Ć” for example. Many people use UTF-8 even if they don’t need characters outside windows-1252 right now, to avoid any need to change the encoding later, if new characters are added.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .