0

I just installed some Arch Linux packages, which dumped this file onto my disk:

/etc/ssl/certs/EBG_Elektronik_Sertifika_Hizmet_Sağlayıcısı.pem

Note that the file name seems to contain Turkish characters. Here are different commands with their output:

> cd /etc/ssl/certs
> echo EBG*
EBG_Elektronik_Sertifika_Hizmet_Sağlayıcısı.pem
> ls -al EBG*
lrwxrwxrwx 1 root root 86 Nov  3 22:27 EBG_Elektronik_Sertifika_Hizmet_Sa??lay??c??s??.pem -> /usr/share/ca-certificates/mozilla/EBG_Elektronik_Sertifika_Hizmet_Sa??lay??c??s??.crt

Q1: Why do echo and ls produce different output?

So it's a symlink. If I dereference it:

> ls -alL EBG*
-rw-r--r-- 1 root root 2106 Sep 24 22:52 EBG_Elektronik_Sertifika_Hizmet_Sa??lay??c??s??.pem

Let's look at the target:

> cd /usr/share/ca-certificates/mozilla
> echo EBG*
EBG_Elektronik_Sertifika_Hizmet_Sağlayıcısı.crt
> ls -al EBG*
-rw-r--r-- 1 root root 2106 Sep 24 22:52 EBG_Elektronik_Sertifika_Hizmet_Sa??lay??c??s??.crt

Q2: What is the encoding used for non-ASCII characters in a Linux file system (here: ext4)? Am I correct that the encoding is not captured anywhere, and if I give you some random hard drive without instructions, you need to guess which encoding I used?

I noticed there was a problem because pacman (the Arch Linux package manager) seemed to get confused about whether or not it had installed that file:

Q3: How do I prevent pacman, or ls, or anything else from getting confused about files like that? What if next week, some file is arabic or hebrew instead of Turkish?

1 Answer 1

2
  1. echo is a dumb program, which produces output whether or not it thinks it makes sense. ls is a smart program, which tries to output only what makes sense in context. This results in ls producing the "wrong" output because you haven't set up your locale correctly. If you export LANG=en_US.UTF-8 (or some other language with utf-8) then ls will display it correctly; normally there is some system environment script that does this.

  2. Linux filesystems do not enforce a encoding (but foreign mounted ones may perform encoding transormations, e.g. from cp1252 for FAT), but by strong convention it is always utf-8. For the past several years it has been considered a severe bug if any other encoding is used in a package.

  3. For ls, fix your environment. For pacman, file a bug.

1
  • Ah! To add, the locale needs to exist, so on Arch, uncomment en_US.UTF-8 UTF-8 in /etc/locale.gen and run sudo locale-gen. Commented Nov 5, 2014 at 3:54

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .