495

What does the C value for LC_ALL do in Unix-like systems?

I know that it forces the same locale for all aspects but what does C do?

4
  • 4
    If you want to resolve a problem with xclock warning(Missing charsets in String to FontSet conversion), it will be better if you will use LC_ALL=C.UTF-8 to avoid problems with cyrillic. To set this environment variable you must add the following line to the end of ~/.bashrc file - export LC_ALL=C.UTF-8 Commented Jun 19, 2019 at 12:42
  • 2
    @fedotsoldier you should probably ask question and give the answer yourself, I don't think it's related to the question. It's just answer to different problem you're having.
    – jcubic
    Commented Jun 19, 2019 at 13:20
  • Yeah, you are right, ok Commented Jun 19, 2019 at 13:22
  • 6
    legendary C locales rant github.com/mpv-player/mpv/commit/…
    – qwr
    Commented Sep 4, 2022 at 6:28

6 Answers 6

514

LC_ALL is the environment variable that overrides all the other localisation settings (except $LANGUAGE under some circumstances).

Different aspects of localisations (like the thousand separator or decimal point character, character set, sorting order, month, day names, language or application messages like error messages, currency symbol) can be set using a few environment variables.

You'll typically set $LANG to your preference with a value that identifies your region (like fr_CH.UTF-8 if you're in French speaking Switzerland, using UTF-8). The individual LC_xxx variables override a certain aspect. LC_ALL overrides them all. The locale command, when called without argument gives a summary of the current settings.

For instance, on a GNU system, I get:

$ locale
LANG=en_GB.UTF-8
LANGUAGE=
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

I can override an individual setting with for instance:

$ LC_TIME=fr_FR.UTF-8 date
jeudi 22 août 2013, 10:41:30 (UTC+0100)

Or:

$ LC_MONETARY=fr_FR.UTF-8 locale currency_symbol
€

Or override everything with LC_ALL.

$ LC_ALL=C LANG=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 cat /
cat: /: Is a directory

In a script, if you want to force a specific setting, as you don't know what settings the user has forced (possibly LC_ALL as well), your best, safest and generally only option is to force LC_ALL.

The C locale is a special locale that is meant to be the simplest locale. You could also say that while the other locales are for humans, the C locale is for computers. In the C locale, characters are single bytes, the charset is ASCII (well, is not required to, but in practice will be in the systems most of us will ever get to use), the sorting order is based on the byte values¹, the language is usually US English (though for application messages (as opposed to things like month or day names or messages by system libraries), it's at the discretion of the application author) and things like currency symbols are not defined.

On some systems, there's a difference with the POSIX locale where for instance the sort order for non-ASCII characters is not defined.

You generally run a command with LC_ALL=C to avoid the user's settings to interfere with your script. For instance, if you want [a-z] to match the 26 ASCII characters from a to z, you have to set LC_ALL=C.

On GNU systems, LC_ALL=C and LC_ALL=POSIX (or LC_MESSAGES=C|POSIX) override $LANGUAGE, while LC_ALL=anything-else wouldn't.

A few cases where you typically need to set LC_ALL=C:

  • sort -u or sort ... | uniq.... In many locales other than C, on some systems (notably GNU ones), some characters have the same sorting order. sort -u doesn't report unique lines, but one of each group of lines that have equal sorting order. So if you do want unique lines, you need a locale where characters are byte and all characters have different sorting order (which the C locale guarantees).

  • the same applies to the = operator of POSIX compliant expr or == operator of POSIX compliant awks (mawk and gawk are not POSIX in that regard), that don't check whether two strings are identical but whether they sort the same.

  • Character ranges like in grep. If you mean to match a letter in the user's language, use grep '[[:alpha:]]' and don't modify LC_ALL. But if you want to match the a-zA-Z ASCII characters, you need either LC_ALL=C grep '[[:alpha:]]' or LC_ALL=C grep '[a-zA-Z]'². [a-z] matches the characters that sort after a and before z (though with many APIs it's more complicated than that). In other locales, you generally don't know what those are. For instance some locales ignore case for sorting so [a-z] in some APIs like bash patterns, could include [B-Z] or [A-Y]. In many UTF-8 locales (including en_US.UTF-8 on most systems), [a-z] will include the latin letters from a to y with diacritics but not those of z (since z sorts before them) which I can't imagine would be what you want (why would you want to include é and not ź?).

  • floating point arithmetic in ksh93. ksh93 honours the decimal_point setting in LC_NUMERIC. If you write a script that contains a=$((1.2/7)), it will stop working when run by a user whose locale has comma as the decimal separator:

     $ ksh93 -c 'echo $((1.1/2))'
     0.55
     $ LANG=fr_FR.UTF-8  ksh93 -c 'echo $((1.1/2))'
     ksh93: 1.1/2: arithmetic syntax error
    

Then you need things like:

    #! /bin/ksh93 -
    float input="$1" # get it as input from the user in his locale
    float output
    arith() { typeset LC_ALL=C; (($@)); }
    arith output=input/1.2 # use the dot here as it will be interpreted
                           # under LC_ALL=C
    echo "$output" # output in the user's locale

As a side note: the , decimal separator conflicts with the , arithmetic operator which can cause even more confusion.

  • When you need characters to be bytes. Nowadays, most locales are UTF-8 based which means characters can take up from 1 to 6 bytes³. When dealing with data that is meant to be bytes, with text utilities, you'll want to set LC_ALL=C. It will also improve performance significantly because parsing UTF-8 data has a cost.

  • a corollary of the previous point: when processing text where you don't know what character set the input is written in, but can assume it's compatible with ASCII (as virtually all charsets are). For instance grep '<.*>' to look for lines containing a <, > pair will not work if you're in a UTF-8 locale and the input is encoded in a single-byte 8-bit character set like iso8859-15. That's because . only matches characters, and non-ASCII characters in iso8859-15 are likely not to form a valid character in UTF-8. On the other hand, LC_ALL=C grep '<.*>' will work because any byte value forms a valid character in the C locale.

  • Any time where you process input data or output data that is not intended from/for a human. If you're talking to a user, you may want to use their convention and language, but for instance, if you generate some numbers to feed some other application that expects English style decimal points, or English month names, you'll want to set LC_ALL=C:

     $ printf '%g\n' 1e-2
     0,01
     $ LC_ALL=C printf '%g\n' 1e-2
     0.01
     $ date +%b
     août
     $ LC_ALL=C date +%b
     Aug
    

That also applies to things like case insensitive comparison (like in grep -i) and case conversion (awk's toupper(), dd conv=ucase...). For instance:

    grep -i i

is not guaranteed to match on I in the user's locale. In some Turkish locales for instance, it doesn't as upper-case i is İ (note the dot) there and lower-case I is ı (note the missing dot).


Notes

¹ again, only on ASCII based systems (the immense majority of systems). POSIX requires the collation order for the C locale to be that of the order of characters in the ASCII charset, even on EBCDIC systems which are not allowed to do the strcoll() === strcmp() optimisation in the C locale.


² Depending on the encoding of the text, that's not necessarily the right thing to do though. That's valid for UTF-8 or single-byte character sets (like iso-8859-1), but not necessarily non-UTF-8 multibyte character sets.

For instance, if you're in a zh_HK.big5hkscs locale (Hong Kong, using the Hong Kong variant of the BIG5 Chinese character encoding), and you want to look for English letters in a file encoded in that charsets, doing either:

LC_ALL=C grep '[[:alpha:]]'

or

LC_ALL=C grep '[a-zA-Z]'

would be wrong, because in that charset (and many others, but hardly used since UTF-8 came out), a lot of characters contain bytes that correspond to the ASCII encoding of A-Za-z characters. For instance, all of A䨝䰲丕乙乜你再劀劈呸哻唥唧噀噦嚳坽 (and many more) contain the encoding of A. is 0x96 0x41, and A is 0x41 like in ASCII. So our LC_ALL=C grep '[a-zA-Z]' would match on those lines that contain those characters as it would misinterpret those sequences of bytes.

LC_COLLATE=C grep '[A-Za-z]'

would work, but only if LC_ALL is not otherwise set (which would override LC_COLLATE). So you may end up having to do:

grep '[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]'

if you wanted to look for English letters in a file encoded in the locale's encoding.


³ some would argue it's rather 1 to 4 bytes these days now that Unicode code points (and the libraries that encode/decode UTF-8 data) have been arbitrarily restricted to code points U+0000 to U+10FFFF (0xD800 to 0xDFFF excluded) down from U+7FFFFFFF to accommodate the UTF-16 encoding, but some applications will still happily encode/decode 6-byte UTF-8 sequences (including the ones that fall in the 0xD800 .. 0xDFFF range).

11
  • 19
    +1, it's the best answer (for pointing out the overriding, etc). But lacks the (nice) examples of Ignacio's answer ^^ Commented Aug 22, 2013 at 11:08
  • 2
    A minor nitpick: The C locale is only required to support the "portable character set" (ASCII 0-127), and behavior for chars > 127 is technically unspecified. In practice, most programs will treat them as opaque data and pass them through as you described. But not all: in particular, Ruby may choke on char data with bytes > 127 if running in the C locale. I honestly don't know if that's technically "conformant", but we've seen it in the wild. Commented Dec 16, 2015 at 19:26
  • 3
    @AndrewJanke, yes. Note that portable character set does not event imply ASCII nor 0-127. There has been a lot of discussion on the Austin group mailing list on what the properties of the "C" locale character set would be and the general consensus (and that will be clarified in the next spec) is that that charset would be single-byte, and encompass the full 8bit range (with the properties described here). In the mean-time, yes there can be some divergence (as bug or because the spec is not explicit enough). In anycase LC_ALL=C is the closest you can get the a sane behaviour. Commented Dec 16, 2015 at 20:11
  • 3
    @12431234123412341234123, the original UTF-8 encoding covers up to U+7FFFFFFF (6 bytes, and there are some extensions to go up to 13 bytes like perl's \x{7FFFFFFFFFFFFFFF}) and while the range of Unicode code points has been arbitrarily restricted to U+10FFFF (due to UTF-16 design limitation), some tools still recognise/produce 6 byte characters. That's what I meant by 6 byte characters. In Unix semantics, one character is one codepoint. Your more than one codepoint "characters" are more generally referenced as graphem clusters to disambiguate from characters. Commented Apr 18, 2017 at 17:42
  • 2
    @UlysseBN, that has nothing to do with bash. It's about the locale definition. See lists.gnu.org/archive/html/bug-bash/2019-12/msg00098.html for a more recent example. I used to use ①②③④⑤ as striking examples (see for example What is the difference between "sort -u" and "sort | uniq"?), but they've now been fixed. Still in current GNU locales (as of glibc 2.30 at least), over 99% of characters don't have a defined order. See those 🧝 🧜 🧙 🧛 🧝 🧚 Commented Dec 27, 2019 at 10:32
288

It forces applications to use the default language for output:

$ LC_ALL=es_ES man
¿Qué página de manual desea?

$ LC_ALL=C man
What manual page do you want?

and forces sorting to be byte-wise:

$ LC_ALL=en_US sort <<< $'a\nb\nA\nB'
a
A
b
B

$ LC_ALL=C sort <<< $'a\nb\nA\nB'
A
B
a
b
13
  • 33
    +1 for good exemples, but lacks the important info that are on Stephane's answer... Commented Aug 22, 2013 at 11:06
  • 14
    What do you mean by default language? Commented Sep 10, 2014 at 14:59
  • 3
    Yes, I understand the author can do whatever he likes including not do what it says on the tin. The thing is. US English is the only language that can be represented correctly with the charset in LC_ALL=C, the only language where the sorting order in LC_ALL=C (LC_COLLATE) makes sense, LC_ALL=C (LC_TIME) has English month and day names. I've never seen apps where LC_ALL=C returned message in a different language from LC_ALL=en LANGUAGE=en. So am I entitled to report a bug against a program if that's not the case? (not talking about apps not translated to English here). Commented Sep 10, 2014 at 19:55
  • 3
    The problem is "US English is the only language that can be represented correctly with the charset in LC_ALL=C". This is usually only true in C/C++ programs when using narrow characters, but even then there are exceptions (since there are several languages that only use characters and symbols found in ASCII). Reporting a bug when the default language is not English will make you seem... bigoted. Commented Sep 10, 2014 at 22:37
  • 3
    Note that in English (meaning LANG=en_US.utf8) the messages can (and should) use unicode characters such as “” for quoting strings. Whereas in LANG=C, it only has ASCII ones (double quotes, backquotes and apostrophes).
    – Ángel
    Commented Mar 10, 2015 at 16:55
12

C is the default locale,"POSIX" is the alias of "C". I guess "C" is derived from ANSI-C. Maybe ANSI-C define the "POSIX" locale.

5
  • Both C and UNIX by far predate ANSI C.
    – user
    Commented Aug 22, 2013 at 10:55
  • @MichaelKjörling: So? I've seen pre-ANSI documentation, and it didn't have locales. Internally at AT&T Bell Labs, everyone spoke English.
    – MSalters
    Commented Aug 22, 2013 at 14:50
  • @MSalters The fact that pre-ANSI documentation for the C language doesn't mention locales (which may or may not imply that pre-ANSI, C had no concept of locales; after all, I'm pretty sure the language still does not, but that's beside the point) does not imply that the C locale name derives from "ANSI C".
    – user
    Commented Aug 22, 2013 at 21:18
  • 6
    @MichaelKjörling: You're missing the point. When locales were introduced, "C" already meant "ANSI C". That it meant K&R C in the past is irrelevant.
    – MSalters
    Commented Aug 23, 2013 at 7:36
  • What is the "default locale"? Commented Jan 28, 2023 at 22:15
6

As far as I can tell, OS X uses code point collation order in UTF-8 locales, so it is an exception to some of the points mentioned in the answer by Stéphane Chazelas.

This prints 26 in OS X and 310 in Ubuntu:

export LC_ALL=en_US.UTF-8
printf %b $(printf '\\U%08x\\n' $(seq $((0x11)) $((0x10ffff))))|grep -a '[a-z]'|wc -l

The code below prints nothing in OS X, indicating that the input is sorted. The six surrogate characters that are removed cause an illegal byte sequence error.

export LC_ALL=en_US.UTF-8
for ((i=1;i<=0x1fffff;i++));do
  x=$(printf %04x $i)
  [[ $x = @(000a|d800|db7f|db80|dbff|dc00|dfff) ]]&&continue
  printf %b \\U$x\\n
done|sort -c

The code below prints nothing in OS X, indicating that there are no two consecutive code points (at least between U+000B and U+D7FF) that have the same collation order.

export LC_ALL=en_US.UTF-8
for ((i=0xb;i<=0xd7fe;i++));do
  printf %b $(printf '\\U%08x\\n' $((i+1)) $i)|sort -c 2>/dev/null&&echo $i
done

(The examples above use %b because printf \\U25 results in an error in zsh.)

Some characters and sequences of characters that have the same collation order in GNU systems do not have the same collation order in OS X. This prints ① first in OS X (using either OS X's sort or GNU sort) but ② first in Ubuntu:

export LC_ALL=en_US.UTF-8;printf %s\\n ② ①|sort

This prints three lines in OS X (using either OS X's sort or GNU sort) but one line in Ubuntu:

export LC_ALL=en_US.UTF-8;printf %b\\n \\u0d4c \\u0d57 \\u0d46\\u0d57|sort -u
1
  • Does anyone know why there is this difference?
    – 1.61803
    Commented Feb 23, 2019 at 9:49
6

It appears that LC_COLLATE controls the "alphabetical order" used by ls, as well. The US locale will sort as follows:

a.C
aFilename.C
aFilename.H
a.H

basically ignoring the periods. You might prefer:

a.C
a.H
aFilename.C
aFilename.H

I certainly do. Setting LC_COLLATE to C accomplishes this. Note that it will also sort lower case after all capitals:

A.C
A.H
AFilename.C
a.C
a.H
2

For an addition to the @Ignacio Vazquez-Abrams 's answer, for some console outputs it requires you to define in the session scale, but not in the local scale.

For example,

As he mentions, in most cases it does work in this way

$ man
What manual page do you want?
$ LC_ALL=es_ES man
Qupina de manual desea?

Yet doesn't work in some cases

$ LC_ALL=es_ES cpio
-bash: /usr/bin/cpio: Permission denied

So it requires you to do this instead

$ export LC_ALL=es_ES
$ cpio
-bash: /usr/bin/cpio: Permiso denegado

Then returns it back to English if it needed

$ export LC_ALL=C
$ cpio
-bash: /usr/bin/cpio: Permission denied

Also note that for some non-alphabetical languages you'd better add ".UTF-8" as others mention.

For example, for the Japanese language

$ export LC_ALL=ja_JP
$ cpio
-bash: /usr/bin/cpio: Ĥ

$ export LC_ALL=ja_JP.UTF-8
$ cpio
-bash: /usr/bin/cpio: 許可がありません

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .