What does "LC_ALL=C" do?

Question

What does the C value for LC_ALL do in Unix-like systems?

I know that it forces the same locale for all aspects but what does C do?

If you want to resolve a problem with xclock warning(Missing charsets in String to FontSet conversion), it will be better if you will use LC_ALL=C.UTF-8 to avoid problems with cyrillic. To set this environment variable you must add the following line to the end of ~/.bashrc file - export LC_ALL=C.UTF-8 — fedotsoldier, Commented Jun 19, 2019 at 12:42
@fedotsoldier you should probably ask question and give the answer yourself, I don't think it's related to the question. It's just answer to different problem you're having. — jcubic, Commented Jun 19, 2019 at 13:20
legendary C locales rant github.com/mpv-player/mpv/commit/… — qwr, Commented Sep 4, 2022 at 6:28

Stéphane Chazelas · Accepted Answer · 2023-12-19 13:35:54Z

LC_ALL is the environment variable that overrides all the other localisation settings (except $LANGUAGE under some circumstances).

Different aspects of localisations (like the thousand separator or decimal point character, character set, sorting order, month, day names, language or application messages like error messages, currency symbol) can be set using a few environment variables.

You'll typically set $LANG to your preference with a value that identifies your region (like fr_CH.UTF-8 if you're in French speaking Switzerland, using UTF-8). The individual LC_xxx variables override a certain aspect. LC_ALL overrides them all. The locale command, when called without argument gives a summary of the current settings.

For instance, on a GNU system, I get:

$ locale
LANG=en_GB.UTF-8
LANGUAGE=
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

I can override an individual setting with for instance:

$ LC_TIME=fr_FR.UTF-8 date
jeudi 22 août 2013, 10:41:30 (UTC+0100)

Or:

$ LC_MONETARY=fr_FR.UTF-8 locale currency_symbol
€

Or override everything with LC_ALL.

$ LC_ALL=C LANG=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 cat /
cat: /: Is a directory

In a script, if you want to force a specific setting, as you don't know what settings the user has forced (possibly LC_ALL as well), your best, safest and generally only option is to force LC_ALL.

The C locale is a special locale that is meant to be the simplest locale. You could also say that while the other locales are for humans, the C locale is for computers. In the C locale, characters are single bytes, the charset is ASCII (well, is not required to, but in practice will be in the systems most of us will ever get to use), the sorting order is based on the byte values¹, the language is usually US English (though for application messages (as opposed to things like month or day names or messages by system libraries), it's at the discretion of the application author) and things like currency symbols are not defined.

On some systems, there's a difference with the POSIX locale where for instance the sort order for non-ASCII characters is not defined.

You generally run a command with LC_ALL=C to avoid the user's settings to interfere with your script. For instance, if you want [a-z] to match the 26 ASCII characters from a to z, you have to set LC_ALL=C.

On GNU systems, LC_ALL=C and LC_ALL=POSIX (or LC_MESSAGES=C|POSIX) override $LANGUAGE, while LC_ALL=anything-else wouldn't.

A few cases where you typically need to set LC_ALL=C:

sort -u or sort ... | uniq.... In many locales other than C, on some systems (notably GNU ones), some characters have the same sorting order. sort -u doesn't report unique lines, but one of each group of lines that have equal sorting order. So if you do want unique lines, you need a locale where characters are byte and all characters have different sorting order (which the C locale guarantees).
the same applies to the = operator of POSIX compliant expr or == operator of POSIX compliant awks (mawk and gawk are not POSIX in that regard), that don't check whether two strings are identical but whether they sort the same.
Character ranges like in grep. If you mean to match a letter in the user's language, use grep '[[:alpha:]]' and don't modify LC_ALL. But if you want to match the a-zA-Z ASCII characters, you need either LC_ALL=C grep '[[:alpha:]]' or LC_ALL=C grep '[a-zA-Z]'². [a-z] matches the characters that sort after a and before z (though with many APIs it's more complicated than that). In other locales, you generally don't know what those are. For instance some locales ignore case for sorting so [a-z] in some APIs like bash patterns, could include [B-Z] or [A-Y]. In many UTF-8 locales (including en_US.UTF-8 on most systems), [a-z] will include the latin letters from a to y with diacritics but not those of z (since z sorts before them) which I can't imagine would be what you want (why would you want to include é and not ź?).
floating point arithmetic in ksh93. ksh93 honours the decimal_point setting in LC_NUMERIC. If you write a script that contains a=$((1.2/7)), it will stop working when run by a user whose locale has comma as the decimal separator:
```
 $ ksh93 -c 'echo $((1.1/2))'
 0.55
 $ LANG=fr_FR.UTF-8  ksh93 -c 'echo $((1.1/2))'
 ksh93: 1.1/2: arithmetic syntax error
```

Then you need things like:

    #! /bin/ksh93 -
    float input="$1" # get it as input from the user in his locale
    float output
    arith() { typeset LC_ALL=C; (($@)); }
    arith output=input/1.2 # use the dot here as it will be interpreted
                           # under LC_ALL=C
    echo "$output" # output in the user's locale

As a side note: the , decimal separator conflicts with the , arithmetic operator which can cause even more confusion.

When you need characters to be bytes. Nowadays, most locales are UTF-8 based which means characters can take up from 1 to 6 bytes³. When dealing with data that is meant to be bytes, with text utilities, you'll want to set LC_ALL=C. It will also improve performance significantly because parsing UTF-8 data has a cost.
a corollary of the previous point: when processing text where you don't know what character set the input is written in, but can assume it's compatible with ASCII (as virtually all charsets are). For instance grep '<.*>' to look for lines containing a <, > pair will not work if you're in a UTF-8 locale and the input is encoded in a single-byte 8-bit character set like iso8859-15. That's because . only matches characters, and non-ASCII characters in iso8859-15 are likely not to form a valid character in UTF-8. On the other hand, LC_ALL=C grep '<.*>' will work because any byte value forms a valid character in the C locale.
Any time where you process input data or output data that is not intended from/for a human. If you're talking to a user, you may want to use their convention and language, but for instance, if you generate some numbers to feed some other application that expects English style decimal points, or English month names, you'll want to set LC_ALL=C:
```
 $ printf '%g\n' 1e-2
 0,01
 $ LC_ALL=C printf '%g\n' 1e-2
 0.01
 $ date +%b
 août
 $ LC_ALL=C date +%b
 Aug
```

That also applies to things like case insensitive comparison (like in grep -i) and case conversion (awk's toupper(), dd conv=ucase...). For instance:

    grep -i i

is not guaranteed to match on I in the user's locale. In some Turkish locales for instance, it doesn't as upper-case i is İ (note the dot) there and lower-case I is ı (note the missing dot).

Notes

¹ again, only on ASCII based systems (the immense majority of systems). POSIX requires the collation order for the C locale to be that of the order of characters in the ASCII charset, even on EBCDIC systems which are not allowed to do the strcoll() === strcmp() optimisation in the C locale.

² Depending on the encoding of the text, that's not necessarily the right thing to do though. That's valid for UTF-8 or single-byte character sets (like iso-8859-1), but not necessarily non-UTF-8 multibyte character sets.

For instance, if you're in a zh_HK.big5hkscs locale (Hong Kong, using the Hong Kong variant of the BIG5 Chinese character encoding), and you want to look for English letters in a file encoded in that charsets, doing either:

LC_ALL=C grep '[[:alpha:]]'

or

LC_ALL=C grep '[a-zA-Z]'

would be wrong, because in that charset (and many others, but hardly used since UTF-8 came out), a lot of characters contain bytes that correspond to the ASCII encoding of A-Za-z characters. For instance, all of A䨝䰲丕乙乜你再劀劈呸哻唥唧噀噦嚳坽 (and many more) contain the encoding of A. 䨝 is 0x96 0x41, and A is 0x41 like in ASCII. So our LC_ALL=C grep '[a-zA-Z]' would match on those lines that contain those characters as it would misinterpret those sequences of bytes.

LC_COLLATE=C grep '[A-Za-z]'

would work, but only if LC_ALL is not otherwise set (which would override LC_COLLATE). So you may end up having to do:

grep '[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]'

if you wanted to look for English letters in a file encoded in the locale's encoding.

³ some would argue it's rather 1 to 4 bytes these days now that Unicode code points (and the libraries that encode/decode UTF-8 data) have been arbitrarily restricted to code points U+0000 to U+10FFFF (0xD800 to 0xDFFF excluded) down from U+7FFFFFFF to accommodate the UTF-16 encoding, but some applications will still happily encode/decode 6-byte UTF-8 sequences (including the ones that fall in the 0xD800 .. 0xDFFF range).

+1, it's the best answer (for pointing out the overriding, etc). But lacks the (nice) examples of Ignacio's answer ^^ — Olivier Dulac, Commented Aug 22, 2013 at 11:08
A minor nitpick: The C locale is only required to support the "portable character set" (ASCII 0-127), and behavior for chars > 127 is technically unspecified. In practice, most programs will treat them as opaque data and pass them through as you described. But not all: in particular, Ruby may choke on char data with bytes > 127 if running in the C locale. I honestly don't know if that's technically "conformant", but we've seen it in the wild. — Andrew Janke, Commented Dec 16, 2015 at 19:26
@AndrewJanke, yes. Note that portable character set does not event imply ASCII nor 0-127. There has been a lot of discussion on the Austin group mailing list on what the properties of the "C" locale character set would be and the general consensus (and that will be clarified in the next spec) is that that charset would be single-byte, and encompass the full 8bit range (with the properties described here). In the mean-time, yes there can be some divergence (as bug or because the spec is not explicit enough). In anycase LC_ALL=C is the closest you can get the a sane behaviour. — Stéphane Chazelas, Commented Dec 16, 2015 at 20:11
@12431234123412341234123, the original UTF-8 encoding covers up to U+7FFFFFFF (6 bytes, and there are some extensions to go up to 13 bytes like perl's \x{7FFFFFFFFFFFFFFF}) and while the range of Unicode code points has been arbitrarily restricted to U+10FFFF (due to UTF-16 design limitation), some tools still recognise/produce 6 byte characters. That's what I meant by 6 byte characters. In Unix semantics, one character is one codepoint. Your more than one codepoint "characters" are more generally referenced as graphem clusters to disambiguate from characters. — Stéphane Chazelas, Commented Apr 18, 2017 at 17:42
@UlysseBN, that has nothing to do with bash. It's about the locale definition. See lists.gnu.org/archive/html/bug-bash/2019-12/msg00098.html for a more recent example. I used to use ①②③④⑤ as striking examples (see for example What is the difference between "sort -u" and "sort | uniq"?), but they've now been fixed. Still in current GNU locales (as of glibc 2.30 at least), over 99% of characters don't have a defined order. See those 🧝 🧜 🧙 🧛 🧝 🧚 — Stéphane Chazelas, Commented Dec 27, 2019 at 10:32

slm · Accepted Answer · 2018-09-09 02:19:37Z

288

It forces applications to use the default language for output:

$ LC_ALL=es_ES man
¿Qué página de manual desea?

$ LC_ALL=C man
What manual page do you want?

and forces sorting to be byte-wise:

$ LC_ALL=en_US sort <<< $'a\nb\nA\nB'
a
A
b
B

$ LC_ALL=C sort <<< $'a\nb\nA\nB'
A
B
a
b

edited Sep 9, 2018 at 2:19

slm♦

372k123 gold badges779 silver badges882 bronze badges

answered Aug 22, 2013 at 7:41

Ignacio Vazquez-Abrams

46k7 gold badges94 silver badges102 bronze badges

33

+1 for good exemples, but lacks the important info that are on Stephane's answer...
– Olivier Dulac
Commented Aug 22, 2013 at 11:06
14

What do you mean by default language?
– Stéphane Chazelas
Commented Sep 10, 2014 at 14:59
3

Yes, I understand the author can do whatever he likes including not do what it says on the tin. The thing is. US English is the only language that can be represented correctly with the charset in LC_ALL=C, the only language where the sorting order in LC_ALL=C (LC_COLLATE) makes sense, LC_ALL=C (LC_TIME) has English month and day names. I've never seen apps where LC_ALL=C returned message in a different language from LC_ALL=en LANGUAGE=en. So am I entitled to report a bug against a program if that's not the case? (not talking about apps not translated to English here).
– Stéphane Chazelas
Commented Sep 10, 2014 at 19:55
3

The problem is "US English is the only language that can be represented correctly with the charset in LC_ALL=C". This is usually only true in C/C++ programs when using narrow characters, but even then there are exceptions (since there are several languages that only use characters and symbols found in ASCII). Reporting a bug when the default language is not English will make you seem... bigoted.
– Ignacio Vazquez-Abrams
Commented Sep 10, 2014 at 22:37
3

Note that in English (meaning LANG=en_US.utf8) the messages can (and should) use unicode characters such as “” for quoting strings. Whereas in LANG=C, it only has ASCII ones (double quotes, backquotes and apostrophes).
– Ángel
Commented Mar 10, 2015 at 16:55

| Show 8 more comments

GAD3R · Accepted Answer · 2017-10-06 08:04:41Z

12

C is the default locale,"POSIX" is the alias of "C". I guess "C" is derived from ANSI-C. Maybe ANSI-C define the "POSIX" locale.

edited Oct 6, 2017 at 8:04

GAD3R

67.5k32 gold badges141 silver badges209 bronze badges

answered Aug 22, 2013 at 7:37

Edward Shen

8884 silver badges8 bronze badges

Both C and UNIX by far predate ANSI C.
– user
Commented Aug 22, 2013 at 10:55
@MichaelKjörling: So? I've seen pre-ANSI documentation, and it didn't have locales. Internally at AT&T Bell Labs, everyone spoke English.
– MSalters
Commented Aug 22, 2013 at 14:50
@MSalters The fact that pre-ANSI documentation for the C language doesn't mention locales (which may or may not imply that pre-ANSI, C had no concept of locales; after all, I'm pretty sure the language still does not, but that's beside the point) does not imply that the C locale name derives from "ANSI C".
– user
Commented Aug 22, 2013 at 21:18
6

@MichaelKjörling: You're missing the point. When locales were introduced, "C" already meant "ANSI C". That it meant K&R C in the past is irrelevant.
– MSalters
Commented Aug 23, 2013 at 7:36
What is the "default locale"?
– robertspierre
Commented Jan 28, 2023 at 22:15

Add a comment |

nisetama · Accepted Answer · 2016-05-07 15:22:48Z

As far as I can tell, OS X uses code point collation order in UTF-8 locales, so it is an exception to some of the points mentioned in the answer by Stéphane Chazelas.

This prints 26 in OS X and 310 in Ubuntu:

export LC_ALL=en_US.UTF-8
printf %b $(printf '\\U%08x\\n' $(seq $((0x11)) $((0x10ffff))))|grep -a '[a-z]'|wc -l

The code below prints nothing in OS X, indicating that the input is sorted. The six surrogate characters that are removed cause an illegal byte sequence error.

export LC_ALL=en_US.UTF-8
for ((i=1;i<=0x1fffff;i++));do
  x=$(printf %04x $i)
  [[ $x = @(000a|d800|db7f|db80|dbff|dc00|dfff) ]]&&continue
  printf %b \\U$x\\n
done|sort -c

The code below prints nothing in OS X, indicating that there are no two consecutive code points (at least between U+000B and U+D7FF) that have the same collation order.

export LC_ALL=en_US.UTF-8
for ((i=0xb;i<=0xd7fe;i++));do
  printf %b $(printf '\\U%08x\\n' $((i+1)) $i)|sort -c 2>/dev/null&&echo $i
done

(The examples above use %b because printf \\U25 results in an error in zsh.)

Some characters and sequences of characters that have the same collation order in GNU systems do not have the same collation order in OS X. This prints ① first in OS X (using either OS X's sort or GNU sort) but ② first in Ubuntu:

export LC_ALL=en_US.UTF-8;printf %s\\n ② ①|sort

This prints three lines in OS X (using either OS X's sort or GNU sort) but one line in Ubuntu:

export LC_ALL=en_US.UTF-8;printf %b\\n \\u0d4c \\u0d57 \\u0d46\\u0d57|sort -u

Does anyone know why there is this difference?
– 1.61803
Commented Feb 23, 2019 at 9:49 — 1.61803, Commented Feb 23, 2019 at 9:49

HalosGhost · Accepted Answer · 2016-10-12 01:14:53Z

6

It appears that LC_COLLATE controls the "alphabetical order" used by ls, as well. The US locale will sort as follows:

a.C
aFilename.C
aFilename.H
a.H

basically ignoring the periods. You might prefer:

a.C
a.H
aFilename.C
aFilename.H

I certainly do. Setting LC_COLLATE to C accomplishes this. Note that it will also sort lower case after all capitals:

A.C
A.H
AFilename.C
a.C
a.H

edited Oct 12, 2016 at 1:14

HalosGhost

4,81010 gold badges36 silver badges41 bronze badges

answered Oct 12, 2016 at 0:41

SteveInCO

611 silver badge1 bronze badge

Add a comment |

ー PupSoZeyDe ー · Accepted Answer · 2021-11-07 02:53:46Z

For an addition to the @Ignacio Vazquez-Abrams 's answer, for some console outputs it requires you to define in the session scale, but not in the local scale.

For example,

As he mentions, in most cases it does work in this way

$ man
What manual page do you want?
$ LC_ALL=es_ES man
Qupina de manual desea?

Yet doesn't work in some cases

$ LC_ALL=es_ES cpio
-bash: /usr/bin/cpio: Permission denied

So it requires you to do this instead

$ export LC_ALL=es_ES
$ cpio
-bash: /usr/bin/cpio: Permiso denegado

Then returns it back to English if it needed

$ export LC_ALL=C
$ cpio
-bash: /usr/bin/cpio: Permission denied

Also note that for some non-alphabetical languages you'd better add ".UTF-8" as others mention.

For example, for the Japanese language

$ export LC_ALL=ja_JP
$ cpio
-bash: /usr/bin/cpio: Ĥ

$ export LC_ALL=ja_JP.UTF-8
$ cpio
-bash: /usr/bin/cpio: 許可がありません

Stack Exchange Network

What does "LC_ALL=C" do?

6 Answers 6

Notes

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
environment-variables
locale
.

Linked

Hot Network Questions

What does "LC_ALL=C" do?

6 Answers 6

Notes

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged environment-variableslocale.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
environment-variables
locale
.