7

Consider the following example:

IFS=:
x="a   :b"   # three spaces
echo ["$x"]  # no word splitting
# [a   :b]   # as is
echo [$x]    # word splitting 
# [a    b]   # four spaces

Word splitting identifies the the words "a " (three spaces) and "b", separated by the colon, then echo joins the words with a space in the middle.
However, when using the value of $x as a function argument, I find it difficult to interpret the results.

args(){ echo ["$*"];}
args a   :b  # three spaces
# [a::b]

and:

args(){ echo [$*];}
args a   :b  # three spaces
# [a  b]     # two spaces

$* expands to the value of all the positional parameters combined. Also, "$*" is equivalent to "$1c$2", where c is the first character of the value of the IFS variable.

args(){ echo ["$1"]["$2"]; }
args a   :b  # three spaces
# [a][:b]

and:

args(){ echo [$1][$2]; }
args a   :b  # three spaces
# [a][ b]   

Word splitting should always occur when there are unquoted expansions. Here "$1" and $1 are the same and in both cases they do not use the : delimiter. [$2] -> [ b] is also unclear.

Probably, before applying IFS-splitting, other tokenization rules are used, but I was unable to find them.

1 Answer 1

12

Word splitting only applies to unquoted expansions (parameter expansion, arithmetic expansion and command substitution) in modern Bourne-like shells (in zsh, only command substitution unless you use an emulation mode).

When you do:

args a    :b

Word splitting is not involved at all.

It's the shell parsing that tokenises those, finds the first one is not one of its keywords and so it's a simple command with 3 arguments: args, a and :b. The amount of space won't make any difference there. Note that it's not only spaces, also tabs, and in some shells (like yash or bash) any character considered as blank in you locale (though in the case of bash, not the multibyte ones)¹.

Even in the Bourne shell where word splitting also applied to unquoted arguments of commands regardless of whether they were the result of expansions or not, that would be done on top (long after) the tokenising and syntax parsing.

In the Bourne shell, in

IFS=i
while bib=did edit foo

That would not parse that as:

"wh" "le b" "b=d" "d ed" "t foo"

But first as a while with a simple command and the edit word (as it's an argument but not the bid=did word which is an assignment) of that simple command would be further split into ed and t so that the ed command with the 3 arguments ed, t and foo would be run as the condition of that while loop.

Word splitting is not part of the syntax parsing. It's like an operator that is applied implicitly to arguments (also in for loop words, arrays and with some shell the target of redirections and a few other contexts) for the parts of them that are not quoted. What's confusing is that it's done implicitly. You don't do cmd split($x), you do cmd $x and the split() (actually glob(split())) is implied. In zsh, you have to request it explicitly for parameter expansions (split($x) is $=x there ($= looking like a pair of scissors)).

So, now, for your examples:

args(){ echo ["$*"];}
args a   :b  # three spaces
# [a::b]

a and :b arguments of args joined with the first character of $IFS which gives a::b (note that it's a bad idea of using [...] here as it's a globbing operator).

args(){ echo [$*];}
args a   :b  # three spaces
# [a  b]     # two spaces

$* (which contains a::b) is split into a, the empty string and b. So it's:

echo '[a' '' 'b]'
args(){ echo ["$1"]["$2"]; }
args a   :b  # three spaces
# [a][:b]

no surprise as not word splitting.

args(){ echo [$1][$2]; }
args a   :b  # three spaces
# [a][ b]   

That's like:

 echo '[a]' '[' 'b]'

as $2 (:b) would be split into the empty string and b.

One case where you will see variations between implementations is when $IFS is empty.

In:

set a b
IFS=
printf '<%s>\n' $*

In some shells (most nowadays), you see

<a>
<b>

And not <ab> even though "$*" would expand to ab. Those shells still separate those a and b position parameters and that has now been made a POSIX requirement in the latest version of the standard.

If you did:

set a b
IFS=
var="$*" # note that the behaviour for var=$* is unspecified
printf '<%s>\n' $var

you'd see <ab> as the information that a and b were 2 separate arguments was lost when assigned to $var.


¹, of course, it's not only blanks that delimit words. Special tokens in the shell syntax do as well, the list of which depends on the context. In most contexts, |, ||, &, ;, newline, <, >, >>... delimit words. In ksh93 for instance, you can write a blank-less command like:

while({([[(:)]])})&&((1||1))do(:);uname<&2|tee>(rev)file;done
5
  • Thank you. So I should have word splitting, given IFS=: and x="foo :bar", in args $x. What would be the sequence here? For example, after the token args and $x are identified, the latter is expanded and retokenised.
    – antonio
    Commented Dec 4, 2017 at 12:29
  • @antonio, there's no retokensing, there's the tokenising that is done when the shell parses its syntax, and there's word splitting, a special operator applied to unquoted expansions. With args $x, it's as if you had written run_simple_command("args", split($x)) in another language (well run_simple_command(glob("args"), glob(split($x))) as globbing applies as well). While for the Bourne shell, it would have been like run_simple_command(glob(split("args")), glob(split($x))). See also: Security implications of forgetting to quote a variable in bash/POSIX shells Commented Dec 4, 2017 at 12:43
  • @StéphaneChazelas I think there is inaccuracy here (second example): $* (which contains a::b) is split into a, the empty string and b. Why $* is a::b? I think next: the $1 = a, $2 = :b and $* processes the a without changes, then splits the :b to two parts - the left is empty string and the right is b. So, we get {a}{}{b}. Try args :b. You also get empty string - {}{b}.
    – MiniMax
    Commented Dec 4, 2017 at 12:50
  • @StéphaneChazelas I was used this code for testing: args(){ echo '$1 = '"$1"; echo '$2 = '"$2"; printf '{%s}' $*; echo ""; } args :b; args a :b;
    – MiniMax
    Commented Dec 4, 2017 at 12:56
  • 1
    @MiniMax, that boils down to my empty IFS case. Wether $* becomes split into "a", "" and "b" because it's separate "a" and ":b" split into "" and "b" (like in some shells), or because if's $* first joined into "a::b" and then split into "a", "" and "b" (like in some other shells) produces the same result in all cases anyway except the empty IFS case. Commented Dec 4, 2017 at 13:04

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .