Replace any number of tabs and spaces with single new line in Linux?

Question

Suppose I have a (potentially very large) text file that contains a word list with whitespace interjected. For example, it might look like this:

Cat                           Dog
Soup                          Rat
Cass                          Audrey

I want each word on a separate line (with no whitespace), like this:

Cat
Dog
Soup
Rat
Cass
Audrey

I can do a simple tr -d " " to make that into:

CatDog
SoupRat
CassAudrey

(but that is not what I want).

I do not know what type of blank space separates those words, so assume that it's some combination of ordinary ASCII spaces and tabs. (We can assume that there are no invisible Unicode characters like em spaces and zero-width thingies.) Naturally, the words do not contain whitespace, so "à la", "alma mater", "apple pie", "at large" and "ice cream" are not valid words.

Assume that words may contain (non-blank) non-alphabetic characters, such as "AC/DC", "add-on", "AT&T", "audio-visual", "can't", "carbon-14", "jack-o'-lantern", "mother-in-law", "o'clock", "O'Reilly", "RS-232" and "3-D". Ideally the solution should tolerate non-ASCII characters, as in "Ångström", "Gödel", "naïve", "résumé" and "smörgåsbord".

How do I get rid of all those spaces while preserving (and isolating) the indented words using common Unix/Linux tools like tr, sed or awk?

It would be great if the solution would also work for more general cases of the stated problem; i.e., not just two-column text, but also random arrangements like:

          Once    upon
    a   midnight
                    dreary
while                     I pondered
       weak    and weary
           Over                many
a   quaint  and     curious     volume
 of forgotten lore

set -f; printf ‘%s\n’ $(<file); set +f. This is halfway a joke, because there are other types of expansion in the shell besides globs, but in some hackish cases it might be a very simple solution. — kojiro, Commented Nov 14, 2017 at 12:25
This is not a question "describing a problem that can't be reproduced and seemingly went away on its own (or went away when a typo was fixed)". This question describes a reproducible problem, whose solution(s) are likely to help future readers. The fact that the OP didn't actually have the problem they described does not invalidate the question, per se. — G-Man Says 'Reinstate Monica', Commented Apr 13, 2021 at 22:27
@G-Man it looks to me like the OP said in version 3 that "So it was looking like some words were appearing right-justified. I went through the same file more slowly with vim and there were no right-justified words." which sounds to me like they realized that there wasn't actually a problem to solve. If we want to reopen this question for the existing answers, I'd suggest editing the Q down to focus on the problem that they solve. — Jeff Schaller, Commented Apr 14, 2021 at 1:04
@Jeff: Well, I acknowledged that "the OP didn't actually have the problem they described". So, what, exactly, are you suggesting? That I delete the OP's edit (i.e., the last paragraph)? Or should I purge all references to the “back story” of how a person might land in the situation of having a file like the one described in the question? — G-Man Says 'Reinstate Monica', Commented Apr 14, 2021 at 5:19
@G-ManSays'ReinstateMonica' I personally think we should keep this question closed, since the OP is in no position to accept an answer. I'll abstain from voting in the reopen queue, though. If we think that this is the best question we have on removing spaces, then I would say to edit the Q to focus on that, removing the backstory and "I didn't actually have this problem" parts. — Jeff Schaller, Commented Apr 14, 2021 at 12:43

G-Man Says 'Reinstate Monica' · Accepted Answer · 2017-11-14 07:05:26Z

14

etopylight was almost right:

tr -s ' \t' '\n'

because the question asks to replace tabs, too.

answered Nov 14, 2017 at 7:05

G-Man Says 'Reinstate Monica'

23.2k27 gold badges74 silver badges122 bronze badges

2

The POSIX equivalent would be tr -s ' \t' '[\n*]'. See also tr -s '[:space:]' '[\n*]' or tr -s '[:blank:]' '[\n*]'
– Stéphane Chazelas
Commented Nov 14, 2017 at 18:45
This fails on "AC/DC", "add-on", "AT&T". You need something like tr -s ' \t,."' '\n' <file
– user232326
Commented Apr 20, 2021 at 22:56
@Isaac Well, that’s debatable. The question says, “Assume that words may contain (non-blank) non-alphabetic characters”. (Disclosure: I edited the question to say that, with the intent of keeping my answer correct.) If words may contain non-alphabetic characters, then "AC/DC", "add-on" and  "AT&T" are all words. And the OP didn’t give us any clue how they want “Mr.”, “Mrs.”, “Ph.D” or “Q.E.D.” to be handled. While comma (and semicolon) maybe should always be separators, people with 20th century technology sometimes used " to denote umlaut / dieresis; e.g., na"ive for naïve.
– G-Man Says 'Reinstate Monica'
Commented Apr 20, 2021 at 23:48
Fair enough. @G-ManSays'ReinstateMonica'
– user232326
Commented Apr 20, 2021 at 23:56
That would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces).
– Ed Morton
Commented Dec 24, 2021 at 8:26

Add a comment |

thecarpy · Accepted Answer · 2017-11-14 07:46:26Z

10

Basically, you could do it in GNU sed:

sed 's/\s\+/\n/g'

There you go...

edited Nov 14, 2017 at 7:46

answered Nov 14, 2017 at 6:44

thecarpy

3,9461 gold badge16 silver badges35 bronze badges

That would output multiple blank lines given the OPs 2nd set of input (the one that starts with spaces).
– Ed Morton
Commented Dec 24, 2021 at 8:27

Add a comment |

Ulrich Schwarz · Accepted Answer · 2017-11-14 06:36:25Z

6

You should be able to use

sed -e 's/[[:space:]]\{1,\}/\n/'

to replace any sequence of one or more whitespace characters (including oddities like formfeed and vertical tabs) with a single newline.

answered Nov 14, 2017 at 6:36

Ulrich Schwarz

16.2k4 gold badges48 silver badges59 bronze badges

7

Almost portable, but most sed versions will insert a backslash and an n, because \n in the replacement is undefined by the standard. Use a literal newline instead (typically by typing backslash, Ctrl-V, Ctrl-J).
– Philippos
Commented Nov 14, 2017 at 7:29
That would output multiple blank lines and lines containing spaces given the OPs 2nd set of input (the one that starts with spaces).
– Ed Morton
Commented Dec 24, 2021 at 8:28

Add a comment |

JJoao · Accepted Answer · 2017-11-14 14:41:47Z

2

If gnu-grep available,

grep -Po '\S+'

answered Nov 14, 2017 at 14:41

JJoao

12.3k1 gold badge23 silver badges45 bronze badges

1

That would work, though you don't need -P (which even in new versions of GNU grep is still considered "experimental" in combination with other grep options so I personally avoid), it'd work the same with -E.
– Ed Morton
Commented Dec 24, 2021 at 8:32

Add a comment |

G-Man Says 'Reinstate Monica' · Accepted Answer · 2021-12-26 22:42:46Z

As the default behavior for awk already is to split on any number of blanks (spaces, tabs), one could as well use that feature, just setting the output field separator to "\n" and rebuilding $0. An open question for the task, however, is: How do you want empty lines to be handled?

To just print them as they are:

awk -v OFS='\n' '{$1 = $1; print}' file

To additionally filter out empty lines:

awk -v OFS='\n' 'NF {$1 = $1; print}' file

(Beware of Windows line endings (containing \r) in a Linux setting, however: awk does not necessarily regard lines with \r as empty and in that case would output them too. So filter text files with CRLF endings through dos2unix first.

score 2 · Accepted Answer · 2022-01-02 01:54:30Z

It could be done in a long list of ways:

tr tr -s ' \t' '\n' <file for tabs and spaces only.
tr tr -s ' \t,."' '\n' <file for your strings.
tr tr -s '[:blank:]' '\n' <file for tabs and spaces only.
tr tr -s '[:space:]' '\n' <file for \t\n\v\f\r
sed sed -e 's/[ \t]/\n/g' -e 's/\n\n*/\n/g' file GNU sed for \n.
sed sed 's/[ \t".,]\+/\n/g' file | tr -s '\n' GNU sed for \n.

Description

tr

The most basic tool is tr, yes, but it could not understand format. So, a basic

tr -s ' ' '\n' <file

would convert all (repeated) spaces to one newline. That, of course could generate empty lines if lines with only spaces exist or the file start with spaces. There is no way to correct that in tr. It could be done adding a filter to remove empty lines. Like sed '/^$/d'

tr -s ' ' '\n' <file | sed '/^$/d'

Additional characters (like tabs) could be added:

tr -s ' \t' '\n' <file | sed '/^$/d'

Others, that might result from a text paragraph (like this one), like commas and periods could also be added. That is changing the definition of what a word is.

tr -s ' \t,.()' '\n' <file | sed '/^$/d'

sed

Sed is more capable than tr (and could be slower). It could change runs of some characters to one newline. The basic idea would be (in GNU sed for the replacement \n):

sed 's/[ \t]\{1,\}/\n/g'

In other seds that are not able to use \n in the right side of a replacement we need to use an actual newline (and, sometimes, an actual tab):

sed 's/[   ]\{1,\}/\
/g'

So, sed helps with some issues but makes unnecessarily complex others.
Read 4.1. How do I insert a newline into the RHS of a substitution?

grep

It could be done in grep as well. Just match sequences of non-word characters:

grep -o '[^         ]\{1,\}'     ## explicit space-tab.

In GNU grep, the equivalent \S+ could be used:

grep -Eo '\S+'     ## or grep -o '\S\+'

awk

In awk we get still more power from the tool. It gets quite simple:

awk '{for(i=1;i<=NF;i++) {print $i}}' file

Which is just: print all fields for each line. Where there are fields. If the number of fields is 0 nothing will be printed.

That is using the default FS which delimits fields on repeated spaces, tabs or newlines.

A similar solution could be done with RS (for awks that allow regex separators):

awk -v RS="[ \t\n]+" 'NF'

Which tells awk to split records in runs of space, tab or newlines and print only if there is any no-empty field (the NF).

I believe all of those would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces). — Ed Morton, Commented Dec 24, 2021 at 8:30
@EdMorton Yes, technically, that is correct for the initial ways I posted. It is quite simple to remove empty lines, anyway, so, I am not very worried for this issue. However, I extended the description to solve that from different points of view. I believe that I have clarified that issue. — user232326, Commented Jan 2, 2022 at 1:30

etopylight · Accepted Answer · 2017-11-14 06:36:53Z

1

You can use the option -s from tr to squeeze repeated characters into one and replace it into a new line

tr -s " " "\n"

answered Nov 14, 2017 at 6:36

etopylight

4111 gold badge3 silver badges8 bronze badges

That wouldn't handle tabs in the input and would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces).
– Ed Morton
Commented Dec 24, 2021 at 8:29

Add a comment |

Ed Morton · Accepted Answer · 2021-12-24 08:34:33Z

1

With GNU awk or any other awk that supports multi-char RS (e.g. newer versions of mawk):

$ awk -v RS='[[:space:]]+' '$0!=""' file
Cat
Dog
Soup
Rat
Cass
Audrey

$ awk -v RS='[[:space:]]+' '$0!=""' file2
Once
upon
a
midnight
dreary
while
I
pondered
weak
and
weary
Over
many
a
quaint
and
curious
volume
of
forgotten
lore

edited Dec 24, 2021 at 8:34

answered Dec 24, 2021 at 8:24

Ed Morton

32.4k6 gold badges23 silver badges52 bronze badges

Should awk -v RS='[[:space:]]+' 'NF' file2 not work ?
– user232326
Commented Jan 2, 2022 at 1:27
@ImHere yes, I think that would work too.
– Ed Morton
Commented Jan 2, 2022 at 13:33

Add a comment |

nezabudka · Accepted Answer · 2021-12-27 06:16:02Z

1

Simple optimal text formatter displays in one column with further deletion of leading spaces:

fmt -1 file | column -t

In awk NF prevents from possible output of the first empty line:

awk 'NF' RS='[[:space:]]+' file

edited Dec 27, 2021 at 6:16

answered Dec 27, 2021 at 5:32

nezabudka

2,4186 silver badges15 bronze badges

Add a comment |

Stack Exchange Network

Replace any number of tabs and spaces with single new line in Linux?

9 Answers 9

Description

tr

sed

grep

awk

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
awk
sed
text-formatting
tr
whitespace
.

Hot Network Questions

Replace any number of tabs and spaces with single new line in Linux?

9 Answers 9

Description

tr

sed

grep

awk

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged awksedtext-formattingtrwhitespace.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
awk
sed
text-formatting
tr
whitespace
.