7

Suppose I have a (potentially very large) text file that contains a word list with whitespace interjected.  For example, it might look like this:

Cat                           Dog
Soup                          Rat
Cass                          Audrey

I want each word on a separate line (with no whitespace), like this:

Cat
Dog
Soup
Rat
Cass
Audrey

I can do a simple tr -d " " to make that into:

CatDog
SoupRat
CassAudrey

(but that is not what I want).

I do not know what type of blank space separates those words, so assume that it's some combination of ordinary ASCII spaces and tabs.  (We can assume that there are no invisible Unicode characters like em spaces and zero-width thingies.)  Naturally, the words do not contain whitespace, so "à la", "alma mater", "apple pie", "at large" and "ice cream" are not valid words.

Assume that words may contain (non-blank) non-alphabetic characters, such as "AC/DC", "add-on", "AT&T", "audio-visual", "can't", "carbon-14", "jack-o'-lantern", "mother-in-law", "o'clock", "O'Reilly", "RS-232" and "3-D".  Ideally the solution should tolerate non-ASCII characters, as in "Ångström", "Gödel", "naïve", "résumé" and "smörgåsbord".

How do I get rid of all those spaces while preserving (and isolating) the indented words using common Unix/Linux tools like tr, sed or awk?

It would be great if the solution would also work for more general cases of the stated problem; i.e., not just two-column text, but also random arrangements like:

          Once    upon
    a   midnight
                    dreary
while                     I pondered
       weak    and weary
           Over                many
a   quaint  and     curious     volume
 of forgotten lore
5
  • set -f; printf ‘%s\n’ $(<file); set +f. This is halfway a joke, because there are other types of expansion in the shell besides globs, but in some hackish cases it might be a very simple solution.
    – kojiro
    Commented Nov 14, 2017 at 12:25
  • 1
    This is not a question "describing a problem that can't be reproduced and seemingly went away on its own (or went away when a typo was fixed)". This question describes a reproducible problem, whose solution(s) are likely to help future readers.  The fact that the OP didn't actually have the problem they described does not invalidate the question, per se. Commented Apr 13, 2021 at 22:27
  • @G-Man it looks to me like the OP said in version 3 that "So it was looking like some words were appearing right-justified. I went through the same file more slowly with vim and there were no right-justified words." which sounds to me like they realized that there wasn't actually a problem to solve. If we want to reopen this question for the existing answers, I'd suggest editing the Q down to focus on the problem that they solve.
    – Jeff Schaller
    Commented Apr 14, 2021 at 1:04
  • @Jeff: Well, I acknowledged that "the OP didn't actually have the problem they described".  So, what, exactly, are you suggesting?  That I delete the OP's edit (i.e., the last paragraph)?  Or should I purge all references to the “back story” of how a person might land in the situation of having a file like the one described in the question? Commented Apr 14, 2021 at 5:19
  • @G-ManSays'ReinstateMonica' I personally think we should keep this question closed, since the OP is in no position to accept an answer. I'll abstain from voting in the reopen queue, though. If we think that this is the best question we have on removing spaces, then I would say to edit the Q to focus on that, removing the backstory and "I didn't actually have this problem" parts.
    – Jeff Schaller
    Commented Apr 14, 2021 at 12:43

9 Answers 9

14

etopylight was almost right:

tr -s ' \t' '\n'

because the question asks to replace tabs, too.

5
  • 2
    The POSIX equivalent would be tr -s ' \t' '[\n*]'. See also tr -s '[:space:]' '[\n*]' or tr -s '[:blank:]' '[\n*]' Commented Nov 14, 2017 at 18:45
  • This fails on "AC/DC", "add-on", "AT&T". You need something like tr -s ' \t,."' '\n' <file
    – user232326
    Commented Apr 20, 2021 at 22:56
  • @Isaac Well, that’s debatable. The question says, “Assume that words may contain (non-blank) non-alphabetic characters”. (Disclosure: I edited the question to say that, with the intent of keeping my answer correct.) If words may contain non-alphabetic characters, then "AC/DC", "add-on" and  "AT&T" are all words. And the OP didn’t give us any clue how they want “Mr.”, “Mrs.”, “Ph.D” or “Q.E.D.” to be handled. While comma (and semicolon) maybe should always be separators, people with 20th century technology sometimes used " to denote umlaut / dieresis; e.g., na"ive for naïve. Commented Apr 20, 2021 at 23:48
  • Fair enough. @G-ManSays'ReinstateMonica'
    – user232326
    Commented Apr 20, 2021 at 23:56
  • That would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces).
    – Ed Morton
    Commented Dec 24, 2021 at 8:26
10

Basically, you could do it in GNU sed:

sed 's/\s\+/\n/g'

There you go...

1
  • That would output multiple blank lines given the OPs 2nd set of input (the one that starts with spaces).
    – Ed Morton
    Commented Dec 24, 2021 at 8:27
6

You should be able to use

sed -e 's/[[:space:]]\{1,\}/\n/'

to replace any sequence of one or more whitespace characters (including oddities like formfeed and vertical tabs) with a single newline.

2
  • 7
    Almost portable, but most sed versions will insert a backslash and an n, because \n in the replacement is undefined by the standard. Use a literal newline instead (typically by typing backslash, Ctrl-V, Ctrl-J).
    – Philippos
    Commented Nov 14, 2017 at 7:29
  • That would output multiple blank lines and lines containing spaces given the OPs 2nd set of input (the one that starts with spaces).
    – Ed Morton
    Commented Dec 24, 2021 at 8:28
2

If gnu-grep available,

grep -Po '\S+'
1
  • 1
    That would work, though you don't need -P (which even in new versions of GNU grep is still considered "experimental" in combination with other grep options so I personally avoid), it'd work the same with -E.
    – Ed Morton
    Commented Dec 24, 2021 at 8:32
2

As the default behavior for awk already is to split on any number of blanks (spaces, tabs), one could as well use that feature, just setting the output field separator to "\n" and rebuilding $0.  An open question for the task, however, is: How do you want empty lines to be handled?

To just print them as they are:

awk -v OFS='\n' '{$1 = $1; print}' file

To additionally filter out empty lines:

awk -v OFS='\n' 'NF {$1 = $1; print}' file

(Beware of Windows line endings (containing \r) in a Linux setting, however: awk does not necessarily regard lines with \r as empty and in that case would output them too.  So filter text files with CRLF endings through dos2unix first.

0
2
+50

It could be done in a long list of ways:

tr -s ' \t' '\n' <file for tabs and spaces only.
tr -s ' \t,."' '\n' <file for your strings.
tr -s '[:blank:]' '\n' <file for tabs and spaces only.
tr -s '[:space:]' '\n' <file for \t\n\v\f\r
sed -e 's/[ \t]/\n/g' -e 's/\n\n*/\n/g' file GNU sed for \n.
sed 's/[ \t".,]\+/\n/g' file | tr -s '\n' GNU sed for \n.


Description

tr

The most basic tool is tr, yes, but it could not understand format. So, a basic

tr -s ' ' '\n' <file

would convert all (repeated) spaces to one newline. That, of course could generate empty lines if lines with only spaces exist or the file start with spaces. There is no way to correct that in tr. It could be done adding a filter to remove empty lines. Like sed '/^$/d'

tr -s ' ' '\n' <file | sed '/^$/d'

Additional characters (like tabs) could be added:

tr -s ' \t' '\n' <file | sed '/^$/d'

Others, that might result from a text paragraph (like this one), like commas and periods could also be added. That is changing the definition of what a word is.

tr -s ' \t,.()' '\n' <file | sed '/^$/d'

sed

Sed is more capable than tr (and could be slower). It could change runs of some characters to one newline. The basic idea would be (in GNU sed for the replacement \n):

sed 's/[ \t]\{1,\}/\n/g' 

In other seds that are not able to use \n in the right side of a replacement we need to use an actual newline (and, sometimes, an actual tab):

sed 's/[   ]\{1,\}/\
/g'

So, sed helps with some issues but makes unnecessarily complex others.
Read 4.1. How do I insert a newline into the RHS of a substitution?

grep

It could be done in grep as well. Just match sequences of non-word characters:

grep -o '[^         ]\{1,\}'     ## explicit space-tab.

In GNU grep, the equivalent \S+ could be used:

grep -Eo '\S+'     ## or grep -o '\S\+'

awk

In awk we get still more power from the tool. It gets quite simple:

awk '{for(i=1;i<=NF;i++) {print $i}}' file

Which is just: print all fields for each line. Where there are fields. If the number of fields is 0 nothing will be printed.

That is using the default FS which delimits fields on repeated spaces, tabs or newlines.

A similar solution could be done with RS (for awks that allow regex separators):

awk -v RS="[ \t\n]+" 'NF'

Which tells awk to split records in runs of space, tab or newlines and print only if there is any no-empty field (the NF).

2
  • I believe all of those would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces).
    – Ed Morton
    Commented Dec 24, 2021 at 8:30
  • @EdMorton Yes, technically, that is correct for the initial ways I posted. It is quite simple to remove empty lines, anyway, so, I am not very worried for this issue. However, I extended the description to solve that from different points of view. I believe that I have clarified that issue.
    – user232326
    Commented Jan 2, 2022 at 1:30
1

You can use the option -s from tr to squeeze repeated characters into one and replace it into a new line

tr -s " " "\n"
1
  • That wouldn't handle tabs in the input and would output a blank line at the start of the OPs 2nd set of input (the one that starts with spaces).
    – Ed Morton
    Commented Dec 24, 2021 at 8:29
1

With GNU awk or any other awk that supports multi-char RS (e.g. newer versions of mawk):

$ awk -v RS='[[:space:]]+' '$0!=""' file
Cat
Dog
Soup
Rat
Cass
Audrey

$ awk -v RS='[[:space:]]+' '$0!=""' file2
Once
upon
a
midnight
dreary
while
I
pondered
weak
and
weary
Over
many
a
quaint
and
curious
volume
of
forgotten
lore
2
  • Should awk -v RS='[[:space:]]+' 'NF' file2 not work ?
    – user232326
    Commented Jan 2, 2022 at 1:27
  • @ImHere yes, I think that would work too.
    – Ed Morton
    Commented Jan 2, 2022 at 13:33
1

Simple optimal text formatter displays in one column with further deletion of leading spaces:

fmt -1 file | column -t

In awk NF prevents from possible output of the first empty line:

awk 'NF' RS='[[:space:]]+' file
0

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .