8

One of most common typos is to repeat the same word twice, as as here. I need an automatic procedure to remove all the repeated words in a text file. This should not be a strange feature for a modern editor or spell-checker, for example I remember that MS Word introduced this feature several years ago! Apparently, the default spell-check on my OS (hun-spell) can't do this, as it only finds words not in the dictionary.

It would be OK to have a solution valid for a specific text editor editor for linux (pluma/gedit2 or Sublime-text) and a solution based on a bash script.

5
  • 1
    Is perl an acceptable alternative to bash? Because that'd be my first port of call.
    – Sobrique
    Commented Nov 22, 2014 at 23:01
  • @Sobrique Please, feel free to add it! I would favor bash-based answers though
    – altroware
    Commented Nov 23, 2014 at 1:13
  • You asked for a script to remove repeated words but you accepted an answer that just prints them and only recognizes even repetitions (it'd fail on abc foo foo foo def for example). If you still need to know how to do what you originally asked for then please do post a new question and tag it with awk.
    – Ed Morton
    Commented Feb 9, 2020 at 14:53
  • @EdMorton I’m actually happy with the solution, I still use it to find words repeated twice in a line.
    – altroware
    Commented Feb 9, 2020 at 15:09
  • sounds good, if you ever need more than that, just ask again and tag with awk.
    – Ed Morton
    Commented Feb 9, 2020 at 15:29

3 Answers 3

17

With GNU grep:

echo 'Hi! Hi, same word twice twice, as as here here! ! ,123 123 need' |  grep -Eo '(\b.+) \1\b'

Output:

twice twice
as as
here here
123 123

Options:

-E: Interpret (\b.+) \1\b as an extended regular expression.

-o: Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

Regex:

\b: Is a zero-width word boundary.

.+: Matches one or more characters.

\1: The parentheses () mark a capturing group and \1 means use here the value from first capturing group.


Reference: The Stack Overflow Regular Expressions FAQ

16
  • Your grep command fails for the following type of example: echo "the thesis" | grep -Eo '(\b.+\b) \1' outputs: the the. grep -Eo '(\b.+) \1\b' seems to work though. Any idea why?
    – el_tenedor
    Commented Mar 24, 2015 at 14:38
  • 1
    I was thinking if there is any way of improving this answer, supplementing the case where the repeated words are not on the same line in separate lines as in: same word twice\n twice
    – altroware
    Commented Aug 29, 2016 at 11:56
  • 1
    @altroware found a solution when repeated words are not on the same line?
    – om-ha
    Commented Feb 9, 2020 at 9:38
  • 1
    @om-ha no I haven’t found it!
    – altroware
    Commented Feb 9, 2020 at 10:02
  • 1
    @altroware Done! You can see the solution here. I've edited an already-existing answer so you'll see the changes when they're approved.
    – om-ha
    Commented Feb 9, 2020 at 12:30
2

It sounds like something like this is what you want (using any awk in any shell on every UNIX box):

$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{
    head = prev = ""
    tail = $0
    while ( match(tail,/[[:alpha:]]+/) ) {
        word = substr(tail,RSTART,RLENGTH)
        head = head substr(tail,1,RSTART-1) (word == prev ? "" : word)
        tail = substr(tail,RSTART+RLENGTH)
        prev = word
    }
    print head tail
}

$ cat file
the quick quick brown
fox jumped jumped
jumped over the lazy
lazy dogs back

$ awk -f tst.awk file
the quick  brown
fox jumped
 over the lazy
 dogs back

but please ask a new question with more truly representative sample input and expected output including punctuation, differences in capitalization, multiple paragraphs, duplicated words at the start/end of sentences and various other non-trivial cases are shown.

1

Perlishly, I'd be thinking:

use strict;
use warnings;

local $/;

my $slurp = <DATA>;
$slurp =~ s/\b(\w+)\W\1/$1/go;
print $slurp;

__DATA__
Hi! Hi, same same? word twice twice, as as here here! ! ,123 123 need
need as here 

Bear in mind though - a lot of pattern matching is line oriented, so you've got to be careful if you cross line boundaries. If you can exclude that case, then you've got an easier job because you can parse one line at a time. I'm not doing that, so you'll end up reading the whole file into memory.

2
  • That's great, I preferred bash-based answer, but this is OK as well.
    – altroware
    Commented Dec 15, 2014 at 20:32
  • Perl is in nearly as many places as bash, and is more fully featured as a programming language.
    – Sobrique
    Commented Dec 15, 2014 at 20:47

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .