Find repeated words in a text

Question

One of most common typos is to repeat the same word twice, as as here. I need an automatic procedure to remove all the repeated words in a text file. This should not be a strange feature for a modern editor or spell-checker, for example I remember that MS Word introduced this feature several years ago! Apparently, the default spell-check on my OS (hun-spell) can't do this, as it only finds words not in the dictionary.

It would be OK to have a solution valid for a specific text editor editor for linux (pluma/gedit2 or Sublime-text) and a solution based on a bash script.

Is perl an acceptable alternative to bash? Because that'd be my first port of call. — Sobrique, Commented Nov 22, 2014 at 23:01
@Sobrique Please, feel free to add it! I would favor bash-based answers though — altroware, Commented Nov 23, 2014 at 1:13
You asked for a script to remove repeated words but you accepted an answer that just prints them and only recognizes even repetitions (it'd fail on abc foo foo foo def for example). If you still need to know how to do what you originally asked for then please do post a new question and tag it with awk. — Ed Morton, Commented Feb 9, 2020 at 14:53
@EdMorton I’m actually happy with the solution, I still use it to find words repeated twice in a line. — altroware, Commented Feb 9, 2020 at 15:09
sounds good, if you ever need more than that, just ask again and tag with awk. — Ed Morton, Commented Feb 9, 2020 at 15:29

Community · Accepted Answer · 2017-05-23 12:41:46Z

17

With GNU grep:

echo 'Hi! Hi, same word twice twice, as as here here! ! ,123 123 need' |  grep -Eo '(\b.+) \1\b'

Output:

twice twice
as as
here here
123 123

Options:

-E: Interpret (\b.+) \1\b as an extended regular expression.

-o: Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

Regex:

\b: Is a zero-width word boundary.

.+: Matches one or more characters.

\1: The parentheses () mark a capturing group and \1 means use here the value from first capturing group.

Reference: The Stack Overflow Regular Expressions FAQ

edited May 23, 2017 at 12:41

CommunityBot

1

answered Nov 23, 2014 at 9:30

Cyrus

5,6111 gold badge23 silver badges30 bronze badges

Your grep command fails for the following type of example: echo "the thesis" | grep -Eo '(\b.+\b) \1' outputs: the the. grep -Eo '(\b.+) \1\b' seems to work though. Any idea why?
– el_tenedor
Commented Mar 24, 2015 at 14:38
1

I was thinking if there is any way of improving this answer, supplementing the case where the repeated words are not on the same line in separate lines as in: same word twice\n twice
– altroware
Commented Aug 29, 2016 at 11:56
1

@altroware found a solution when repeated words are not on the same line?
– om-ha
Commented Feb 9, 2020 at 9:38
1

@om-ha no I haven’t found it!
– altroware
Commented Feb 9, 2020 at 10:02
1

@altroware Done! You can see the solution here. I've edited an already-existing answer so you'll see the changes when they're approved.
– om-ha
Commented Feb 9, 2020 at 12:30

| Show 11 more comments

Ed Morton · Accepted Answer · 2020-02-09 16:17:05Z

It sounds like something like this is what you want (using any awk in any shell on every UNIX box):

$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{
    head = prev = ""
    tail = $0
    while ( match(tail,/[[:alpha:]]+/) ) {
        word = substr(tail,RSTART,RLENGTH)
        head = head substr(tail,1,RSTART-1) (word == prev ? "" : word)
        tail = substr(tail,RSTART+RLENGTH)
        prev = word
    }
    print head tail
}

$ cat file
the quick quick brown
fox jumped jumped
jumped over the lazy
lazy dogs back

$ awk -f tst.awk file
the quick  brown
fox jumped
 over the lazy
 dogs back

but please ask a new question with more truly representative sample input and expected output including punctuation, differences in capitalization, multiple paragraphs, duplicated words at the start/end of sentences and various other non-trivial cases are shown.

Sobrique · Accepted Answer · 2014-11-23 13:25:46Z

1

Perlishly, I'd be thinking:

use strict;
use warnings;

local $/;

my $slurp = <DATA>;
$slurp =~ s/\b(\w+)\W\1/$1/go;
print $slurp;

__DATA__
Hi! Hi, same same? word twice twice, as as here here! ! ,123 123 need
need as here

Bear in mind though - a lot of pattern matching is line oriented, so you've got to be careful if you cross line boundaries. If you can exclude that case, then you've got an easier job because you can parse one line at a time. I'm not doing that, so you'll end up reading the whole file into memory.

answered Nov 23, 2014 at 13:25

Sobrique

4462 silver badges8 bronze badges

That's great, I preferred bash-based answer, but this is OK as well.
– altroware
Commented Dec 15, 2014 at 20:32
Perl is in nearly as many places as bash, and is more fully featured as a programming language.
– Sobrique
Commented Dec 15, 2014 at 20:47

Add a comment |

Stack Exchange Network

Find repeated words in a text

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
bash
regex
text-editing
spell-check
.

Linked

Hot Network Questions

Find repeated words in a text

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxbashregextext-editingspell-check.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
bash
regex
text-editing
spell-check
.