How to remove all the duplicated words on every line using Notepad++?

Question

I'm working on a file containing lines with keywords and some lines contain duplicated keywords.

For example:

dangerous,dangerous,hazardous,perilous

I want to tell Notepad++ that I want to remove every duplicated word per line. For this example dangerous, would be removed:

dangerous,hazardous,perilous

I have a bunch of lines like that and that's why I'm looking for an automated way of doing this.

What about dangerous,hazardous,dangerous,perilous? In other words, are duplicated words always next to each other? — Daniel Beck, Commented Jul 26, 2012 at 21:24

amiregelz · Accepted Answer · 2012-07-27 06:57:25Z

13

You can use a regular expression to remove consecutive duplicated words in a line, however I don't think it's possible to remove duplicated words which are not consecutive (e.g dangerous, hazardous, dangerous).

Use this regex in the replace window in Notepad++, and don't forget to select "Regular expression" as the Search Mode option below:

This regex will remove all consecutive duplicated words - whether it's 2 duplicated words or 10 duplicated words consecutively: \b(\w+)(?:,\s+\1\b)+.

The exact same no-commas regex would be: \b(\w+)(?:\s+\1\b)+ (might be useful for other users).

If you want a regex specifically for only two duplicated words (doubles), use this regex: (\b\w+\b)\W+\1.

Place this regex in the Replace with box to keep one occurrence of the word (otherwise all repeated words will be removed): ${1}.

These regular expressions will fix a situation like the one you described in your question as an example. The first regex will work for every number of duplicated words (e.g dangerous, dangerous, dangerous, dangerous, hazardous), while the second version will only work for two duplicated words (e.g dangerous, dangerous, hazardous).

Note: The regular expression will only apply to the format described in the question, meaning that formats like two words, two words, anotherword, two-words, two-words, anotherword, three words expression, three words expression, anotherword won't be changed because the regex won't apply to them.

edited Jul 27, 2012 at 6:57

answered Jul 26, 2012 at 20:03

amiregelz

8,19712 gold badges49 silver badges58 bronze badges

Thanks for the help! However I'm getting 0 occurrences, I tried doing this with separated keywords as you suggested and it didn't work, I also tried as they were before and nothing, please check my screen capture: goo.gl/eZ7Kp
– Gabriel
Commented Jul 26, 2012 at 20:28
1

This regex should work: (\b\w+\b)\W+\1 for two duplicated words. I'll edit my answer. The commas are why the other regex doesn't work.
– amiregelz
Commented Jul 26, 2012 at 20:40
I tried every possible combination, no commas, only spaces, no space and comma, and yet nothing. Please enlight me, here's the text file: goo.gl/sP20z
– Gabriel
Commented Jul 26, 2012 at 21:59
The problem is that the regular expression I wrote in my answer only applies to the format (I thought) you asked for: word, word, anotherword. However, you have many instances which are a little bit different, like came across, came across and some with 3 or 4 words. Also there are words with ' like don't and it makes things more complicated in the Notepad++ regex system. The Notepad++ regex system is pretty annoying and limited as well, so the solution is to either use regex in python (or another language), or make format-specific regular expressions for the Notepad++.
– amiregelz
Commented Jul 26, 2012 at 23:48
Another problem is that most of the words that are duplicated also appear in the previous line, which makes it difficult to achieve your goal. If you'd want to remove all duplicated words, then it wasn't that difficult. You could do something like this & this. I suggest you use specific regular expressions in Notepad++ (I can help you, just tell me all the formats of the duplicated words) or consider a different approach to your problem.
– amiregelz
Commented Jul 27, 2012 at 0:16

Add a comment |

Toto · Accepted Answer · 2019-02-27 14:56:34Z

Here is a way to do the job, this will replace all duplicate words even if they are not contiguous:

Ctrl+H
Find what: (?:^|\G)(\b\w+\b),?(?=.*\1)
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
DO NOT CHECK . matches newline
Replace all

Explanation:

(?:^|\G)    : non capture group, beginning of line or position of last match
(\b\w+\b)   : group 1, 1 or more word character (ie. [a-zA-Z0-9_]), surrounded by word boundaries
,?          : optional comma
(?=.*\1)    : positive lookahead, check if thhere is the same word (contained in group 1) somewhere after

Given an input like: dangerous,dangerous,hazardous,perilous,dangerous,dangerous,hazardous,perilous

We got:

dangerous,hazardous,perilous

Awesome answer! Really helpful. Thanks for sharing :)
– Chaity
Commented Sep 19, 2020 at 6:31 — Chaity, Commented Sep 19, 2020 at 6:31

Just Me · Accepted Answer · 2021-06-28 07:11:45Z

0

Try this:

Ctrl+H
Find what: \b(\w+)\s+\1\b
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
DO NOT CHECK . matches newline
Replace all

answered Jun 28, 2021 at 7:11

Just Me

8661 gold badge20 silver badges45 bronze badges

Add a comment |

Stack Exchange Network

How to remove all the duplicated words on every line using Notepad++?

3 Answers 3

Use this regex in the replace window in Notepad++, and don't forget to select "Regular expression" as the Search Mode option below:

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
notepad++
text-editors
duplicate
.

Linked

Hot Network Questions

How to remove all the duplicated words on every line using Notepad++?

3 Answers 3

Use this regex in the replace window in Notepad++, and don't forget to select "Regular expression" as the Search Mode option below:

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged notepad++text-editorsduplicate.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
notepad++
text-editors
duplicate
.