Removing duplicate paragraphs with Edit Pad Pro or Notepad++

Question

I have a .docx file that contains mcqs which are in the format as shown below. The problem is there are many duplicate mcqs and I would therefore like to know if a regex can be created to detect all duplicate mcqs.

I have Edit Pad Pro 7,Notepad++,powergrep and sublime text. and all the regex that I have used until now deleted duplicates on a line by line basis, thereby deleting options from other questions even though the questions don't match.

So basically what I am saying is I need a regex that can delete all the duplicate mcqs only if the whole mcq matches, not individul lines or sentences.

I am a novice with respect to regex, so please excuse any inadequacies.

Lichen planus occurs most frequently on the?
A.  buccal mucosa.
B.  tongue.
C.  floor of the mouth.
D.  gingiva.

In the absence of “Hanks balanced salt solution”, what is the most appropriate media to transport an avulsed tooth?
A.  Saliva.
B.  Milk.
C.  Saline.
D.  Tap water.

Which of the following is the most likely cause of osteoporosis, glaucoma, hypertension and peptic ulcers in a 65 year old with Crohn’s disease?
A.  Uncontrolled diabetes.
B.  Systemic corticosteroid therapy.
C.  Chronic renal failure.
D.  Prolonged NSAID therapy.
E.  Malabsorption syndrome.

Lichen planus occurs most frequently on the?
A. buccal mucosa.
B. tongue.
C. floor of the mouth.
D. gingiva.

expected result

Lichen planus occurs most frequently on the?
A.  buccal mucosa.
B.  tongue.
C.  floor of the mouth.
D.  gingiva.

In the absence of “Hanks balanced salt solution”, what is the most appropriate media to transport an avulsed?
A.  Saliva.
B.  Milk.
C.  Saline.
D.  Tap water.

Which of the following is the most likely cause of osteoporosis, glaucoma, hypertension and peptic ulcers in a 65 year old with Crohn’s disease?
A.  Uncontrolled diabetes.
B.  Systemic corticosteroid therapy.
C.  Chronic renal failure.
D.  Prolonged NSAID therapy.
E.  Malabsorption syndrome.

I don't see duplicate with your example. lease, edit your question and give expected result. — Toto, Commented May 10, 2018 at 16:31

Toto · Accepted Answer · 2018-05-10 17:45:21Z

0

Ctrl+H
Find what: (([^?]+\?\R(?:.+\.\R)+)[\s\S]+?)\2
Replace with: $1
check Wrap around
check Regular expression
DO NOT CHECK . matches newline
Replace all

Explanation:

(           : start group 1
  (         : start group 2
    [^?]+   : 1 or more any character that is not "?"
    \?      : a question mark
    \R      : any kind of line break
    (?:     : start non capture group
      .+    : 1 or more any character but newline
      \.    : a dot
      \R    : any kind of line break
    )+      : end group, must appear 1 or more times
  )         : end group 2
  [\s\S]+?  : 1 or more any character, not greedy
)           : end group 1
\2          : another occurrence of group 2

Replacement:

$1          : content of group 1

answered May 10, 2018 at 17:45

Toto

18.3k73 gold badges33 silver badges45 bronze badges

@den: You're welcome, glad it helps.
– Toto
Commented May 12, 2018 at 8:17
@den: Feel free to mark the answer as accepted, see: superuser.com/help/someone-answers
– Toto
Commented May 12, 2018 at 8:17
Is there a Way to mark the individual McQ as a pharagraph permanently .so that even ms word recognises it as a pharagraph.
– den
Commented May 12, 2018 at 14:22
it has worked but the problem is that it removes even the original for exp input = pharagraph1,pharagraph2,pharapraph3,pharagraph1 result =pharagraph3,pharagraph3 .pharagraph 1 gets deleted completely i want a code that delets only the duplicate and maintains 1 copy of everything.hope i am clear.and thnx in advance
– den
Commented May 12, 2018 at 16:27
@den: That's strange, it works fine for me, I have Notepad++ v7.5.6.
– Toto
Commented May 12, 2018 at 16:30

| Show 3 more comments

wlod · Accepted Answer · 2018-05-23 14:26:22Z

Technically in the given input there are no duplicates as ‘A. buccal mucosa.’ and ‘A. buccal mucosa.’ differ in number of spaces after 'A.'.

However, the intuition suggests that such cases should be somehow spotted.

As you mentioned in the comment that you are using https://regex101.com/ I will use this webpage to do the matches and replaces.
I selected flavor: javascript and set two flags in the regular expression section: g (global) and s (single line).

I will use 3 patterns to deal with this string.

The first pattern searches for identical question_and_answer occurences. If there are discrepancies between them, those won't be treated as duplicates.
If there are more than one duplicate all of them will be captured.

(?<=^|\n)([^\n]+)(\n)(\D{1}\.\s+[^^]+?)(\n{2})(?=[^^]*\1\n\3)

Input (TEST STRING):

Question 1?
A. SomeA1.
B. SomeB1.

Question 1?
A.     SomeA1.
B.     SomeB1.

Question 1?
A. SomeA1.
B. SomeB1.

Output (SUBSTITUTION): // one duplicate removed

Question 1?
A.     SomeA1.
B.     SomeB1.

Question 1?
A. SomeA1.
B. SomeB1.

If we just want to find duplicates based on the questions then this pattern should work but should only be used for information only.

(?<=^|\n)([^\n]+)(\n)(\D{1}\.\s+[^^]+?)(\n{2})(?=[^^]*\1)

Input (TEST STRING):

Question 1?
A. SomeA1.
B. SomeB1.

Question 1?
A.     SomeA1.
B.     SomeB1.

Question 1?
A. SomeA1.
B. SomeB1.

Output (SUBSTITUTION): // it looks ok, but it's a trick and must be used with caution

Question 1?
A. SomeA1.
B. SomeB1.

Ideally if we know what kind of deviations we might find in the data we can clean the data before the first pattern is applied, like in the following example where multiple spaces are replaced with just one space.

Find: (?<=\n)(\D{1}\.)(\s+)([^^]+?\n)
Substitution: \1 \3

Input (TEST STRING):

Lichen planus occurs most frequently on the?
A.  buccal mucosa.
B.             tongue.
C.                        floor of the mouth.
D.  gingiva.

Output (SUBSTITUTION):

Lichen planus occurs most frequently on the?
A. buccal mucosa.
B. tongue.
C. floor of the mouth.
D. gingiva.

These regex don't work in Npp. What do you want to match with [^^]*? — Toto, Commented May 23, 2018 at 17:34
@Toto, indeed those patterns do not work in N++ but I got the impression that @den is presumably interested in a solution rather than a solution purely in N++. I also tried to solve it in N++ and those two patterns should do the job meaning that it's possible to remove multiple consecutive and non-consecutive duplicates. Just put \n in the replace textbox: (1) (^[^\e]+?$)(\R)(\D{1}\.\s+[^\e]+?)(\R{2})(?=[^\e]*\1\2\3); (2) (^[^\e]+?)(\R)(\R)(?=[^^]*\1(\R|$)). There will be too many new lines left but the data itself will be nicely structured and easy to work for humans and machines. — wlod, Commented May 24, 2018 at 7:35
OK, but what are you trying to match with [^^]* and [^\e]*? — Toto, Commented May 24, 2018 at 9:04
I often switch between different systems with many regex flavors and don't want to think each time what will capture for example a dot in a given context so I pick some character which I know for sure is not going to appear in the text like BEL (\a) or ESC (\e) and use it for matching as many characters as possible until limiting group is reached. In single line mode there's only one ^ 'start of a line' so I often use it for greedy capture. [^^]* means 0 or as many characters as possible. — wlod, Commented May 24, 2018 at 9:23
^ inside a character class doesn't mean "start of line" but it negates the character class when in first position else it means simply a caret ^ , so [^^]* means not a caret 0 or more times. — Toto, Commented May 24, 2018 at 10:22

Stack Exchange Network

Removing duplicate paragraphs with Edit Pad Pro or Notepad++

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
notepad++
regex
sublime-text-3
.

Linked

Hot Network Questions

Removing duplicate paragraphs with Edit Pad Pro or Notepad++

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged notepad++regexsublime-text-3.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
notepad++
regex
sublime-text-3
.