1

Suppose I have a pipe-delimited text file. I suspect that one of the columns might have an embedded pipe character ('|'). I know there are 8 columns in the file, and each row should have 8-1 = 7 pipe characters. Hence, I need to find all lines that have 8 or more '|' characters.

The following regex should find all such cases, but takes too long to return on my 200,000 record file:

^\(.*|.*\)\{8,}$

Is there a faster regex I should use instead? By too long, I mean longer than I would expect - at least several minutes. This is not so large a file (200K records), so I'm assuming the regex itself is just not efficient.


Some sample data:

SAMPLE_ID|GROUPS|ADDRESSSTRING|LATITUDE|LONGITUDE|COUNTRYCODE|LANGUAGECODE|ISO_2_LTR_CODE
7304094||Rhein-Galerie;Baden-Württemberg|49.48334|8.45007|DEU|ger|DE
7303851||Steigenberger Insel;Baden-Württemberg|47.69005|9.18812|DEU|ger|DE
7303850||Si-Suites;Baden-Württemberg|48.72309|9.16138|DEU|ger|DE

(I am running gVim on WinXP)

1
  • Chris Johnsen's regex returns matches in 1 second. celebdor's regex returns matches in 1 minute. My original regex returned significantly later.
    – drapkin11
    Commented Mar 30, 2011 at 14:49

2 Answers 2

2

Your regex is prone to running into some O(N^2) behavior of the “backtracking” regex engine used in Vim (and many other languages and environments).

Luckily, there are ways to write equivalent expressions that do not cause excessive backtracking. For example:

/^\([^|]*|\)\{8}.*$

In general, you do not need to match “eight or more”, since if you already know the line is problematic if it has eight (whether it has more or not).

If you actually need to match the entire line (e.g. because it is part of a :s operation), then you will need to keep the last part (.*$); if you are just using the regex to find the “eight or more” lines, then you can leave .*$ off the end.

Also, I advise only trying to match one “side” of the pipe inside the group that that you repeat. This simplifies both thinking about how the regex matches lines, and how the regex engine itself executes (it eliminates a source of backtracking).


Now, to explain the bit about “backtracking”. Consider you have a line that does have eight pipe characters on it:

aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh

The following passage describes how the regex engine tries to match your expression against the above line (I added extra whitespace to the regex lines to show (approximately) where parts of the regex match the characters of the line itself).

The first .* is greedy and will match everything to the end of the line, leaving the pipe character unmatchable.

  aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh
^(.*                                                            |

The most recent “shrinkable” match gives up bits of its match and tries the rest of the regex again. This happens one character at a time in this case (since . will match any single character). This backtracking proceeds until the rest of the expression can match (or until it backtracks to the beginning—which is the only way it knows the line does not match the expression!).

  aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh
^(.*                                                     |.*    )(.*|

So, the first .* backtracked enough to let the rest of the group match, but there was nothing for the second group to match. Time to backtrack some more.

  aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh
^(.*                                              |.*           )(.*|

The backtracking found a new “stopping” point, but now the second .* in the first group is doing its greedy matching. The second group fails to match. Backtracking of the second .* in the first group starts.

  aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh
^(.*                                              |.*)(.*|.*    )(.*|

The second group found a match, but the third group did not match. Backtrack again, starting with the more recent match. The second .* of the second group backtracks to back to nothing. The first .* of the second group backtracks to nothing. The second .* of the first group backtracks to nothing. The first .* of the first group backtracks successfully.

  aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh
^(.*                                       |.*                  )(.*|

But again, the second .* is greedy, so it leaves nothing for the second group.

  aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh
^(.*                                       |.*       )(.*|.*    )(.*|
  aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh
^(.*                                       |.*)(.*|.*)(.*|.*    )(.*|

Eventually, all three groups match, but the fourth instance of the group fails. Start backtracking.

  aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh
^(.*                                |.*                         )(.*|
  aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh
^(.*                                |.*              )(.*|.*    )(.*|
  aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh
^(.*                                |.*       )(.*|.*)(.*|.*    )(.*|
  aaaaaa|bbbbbb|cccccc|dddddd|eeeeee|ffffff|  gg  |  gg  |hhhhhh
^(.*                                |.*)(.*|.*)(.*|.*)(.*|.*    )(.*|

You can see how this burns a lot of time (the diagrams even skip over the character-by-character backtracking which actually occurs; only the “high points” are shown above). The problem comes from having an earlier bit of the regex greedily match something that the later part of the regex will eventually have to match to get the proper number of repetitions of the group.

In my expression, each repetition ([^|]*) never matches anything that the following element (|) would match, so the backtracking is purely linear. Once backtracking starts for each “shrinkable” match, it will (in linear time) find that there are no earlier places where the following expression can match; this forces backtracking to continue with the previous “shrinkable” match until nothing matches and the whole line is decided to be non-matching.

Instead of “zero or more non-pipe, then pipe” ([^|]*|), it is also possible to use . with an explicitly non-greedy repetition (\{-} in Vim, but it varies; other regex languages use *?).

^\(.\{-}|\)\{8}.*$
0
1

Well, in my computer this is faster:

:%s/\(|.\{-}\)\{8,}//n
1
  • Yes, same on my computer. This regex does a phenomenally better job.
    – drapkin11
    Commented Mar 30, 2011 at 14:30

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .