2

I have a number of csv files that have differing numbers of commas per line in each of them. I would like to remove any commas past 6 in a line but if there are only 6 commas, then leave the line alone.

This regex removes extra commas if there are more than 6:

^([^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*),(.*)$

and replace with

\1\2

The problem I am running into is if I run this against all files, if it finds a line with only six commas, it will move and include the next line. How do I restrict this to each single line?

Thanks everyone.

Example:

c1,c2,c3,c4,c5,c6,c7
asdf,asdf,asdf,asdf,asdf,asdf,asdf
asdf,asdf,asdf,asdf,asdf,asdf,asdf,,,asdf,asdf
asdf,asdf,asdf,asdf,asdf,asdf,asdf,,

I would like to end up with this:

c1,c2,c3,c4,c5,c6,c7
asdf,asdf,asdf,asdf,asdf,asdf,asdf
asdf,asdf,asdf,asdf,asdf,asdf,asdfasdfasdf
asdf,asdf,asdf,asdf,asdf,asdf,asdf
3
  • 1
    Where's Toto at when you need him!! Commented Oct 11, 2019 at 19:32
  • Paste here a partial of your source file as example. Commented Oct 11, 2019 at 20:04
  • See updated example. Commented Oct 11, 2019 at 20:19

2 Answers 2

2

The problem is that [^,] matches all characters other than ,, including newlines. Replace it with [^,\r\n] and it should work.

You can shorten the regex by using a numeric repetition count: ^((?:[^,\r\n]*,){6}[^,\r\n]*),(.*)$

Note that your regex will break on csv files that contain quoted commas within fields. Fixing this is ugly and depends on the exact csv format you're using. (Unfortunately there is no standard.)

Note also that replacing the regex with \1 will delete everything after the seventh field. If you really want to just delete the commas and concatenate all of the later fields, as suggested by your example output, you should use \1\2 as the replacement and do "replace all" repeatedly until it no longer finds any matches.

1
  • Thanks! (I had removed the \2 accidentally on an edit.) Commented Oct 11, 2019 at 20:31
0

Here is a way to do the job in a single pass:

  • Ctrl+H
  • Find what: (?:^(?:.*?,){6}|\G(?!^)).*?\K,
  • Replace with: LEAVE EMPTY
  • CHECK Wrap around
  • CHECK Regular expression
  • UNCHECK . matches newline*
  • Replace all

Explanation:

(?:             # non capture group
    ^           # beginning of line
    (?:         # non capture group
        .*?     # 0 or more any character, not greedy
        ,       # a comma
    ){6}        # end group, must appear 6 times
  |             # OR
    \G          # restart from last match position
    (?!^)       # negative lookahead, make sure we are not at the beginning of a line
)               # end group
.*?             # 0 or more any character, not greedy
\K              # forget all we have seen until this position
,               # a comma

Screen capture (before):

enter image description here

Screen capture (after):

enter image description here

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .