1

I am struggling with coming up with a regexp in notepad++ that finds and replaces x number of bytes with nothing. Carriage return (0D) counts, line feed counts (0A).

This is the regex I am trying: (0C is my begin, I am removing 318 bytes after 0C along with the 0C)

\x0C(.{318})

This regex doesn't find anything, it says no match found. I can find \x0C, and I can find . but I can't find .{318} also . skips over 0x0A and 0x0D

-wrap around is checked.

-regular expression is checked.

Here is part of the file in hex with ascii:

0C 30 31 32 27 34 35 36 0D 0A 30 61 32 0D 33 34 0A [snip] 0C 32 0A 0D 35 [etc..]
<ff>0  1  2  '  4  5  6<cr><lf>0  a  2<cr> 3  4<lf>[snip]<ff> 2<lf><cr>5 [etc..]
7
  • So what's your problem and what doesn't work about it? What does your input and output actually look like?
    – Seth
    Commented May 26, 2017 at 16:28
  • 1
    one thing you could try is converting the file to hex, and run the regex on the hex, so the file will look a little bit like the one you show, but then you don't do \x0C you do 0C literally. Your way, looking for the hex eg \x0C may work too if it's ascii so every char is a byte anyway. But include the file here like upload the file to ge.tt and include a link in your question. And re your concern about whether or not dot matches new line, you can toggle it superuser.com/questions/481276/…
    – barlop
    Commented May 26, 2017 at 17:03
  • The round brackets are superfluous so you can remove them. Also, try changing 318 to a much smaller number like 3, see if that matches anything. Then troubleshoot, find at what point it doesn't match.
    – barlop
    Commented May 26, 2017 at 17:57
  • @barlop I didn't have that option for . so I updated and now everything works great... I don't really know what to do with my question now though.
    – UpTide
    Commented May 26, 2017 at 18:01
  • @UpTide doesn't matter you could just leave it. It's good that you found the issue and cause of the problem you had.
    – barlop
    Commented May 26, 2017 at 18:08

1 Answer 1

0

Since you mentioned the encoding is us-ascii, we can assume each character is one byte. In regex, the '.' matches any character, except newlines, and you want each individual part of a CR/LF newline to be matched separately, since they are two bytes.

I'm also going to make the assumption that you are processing actual text data, and not a binary file that can contain bytes outside of the us-ascii character mapping.

If all of the above is true, you can use the following regex:

\x0C[^\xFF]{318}

The reason the '.' didn't work in your attempt, is because the '.' does not match newlines. You also can't use \x0C[.\r\n]{318}, because the '.' wildcard is not available within a character class (square bracket group). The Hex value FF does not map to any valid codepoint inside the us-ascii character set, and hence when you look for "any character that is not the FF character", you will be taking bytes into consideration.

Keep in mind that this method counts windows/mac Newlines as two characters/bytes (per your request).

Hope this is what you were looking for...

EDIT - Regex explained

Full expression

\x0C[^\xFF]{318}

Let's break this down.

\x0C

This matches a Single Unicode Grapheme, you can find more information on this over here. In summary, You can consider \x the Unicode version of the dot, except that it can also match line-breaks (this is important, more on this later).

But, since you also used this, I'm guessing you're already partly familiar with this.

[^\xFF]

Everything between [] is called a Character Set (not to be confused with the same concept in Character encoding). You can read more about it on Regexp Tutorial, but in summary, it serves as an "OR" statement. [ab] simply means, "a or b". When ^ is used inside a character set, it serves as a negation. So [^a] means "not a". In our use-case, we look for any character that is not the HEX value FF.

{318}

And we look for this kind of character, 318 times. The {} syntax always applies to the Regex element just in front of it, so in this case the [^\xFF] Character set.

Why \xFF?

In Hexadecimal notation, the us-ascii character set goes from 00 up to 7E. Any value higher can not be mapped to a us-ascii codepoint. This means that any file encoded (correctly) in us-ascii, can only contain HEX values between 00 and 7E. As a result, it can't contain FF.

So, we can cleverly make use of this to search for any character including newline characters, since \x.. also matches newlines like \x0A and \x0C. When we search for any character that is not FF, we end up finding every character.

Keep in mind that this solution is dependant on the fact that your file is encoded in us-ascii, and not UTF-8.

5
  • while your regex works great, I would love a walkthrough on what each part of it does. For some reason I have not been able to wrap my mind around regex statements.
    – UpTide
    Commented Jun 9, 2017 at 13:17
  • there you go :)
    – Wouter
    Commented Jun 9, 2017 at 14:05
  • Oh and, it's normal that you can't wrap your mind around regex statements. regex.info/blog/2006-09-15/247 mastery of regex takes a decade :)
    – Wouter
    Commented Jun 9, 2017 at 14:07
  • Your explanation is superb. If I understood this correctly, then this finds x0C, selects it, then selects the next 318 bytes (even if it is a x0C). This selects 319 bytes, including the x0C. Thanks! I feel like I need to make more accounts to uptick you more.
    – UpTide
    Commented Jun 9, 2017 at 15:33
  • Haha, thanks :) And yes, you understood this correctly.
    – Wouter
    Commented Jun 9, 2017 at 15:36

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .