8

On https://www.emacswiki.org/emacs/MultilineRegexp one finds the hint to use

[\0-\377[:nonascii:]]*\n

instead of the standard

.*\n

to match any character up to a newline to avoid stack overflow for huge texts (37 KB). Is the overflow the concern here, or is a matching run for the former also more performant than the latter?

1 Answer 1

9

In Emacs's regexps, . does not match all characters. It is a synonym of [^\n]. So the reason for using [\0-\377[:nonascii:]] is when you want to match "any char, even a newline".

W.r.t overflowing the stack, .*\n should be handled very efficiently, i.e. without backtracking and without eating up the stack. On the contrary [\0-\377[:nonascii:]]*\n is handled rather inefficiently by Emacs's regexp engine because it will eat up a bit of the stack for every character matched, so on "huge" texts it will tend to overflow the stack.

Note that the emacswiki suggests [\0-\377[:nonascii:]]* and not [\0-\377[:nonascii:]]*\n.

7
  • Thanks for the clarification. However, for the stack overflow, are you sure that [\0-\377[:nonascii:]]*\n will cause an overflow? This is the contrary to what the wiki claims. Is this bcs of the \n at the end? What use would a pattern like [\0-\377[:nonascii:]]* without an ending character be then? Commented Nov 21, 2016 at 15:37
  • Any regexp which matches "anything" will eat up stack space (with Emacs's regexp engine, I mean), and I don't see why [\0-\377[:nonascii:]]* would do so less then \\(.\\|\n\\)*. So I think the emacswiki is wrong on this one.
    – Stefan
    Commented Nov 21, 2016 at 15:46
  • Any way (or anyone) to authoritatively clarify on this issue? Commented Nov 21, 2016 at 16:01
  • @Vroomfondel test it and see. I can imagine that the regexp with | might need more backtracking, but whether it actually does depends on how it's compiled.
    – npostavs
    Commented Nov 21, 2016 at 20:05
  • 3
    That is true only if the regexp ends with [\0-\377[:nonascii:]]* (which is rather unusual, since you might as well use point-max rather than search for it via such a regexp) (for the curious: the crux of the matter is whether the set of chars that can match after the * is disjoint from the set of char that can match within the *. If it is disjoint, then the regexp engine will skip recording intermediate steps, and hence avoid eating up stack space. So .*\n and [^a]*a don't consume the stack, whereas .*a does).
    – Stefan
    Commented Nov 23, 2016 at 18:30

Not the answer you're looking for? Browse other questions tagged or ask your own question.