7

I have some programmatically assembled huge regex, like this

(A)|(B)|(C)|...

Each sub-pattern is in its capturing group. When I get a match, how do I figure out which group matches without linearly testing each group(i) to see it returns a non-null string?

2
  • Do you want to find which group matches or the contents of the group?
    – Ben Lings
    Commented Jul 24, 2009 at 16:23
  • I am not aware of a regex system that does what you are asking, and I am pretty sure that the one in core Java does it's system linearly. See @Thomas' post for better details.
    – aperkins
    Commented Jul 24, 2009 at 17:35

5 Answers 5

4

If your regex is programmatically generated, why not programmatically generate n separate regexes and test each of them in turn? Unless they share a common prefix and the Java regex engine is clever, all alternatives get tested anyway.

Update: I just looked through the Sun Java source, in particular, java.util.regex.Pattern$Branch.match(), and that does also simply do a linear search over all alternatives, trying each in turn. The other places where Branch is used do not suggest any kind of optimization of common prefixes.

1
  • Yes they might share prefixes etc. Commented Jul 24, 2009 at 16:23
1

You can use non-capturing groups, instead of:

(A)|(B)|(C)|...

replace with

((?:A)|(?:B)|(?:C))

The non-capturing groups (?:) will not be included in the group count, but the result of the branch will be captured in the outer () group.

0

Break up your regex into three:

String[] regexes = new String[] { "pattern1", "pattern2", "pattern3" };

for(int i = 0; i < regexes.length; i++) {
  Pattern pattern = Pattern.compile(regexes[i]);

  Matcher matcher = pattern.matcher(inputStr);
  if(matcher.matches()) {
     //process, optionally break out of loop
  }
}

public int getMatchedGroupIndex(Matcher matcher) { 
  int index = -1;  

  for(int i = 0; i < matcher.groupCount(); i++) {
    if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
      index = i;
    }
  }

  return index;
}

The alternative is:

for(int i = 0; i < matcher.groupCount(); i++) {
  if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
     //process, optionally break out of loop
  }
}
2
  • I don't want to do a linear search. I'm asking if I can get functionality of this non-existent method Matcher.getMatchedGroupIndex() which will magically tell me which group is matched without me wading through each group to test it. Commented Jul 24, 2009 at 16:37
  • I added the getMatchedGroupIndex() method, but under the covers it will still use a FOR loop to iterate through the group contents.
    – OMG Ponies
    Commented Jul 24, 2009 at 16:57
0

I don't think you can get around the linear search, but you can make it a lot more efficient by using start(int) instead of group(int).

static int getMatchedGroupIndex(Matcher m)
{ 
  int index = -1;
  for (int i = 1, n = m.groupCount(); i <= n; i++)
  {
    if ( (index = m.start(i)) != -1 )
    {
      break;
    }
  }
  return index;
}

This way, instead of generating a substring for every group, you just query an int value representing its starting index.

0

From the various comments, it seems that the simple answer is "no", and that using separate regexes is a better idea. To improve on that approach, you might need to figure out the common pattern prefixes when you generate them, or use your own regex (or other) pattern matching engine. But before you go to all of that effort, you need to be sure that this is a significant bottleneck in your system. In other words, benchmark it and see if the performance is acceptable for realistic input data, and if not the profile it to see where the real bottlenecks are.

Not the answer you're looking for? Browse other questions tagged or ask your own question.