Regular Expression Evaluator

Question

Task to be accomplished:

Write a function that takes two strings s and p as arguments and returns a boolean denoting whether s matches p.

p is a sequence of any number of the following:

a-z - which stands for itself
. - which matches any character
* - which matches 0 or more occurrences of the previous single character

Code:

#include <iostream>
using namespace std;

bool eval(string s, string p) { 
    int j = 0;
    for(int i = 0; i < p.length(); i++) {
        // is next character is wildcard
        if((int)p[i+1] == 42) 
        {
            // is it preceded by a dot
            if((int)p[i] == 46) 
            {
                char charToMatch = s[j];

                // keep moving forward in string until the repetition of the 'any character' is over
                while(s[j] == charToMatch) j++; 
            }
            // it's preceded by a-z
            else 
            {
                // keep moving forward in string until repetition of the letter is over
                while(s[j] == p[i]) j++; 
            }
        }
        // is current character a dot
        else if((int)p[i] == 46) 
        {
            // move forward in string as it'll match no matter what
            j++; 
        }
        // it's not '.' || '*'
        else 
        {
            // is it a-z
            if((int)p[i] >= 97 && (int)p[i] <= 122) { 

                // if current character in string != character mentioned in pattern
                if(s[j] != p[i]) 
                {
                    // pattern doesnt match
                    return false; 
                }
                else
                {
                    // it matches, move forward
                    j++; 
                }
            } 
        }
    }

    // if pattern is ended but string is still left. pattern didnt match the whole string.
    if(j < s.length()) return false;
    // gone through whole string, pattern also finished. pattern matches.
    else return true; 
}

int main() {
    cout << eval("a", "b.*"); // false
}

Constraints: Pattern is always valid and non-empty.

I have just started to learn about time complexity of algos and I believe its complexity should be \$\mathcal{O}(nm)\$ where \$n\$ and \$m\$ are the lengths of p and the s respectively.

If it's not please correct me. Along with that, what improvements and optimizations could be made to lower the complexity?

Incomputable · Accepted Answer · 2019-05-13 10:48:39Z

Bug

There is a case that requires NFA instead of simple DFA to recognize string (of course it is possible to convert NFA to DFA):

eval("aaab", "a*ab");

Gives false, even though the string matches the regex. Disallowing same character after * fixes the problem too.

Style

why-is-using-namespace-std-considered-bad-practice.

Pass by const reference for read-only access. In the case of eval and availability of C++17, one could use std::string_view. I would advise taking the input range as input iterator, that would allow matching files, input from console, etc.

(int)p[i+1] == 42

I do not see why that is preferred to

p[i + 1] == '*'

more readable version makes comment obsolete.

if(j < s.length()) return false;
    else return true;

Could be rewritten as

return j == s.length();

Note that for this a bug about running over the string should be fixed, the bug is described below. >= will work too.

There are also no checks if string is already fully traversed in conditions with *. Although all implementations that I know of store null terminator at the end, this will not work on arbitrary ranges of characters.

I agree with KonradRudolph that comments do not contribute anything useful.

// is next character is wildcard

This comment is misleading. * is called Kleene star. Without looking at ASCII table one would assume there is a bug.

Runtime complexity

I believe runtime is \$\mathcal{O}(n)\$ in the worst case, where \$n\$ is length of input string. Think about it: the code does a single pass over a string, and doesn't reroll.

Theory

I would advise learning method of transforming NFA to DFA. It's great to learn parsing techniques, but there tools for that like flex, bison, etc exist, if you just want to hammer the nail into it's place.

I would assume .* means match any sequence of characters, as by logic in the question even doing | for all characters in the alphabet will make it impossible to describe just any combination of letters in the alphabet. Note that such a language is in the set of regular languages, as it is possible to construct DFA for it. May be I am wrong though, as my assumption stems from javascript regex.

+1. Since you mention redundant comments let me emphasise that all comments in OP’s code are redundant, as they don’t add a single jot of clarity. By contrast, there’s no comment describing the algorithm. — Konrad Rudolph, Commented May 13, 2019 at 10:02
Great answer and very helpful. Thanks alot! One thing tho, I used (int)p[i+1] == 42 because the compiler was giving me an error due to non equal data type comparison. And, why is the complexity \$\mathcal{O}(n)\$ when there's a while loop inside the outer loop which in the worse case would pass over the whole string no? Is it because this causes the complexity to become a linear combination and hence just \$n\$ ? — Mohammad Areeb Siddiqui, Commented May 14, 2019 at 14:59
@MohammadAreebSiddiqui, what compiler are you using? Wandbox doesn't show any errors. About being linear time: each token in the regex will match some portion of the string. The sum will give the whole length in case the check reached the end, and in other cases it will be less than the length. Do not let loops to deceive you. — Incomputable, Commented May 14, 2019 at 15:36
@MohammadAreebSiddiqui, if you are on windows, you might want to download visual studio community, or at least the compiler. Using it with visual studio code and cmake is pretty easy. Clangd (aka intellisense) doesn't seem to work properly for me, but with some tinkering I'm sure you can get it to work. The compiler is more or less up to date. If you still want gcc (bundled with g++), there is mingw64 project, which offers more up to date compilers: mingw-w64.org/doku.php . I recommend deleting tdm-gcc 4.9.2 and forgetting it ever existed. — Incomputable, Commented May 16, 2019 at 9:22

Stack Exchange Network

Regular Expression Evaluator

1 Answer 1

Bug

Style

Runtime complexity

Theory

Not the answer you're looking for? Browse other questions tagged
c++
algorithm
regex
complexity
or ask your own question.

Hot Network Questions

Regular Expression Evaluator

1 Answer 1

Bug

Style

Runtime complexity

Theory

Not the answer you're looking for? Browse other questions tagged c++algorithmregexcomplexity or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
c++
algorithm
regex
complexity
or ask your own question.