Revisions to Regular Expression Evaluator - Code Review Stack Exchange

added 356 characters in body

Source Link

edited May 13, 2019 at 10:48

9.5k
3
33
72

Bug

There is a case that requires NFA instead of simple DFA to recognize string (of course it is possible to convert NFA to DFA):

eval("aaab", "a*ab");

Gives false, even though the string matches the regex. Disallowing same character after * fixes the problem too.

Style

why-is-using-namespace-std-considered-bad-practice.

Pass by const reference for read-only access. In the case of eval and availability of C++17, one could use std::string_view. I would advise taking the input range as input iterator, that would allow matching files, input from console, etc.

(int)p[i+1] == 42

I do not see why that is preferred to

p[i + 1] == '*'

more readable version makes comment obsolete.

if(j < s.length()) return false;
    else return true;

Could be rewritten as

return j == s.length();

Note that for this a bug about running over the string should be fixed, the bug is described below. >= will work too.

There are also no checks if string is already fully traversed in conditions with *. Although all implementations that I know of store null terminator at the end, this will not work on arbitrary ranges of characters.

I agree with KonradRudolph that comments do not contribute anything useful.

// is next character is wildcard

This comment is misleading. * is called Kleene star. Without looking at ASCII table one would assume there is a bug.

Runtime complexity

I believe runtime is \$\mathcal{O}(n)\$ in the worst case, where \$n\$ is length of input string. Think about it: the code does a single pass over a string, and doesn't reroll.

Theory

I would advise learning method of transforming NFA to DFA. It's great to learn parsing techniques, but there tools for that like flex, bison, etc exist, if you just want to hammer the nail into it's place.

I would assume .* means match any sequence of characters, as by logic in the question even doing | for all characters in the alphabet will make it impossible to describe just any combination of letters in the alphabet. Note that such a language is in the set of regular languages, as it is possible to construct DFA for it. May be I am wrong though, as my assumption stems from javascript regex.

Bug

There is a case that requires NFA instead of simple DFA to recognize string (of course it is possible to convert NFA to DFA):

eval("aaab", "a*ab");

Gives false, even though the string matches the regex. Disallowing same character after * fixes the problem too.

Style

why-is-using-namespace-std-considered-bad-practice.

Pass by const reference for read-only access. In the case of eval and availability of C++17, one could use std::string_view. I would advise taking the input range as input iterator, that would allow matching files, input from console, etc.

(int)p[i+1] == 42

I do not see why that is preferred to

p[i + 1] == '*'

more readable version makes comment obsolete.

if(j < s.length()) return false;
    else return true;

Could be rewritten as

return j == s.length();

Note that for this a bug about running over the string should be fixed, the bug is described below. >= will work too.

There are also no checks if string is already fully traversed in conditions with *. Although all implementations that I know of store null terminator at the end, this will not work on arbitrary ranges of characters.

Runtime complexity

I believe runtime is \$\mathcal{O}(n)\$ in the worst case, where \$n\$ is length of input string. Think about it: the code does a single pass over a string, and doesn't reroll.

Theory

I would advise learning method of transforming NFA to DFA. It's great to learn parsing techniques, but there tools for that like flex, bison, etc exist, if you just want to hammer the nail into it's place.

I would assume .* means match any sequence of characters, as by logic in the question even doing | for all characters in the alphabet will make it impossible to describe just any combination of letters in the alphabet. Note that such a language is in the set of regular languages, as it is possible to construct DFA for it. May be I am wrong though, as my assumption stems from javascript regex.

Bug

There is a case that requires NFA instead of simple DFA to recognize string (of course it is possible to convert NFA to DFA):

eval("aaab", "a*ab");

Gives false, even though the string matches the regex. Disallowing same character after * fixes the problem too.

Style

why-is-using-namespace-std-considered-bad-practice.

Pass by const reference for read-only access. In the case of eval and availability of C++17, one could use std::string_view. I would advise taking the input range as input iterator, that would allow matching files, input from console, etc.

(int)p[i+1] == 42

I do not see why that is preferred to

p[i + 1] == '*'

more readable version makes comment obsolete.

if(j < s.length()) return false;
    else return true;

Could be rewritten as

return j == s.length();

Note that for this a bug about running over the string should be fixed, the bug is described below. >= will work too.

There are also no checks if string is already fully traversed in conditions with *. Although all implementations that I know of store null terminator at the end, this will not work on arbitrary ranges of characters.

I agree with KonradRudolph that comments do not contribute anything useful.

// is next character is wildcard

This comment is misleading. * is called Kleene star. Without looking at ASCII table one would assume there is a bug.

Runtime complexity

I believe runtime is \$\mathcal{O}(n)\$ in the worst case, where \$n\$ is length of input string. Think about it: the code does a single pass over a string, and doesn't reroll.

Theory

I would advise learning method of transforming NFA to DFA. It's great to learn parsing techniques, but there tools for that like flex, bison, etc exist, if you just want to hammer the nail into it's place.

I would assume .* means match any sequence of characters, as by logic in the question even doing | for all characters in the alphabet will make it impossible to describe just any combination of letters in the alphabet. Note that such a language is in the set of regular languages, as it is possible to construct DFA for it. May be I am wrong though, as my assumption stems from javascript regex.

added 56 characters in body

Source Link

edited May 13, 2019 at 9:06

Incomputable

9.5k
3
33
72

Bug

There is a case that requires NFA instead of simple DFA to recognize string (of course it is possible to convert NFA to DFA):

eval("aaab", "a*ab");

Gives false, even though the string matches the regex. Disallowing same character after * fixes the problem too.

Style

why-is-using-namespace-std-considered-bad-practice.

Pass by const reference for read-only access. In the case of eval and availability of C++17, one could use std::string_view. I would advise taking the input range as input iterator, that would allow matching files, input from console, etc.

(int)p[i+1] == 42

I do not see why that is preferred to

p[i + 1] == '*'

more readable version makes comment obsolete.

if(j < s.length()) return false;
    else return true;

Could be rewritten as

return j == s.length();

Note that for this a bug about running over the string should be fixed, the bug is described below. >= will work too.

There are also no checks if string is already fully traversed in conditions with *. Although all implementations that I know of store null terminator at the end, this will not work on arbitrary ranges of characters.

Runtime complexity

I believe runtime is \$\mathcal{O}(n)\$ in the worst case, where \$n\$ is length of input string. Think about it: the code does a single pass over a string, and doesn't reroll.

Theory

I would advise learning method of transforming NFA to DFA. It's great to learn parsing techniques, but there tools for that like flex, bison, etc exist, if you just want to hammer the nail into it's place.

I would assume .* means match any sequence of characters, as by logic in the question even doing | for all characters in the alphabet will make it impossible to describe just any combination of letters in the alphabet. Note that such a language is in the set of regular languages, as it is possible to construct DFA for it. May be I am wrong though, as my assumption stems from javascript regex.

Bug

There is a case that requires NFA instead of DFA to recognize string:

eval("aaab", "a*ab");

Gives false, even though the string matches the regex. Disallowing same character after * fixes the problem too.

Style

why-is-using-namespace-std-considered-bad-practice.

Pass by const reference for read-only access. In the case of eval and availability of C++17, one could use std::string_view. I would advise taking the input range as input iterator, that would allow matching files, input from console, etc.

(int)p[i+1] == 42

I do not see why that is preferred to

p[i + 1] == '*'

more readable version makes comment obsolete.

if(j < s.length()) return false;
    else return true;

Could be rewritten as

return j == s.length();

Note that for this a bug about running over the string should be fixed, the bug is described below. >= will work too.

There are also no checks if string is already fully traversed in conditions with *. Although all implementations that I know of store null terminator at the end, this will not work on arbitrary ranges of characters.

Runtime complexity

I believe runtime is \$\mathcal{O}(n)\$ in the worst case, where \$n\$ is length of input string. Think about it: the code does a single pass over a string, and doesn't reroll.

Theory

I would advise learning method of transforming NFA to DFA. It's great to learn parsing techniques, but there tools for that like flex, bison, etc exist, if you just want to hammer the nail into it's place.

I would assume .* means match any sequence of characters, as by logic in the question even doing | for all characters in the alphabet will make it impossible to describe just any combination of letters in the alphabet. Note that such a language is in the set of regular languages, as it is possible to construct DFA for it. May be I am wrong though, as my assumption stems from javascript regex.

Bug

There is a case that requires NFA instead of simple DFA to recognize string (of course it is possible to convert NFA to DFA):

eval("aaab", "a*ab");

Gives false, even though the string matches the regex. Disallowing same character after * fixes the problem too.

Style

why-is-using-namespace-std-considered-bad-practice.

Pass by const reference for read-only access. In the case of eval and availability of C++17, one could use std::string_view. I would advise taking the input range as input iterator, that would allow matching files, input from console, etc.

(int)p[i+1] == 42

I do not see why that is preferred to

p[i + 1] == '*'

more readable version makes comment obsolete.

if(j < s.length()) return false;
    else return true;

Could be rewritten as

return j == s.length();

Note that for this a bug about running over the string should be fixed, the bug is described below. >= will work too.

There are also no checks if string is already fully traversed in conditions with *. Although all implementations that I know of store null terminator at the end, this will not work on arbitrary ranges of characters.

Runtime complexity

I believe runtime is \$\mathcal{O}(n)\$ in the worst case, where \$n\$ is length of input string. Think about it: the code does a single pass over a string, and doesn't reroll.

Theory

I would advise learning method of transforming NFA to DFA. It's great to learn parsing techniques, but there tools for that like flex, bison, etc exist, if you just want to hammer the nail into it's place.

I would assume .* means match any sequence of characters, as by logic in the question even doing | for all characters in the alphabet will make it impossible to describe just any combination of letters in the alphabet. Note that such a language is in the set of regular languages, as it is possible to construct DFA for it. May be I am wrong though, as my assumption stems from javascript regex.

added 6 characters in body

Source Link

edited May 13, 2019 at 8:03

Incomputable

9.5k
3
33
72

Bug

There is a case that requires NFA instead of DFA to recognize string:

eval("aaab", "a*ab");

Gives false, even though the string matches the regex. Disallowing same character after * fixes the problem too.

Style

why-is-using-namespace-std-considered-bad-practice.

Pass by const reference for read-only access. In the case of eval and availability of C++17, one could use std::string_view. I would advise taking the input range as input iterator, that would allow matching files, input from console, etc.

(int)p[i+1] == 42

I do not see why that is preferred to

p[i + 1] == '*'

more readable version makes comment obsolete.

if(j < s.length()) return false;
    else return true;

Could be rewritten as

return j == s.length();

Note that for this a bug about running over the string should be fixed, the bug is described below. >= will work too.

There are also no checks if string is already fully traversed in conditions with *. Although all implementations that I know of store null terminator at the end, this will not work on arbitrary ranges of characters.

Runtime complexity

I believe runtime is \$\mathcal{O}(n)\$ in the worst case, where \$n\$ is length of input string. Think about it: the code does a single pass over a string, and doesn't reroll.

Theory

I would advise learning method of transforming NFA to DFA. It's great to learn parsing techniques, but there tools for that like flex, bison, etc exist, if you just want to hammer the nail into it's place.

I would assume .* means match any sequence of characters, as by logic in the question even doing | for all characters in the alphabet will make it impossible to describe just any combination of letters in the alphabet. Note that such a language is in the set of regular languages, as it is possible to construct DFA for it. May be I am wrong though, as my assumption stems from javascript regex.

Bug

There is a case that requires NFA instead of DFA to recognize string:

eval("aaab", "a*ab");

Gives false, even though the string matches the regex. Disallowing same character after * fixes the problem too.

Style

why-is-using-namespace-std-considered-bad-practice.

Pass by const reference for read-only access. In the case of eval and availability of C++17, one could use std::string_view. I would advise taking the input range as input iterator, that would allow matching files, input from console, etc.

(int)p[i+1] == 42

I do not see why that is preferred to

p[i + 1] == '*'

more readable version makes comment obsolete.

if(j < s.length()) return false;
    else return true;

Could be rewritten as

return j == s.length();

Note that for this a bug about running over the string should be fixed, the bug is described below. >= will work too.

There are also no checks if string is already fully traversed in conditions with *. Although all implementations that I know of store null terminator at the end, this will not work on arbitrary ranges of characters.

Runtime complexity

I believe runtime is \$\mathcal{O}(n)\$ in the worst case, where \$n\$ is length of input string. Think about it: the code does a single pass over a string, and doesn't reroll.

Theory

I would advise learning method of transforming NFA to DFA. It's great to learn parsing techniques, but there tools for that like flex, bison, etc, if you just want to hammer the nail into it's place.

I would assume .* means match any sequence of characters, as by logic in the question even doing | for all characters in the alphabet will make it impossible to describe just any combination of letters in the alphabet. Note that such a language is in the set of regular languages, as it is possible to construct DFA for it. May be I am wrong though, as my assumption stems from javascript regex.

Bug

There is a case that requires NFA instead of DFA to recognize string:

eval("aaab", "a*ab");

Gives false, even though the string matches the regex. Disallowing same character after * fixes the problem too.

Style

why-is-using-namespace-std-considered-bad-practice.

Pass by const reference for read-only access. In the case of eval and availability of C++17, one could use std::string_view. I would advise taking the input range as input iterator, that would allow matching files, input from console, etc.

(int)p[i+1] == 42

I do not see why that is preferred to

p[i + 1] == '*'

more readable version makes comment obsolete.

if(j < s.length()) return false;
    else return true;

Could be rewritten as

return j == s.length();

Note that for this a bug about running over the string should be fixed, the bug is described below. >= will work too.

There are also no checks if string is already fully traversed in conditions with *. Although all implementations that I know of store null terminator at the end, this will not work on arbitrary ranges of characters.

Runtime complexity

I believe runtime is \$\mathcal{O}(n)\$ in the worst case, where \$n\$ is length of input string. Think about it: the code does a single pass over a string, and doesn't reroll.

Theory

I would advise learning method of transforming NFA to DFA. It's great to learn parsing techniques, but there tools for that like flex, bison, etc exist, if you just want to hammer the nail into it's place.

I would assume .* means match any sequence of characters, as by logic in the question even doing | for all characters in the alphabet will make it impossible to describe just any combination of letters in the alphabet. Note that such a language is in the set of regular languages, as it is possible to construct DFA for it. May be I am wrong though, as my assumption stems from javascript regex.

added 226 characters in body

Source Link

edited May 13, 2019 at 5:54

Incomputable

9.5k
3
33
72

Loading

Source Link

created May 13, 2019 at 5:47

Incomputable

9.5k
3
33
72

Loading

Stack Exchange Network

Return to Answer

Bug

Style

Runtime complexity

Theory

Bug

Style

Runtime complexity

Theory

Bug

Style

Runtime complexity

Theory

Bug

Style

Runtime complexity

Theory

Bug

Style

Runtime complexity

Theory

Bug

Style

Runtime complexity

Theory

Bug

Style

Runtime complexity

Theory

Bug

Style

Runtime complexity

Theory

Bug

Style

Runtime complexity

Theory