I'm working on a project where I want to evaluate how Behavior Driven Development is being used. I want to extract Gherkin steps from .feature files and match the underlying step definition code. For example, let's say I have the following Gherkin step:
Given I have an "index.md" page that contains "{{ site.title }}"
I want to find the underlying step definition that might look like this:
Given(%r!^I have an? "(.*)" page(?: with (.*) "(.*)")? that contains "(.*)"$!) do |file, key, value, text|
File.write(file, <<~DATA)
---
#{key || "layout"}: #{value || "none"}
---
#{text}
DATA
end
and match it and create a JSON file with the format like the following:
"step_name": "I have an \"index.md\" page that contains \"{{ site.title }}\"",
"step_definition": "do\nFile.write(file, <<-HEREDOC)\n---\n#{key || \"layout\"}: #{value || \"none\"}\n---\n\n#{text}\nHEREDOC\nend",
I'm using the behave.parser library in python to extract the Gherkin steps from the feature files. I'm wondering what would be the best way to extract the step definitions would be? I initially tried using regex but it's difficult to come up with an expression that is robust enough to capture all the different code variations (for example nested functions, etc). Do I need to use a parsing library to parse the underlying code into an Abstract Syntax Tree (AST) and then parse the underlying code? Is this overkill? Is there a better way to do it?