What is a more elegant way to parse this string?

Question

I have a task where i need to parse C# scripts and look for a certain method attribute and extract parts from it, and i wonder if there is a more elegant way than how i do it:

[Info("Title", "Author", "5.2.5", ResourceId = 819)]

Here is what i do:

// foreach line in script
if (line.Contains("[Info(") && line.Contains("ResourceId"))
{
    var _attributes = line
        .Replace(" ", "")
        .Replace("\"", "")
        .Replace("[Info(", "")
        .Replace(")]", "")
        .Replace("ResourceId=", "")
        .Split(new string[] { "," }, StringSplitOptions.RemoveEmptyEntries);
        // Do stuff with _attributes[0] _attributes[1] etc..
        break;
}

Text parsing is best done with regular expressions. Creating a regular expression automatically from a string can be achieved with this site: txt2re.com — Siderite Zackwehdex, Commented Apr 11, 2016 at 9:31
This task is ideal for regular expressions. But I disagree with Siderite that 'Test parsing is best done with regular expressions'. People tend to use regular expressions when it is not appropriate. Always try to use simpler method() (like string methods) than regular expressions if possible. Regular expressions is slow and uses lots of memory. — jdweng, Commented Apr 11, 2016 at 9:51

Neuron · Accepted Answer · 2018-04-18 16:36:37Z

The easiest solution nowadays would be to use Roslyn. You can parse the code, find actual attributes (rather than things that look like the attribute you're looking for), and handle them all in a way that's C#-proper.

Here's a simple example:

var infoAttributes = CSharpSyntaxTree.ParseText(@"
namespace MyNamespace
{
    public class SomeClass
    {
        const string SomeConstant = ""Hi!"";

        [Info(""Some book"", ""Ray Brandenburg"", ""5.2.5"", ResourceId = 819)]
        public void SomeMethod()
        {

        }

        [InfoAttribute(SomeConstant, 42, ""Banana"")]
        public void SomeMethod2()
        {

        }

        // [Info(""Not going to happen"", ""Hilary Clinton"", ""1.2.0"")]
        public void SomeMethod3()
        {

        }
    }
}
")
.GetRoot()
.DescendantNodes()
.OfType<AttributeSyntax>()
.Where(i => i.Name.ToString() == "Info" || i.Name.ToString() == "InfoAttribute")
.Where
(
  i => 
    i.ArgumentList.Arguments.Count(j => j.NameEquals == null) == 3 
    && i.ArgumentList.Arguments[0].GetFirstToken().IsKind(SyntaxKind.StringLiteralToken)
    && i.ArgumentList.Arguments[1].GetFirstToken().IsKind(SyntaxKind.StringLiteralToken)
    && i.ArgumentList.Arguments[2].GetFirstToken().IsKind(SyntaxKind.StringLiteralToken)
)
.Select
(
  i =>
  new 
  {
    Title = (string)i.ArgumentList.Arguments[0].GetFirstToken().Value,
    Author = (string)i.ArgumentList.Arguments[1].GetFirstToken().Value,
    Version = (string)i.ArgumentList.Arguments[2].GetFirstToken().Value,
    ResourceId = 
      i.ArgumentList.Arguments
       .Where(j => j.NameEquals != null && j.NameEquals.Name.ToString() == "ResourceId")
       .Select(j => j.ChildNodes().Skip(1).First().GetFirstToken().Value.ToString())
       .FirstOrDefault()
  }
);

infoAttributes.Dump();

At this level, this is only doing parsing of the source code. To make things simpler, I added defensive clauses to only make this work with literal values - you'll probably want to turn those into warnings to be handled manually or something. The code correctly handles any trivia (e.g. whitespace), code that looks like attribute declaration but isn't, comments and plenty of other possible issues. There's still a simplifying assumption - the values must be literals (string or otherwise). The example will only find one Info attribute - the one on SomeMethod2 uses a constant and a different constructor overload, and the one on SomeMethod3 is commented out.

Another level is creating a compilation tree from this. That's a bit more involved, but allows you to make everything work as if it were real C# code - for example, the attribute on SomeMethod2 will resolve SomeConstant correctly. Of course, if you really want to be 100% correct, this requires gathering all the dependencies etc., which sounds like an overkill. Unless this is a real problem in your code, warnings should do fine for the outliers. If local constants are used often in your code, expanding the code to handle a local literal constant is still pretty easy.

As a disclaimer, this surely isn't the best way to do the parsing using Roslyn. It's just the first thing that came to mind and took just a while to get going. I'm still finding better ways of dealing with Roslyn pretty much every day :)

This is very interesting indeed, could you possibly provide an example of how to get the attribute and values? — Dan-Levi Tømta, Commented Apr 11, 2016 at 9:48
@Dan-LeviTømta Added sample code. Note that it can be made more or less complex depending on your exact requirements - I went the cautious way for the most part, with a few simplifying assumptions. — Luaan, Commented Apr 11, 2016 at 10:53
This is truely very interesting, i did not know of Roslyn, thank you for letting me know of it, i will definetively have this as a reference for later projects. — Dan-Levi Tømta, Commented Apr 11, 2016 at 15:23

npinti · Accepted Answer · 2016-04-11 10:56:27Z

3

If for some reason what @Luaan suggests cannot be done, you can use an expression such as this: \[Info\("(.+?)", "(.+?)", "([\d.]+)", ResourceId\s*=\s*(\d+)\)\] to match and extract the values you are after.

An example is available here.

EDIT: As pointed out by @Evk, this expression will also match commented attributes. If this is not something which you are after, please let me know.

EDIT: As per your query, you would need to use something like so: \[Info\("(.+?)", "(.+?)", "?([\d.]+)"?, ResourceId\s*=\s*(\d+)\)\]. In this case, the quotation marks for the 3rd argument are followed by the ? character, which instructs the engine that the quotation marks might not be there. An example is available here.

edited Apr 11, 2016 at 10:56

answered Apr 11, 2016 at 9:35

npinti

52.1k5 gold badges73 silver badges98 bronze badges

Don't forget to handle comments (//, /* ... */) in a proper way. If you just use this regex - it will match all commented attributes too.
– Evk
Commented Apr 11, 2016 at 9:38
@Evk: Yeah but the OP uses .contains, and commented code does not seem to be an issue. I'll add a note just in case.
– npinti
Commented Apr 11, 2016 at 9:40
Well that is more comment to author to not forget about this, since he might have not realized they can be commented.
– Evk
Commented Apr 11, 2016 at 9:44
Thanks! What if by any reason say the third attribute value "5.2.5" differs to this 5.2.5. Would i try with a expression that matches this or could i make one regular expression that also take care of this? Sorry i am not that familiar with using regular expressions.
– Dan-Levi Tømta
Commented Apr 11, 2016 at 9:56
Im trying to give the last regular expression a go, i'm having difficulties gettings match. Should i not comment the slashes and quotes in the expression? Regex re = new Regex("\\[Info\\(\"(.+?)\", \"(.+?)\", \" ? ([\\d.] +)\"?, ResourceId\\s*=\\s*(\\d+)\\)\\]");
– Dan-Levi Tømta
Commented Apr 11, 2016 at 13:51

| Show 1 more comment

Collectives™ on Stack Overflow

What is a more elegant way to parse this string?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
c#
parsing
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged c#parsing or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
c#
parsing
or ask your own question.