Compact command line argument parser : Revisited

Question

This question is a follow up to my previous one which can be found here. The user dfhwze suggested me to look into compiler construction and recommended me to write a lexer and a parser that would process the input step by step. I am very grateful that he pointed me into this direction, because I have the feeling everything is much more robust now. As this is my first time implementing a lexer and a parser I am convinced there are still things that can be optimized a lot.

A few things things that come to my mind:

Are my naming conventions fine? Are all identifiers self-descriptive?
Can I abstract the project more? I'd like it to be as flexible as possible.
Are there performance optimizations that can be made?

Notes:

Comma separated lists are intentionally processed as one argument.

To test the code run the unit tests (xUnit).

CommandLineLexer.cs

public class CommandLineLexer
{
    /// <summary>
    /// To read a stream if characters
    /// </summary>
    private readonly TextReader _reader;

    /// <summary>
    /// The current token that is processed
    /// </summary>
    private CommandLineToken? _currentToken;

    /// <summary>
    /// Create a new lexer for an incoming character stream
    /// </summary>
    /// <param name="reader">The text reader that processes the data</param>
    public CommandLineLexer(TextReader reader)
    {
        _reader = reader;
    }

    /// <summary>
    /// Gets the next character in the stream
    /// </summary>
    /// <returns>Read the next character</returns>
    private char ReadCharacter()
    {
        char c = (char) _reader.Read();
        return c;
    }

    /// <summary>
    /// Reads next CommandLineToken
    /// </summary>
    /// <returns>The next lexed token</returns>
    public CommandLineToken Next()
    {
        var nextToken = Peek();
        _currentToken = null;
        return nextToken;
    }

    /// <summary>
    /// Check next token but doesn't read it yet
    /// </summary>
    /// <returns>The next token</returns>
    public CommandLineToken Peek()
    {
        if (_currentToken == null)
            _currentToken = ReadNextToken();
        return _currentToken.Value;
    }

    /// <summary>
    /// Verifies if there are more character is the inputstream
    /// </summary>
    /// <returns>true if there are more characters, false if end of inputstream</returns>
    public bool HasNext()
    {
        if (_currentToken == null)
        {
            SkipWhitespaces();
            return _reader.Peek() != -1;
        }

        return true;
    }

    /// <summary>
    /// Do not process whitespaces in the input unless they are part of an argument
    /// </summary>
    private void SkipWhitespaces()
    {
        while (true)
        {
            int c = _reader.Peek();
            if (c == -1 || !char.IsWhiteSpace((char) c))
                break;
            ReadCharacter();
        }
    }

    /// <summary>
    /// Read the next token
    /// </summary>
    /// <returns>The next lexed token</returns>
    /// <exception cref="EndOfStreamException"></exception>
    private CommandLineToken ReadNextToken()
    {
        SkipWhitespaces();

        int peakedChar = _reader.Peek();
        if (peakedChar == -1)
            throw new EndOfStreamException(nameof(_reader));
        char character = (char) peakedChar;

        // Parsing Logic
        switch (character)
        {
            case '-': return ReadSwitch();
            case '"': return ReadQuotedArg();
            case ',': return ReadCommaSeparator();
            default:
                return ReadArg();
        }
    }

    /// <summary>
    /// Reads arguments that start and end with a quotionmark
    /// </summary>
    /// <returns>The lexed argument token</returns>
    private CommandLineToken ReadQuotedArg()
    {
        var stringBuilder = new StringBuilder();

        while (true)
        {
            stringBuilder.Append(ReadCharacter());
            int chr = _reader.Peek();
            if (chr == -1 || chr == '"')
            {
                stringBuilder.Append("\"");
                ReadCharacter();
                break;
            }
        }

        return new CommandLineToken(CommandLineTerminal.Argument, stringBuilder.ToString());
    }

    /// <summary>
    /// Reads a comma separator token
    /// </summary>
    /// <returns>The lexed comma token</returns>
    private CommandLineToken ReadCommaSeparator()
    {
        return new CommandLineToken(CommandLineTerminal.Comma, ReadCharacter().ToString());
    }

    /// <summary>
    /// Reads an argument token
    /// </summary>
    /// <returns>The lexed comma token</returns>
    private CommandLineToken ReadArg()
    {
        var stringBuilder = new StringBuilder();
        var allowedChars = "abcdefghijklmonopqrstuvxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!:!?.".ToList();
        while (true)
        {
            int chr = _reader.Peek();
            if (chr == -1)
                break;
            if (chr == ',' || chr == ' ')
                break;
            if (!allowedChars.Contains((char) chr))
                throw new FormatException($"Illegal character in argument");

            stringBuilder.Append(ReadCharacter());
        }

        return new CommandLineToken(CommandLineTerminal.Argument, stringBuilder.ToString());
    }

    /// <summary>
    /// Reads an argument token 
    /// </summary>
    /// <returns>The lexed switch token</returns>
    private CommandLineToken ReadSwitch()
    {
        var stringBuilder = new StringBuilder();
        var allowedChars = "abcdefghijklmonopqrstuvxyz-".ToList();
        while (true)
        {
            int chr = _reader.Peek();
            if (chr == -1 || chr == ' ')
                break;
            if (!allowedChars.Contains((char) chr))
                throw new FormatException($"Illegal character in switch: {(char) chr}");

            stringBuilder.Append(ReadCharacter());
        }

        if (stringBuilder.ToString().All(x => x == '-'))
            throw new FormatException("Switch does not have a name");

        return new CommandLineToken(CommandLineTerminal.Switch, stringBuilder.ToString());
    }
}

CommandLineToken.cs

public struct CommandLineToken
{
    public CommandLineTerminal Terminal { get; }
    public string Text { get; }

    public CommandLineToken(CommandLineTerminal terminal, string text)
    {
        Terminal = terminal;
        Text = text;
    }
}

CommandLineTerminal.cs

public enum CommandLineTerminal
{
    /// <summary>
    /// Switch
    /// </summary>
    Switch,

    /// <summary>
    /// Argument of a switch
    /// </summary>
    Argument,

    /// <summary>
    /// Separator for a list of arguments
    /// </summary>
    Comma,
}

CommandLineParser.cs

public class CommandLineParser
{
    /* Grammar:
        * 
        * switches <- switch+
        * switch <- SWITCH args
        * args <- ARGUMENT (COMMA ARGUMENT)*
        */

    private readonly CommandLineLexer _lexer;

    public CommandLineParser(CommandLineLexer lexer)
    {
        _lexer = lexer ?? throw new ArgumentNullException(nameof(lexer));
    }

    public CommandLineParser(TextReader reader)
        : this(new CommandLineLexer(reader))
    {
    }

    public CommandLineParser(string input)
        : this(new StringReader(input))
    {
    }


    public IEnumerable<CommandLineExpression> ParseAll()
    {
        var parsed = new List<CommandLineExpression>();
        while (_lexer.HasNext())
            parsed.Add(Parse());

        return parsed;
    }

    private CommandLineExpression Parse()
    {
        var @switch = ExpectOneOf(CommandLineTerminal.Switch);

        // Switch without args
        if (!_lexer.HasNext())
            return new CommandLineExpression(@switch.Text, null);

        // Verify if there are more args after switch
        while (true)
        {
            var next = _lexer.Peek();
            switch (next.Terminal)
            {
                case CommandLineTerminal.Switch:
                    break;
                case CommandLineTerminal.Argument:
                {
                    var allArgs = ParseAllArgs();
                    return new CommandLineExpression(@switch.Text, allArgs);
                }

                default:
                    throw new FormatException("Invalid character");
            }
        }
    }

    private IList<IArgument> ParseAllArgs()
    {
        var allArgs = new List<IArgument>();
        while (true)
        {
            if (!_lexer.HasNext())
                return allArgs;

            var next = _lexer.Peek();
            switch (next.Terminal)
            {
                case CommandLineTerminal.Switch:
                    return allArgs;
                case CommandLineTerminal.Argument:
                {
                    // Check if we are dealing with an ArgList
                    var token = _lexer.Next();

                    if (!_lexer.HasNext())
                    {
                        allArgs.Add(new CommandLineArgument(token.Text));
                        return allArgs;
                    }

                    var next2 = _lexer.Peek();

                    if (next2.Terminal == CommandLineTerminal.Comma)
                    {
                        var argList = ParseArgList(token);
                        allArgs.Add(new CommandLineArgumentList(argList));

                        break;
                    }

                    // Add arg normally - its not part of a list
                    allArgs.Add(new CommandLineArgument(token.Text));
                    break;
                }

                default:
                    throw new FormatException("Invalid character");
            }
        }
    }

    private List<CommandLineArgument> ParseArgList(CommandLineToken token)
    {
        bool commaExpected = true;
        var argList = new List<CommandLineArgument>() {new CommandLineArgument(token.Text)};
        while (true)
        {
            if (!_lexer.HasNext())
                return argList;

            var next = _lexer.Peek();

            switch (@next.Terminal)
            {
                case CommandLineTerminal.Switch:
                {
                    return argList; // kk, new swithc starts we are done processing the arglist
                }

                case CommandLineTerminal.Argument:
                {
                    if (commaExpected)
                    {
                        // end of arg list but there is more args that do not belong to the list
                        return argList;
                    }

                    argList.Add(new CommandLineArgument(_lexer.Next().Text));
                    commaExpected = true;

                    break;
                }

                case CommandLineTerminal.Comma:
                {
                    if (commaExpected)
                    {
                        commaExpected = false;
                        // consume comma
                        _lexer.Next(); // ??
                        break;
                    }

                    throw new FormatException(); // two commas after each other?
                }
            }
        }
    }

    private CommandLineToken ExpectOneOf(params CommandLineTerminal[] terminals)
    {
        var token = _lexer.Next();
        if (!terminals.Contains(token.Terminal))
            throw new FormatException($"Expected {string.Join(",", "terminals")}");
        return token;
    }
}

CommandLineExpression.cs

public class CommandLineExpression
{
    public string Switch { get; }
    public IList<IArgument> Args { get; }

    public CommandLineExpression(string @switch, IList<IArgument> args)
    {
        Switch = @switch;
        Args = args;
    }
    // Can this be optimized?
    public override bool Equals(object obj)
    {
        var cmp = obj as CommandLineExpression ?? throw new ArgumentNullException(nameof(obj));
        if (Switch != cmp.Switch)
            return false;

        if (Args == null ^ cmp.Args == null)
            return false;

        if (Args == null && cmp.Args == null)
            return true;

        if (Args.Count != cmp.Args.Count)
            return false;

        for (var index = 0; index < Args.Count; index++)
        {
            // Verify if both args are arglists
            if (Args[index] is CommandLineArgumentList)
            {
                // Compare args and arglists too
                if (cmp.Args[index] is CommandLineArgumentList)
                {
                    // Iterate arg lists of both args
                    for (var index2 = 0; index2 < ((CommandLineArgumentList) Args[index]).Arg.Count; index2++)
                    {
                        var argListItem1 = ((CommandLineArgumentList) Args[index]).Arg[index2];
                        var argListItem2 = ((CommandLineArgumentList) cmp.Args[index]).Arg[index2];
                        if (argListItem1.Argument != argListItem2.Argument)
                            return false;
                    }
                }
                else
                {
                    return false;
                }

                continue;
            }

            if (cmp.Args[index] is CommandLineArgumentList)
            {
                // Compare args and arglists too
                if (Args[index] is CommandLineArgumentList)
                {
                    // Compare args and arglists too
                    for (var index2 = 0; index2 < ((CommandLineArgumentList) Args[index]).Arg.Count; index2++)
                    {
                        var argListItem1 = ((CommandLineArgumentList) Args[index]).Arg[index2];
                        var argListItem2 = ((CommandLineArgumentList) cmp.Args[index]).Arg[index2];
                        if (argListItem1.Argument != argListItem2.Argument)
                            return false;
                    }
                }
                else
                {
                    return false;
                }

                continue;
            }

            // If argument is not a list do the normal comparison
            var arg = (CommandLineArgument) Args[index];
            var arg2 = (CommandLineArgument) cmp.Args[index];
            if (arg.Argument != arg2.Argument)
                return false;
        }

        return true;
    }
}

CommandLineArgumentList.cs

public class CommandLineArgumentList : IArgument
{
    public IList<CommandLineArgument> Arg { get; }

    public CommandLineArgumentList(IList<CommandLineArgument> arg)
    {
        Arg = arg;
    }
}

CommandLineArgument.cs

public class CommandLineArgument : IArgument
{
    public string Argument { get; }

    public CommandLineArgument(string argument)
    {
        Argument = argument;
    }
}

IArgument.cs

public interface IArgument
{
}

Unit tests for verification:

CommandLineParserTest.cs

using System.Collections.Generic;
using System.Linq;
using System.Text;
using Xunit;

namespace TinyCommandLineParser.Core.Tests
{
    public class CommandLineParserTest
    {
        [Fact]
        public void ParseSwitchNoArgumentTest()
        {
            var parser = new CommandLineParser("--verbose");
            var actual = parser.ParseAll().ToList()[0];
            var expected = new CommandLineExpression("--verbose", null);

            Assert.Equal(actual, expected);
        }

        [Fact]
        public void ParseShit()
        {
            var parser = new CommandLineParser("--test --values file1 file2 --data A,B,C");
            var actual = parser.ParseAll().ToList();
            var expected = new List<CommandLineExpression>
            {
                new CommandLineExpression("--verbose", null),
                new CommandLineExpression("--values", new List<IArgument>()
                {
                    new CommandLineArgument("file1"),
                    new CommandLineArgument("file2")
                }),
                new CommandLineExpression("--data", new List<IArgument>()
                {
                    new CommandLineArgumentList(new List<CommandLineArgument>()
                    {
                        new CommandLineArgument("A"),
                        new CommandLineArgument("B"),
                        new CommandLineArgument("C")
                    }),
                })
            };

            Assert.All(actual, x => Assert.Equal(actual, expected));
        }

        [Fact]
        public void ParseSwitchMultipleArgumentTest()
        {
            var parser = new CommandLineParser("--data file1.txt file2.txt file3.txt");
            var actual = parser.ParseAll().ToList();
            var expected = new List<CommandLineExpression>
            {
                new CommandLineExpression("--data", new List<IArgument>()
                {
                    new CommandLineArgument("file1.txt"),
                    new CommandLineArgument("file2.txt"),
                    new CommandLineArgument("file3.txt"),
                })
            };

            Assert.All(actual, x => Assert.Equal(actual, expected));
        }

        [Fact]
        public void ParseSwitchesWithArgumentListsTest()
        {
            var stringBuilder = new StringBuilder();
            stringBuilder.Append("--data start.txt file1.txt,file2.txt,file3.txt end.txt ");
            stringBuilder.Append("--output-dir \"/home/user/my docs/\"");
            stringBuilder.Append("--more-data start2.txt file4.txt,file5.txt end2.txt ");
            stringBuilder.Append("--verbose");

            var parser = new CommandLineParser(stringBuilder.ToString());
            var actual = parser.ParseAll().ToList();

            var expected = new List<CommandLineExpression>
            {
                new CommandLineExpression("--data", new List<IArgument>()
                {
                    new CommandLineArgument("start.txt"),
                    new CommandLineArgumentList(new List<CommandLineArgument>()
                    {
                        new CommandLineArgument("file1.txt"),
                        new CommandLineArgument("file2.txt"),
                        new CommandLineArgument("file3.txt")
                    }),
                    new CommandLineArgument("end.txt"),
                }),
                new CommandLineExpression("--output-dir", new List<IArgument>()
                {
                    new CommandLineArgument("\"/home/user/my docs/\"")
                }),
                new CommandLineExpression("--more-data", new List<IArgument>()
                {
                    new CommandLineArgument("start2.txt"),
                    new CommandLineArgumentList(new List<CommandLineArgument>()
                    {
                        new CommandLineArgument("file4.txt"),
                        new CommandLineArgument("file5.txt"),
                    }),
                    new CommandLineArgument("end2.txt"),
                }),
                new CommandLineExpression("--verbose", null)
            };

            Assert.All(actual, x => Assert.Equal(actual, expected));
        }
    }
}

CommandLineLexerTest.cs

using System;
using System.Collections.Generic;
using System.IO;
using Xunit;

namespace TinyCommandLineParser.Core.Tests
{
    public class CommandLineLexerTest
    {

        [Fact]
        public void LexIncorrectlyFormattedSwitchTest()
        {
            Assert.Throws<FormatException>(() =>
            {
                var lexer = new CommandLineLexer(new StringReader("--ver´bose"));
                lexer.Next();
            });

            Assert.Throws<FormatException>(() =>
            {
                var lexer = new CommandLineLexer(new StringReader("--"));
                lexer.Next();
            });

            Assert.Throws<FormatException>(() =>
            {
                var lexer = new CommandLineLexer(new StringReader("-"));
                lexer.Next();
            });
        }

        [Fact]
        public void LexQuotedArgTest()
        {
            var input = "--phrase \"this is a test\" --info \"this is cool\"";
            var lexer = new CommandLineLexer(new StringReader(input));
            var tokens = new List<CommandLineToken>();

            while (lexer.HasNext())
                tokens.Add(lexer.Next());

            var expected = new List<CommandLineToken>()
            {
                new CommandLineToken(CommandLineTerminal.Switch, "--phrase"),
                new CommandLineToken(CommandLineTerminal.Argument, "\"this is a test\""),
                new CommandLineToken(CommandLineTerminal.Switch, "--info"),
                new CommandLineToken(CommandLineTerminal.Argument, "\"this is cool\"")
            };
            Assert.Equal(expected, tokens);
        }

        [Fact]
        public void LexMultipleArgsTest()
        {
            var input = "--load valueA valueB valueC 0x0600001";
            var lexer = new CommandLineLexer(new StringReader(input));
            var tokens = new List<CommandLineToken>();

            while (lexer.HasNext())
                tokens.Add(lexer.Next());

            var expected = new List<CommandLineToken>()
            {
                new CommandLineToken(CommandLineTerminal.Switch, "--load"),
                new CommandLineToken(CommandLineTerminal.Argument, "valueA"),
                new CommandLineToken(CommandLineTerminal.Argument, "valueB"),
                new CommandLineToken(CommandLineTerminal.Argument, "valueC"),
                new CommandLineToken(CommandLineTerminal.Argument, "0x0600001")
            };
            Assert.Equal(expected, tokens);
        }

        [Fact]
        public void LexLongSwitchesTest()
        {
            var input = "--output-directory --verbose -i -rt";
            var lexer = new CommandLineLexer(new StringReader(input));
            var tokens = new List<CommandLineToken>();

            while (lexer.HasNext())
                tokens.Add(lexer.Next());

            var expected = new List<CommandLineToken>()
            {
                new CommandLineToken(CommandLineTerminal.Switch, "--output-directory"),
                new CommandLineToken(CommandLineTerminal.Switch, "--verbose"),
                new CommandLineToken(CommandLineTerminal.Switch, "-i"),
                new CommandLineToken(CommandLineTerminal.Switch, "-rt")
            };
            Assert.Equal(expected, tokens);
        }

        [Fact]
        public void LexCommaSeparatedArgsTest()
        {
            var input = "--data here,is,some,random,data,123,\"more stuff\",cool";
            var lexer = new CommandLineLexer(new StringReader(input));
            var tokens = new List<CommandLineToken>();

            while (lexer.HasNext())
                tokens.Add(lexer.Next());

            var expected = new List<CommandLineToken>()
            {
                new CommandLineToken(CommandLineTerminal.Switch, "--data"),
                new CommandLineToken(CommandLineTerminal.Argument, "here"),
                new CommandLineToken(CommandLineTerminal.Comma, ","),
                new CommandLineToken(CommandLineTerminal.Argument, "is"),
                new CommandLineToken(CommandLineTerminal.Comma, ","),
                new CommandLineToken(CommandLineTerminal.Argument, "some"),
                new CommandLineToken(CommandLineTerminal.Comma, ","),
                new CommandLineToken(CommandLineTerminal.Argument, "random"),
                new CommandLineToken(CommandLineTerminal.Comma, ","),
                new CommandLineToken(CommandLineTerminal.Argument, "data"),
                new CommandLineToken(CommandLineTerminal.Comma, ","),
                new CommandLineToken(CommandLineTerminal.Argument, "123"),
                new CommandLineToken(CommandLineTerminal.Comma, ","),
                new CommandLineToken(CommandLineTerminal.Argument, "\"more stuff\""),
                new CommandLineToken(CommandLineTerminal.Comma, ","),
                new CommandLineToken(CommandLineTerminal.Argument, "cool"),
            };
            Assert.Equal(expected, tokens);
        }

    }
}

Please be nitpicky in the review :)

I quickly glanced over the code and I'm impressed :) Could you include some unit tests to see how this parser/lexer works? — dfhwze, Commented Jun 19, 2019 at 4:42
@dfhwze I totally forgot to attach the tests :D I've added unit tests for the lexer and for the parser. — 766F6964, Commented Jun 19, 2019 at 7:42
I don't have time to review it today, but I will get to it asap. I do have a question for you. Do you want quoted arguments to remain quoted after being parsed into CommandLineArgument? I can image the token keeping the quotes, but the parsed results should not. Also think about literals that have quotes inside them. — dfhwze, Commented Jun 19, 2019 at 7:50
Yeah, I guess it would make sense to remove the quotes. I left them because the lexer should process every character, however it would make sense to remove them in the parser as they could be seen as pollution of the output.. — 766F6964, Commented Jun 19, 2019 at 7:55

Pieter Witvoet · Accepted Answer · 2019-06-19 11:57:16Z

API

Calling code has to pass input to CommandLineParser's constructor, but do the actual parsing with ParseAll. Calling ParseAll a second time then returns an empty output. A static CommandLineParser.Parse(input) method that creates that instance internally would be more sensible.
It's not clear what syntax this parser supports. Both "/?" and "--file C:\test.txt" result in a FormatException: Illegal character in argument. It's a good idea to document this for the users of your API.
Likewise, it's not clear what constructs are supported. It looks like every switch must have one or more values? Except when it's the last switch? And switches can have multiple groups of multiple values?
"-switch arg" results in a FormatException: Illegal character in switch: w. "-h1 arg" fails in a similar manner, and so do "-a=b" and "-a:b". Not to mention other languages like "-号 123".
The API is relatively low-level, with callers having to search through a list of switches and arguments and having to remove hyphens and quotes. A higher-level approach that lets callers describe the options that they support would be more useful. It may also be a good idea to support multiple input formats, such as -f, /f and --file, and have them all map to the same file option.
Switch arguments are not very intuitive due to their IArgument type. Why not use a simple array of strings instead?

Lexer

It's clear that a lot of care went into the lexer. Good first impression.
I'd remove some of the field comments - names like _reader and _currentToken are descriptive enough on their own.
_currentToken should probably be named _nextToken or _peekedToken.
ReadCharacter doesn't check if _reader is exhausted (_reader.Read() == -1).
Next and Peek can throw an EndOfStreamException if there's nothing left. You may want to document that.
ReadArg and ReadSwitch create a list of allowed characters on every call. Those lists should be static, but Linq's Contains method also allows you to work with just strings. Still, a whitelist approach is very restrictive. I'd go for blacklisting specific characters or perhaps specific Unicode categories.
TextReader should be disposed after use.

Parser

I'd rename parsed to expressions and Parse to ParseExpression.
Parse gets stuck in its while loop when a switch is followed by another switch. Parsing "-a -b" never ends.
ExpectOneOf joins the string "terminals", instead of the parameter terminals. This results in a not very helpful exception message.

Arguments

CommandLineExpression, CommandLineArgumentList and CommandLineArgument look like you intended them to be immutable. That's a good idea. There's one problem though: those IList properties may not be settable, but they are mutable. IReadOnlyList is better.
Regarding CommandLineExpression.Equals:
- Why do you need an equality check for this? If it's useful, why not also implement IEquatable<CommandLineExpression>?
- If you override Equals, you're also supposed to override GetHashCode.
- I don't expect Equals to throw, and throwing an ArgumentNullException when obj is of a different type is misleading.
- This method can indeed be simplified a lot. Implement Equals in both CommandLineArgumentList and CommandLineArgument, so you can use Enumerable.SequenceEqual to compare the Args lists.
- Instead of if (condition) { ... } else { return ..; }, you can use early-out returns to reduce nesting depth: if (!condition) return ..; .... This often makes code easier to read.
IArgument and the classes that implement it seem more complicated than necessary. What's the use of "-a 1,2 3,4" returning a list of argument-lists? How do callers know that they won't have to process a tree of arbitrary depth?

Tests

In ParseSwitchNoArgumentTest, parser.ParseAll().ToList()[0] can be simplified to parser.ParseAll().First(). However, what if the result is empty, or what if it contains extra unexpected items? It's better to compare the whole result instead of picking the first item.
The next test is poorly named. It can also be simplified by writing a few helper methods that can create (lists of) expressions and arguments. params is useful here.
I don't have XUnit here to verify, but in that test it looks like you're comparing each expression against the full list of expected expressions. Also, the names of the first switch items don't match. Are these tests actually passing?

Thank you for the in-depth feedback. I'll definitely apply some of these suggestions. Can you again elaborate on the optimization of the Equals() method? I am not so sure if I got that. Also I definitely have to write a unit test for that bug you found! — 766F6964, Commented Jun 19, 2019 at 22:21
bool Equals(object obj) => obj is CommandLineExpression other && Switch == other.Switch && ((Args == null) == (other.Args == null)) && (Args == null || Args.SequenceEqual(other.Args)); - a bit too long for => notation, but you get the idea. The other IArgument classes also need to implement Equals for the SequenceEqual call here to work as intended. — Pieter Witvoet, Commented Jun 20, 2019 at 7:19

dfhwze · Accepted Answer · 2019-06-20 04:36:42Z

Follow-up

In your previous post, I described some design issues I found. I'm happy to see your new design is cleaner (specially the lexer) and no longer depends on an already parsed array of tokens!

Pieter Witvoet has already went through your code and detected many edge cases your API falls short. (No need for me to re-iterate them) This is mainly because you still have a "lack of clear specification". I can't stress enough how important this is, specially since you state yourself

you want to provide several layers of abstraction and allow for a flexible design.

Without going into much detail (I'm mostly using pseudo-code), I will walk you through the steps required to create a compiler, reflecting back to your code.

But first, we need a clear specification.

Specification

We need to establish a specification. And since we are creating a compiler from scratch, why not be ambitious about it? As starting point, we have the following snippet with cmd_line_args being the command line arguments string and cmd the object graph representing the compiled string.

In pseudo-code:

var cmd = compile(cmd_line_args);

Context-bound language

Consider the following command line: cmd/ioc:\temp\

It's written in "compact form", a form with the highest density. It could be normalized to "friendly form", a form that has optimal readability.

But how should we interpret this? In other words, what is our friendly form? This brings us to our first design decision. Do we require a "context" or is our language "context-free"?

If our language is context-free, the command line above is ill-defined. The compact form would be the same as the friendly form: cmd /io c:\temp\
If on the other hand, our language is context-bound, the command line above would have a different friendly form depending on the context. The context could specify the known switches, which would allow us to combine switches.

Some possibilities include:

If the context specifies a verb "cmd" with switches "i" and "o" with the former having an argument "path", the friendly form would be: cmd /o /i c:\temp\
If the context specifies a verb "cmd" with switches "i" and "o" with the latter having an argument "path", the friendly form would be: cmd /i /o c:\temp\
If the context specifies a verb "cmd" with switch "io" having an argument "path", the friendly form would be: cmd /io c:\temp\

Let's make sure our compiler is context-free, but can be augmented with an optional context.

In pseudo-code:

var cmd = compile(cmd_line_args, context = null);

Lexicon

Next up, we need to determine which delimiters and other keywords are allowed. The command line cmd /o c:\temp\ could be formatted in different styles. Note that the "system path seperator" influences the delimiters.

Some possibilities include:

win style: cmd /o c:\temp\
win posix style: cmd -o c:\temp\
win posix long style: cmd --output c:\temp\
unix posix style: cmd -o /c/temp/
unix posix long style: cmd --output /c/temp/

Furthermore, a switch and its arguments could be formatted in different styles.

Some possibilities include:

cmd /o:c:\temp\
cmd /o=c:\temp\
cmd /o c:\temp\
cmd /o c:\temp\out1\ c:\temp\out2\
cmd /o c:\temp\out1\,c:\temp\out2\

Let's make sure our compiler uses a "lexicon", based on style preference and system path separator.

In pseudo-code:

var cmd = compile(cmd_line_args, lexicon = default, context = null);

Features

There is no universal set of features a command line tool must comprise. This means the compiler can be as simple or complex as we decide. The more complex compilers (like Powershell) allow for expressions, piping, and more exotic stuff. Perhaps this is a bridge too far for our use case.

I propose to use the a superset of the most common features found across compilers.

Feature list:

verbs: cmd get-logs
flags: cmd /a -q --verbose
options: cmd /input c:\in\ -o=c:\out\
arguments: cmd -o c:\logs\ c:\out\
operands: cmd -o c:\logs\ -- readme.txt
combined switches: cmd /aqo c:\out\
repeating options: cmd -o c:\in\ -o c:\in\nested\
help: cmd get-logs -? /h --help
about: cmd -! --version
escape sequence: cmd a\r\nb ~ a[newline]b
unicode escape sequence: cmd get-logs \u002Dq ~ cmd get-logs -q
literal unicode escape sequence: cmd get-logs c:\temp\\x69\x6E\ ~ cmd get-logs c:\temp\in\
quoted literal: cmd "my \"quoted\" literal"
alt quoted literal: cmd 'my "quoted" literal'

Definitions:

Verb: defines a group of shared functionality and operations.
Switch: the union of flags and options with their arguments.
Flag: a switch that does not have an argument. It's considered a boolean.
Option: a switch that takes 0..* arguments. Some arguments may be mandatory, others optional.
Argument: the value or one of the values linked to a parent option.
Operand: the value or one of the values linked to the verb, or default verb is none specified.

Syntax:

Escaping unicode: \u[n,4] or \U[n,8] -> \u002D, \U00020B20
Escaping unicode in literal: \x[n,1-4] -> \x0, \x01, \x001, \x0001
Quoting literal: "A string with whitespace, and other delimiters and \"escaped\" quotes"
Alt quoting literal: 'A string with whitespace, and other delimiters and "no need to escape" quotes'
Operand delimiter: cmd -o c:\logs\ -- readme.txt -> -- forces all remaining tokens to be operands

Compiler

Having our specification, we should let a command line go through a set of layers to get it compiled. Ideally, we would like to end up with our compiler doing:

In pseudo-code:

// input
var cmd_line_args = "cmd get-logs \u002Dq -ab c:\temp\in\ -- out.txt";

// compiler
var cmd = compile(cmd_line_args, lexicon = default, context = null);

// print command line back to string, using some style
cmd.print(win, short) -> "cmd get-logs -q -a -b c:\temp\in\ -- out.txt"
cmd.print(posix, long) -> "cmd get-logs --quiet --all --binary -c /c/temp/in/ -- out.txt""

let compile(cmd_line_args, lexicon = default, context = null) = 
{
    var cmd_line_sanitized = preprocess(cmd_line_args);
    var tokens = lex(cmd_line_sanitized, lexicon, context);
    var astTree = parse(tokens, lexicon, context).optmize();
    var graph = materialize(astTree);
}

1. Pre-processor

unescape unicode escape sequences: get-logs -q -ab c:\temp\in\ -- out.txt

Your API does not have a pre-processor defined.

2. Lexer

create tokens from pre-processed command line string

Your API provides a set of tokens.

 public enum CommandLineTerminal
    {
        Switch,
        Argument,
        Comma,
    }

Given our specification, we should extend this:

public enum CommandLineTerminal
{
    Verb,
    Switch, // could be flag, combined flags, option (lexer might not know the difference)
    Flag,
    Option, 
    Argument,
    Operand,
    // keyword terminals (many lexers include these, but keep them hidden from higher layers)
    Whitespace, // contextual
    SwitchPrefix // '-' '/' '--'
    OptionArgumentSeparator, // ':' '=' 
    ArgumentDelimiter, // ','
    OperandDelimiter, // '--' (without an option attached)
}

yielding us:

- verb: get-logs
- whitespace
- switch prefix: -
- switch: q
- whitespace
- switch prefix: -
- switch: ab
- whitespace
- argument: c:\temp\in\
- whitespace
- operand delimiter: --
- whitespace
- operand: out.txt

Your API stores tokens as follow:

public struct CommandLineToken
{
    public CommandLineTerminal Terminal { get; }
    public string Text { get; }

    public CommandLineToken(CommandLineTerminal terminal, string text)
    {
        Terminal = terminal;
        Text = text;
    }
}

I would extend this, and keep track of:

line number -> allows for better exception output to consumer
token type (hidden or normal) -> hidden: white space, delimiters, ..

3. AST Parser

create an abstract syntax tree from the tokens
could use the context of a tree to further refine tokens (switch -> flag or option)
not all tokens from the lexer end up in the AST

Your API does not include this step, instead goes on to materialize directly.
 private IList<IArgument> ParseAllArgs()
 {
    // impl ..
 }

An AST might look like this:

In pseudo-code:

 // `get-logs -q -ab c:\temp\in\ -- out.txt`
 Node->verb: name=get-logs
    child: Node->flag: name=q longname=quiet
    child: Node->combined flag: name=ab longname=all
        child: Node->argument: name=path value=c:\temp\in\
    child: Node->operand delimiter
    child: Node->operand: name=logfile value=out.txt

In fact, by not using the AST parser, you are working yourself a bit in trouble. This next quote by you makes me think you try to have a flattened parser, rather than a tree parser.

Comma separated lists are intentionally processed as one argument.

AST Node

You were struggling to build a tree structure. I suggest a class in the likes of:

class AstNode 
{
    internal AstNode Parent;
    internal List<AstNode> Children;
    internal CommandLineToken Token;
    internal CommandLineTerminal Terminal;
}

Building the AST from a flattened list of lexed tokens requires a common parsing technique shift-reduce. See links for parsing and examples.

Links:

4. AST Optimizer

A set of predefined optimizers should be run on the AST to normalize the graph.

In our example:

Combined flag ab can be uncombined. The context might show us that the argument belongs to b.

child: Node->flag: name=a longname=all
child: Node->option: name=b longname=binary
    child: Node->argument: name=path value=c:\temp\in\

5. Parser / Materializer

Map the AST to a concrete object graph, usable by consumers of the API.

Your API has such classes as CommandLineArgument.

6. Printer

the materialized graph can be printed back to a command line string
using tree walkers the graph can be transformed to a string

Thanks for the review. A lot I need to change ;) Regarding testing, what cases would you recommend me to cover? I've the feeling that I am missing a lot of cases, especially for the parser? — 766F6964, Commented Jun 19, 2019 at 23:08
Another things that I struggle with is to come up with a good tree-like output. As already mentioned in a previous reply the argument lists makes me end up with a very weird data structure. I'd like to encapsulate the output in a class object that allows me for easy access to every switch as well as all the corresponding arguments / argument lists (if any). The main issue with the lists if I don't nest them is, that I can't decide where the list starts. Is the previous element part of the list or not? Can you maybe suggest me a rough class outline, how a parse result class could look like? — 766F6964, Commented Jun 20, 2019 at 1:18
I have always wondered which genius has invented the double-dash argument prefix for full names and single-dash for short ones. I guess he must have hated people because there is nothing more annoying than -- ;-] @766F6964 be nice to the users of your command-line interpreter and save them from typing something that has absolutely no value. It looks like its only purpose is to allow combining flags which you do like never but always trip up on that damn -- :-] — t3chb0t, Commented Jun 20, 2019 at 4:28
I have included a template for an AstNode and some links describing how to parse AST using shift-reduce. Regarding unit tests, I would make a list of all possible friendly, compact an exotic command lines I can think of that the lexer should allow to handle. For each lexer unit test, include a parser unit test. — dfhwze, Commented Jun 20, 2019 at 4:39
Besides, there are 2 main groups of people that use command lines. System administrators that love compact forms cmd/abcc:\temp and normal humans that want readable command lines cmd -a -b -c c:\temp. Make sure to please both :) — dfhwze, Commented Jun 20, 2019 at 4:41

Stack Exchange Network

Compact command line argument parser : Revisited

2 Answers 2

API

Lexer

Parser

Arguments

Tests

Follow-up

Specification

Context-bound language

Lexicon

Features

Compiler

1. Pre-processor

2. Lexer

3. AST Parser

4. AST Optimizer

5. Parser / Materializer

6. Printer

Not the answer you're looking for? Browse other questions tagged
c#
parsing
reinventing-the-wheel
console
lexical-analysis
or ask your own question.

Linked

Hot Network Questions

Compact command line argument parser : Revisited

2 Answers 2

API

Lexer

Parser

Arguments

Tests

Follow-up

Specification

Context-bound language

Lexicon

Features

Compiler

1. Pre-processor

2. Lexer

3. AST Parser

4. AST Optimizer

5. Parser / Materializer

6. Printer

Not the answer you're looking for? Browse other questions tagged c#parsingreinventing-the-wheelconsolelexical-analysis or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
c#
parsing
reinventing-the-wheel
console
lexical-analysis
or ask your own question.