24

I have a string that I am reading from another system. It's basically a long string that represents a list of key value pairs that are separated by a space in between. It looks like this:

 key:value[space]key:value[space]key:value[space]

So I wrote this code to parse it:

string myString = ReadinString();
string[] tokens = myString.split(' ');
foreach (string token in tokens) {
     string key = token.split(':')[0];
     string value = token.split(':')[1];
     .  . . . 
}

The issue now is that some of the values have spaces in them so my "simplistic" split at the top no longer works. I wanted to see how I could still parse out the list of key value pairs (given space as a separator character) now that I know there also could be spaces in the value field as split doesn't seem like it's going to be able to work anymore.

NOTE: I now confirmed that KEYs will NOT have spaces in them so I only have to worry about the values. Apologies for the confusion.

11
  • 1
    do you have control over the input format?
    – Stefan
    Commented May 31, 2011 at 11:01
  • @Jason - i am trying to get control, in which case i will change the separator character but i am still concerned that whatever character i use as a separator could also be in the value field.
    – leora
    Commented May 31, 2011 at 11:02
  • 3
    Is it at least enforced that there is no : possible inside values? if not, you are stuck. If you generate the long string, then you have the possibility to escape the characters to avoid the trouble, but then you'll need something better than Split to read the input.
    – jdehaan
    Commented May 31, 2011 at 11:03
  • @jdehaan - good point. I know there is definitely no ":" in the key but i can possibly imagine that showing up in the value one day (even though i can't find an example now). I obviously want to be a bit future proof
    – leora
    Commented May 31, 2011 at 11:04
  • 12
    this cannot be done. How do you know if a word belongs to the value or to the next key?
    – vidstige
    Commented May 31, 2011 at 11:09

9 Answers 9

22

Use this regular expression:

\w+:[\w\s]+(?![\w+:])

I tested it on

test:testvalue test2:test value test3:testvalue3

It returns three matches:

test:testvalue
test2:test value
test3:testvalue3

You can change \w to any character set that can occur in your input.

Code for testing this:

var regex = new Regex(@"\w+:[\w\s]+(?![\w+:])");
var test = "test:testvalue test2:test value test3:testvalue3";

foreach (Match match in regex.Matches(test))
{
    var key = match.Value.Split(':')[0];
    var value = match.Value.Split(':')[1];

    Console.WriteLine("{0}:{1}", key, value);
}
Console.ReadLine();

As Wonko the Sane pointed out, this regular expression will fail on values with :. If you predict such situation, use \w+:[\w: ]+?(?![\w+:]) as the regular expression. This will still fail when a colon in value is preceded by space though... I'll think about solution to this.

6
  • 15
    "any sufficiently advanced regex is indistinguishable from magic" ;)
    – Alex
    Commented May 31, 2011 at 12:24
  • Note: \s will also match tab and newline, so if you think they can occur in values change [\w\s] to [\w ]
    – Episodex
    Commented May 31, 2011 at 12:28
  • Note - "test:testvalue test2:test:withcolon value test3:testvalue3" fails test2. Commented May 31, 2011 at 13:27
  • @Wonko the Sane: you're right. I added solution to this. Still not perfect though.
    – Episodex
    Commented May 31, 2011 at 13:45
  • Right - the problem is that to use a parser on a pattern, the pattern must always be predictable. Commented May 31, 2011 at 14:49
5

This cannot work without changing your split from a space to something else such as a "|".

Consider this:

Alfred Bester:Alfred Bester Alfred:Alfred Bester

  • Is this Key "Alfred Bester" & value Alfred" or Key "Alfred" & value "Bester Alfred"?
0
4
string input = "foo:Foobarius Maximus Tiberius Kirk bar:Barforama zap:Zip Brannigan";

foreach (Match match in Regex.Matches(input, @"(\w+):([^:]+)(?![\w+:])"))
{
   Console.WriteLine("{0} = {1}", 
       match.Groups[1].Value, 
       match.Groups[2].Value
      );
}

Gives you:

foo = Foobarius Maximus Tiberius Kirk
bar = Barforama
zap = Zip Brannigan
1
  • Nice! I like this solution better because it groups the key and value rather than relying on a split. This puts more logic in the regex and allows for more customization, such as the potential for using Google-style groupings to encapsulate value strings. e.g. key1:(this:is:a funky value)
    – Jason
    Commented May 7, 2013 at 22:41
2

You could try to Url encode the content between the space (The keys and the values not the : symbol) but this would require that you have control over the Input Method.

Or you could simply use another format (Like XML or JSON), but again you will need control over the Input Format.

If you can't control the input format you could always use a Regular expression and that searches for single spaces where a word plus : follows.

Update (Thanks Jon Grant) It appears that you can have spaces in the key and the value. If this is the case you will need to seriously rethink your strategy as even Regex won't help.

5
  • As much as I hate Regex, I think it is the way to go in this instance. Commented May 31, 2011 at 11:07
  • 1
    That's why I use it. Not that I can but because I have too. :D Commented May 31, 2011 at 11:08
  • 2
    The question says there can be spaces in the key AND the value... even regexes can't solve that problem.
    – Jon Grant
    Commented May 31, 2011 at 11:13
  • Ah, I missed that bit of the original question. That being the case, you're right and the OP is likely to have to resort to best guesses... Maybe if the keys are from a predefined list of possibilities? Commented May 31, 2011 at 11:21
  • Well it's possible. You could probably then just scan the string and search for the location of the {key}: and then substring it from that location to the next : and then check for any key that is still in the string and replace it. But this seems very "Ugly". Can't you change the Input Format ? Or is it a third party lib? Commented May 31, 2011 at 11:25
1
string input = "key1:value key2:value key3:value";
Dictionary<string, string> dic = input.Split(' ').Select(x => x.Split(':')).ToDictionary(x => x[0], x => x[1]);

The first will produce an array:

"key:value", "key:value"

Then an array of arrays:

{ "key", "value" }, { "key", "value" }

And then a dictionary:

"key" => "value", "key" => "value"

Note, that Dictionary<K,V> doesn't allow duplicated keys, it will raise an exception in such a case. If such a scenario is possible, use ToLookup().

1

Using a regular expression can solve your problem:

private void DoSplit(string str)
{
    str += str.Trim() + " ";
    string patterns = @"\w+:([\w+\s*])+[^!\w+:]";
    var r = new System.Text.RegularExpressions.Regex(patterns);
    var ms = r.Matches(str);
    foreach (System.Text.RegularExpressions.Match item in ms)
    {
        string[] s = item.Value.Split(new char[] { ':' });
        //Do something
    }
}
0

This code will do it (given the rules below). It parses the keys and values and returns them in a Dictonary<string, string> data structure. I have added some code at the end that assumes given your example that the last value of the entire string/stream will be appended with a [space]:

private Dictionary<string, string> ParseKeyValues(string input)
        {
            Dictionary<string, string> items = new Dictionary<string, string>();

            string[] parts = input.Split(':');

            string key = parts[0];
            string value;

            int currentIndex = 1;

            while (currentIndex < parts.Length-1)
            {
                int indexOfLastSpace=parts[currentIndex].LastIndexOf(' ');
                value = parts[currentIndex].Substring(0, indexOfLastSpace);
                items.Add(key, value);
                key = parts[currentIndex].Substring(indexOfLastSpace + 1);
                currentIndex++;
            }
            value = parts[parts.Length - 1].Substring(0,parts[parts.Length - 1].Length-1);


            items.Add(key, parts[parts.Length-1]);

            return items;

        }

Note: this algorithm assumes the following rules:

  1. No spaces in the values
  2. No colons in the keys
  3. No colons in the values
0

Without any Regex nor string concat, and as an enumerable (it supposes keys don't have spaces, but values can):

    public static IEnumerable<KeyValuePair<string, string>> Split(string text)
    {
        if (text == null)
            yield break;

        int keyStart = 0;
        int keyEnd = -1;
        int lastSpace = -1;
        for(int i = 0; i < text.Length; i++)
        {
            if (text[i] == ' ')
            {
                lastSpace = i;
                continue;
            }

            if (text[i] == ':')
            {
                if (lastSpace >= 0)
                {
                    yield return new KeyValuePair<string, string>(text.Substring(keyStart, keyEnd - keyStart), text.Substring(keyEnd + 1, lastSpace - keyEnd - 1));
                    keyStart = lastSpace + 1;
                }
                keyEnd = i;
                continue;
            }
        }
        if (keyEnd >= 0)
            yield return new KeyValuePair<string, string>(text.Substring(keyStart, keyEnd - keyStart), text.Substring(keyEnd + 1));
    }
0

I guess you could take your method and expand upon it slightly to deal with this stuff...

Kind of pseudocode:

List<string> parsedTokens = new List<String>();
string[] tokens = myString.split(' ');
for(int i = 0; i < tokens.Length; i++)
{
    // We need to deal with the special case of the last item, 
    // or if the following item does not contain a colon.
    if(i == tokens.Length - 1 || tokens[i+1].IndexOf(':' > -1)
    {
        parsedTokens.Add(tokens[i]);
    }
    else
    {
        // This bit needs to be refined to deal with values with multiple spaces...
        parsedTokens.Add(tokens[i] + " " + tokens[i+1]);
    }
}

Another approach would be to split on the colon... That way, your first array item would be the name of the first key, second item would be the value of the first key and then name of the second key (can use LastIndexOf to split it out), and so on. This would obviously get very messy if the values can include colons, or the keys can contain spaces, but in that case you'd be pretty much out of luck...

Not the answer you're looking for? Browse other questions tagged or ask your own question.