1

I see how to tokenise a string in the traditional manner (i.e. this answer here How do I tokenize a string in C++?) but how can I split a string by its tokens, also including them?

For example given a date/time picture such as yyyy\MMM\dd HH:mm:ss, I would like to split into an array with the following:

"yyyy", "\", "MMM", "\", "dd", " " , "HH", ":", "mm", ":", "ss"

The "tokens" are yyyy, MMM, dd, HH, mm, ss in this example. I don't know what the separators are, only what the tokens are. The separators need to appear in the final result however. The complete list of tokens is:

        "yyyy"  // – four-digit year, e.g. 1996
        "yy"    // – two-digit year, e.g. 96
        "MMMM"  // – month spelled out in full, e.g. April
        "MMM"   // – three-letter abbreviation for month, e.g. Apr
        "MM"    // – two-digit month, e.g. 04
        "M"     // – one-digit month for months below 10, e.g. 4
        "dd"    // – two-digit day, e.g. 02
        "d"     // – one-digit day for days below 10, e.g. 2
        "ss"    // - two digit second
        "s"     // - one-digit second for seconds below 10
        "mm"    // - two digit minute
        "m"     // - one-digit minute for minutes below 10
        "tt"    // - AM/PM designator
        "t"     // - first character of AM/PM designator
        "hh"    // - 12 hour two-digit for hours below 10
        "h"     // - 12 hour one-digit for hours below 10
        "HH"    // - 24 hour two-digit for hours below 10
        "H"     // - 24 hour one-digit for hours below 10

I've noticed the standard library std::string isn't very strong on parsing and tokenising and I can't use boost. Is there a tight, idiomatic solution? I'd hate to break out a C-style algorithm for doing this. Performance isn't a consideration.

4
  • Are your tokens always going to be special characters, or could they be letters too? Commented Apr 25, 2016 at 10:47
  • The tokens are only those above. The characters between tokens could be some Japanese character (I have to use wchar_t as I'm on Windows so that's potentially an issue too). Basically whatever a date/time picture string looks like for any given locale. All I know for sure about them is their components are defined in the picture with the above listed tokens.
    – Robinson
    Commented Apr 25, 2016 at 10:59
  • 1
    Regular expressions are part of C++11: cplusplus.com/reference/regex . Unless performance is not an absolute requirement, use the regex.
    – Dummy00001
    Commented Apr 25, 2016 at 11:10
  • The trouble with regex is I have to spend months learning an almost incomprehensible syntax to solve this one, somewhat simpler problem.
    – Robinson
    Commented Apr 25, 2016 at 11:14

1 Answer 1

1

Perhaps http://www.cplusplus.com/reference/cstring/strtok/ is what you're looking for, with a useful example.

However, it eats the delimiters. You could solve that problem with comparing the base pointer and the resulting string, moving forward by the string length.

#include <iostream>
#include <cstdio>
#include <cstring>
#include <vector>
#include <sstream>

int main() 
{
    char data[] = "yyyy\\MMM\\dd HH:mm:ss";
    std::vector<std::string> tokens;

    char* pch = strtok (data,"\\:");                                        // pch holds 'yyyy'
    while (pch != NULL)
    {
        tokens.push_back(pch);

        int delimeterIndex = static_cast<int>(pch - data + strlen(pch));    // delimeter index: 4, 8, ...
        std::stringstream ss;
        ss << delimeterIndex;
        tokens.push_back(ss.str());

        pch = strtok (NULL,"\\:");                                          // pch holds 'MMM', 'dd', ...
    }

    for (const auto& token : tokens)
    {
        std::cout << token << ", ";
    }
}

This gives output of:

yyyy, 4, MMM, 8, dd HH, 14, mm, 17, ss, 20, 
1
  • That's an interesting approach.
    – Robinson
    Commented Apr 25, 2016 at 11:26

Not the answer you're looking for? Browse other questions tagged or ask your own question.