20

I'm tokening with the following, but unsure how to include the delimiters with it.

void Tokenize(const string str, vector<string>& tokens, const string& delimiters)
{

    int startpos = 0;
    int pos = str.find_first_of(delimiters, startpos);
    string strTemp;


    while (string::npos != pos || string::npos != startpos)
    {

        strTemp = str.substr(startpos, pos - startpos);
        tokens.push_back(strTemp.substr(0, strTemp.length()));

        startpos = str.find_first_not_of(delimiters, pos);
        pos = str.find_first_of(delimiters, startpos);

    }
}
0

5 Answers 5

17

The C++ String Toolkit Library (StrTk) has the following solution:

std::string str = "abc,123 xyz";
std::vector<std::string> token_list;
strtk::split(";., ",
             str,
             strtk::range_to_type_back_inserter(token_list),
             strtk::include_delimiters);

It should result with token_list have the following elements:

Token0 = "abc,"
Token1 = "123 "
Token2 = "xyz"

More examples can be found Here

4

I now this a little sloppy, but this is what I ended up with. I did not want to use boost since this is a school assignment and my instructor wanted me to use find_first_of to accomplish this.

Thanks for everyone's help.

vector<string> Tokenize(const string& strInput, const string& strDelims)
{
 vector<string> vS;

 string strOne = strInput;
 string delimiters = strDelims;

 int startpos = 0;
 int pos = strOne.find_first_of(delimiters, startpos);

 while (string::npos != pos || string::npos != startpos)
 {
  if(strOne.substr(startpos, pos - startpos) != "")
   vS.push_back(strOne.substr(startpos, pos - startpos));

  // if delimiter is a new line (\n) then addt new line
  if(strOne.substr(pos, 1) == "\n")
   vS.push_back("\\n");
  // else if the delimiter is not a space
  else if (strOne.substr(pos, 1) != " ")
   vS.push_back(strOne.substr(pos, 1));

  if( string::npos == strOne.find_first_not_of(delimiters, pos) )
   startpos = strOne.find_first_not_of(delimiters, pos);
  else
   startpos = pos + 1;

        pos = strOne.find_first_of(delimiters, startpos);

 }

 return vS;
}
2

I can't really follow your code, could you post a working program?

Anyway, this is a simple tokenizer, without testing edge cases:

#include <iostream>
#include <string>
#include <vector>

using namespace std;

void tokenize(vector<string>& tokens, const string& text, const string& del)
{
    string::size_type startpos = 0,
        currentpos = text.find(del, startpos);

    do
    {
        tokens.push_back(text.substr(startpos, currentpos-startpos+del.size()));

        startpos = currentpos + del.size();
        currentpos = text.find(del, startpos);
    } while(currentpos != string::npos);

    tokens.push_back(text.substr(startpos, currentpos-startpos+del.size()));
}

Example input, delimiter = $$:

Hello$$Stack$$Over$$$Flow$$$$!

Tokens:

Hello$$
Stack$$
Over$$
$Flow$$
$$
!

Note: I would never use a tokenizer I wrote without testing! please use boost::tokenizer!

5
  • I edited my post to include all of the function. I see what you did, but the delimiters will be a string and each char in the string will be a delimiter. Passed like so " ,.!\n" So a comma, period, exclamation, and new line will be pushed into the vector as well, but not the space. This way I can join the vector back and use a space in between the vector items and rebuild the string.
    – Jeremiah
    Commented Oct 2, 2009 at 18:54
  • comma, period, exclamation, and new line including the space will be the delimiters. sorry wanted to make taht clear.
    – Jeremiah
    Commented Oct 2, 2009 at 18:54
  • Aha :) I think I miss understood the question. I though you want to include the delimiters in with tokens. Why don't you use boost::tokenizer? it exactly does what you want. Commented Oct 2, 2009 at 19:00
  • Can I get the tokenizer without the entire library?
    – Jeremiah
    Commented Oct 2, 2009 at 19:25
  • You could use boost::bcp to extract the required headers. It is not that simple but you could try. Commented Oct 2, 2009 at 19:34
2

if the delimiters are characters and not strings, then you can use strtok.

2
  • Thanks .. I had almost forgotten about this function :P
    – poorva
    Commented Sep 5, 2013 at 11:09
  • 1
    strtok consumes the delimiter tokens, I believe.
    – Santa
    Commented Aug 28, 2015 at 4:24
0

It depends on whether you want the preceding delimiters, the following delimiters, or both, and what you want to do with strings at the beginning and end of the string that may not have delimiters before/after them.

I'm going to assume you want each word, with its preceding and following delimiters, but NOT any strings of delimiters by themselves (e.g. if there's a delimiter following the last string).

template <class iter>
void tokenize(std::string const &str, std::string const &delims, iter out) { 
    int pos = 0;
    do { 
        int beg_word = str.find_first_not_of(delims, pos);
        if (beg_word == std::string::npos) 
            break;
        int end_word = str.find_first_of(delims, beg_word);
        int beg_next_word = str.find_first_not_of(delims, end_word);
        *out++ = std::string(str, pos, beg_next_word-pos);
        pos = end_word;
    } while (pos != std::string::npos);
}

For the moment, I've written it more like an STL algorithm, taking an iterator for its output instead of assuming it's always pushing onto a collection. Since it depends (for the moment) in the input being a string, it doesn't use iterators for the input.

2
  • I want the string "Test string, on the web.\nTest line one." to be tokens like so. I want a space, a commma, a period, and \n to be delimiters. Test string , on the web . \n Test line one .
    – Jeremiah
    Commented Oct 2, 2009 at 19:38
  • Sorry, it didn't post correctly. After the word delimiter its was supposed to have each thing on a new line.
    – Jeremiah
    Commented Oct 2, 2009 at 19:39

Not the answer you're looking for? Browse other questions tagged or ask your own question.