Tokenize a string and include delimiters in C++

Question

I'm tokening with the following, but unsure how to include the delimiters with it.

void Tokenize(const string str, vector<string>& tokens, const string& delimiters)
{

    int startpos = 0;
    int pos = str.find_first_of(delimiters, startpos);
    string strTemp;


    while (string::npos != pos || string::npos != startpos)
    {

        strTemp = str.substr(startpos, pos - startpos);
        tokens.push_back(strTemp.substr(0, strTemp.length()));

        startpos = str.find_first_not_of(delimiters, pos);
        pos = str.find_first_of(delimiters, startpos);

    }
}

score 17 · Accepted Answer · 2010-12-13 05:39:05Z

17

The C++ String Toolkit Library (StrTk) has the following solution:

std::string str = "abc,123 xyz";
std::vector<std::string> token_list;
strtk::split(";., ",
             str,
             strtk::range_to_type_back_inserter(token_list),
             strtk::include_delimiters);

It should result with token_list have the following elements:

Token₀ = "abc,"
Token₁ = "123 "
Token₂ = "xyz"

More examples can be found Here

edited Dec 13, 2010 at 5:39

answered Oct 17, 2009 at 21:59

Matthieu N.

Add a comment |

Jeremiah · Accepted Answer · 2009-10-03 15:50:42Z

I now this a little sloppy, but this is what I ended up with. I did not want to use boost since this is a school assignment and my instructor wanted me to use find_first_of to accomplish this.

Thanks for everyone's help.

vector<string> Tokenize(const string& strInput, const string& strDelims)
{
 vector<string> vS;

 string strOne = strInput;
 string delimiters = strDelims;

 int startpos = 0;
 int pos = strOne.find_first_of(delimiters, startpos);

 while (string::npos != pos || string::npos != startpos)
 {
  if(strOne.substr(startpos, pos - startpos) != "")
   vS.push_back(strOne.substr(startpos, pos - startpos));

  // if delimiter is a new line (\n) then addt new line
  if(strOne.substr(pos, 1) == "\n")
   vS.push_back("\\n");
  // else if the delimiter is not a space
  else if (strOne.substr(pos, 1) != " ")
   vS.push_back(strOne.substr(pos, 1));

  if( string::npos == strOne.find_first_not_of(delimiters, pos) )
   startpos = strOne.find_first_not_of(delimiters, pos);
  else
   startpos = pos + 1;

        pos = strOne.find_first_of(delimiters, startpos);

 }

 return vS;
}

Khaled Alshaya · Accepted Answer · 2009-10-02 18:38:19Z

2

I can't really follow your code, could you post a working program?

Anyway, this is a simple tokenizer, without testing edge cases:

#include <iostream>
#include <string>
#include <vector>

using namespace std;

void tokenize(vector<string>& tokens, const string& text, const string& del)
{
    string::size_type startpos = 0,
        currentpos = text.find(del, startpos);

    do
    {
        tokens.push_back(text.substr(startpos, currentpos-startpos+del.size()));

        startpos = currentpos + del.size();
        currentpos = text.find(del, startpos);
    } while(currentpos != string::npos);

    tokens.push_back(text.substr(startpos, currentpos-startpos+del.size()));
}

Example input, delimiter = $$:

Hello$$Stack$$Over$$$Flow$$$$!

Tokens:

Hello$$
Stack$$
Over$$
$Flow$$
$$
!

Note: I would never use a tokenizer I wrote without testing! please use boost::tokenizer!

answered Oct 2, 2009 at 18:38

Khaled Alshaya

96.1k41 gold badges180 silver badges236 bronze badges

I edited my post to include all of the function. I see what you did, but the delimiters will be a string and each char in the string will be a delimiter. Passed like so " ,.!\n" So a comma, period, exclamation, and new line will be pushed into the vector as well, but not the space. This way I can join the vector back and use a space in between the vector items and rebuild the string.
– Jeremiah
Commented Oct 2, 2009 at 18:54
comma, period, exclamation, and new line including the space will be the delimiters. sorry wanted to make taht clear.
– Jeremiah
Commented Oct 2, 2009 at 18:54
Aha :) I think I miss understood the question. I though you want to include the delimiters in with tokens. Why don't you use boost::tokenizer? it exactly does what you want.
– Khaled Alshaya
Commented Oct 2, 2009 at 19:00
Can I get the tokenizer without the entire library?
– Jeremiah
Commented Oct 2, 2009 at 19:25
You could use boost::bcp to extract the required headers. It is not that simple but you could try.
– Khaled Alshaya
Commented Oct 2, 2009 at 19:34

Add a comment |

sean riley · Accepted Answer · 2009-10-02 20:17:16Z

2

if the delimiters are characters and not strings, then you can use strtok.

answered Oct 2, 2009 at 20:17

sean riley

2,6931 gold badge22 silver badges22 bronze badges

Thanks .. I had almost forgotten about this function :P
– poorva
Commented Sep 5, 2013 at 11:09
1

strtok consumes the delimiter tokens, I believe.
– Santa
Commented Aug 28, 2015 at 4:24

Add a comment |

Jerry Coffin · Accepted Answer · 2009-10-02 19:04:06Z

It depends on whether you want the preceding delimiters, the following delimiters, or both, and what you want to do with strings at the beginning and end of the string that may not have delimiters before/after them.

I'm going to assume you want each word, with its preceding and following delimiters, but NOT any strings of delimiters by themselves (e.g. if there's a delimiter following the last string).

template <class iter>
void tokenize(std::string const &str, std::string const &delims, iter out) { 
    int pos = 0;
    do { 
        int beg_word = str.find_first_not_of(delims, pos);
        if (beg_word == std::string::npos) 
            break;
        int end_word = str.find_first_of(delims, beg_word);
        int beg_next_word = str.find_first_not_of(delims, end_word);
        *out++ = std::string(str, pos, beg_next_word-pos);
        pos = end_word;
    } while (pos != std::string::npos);
}

For the moment, I've written it more like an STL algorithm, taking an iterator for its output instead of assuming it's always pushing onto a collection. Since it depends (for the moment) in the input being a string, it doesn't use iterators for the input.

I want the string "Test string, on the web.\nTest line one." to be tokens like so. I want a space, a commma, a period, and \n to be delimiters. Test string , on the web . \n Test line one . — Jeremiah, Commented Oct 2, 2009 at 19:38
Sorry, it didn't post correctly. After the word delimiter its was supposed to have each thing on a new line. — Jeremiah, Commented Oct 2, 2009 at 19:39

Collectives™ on Stack Overflow

Tokenize a string and include delimiters in C++

5 Answers 5

Not the answer you're looking for? Browse other questions tagged
c++
tokenize
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Not the answer you're looking for? Browse other questions tagged c++tokenize or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c++
tokenize
or ask your own question.