I forget regular expressions faster then my mothers birthday. It is a major PITA. Anyhow I wanted a RE for parsing the HTTP response status line and have the sub-elements properly captured. I got this working :
const boost::regex status_line("HTTP/(\\d+?)\\.(\\d+?) (\\d+?) (.*)\r\n");
std::string status_test1("HTTP/1.1 200 hassan ali\r\n");
boost::smatch what;
std::cout << regex_match(status_test1,what, status_line, boost::match_extra) << std::endl;
std::cout << what.size() << std::endl;
BOOST_FOREACH(std::string s, what)
{
std::cout << s << std::endl;
}
The 4th capture group is what I was fussing about, particularly tokenising the words. But I don't need it so my job is done. However, I'd still like to know how to tokenise a space seperated sentence that ends with a '\0' which results in a vector/array of stripped words.
I can't get the following fragment to work
const boost::regex sentence_re("(.+?)( (.+?))*");
boost::smatch sentence_what;
std::string sentence("hassan ali syed ");
std::cout << boost::regex_match(sentence,sentence_what,sentence_re, boost::match_extra) << std::endl;
BOOST_FOREACH(std::string s, sentence_what)
{
std::cout << s << std::endl;
}
it shouldn't match "hassan ali syed ", but it should match "hassan ali syed", and the capture group should output hassan ali syed (with newlines), but it outputs hassan syed syed (note, the space in the third syed<space>syed. I suppose capture groups can't deal with recursive entities ?
So, is there a clean way for specifying a tokenising task in PCRE syntax, that results in a clean token vector (without repetition --i.e., I don't want the nested group to try and strip the whitespace).
I know this isn't the right tool for the job, spirit / lexx or boost::tokenise is best, and I know it isn't the right way to go about it. in .net when doing screen scraping I'd find tokens in bodies of text by repeatedly applying a regular expression to the body till it ran out of tokens.