Introduction to Boost regex

44
Introduction to Boost.Regex Yongqiang Li

Transcript of Introduction to Boost regex

Page 1: Introduction to Boost regex

Introduction to Boost.Regex

Yongqiang Li

Page 2: Introduction to Boost regex

Boost Libs• Boost libraries are intended to be widely useful, and usable across

a broad spectrum of applications. • Boost works on almost any modern operating system, including

UNIX and Windows variants.• Latest version is 1.34.1 .• Boost.Regex is a C++ library which can be used to parse the text or

strings and decide whether they match the regular expression we defined.

• Boost.Regex was written by Dr. John Maddock.

Page 3: Introduction to Boost regex

Installation• Step 1: Download boost_1_34_1.zip

http://sourceforge.net/project/showfiles.php?group_id=7586• Step 2: Unzip the files to proper directory.• Step 3: Use “Visual Studio .NET 2003 Command Prompt” to

open a command line window.• Step 4: Go the %BOOST%/libs/regex/build• Step 5: Compile and install the lib

• nmake –fvc71.mak• namke –fvc71.mak install

• Step 6: Add include directory to VStudio.

Page 4: Introduction to Boost regex

• Note:• If you want to have the feature of getting “repeated captures”,

you should uncomment BOOST_REGEX_MATCH_EXTRA in boost/regex/user.hpp before compile.

• If the version you download is 1.34.1, you may change the filename of libs after install. The filename should be “***34_1.lib”, not “***34.lib”. Default lib directory of VC is “partition_you_install/Program Files/Microsoft Visual Studio .NET 2003/Vc7/lib”

Page 5: Introduction to Boost regex

Main classes and typedefs• boost::base_regex

• It stores a regular expression.• It is very closely modeled on std::string.• typedef basic_regex<char> regex; • typedef basic_regex<wchar_t> wregex;

• boost::match_results• It stores the matching result.• typedef match_results<const char*> cmatch; • typedef match_results<const wchar_t*> wcmatch; • typedef match_results<string::const_iterator> smatch; • typedef match_results<wstring::const_iterator> wsmatch;

Note: all of them are included in <boost/regex.hpp>.

Page 6: Introduction to Boost regex

• boost::regex_iteratortypedef regex_iterator<const char*> cregex_iterator; typedef regex_iterator<std::string::const_iterator> sregex_iterator; typedef regex_iterator<const wchar_t*> wcregex_iterator; typedef regex_iterator<std::wstring::const_iterator> wsregex_iterator;

• boost::regex_token_iterator typedef regex_token_iterator<const char*> cregex_token_iterator; typedef regex_token_iterator<std::string::const_iterator> sregex_token_iterator; typedef regex_token_iterator<const wchar_t*> wcregex_token_iterator; typedef regex_token_iterator<<std::wstring::const_iterator> wsregex_token_iterator;

Page 7: Introduction to Boost regex

How to define a regular expression?• boost::basic_regex constructor:

explicit basic_regex(const basic_string<charT, ST, SA>& p, flag_type f = regex_constants::normal);

• Example:boost::regex ip_re("^(\\d{1,2}|1\\d\\d|2[0-4]\\d|25[0-5])\."

"(\\d{1,2}|1\\d\\d|2[0-4]\\d|25[0-5])\."

"(\\d{1,2}|1\\d\\d|2[0-4]\\d|25[0-5])\."

"(\\d{1,2}|1\\d\\d|2[0-4]\\d|25[0-5])$");

boost::regex credit_re(“(\d{4}[- ]){3}\d{4}”);

Page 8: Introduction to Boost regex

• Boost.regex supports many different ways to interprete the regular expression string. Type syntax_option_type is an implementation specific bitmask type that controls the method we want to use, for example:static const syntax_option_type normal; static const syntax_option_type ECMAScript = normal; static const syntax_option_type JavaScript = normal; static const syntax_option_type JScript = normal; static const syntax_option_type perl = normal;static const syntax_option_type basic; static const syntax_option_type sed = basic; …

Page 9: Introduction to Boost regex

How to do the match?• bool boost::regex_match(…)template <class BidirectionalIterator, class Allocator, class charT,

class traits> bool regex_match(

BidirectionalIterator first, BidirectionalIterator last, match_results<BidirectionalIterator, Allocator>& m,

const basic_regex <charT, traits>& e, match_flag_type flags = match_default);

Page 10: Introduction to Boost regex

• What to give:• What to be matched (strings, char*, or the range)• Where the result to be put(cmatch, smatch)• The RE defined(regex, wregex)• How the expression is matched(some match flags)

• Note that regex_match’s result is true only if the expression matches the whole of the input sequence. If you want to search for an expression somewhere within the sequence then use regex_search.

Page 11: Introduction to Boost regex

• Sample:std::string credit_num(“1111-2222-3333-4444”);boost::regex credit_re(“(\d{4}[- ]){3}\d{4}”);boost::smatch what;…if (regex_match(credit_num, what, credit_re,

boost::match_default)…

else…

Page 12: Introduction to Boost regex

Understanding Captures• Captures are the iterator ranges that are "captured" by

marked sub-expressions as a regular expression gets matched. • Each marked sub-expression can result in more than one

capture, if it is matched more than once.

Page 13: Introduction to Boost regex

Marked sub-expression• Every time a Perl regular expression contains a parenthesis

group (), it spits out an extra field, known as a marked sub-expression, for example the expression:

(\w+)\W+(\w+)

$1 $2

$&

Page 14: Introduction to Boost regex

^(\\d{1,2}|1\\d\\d|2[0-4]\\d|25[0-5])\. (\\d{1,2}|1\\d\\d|2[0-4]\\d|25[0-5])\. (\\d{1,2}|1\\d\\d|2[0-4]\\d|25[0-5])\. (\\d{1,2}|1\\d\\d|2[0-4]\\d|25[0-5])$);

$1

$2

$3

$4

Page 15: Introduction to Boost regex

• So if the above expression is searched for within "@abc def--“

Perl Boost.Regex Text found

$` m.prefix() “@”

$& m[0] “abc def”

$1 m[1] “abc”

$2 m[2] “def”

$’ m.suffix() “--”

Page 16: Introduction to Boost regex

• When a regular expression match is found there is no need for all of the marked sub-expressions to have participated in the match, for example the expression:

(abc)|(def)can match either $1 or $2, but never both at the same time.

Unmatched Sub-Expressions

Page 17: Introduction to Boost regex

• When a marked sub-expression is repeated, then the sub-expression gets "captured" multiple times, however normally only the final capture is available, for example if

(?:(\w+)\W+)+is matched against

one fine dayThen $1 will contain the string "day", and all the previous captures will have been forgotten.

Repeated CapturesRepeated Captures

Page 18: Introduction to Boost regex

What can we get from match_result?

• If the function “regex_match” returns true,

Element Value

what.size() e.mark_count()

what.empty() false

what.prefix().first first

what.prefix().last first

what.prefix().matched false

what.suffix().first last

Page 19: Introduction to Boost regex

m.suffix().last last

m.suffix().matched false

m[0].first first

m[0].second last

m[0].matched true if a full match was found, and false if it was a partial match.

Page 20: Introduction to Boost regex

m[n].first

For all integers n < m.size(), the start of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.

m[n].second

For all integers n < m.size(), the end of the sequence that matched sub-expression n. Alternatively, if sub-expression n did not participate in the match, then last.

m[n].matched For all integers n < m.size(), true if sub-expression n participated in the match, false otherwise.

Page 21: Introduction to Boost regex

• Note: If the function returns false, then the effect on parameter what is undefined.

• Example:

Page 22: Introduction to Boost regex

• Method• Use for loop

Page 23: Introduction to Boost regex

What about repeated captures?• Unfortunately enabling this feature has an impact on

performance (even if you don't use it), and a much bigger impact if you do use it, therefore to use this feature you need to:• Define BOOST_REGEX_MATCH_EXTRA for all translation units

including the library source (the best way to do this is to uncomment this define in boost/regex/user.hpp and then rebuild everything.

• Pass the match_extra flag to the particular algorithms where you actually need the captures information (regex_search, regex_match, or regex_iterator).

Page 24: Introduction to Boost regex

• Example:boost::regex e("^(?:(\w+)|(?>\W+))*$“);std::string text("now is the time for all good men to come to the aid

of the party“);…if(boost::regex_match(text, what, e, boost::match_extra))

//do some to get all captures informationelse

Page 25: Introduction to Boost regex

• MethodHow many repeated captures

Get them out!

Page 26: Introduction to Boost regex
Page 27: Introduction to Boost regex

Other match flags…

• There are many match flags which control how a regular expression is matched against a character sequence.

• Take someone for example:

Element Effect if set

match_not_bob Specifies that the expressions "\A" and "\`" should not match against the sub-sequence [first,first).

match_not_eob Specifies that the expressions "\'", "\z" and "\Z" should not match against the sub-sequence [last,last).

match_not_null Specifies that the expression can not be matched against an empty sequence.

Page 28: Introduction to Boost regex

Partial Matches• The match-flag match_partial can be passed to the following

algorithms: regex_match, regex_search, and used with the iterator regex_iterator.

• When used it indicates that partial as well as full matches should be found. A partial match is one that matched one or more characters at the end of the text input, but did not match all of the regular expression.

• Partial matches are typically used when either validating data input , or when searching texts that are either too long to load into memory.

• We can use match_normal | match_partial.

Page 29: Introduction to Boost regex

Result

M[0].matched M[0].first M[0].second

No Match False undefined Undefined Undefined

Partial match True False

Start of partial match

End of partial match

Full match True True Start of full match

End of full match

Page 30: Introduction to Boost regex

Others…• bool boost::regex_search(…)

template <class BidirectionalIterator, class Allocator, class charT, class traits> bool regex_search(

BidirectionalIterator first, BidirectionalIterator last, match_results<BidirectionalIterator, Allocator>& m,

const basic_regex<charT, traits>& e, match_flag_type flags = match_default);

Page 31: Introduction to Boost regex

It’s almost the same with regex_match(). The difference is regex_search don’t not require the expression matches the whole of the input sequence, like this:

std::string regstr = "(\\d+)";boost::regex expression(regstr);std::string testString = "192.168.4.1";boost::smatch what;if( boost::regex_search(testString, expression) ){    std::cout<< "Have digit" << std::endl; }

Page 32: Introduction to Boost regex

• std::string regstr = "(\\d+)";boost::regex expression(regstr);std::string testString = "192.168.4.1";boost::smatch what;std::string::const_iterator start = testString.begin();std::string::const_iterator end = testString.end();while( boost::regex_search(start, end, what, expression) ){ std::cout<< "Have digit : " ; std::string msg(what[1].first, what[1].second); std::cout<< msg.c_str() << std::endl; start = what[0].second;}

Page 33: Introduction to Boost regex

• boost::regex_replace()The algorithm regex_replace searches through a string finding all the matches to the regular expression: for each match it then calls match_results::format to format the string and sends the result to the output iterator.

template <class OutputIterator, class BidirectionalIterator, class traits, class charT>

OutputIterator regex_replace(OutputIterator out, BidirectionalIterator first, BidirectionalIterator last, const basic_regex<charT, traits>& e, const basic_string<charT>& fmt, match_flag_type flags = match_default);

Page 34: Introduction to Boost regex

Example:static const boost::regex e("\\A(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"); const std::string machine_format("\\1\\2\\3\\4"); const std::string human_format("\\1-\\2-\\3-\\4"); …std::string machine_readable_card_number(const std::string& s) { return boost::regex_replace(s, e, machine_format, boost::match_default | boost::format_sed); } std::string human_readable_card_number(const std::string& s) { return boost::regex_replace(s, e, human_format, boost::match_default | boost::format_sed); }

Page 35: Introduction to Boost regex

• Result:• string s[4] = { "0000111122223333",

"0000 1111 2222 3333" };

machine_format:

0000111122223333

0000111122223333

human_format:

0000-1111-2222-3333

0000-1111-2222-3333

Page 36: Introduction to Boost regex

• boost::regex_iteratorThe iterator type regex_iterator will enumerate all of the regular expression matches found in some sequence: dereferencing a regex_iterator yields a reference to a match_results object.

• Example:…boost::sregex_iterator m1(text.begin(), text.end(), expression); boost::sregex_iterator m2; std::for_each(m1, m2, &regex_callback); …

Page 37: Introduction to Boost regex

• boost::regex_token_iteratorThe template class regex_token_iterator is an iterator adapter; that is to say it represents a new view of an existing iterator sequence, by enumerating all the occurrences of a regular expression within that sequence, and presenting one or more character sequence for each match found.

• regex_token_iterator is almost like regex_iterator, but it can be used to list every sequence that doesn’t match the regular expression.

Page 38: Introduction to Boost regex

• Example 1:boost::regex re("\\s+"); boost::sregex_token_iterator i(s.begin(), s.end(), re, -1);boost::sregex_token_iterator j; unsigned count = 0;while(i != j) {

cout << *i++ << endl; count++;

}

Page 39: Introduction to Boost regex
Page 40: Introduction to Boost regex

• Example 2:boost::regex e("<\\s*A\\s+[^>]*href\\s*=\\s*\"([^\"]*)\"",

boost::regex::normal | boost::regbase::icase); …const int subs[] = {1, 0,}; boost::sregex_token_iterator i(s.begin(), s.end(), e, subs); boost::sregex_token_iterator j; while(i != j) {

std::cout << *i++ << std::endl; }

Page 41: Introduction to Boost regex
Page 42: Introduction to Boost regex

What’s more?• Thread Safety• Performance

Page 43: Introduction to Boost regex

References• http://www.boost.org• Beyond the C++ Standard Library: An Introduction to Boost -- L

ibrary 5.2 Usage

Page 44: Introduction to Boost regex

Thank you!