REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234...

Post on 13-Jan-2016

232 views 0 download

Transcript of REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234...

REGEX

Problems

• Have big text file, want to extract data– Phone numbers• 1-503-123-1234• 503-123-1234• (503) 123-1234• 123-1234• 503.123.1234

Regular Expressions

• Regular Expressions– Format for specifying patterns

• Pattern consists of– Literals– Ranges – Special values– Quantity indicators– Groupings

Literals

• Characters without special meaning are interpreted literally

1 Look for 1123 Look for 12312A Look for 12A

Ranges

• [ ] enclose a group of options

[123] Look for 1, 2, or 3

[AB] Look for A or B

2[BC] Look for 2 followed by B or C

Ranges

• [a-b] indicates a range

[0-9] Look for 0-9

[1-3] Look for 1-3

[a-zA-Z] Look for lowercase a-z or upper

[0-9A-Z] digit or uppercase letter

Ranges

• [^ ] says not any of these

[^123] Look for anything but 1,2,3

AA[^A] Look for 2 A's followed by anything not an A

Special Characters

• . Means any character but newline

A.C Matches ABC, ADC, A_C, A+C…

Special Characters

• ^ at start means nothing can be before• $ at end means nothing else after

Special Characters

• \s any whitespace– Tab, space, etc…

• \d any digit– Same as [0-9]

• \w any word character– Same as [a-zA-Z]

Special Characters

• \S anything BUT whitespace• \D anything BUT digit• \W anything BUT word character

Quantity Indicators

• {n} Must have n copies of whatever came before

\d{5} Match 5 digits

A{3}B Match 3 A's followed by a B

Quantity Indicators

• {n, m} n to m copies\d{2,5} Match 2 to 5 digits

• {n,} n or more copies {3,} Match any sequence of 3 or more digits

Quantity Indicators

• ? Indicates 0 or 1• + indicates 1 or more• * indicates 0 or more

A?B+C* could be:

BBBB, AB, ABBBC, ABCCCCC, B, BCCCC,…

\

• \ to escape chars\[ Find a [\. Find a .\\ Find a \

Grouping

• ( ) groups sequences– Apply options to whole group– Can extract each group from results

|

• | gives multiple options

Testing

http://www.debuggex.com/

QT Creator:

In C++

• Part of c++11– Only partially implemented in current GCC– Available in boost xpression library