REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234...

19
REGEX

Transcript of REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234...

Page 1: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

REGEX

Page 2: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Problems

• Have big text file, want to extract data– Phone numbers• 1-503-123-1234• 503-123-1234• (503) 123-1234• 123-1234• 503.123.1234

Page 3: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Regular Expressions

• Regular Expressions– Format for specifying patterns

• Pattern consists of– Literals– Ranges – Special values– Quantity indicators– Groupings

Page 4: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Literals

• Characters without special meaning are interpreted literally

1 Look for 1123 Look for 12312A Look for 12A

Page 5: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Ranges

• [ ] enclose a group of options

[123] Look for 1, 2, or 3

[AB] Look for A or B

2[BC] Look for 2 followed by B or C

Page 6: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Ranges

• [a-b] indicates a range

[0-9] Look for 0-9

[1-3] Look for 1-3

[a-zA-Z] Look for lowercase a-z or upper

[0-9A-Z] digit or uppercase letter

Page 7: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Ranges

• [^ ] says not any of these

[^123] Look for anything but 1,2,3

AA[^A] Look for 2 A's followed by anything not an A

Page 8: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Special Characters

• . Means any character but newline

A.C Matches ABC, ADC, A_C, A+C…

Page 9: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Special Characters

• ^ at start means nothing can be before• $ at end means nothing else after

Page 10: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Special Characters

• \s any whitespace– Tab, space, etc…

• \d any digit– Same as [0-9]

• \w any word character– Same as [a-zA-Z]

Page 11: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Special Characters

• \S anything BUT whitespace• \D anything BUT digit• \W anything BUT word character

Page 12: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Quantity Indicators

• {n} Must have n copies of whatever came before

\d{5} Match 5 digits

A{3}B Match 3 A's followed by a B

Page 13: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Quantity Indicators

• {n, m} n to m copies\d{2,5} Match 2 to 5 digits

• {n,} n or more copies {3,} Match any sequence of 3 or more digits

Page 14: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Quantity Indicators

• ? Indicates 0 or 1• + indicates 1 or more• * indicates 0 or more

A?B+C* could be:

BBBB, AB, ABBBC, ABCCCCC, B, BCCCC,…

Page 15: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

\

• \ to escape chars\[ Find a [\. Find a .\\ Find a \

Page 16: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Grouping

• ( ) groups sequences– Apply options to whole group– Can extract each group from results

Page 17: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

|

• | gives multiple options

Page 18: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

Testing

http://www.debuggex.com/

QT Creator:

Page 19: REGEX. Problems Have big text file, want to extract data – Phone numbers 1-503-123-1234 503-123-1234 (503) 123-1234 123-1234 503.123.1234.

In C++

• Part of c++11– Only partially implemented in current GCC– Available in boost xpression library