Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

35
Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012

Transcript of Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Page 1: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Introduction to Regular Expressions

Ben Brumfield

THATCamp DigitalFrontiers 2012

Page 2: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

What are Regular Expressions?

• Very small language for describing text.

• Not a programming language.

• Incredibly powerful tool for search/replace operations.

• Arcane art.

• Ubiquitous.

Page 3: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Why Use Regular Expressions?

• Finding every instance of a string in a file – i.e. every mention of “chickens” in a farm diary

• How many times does “sing” appear in a text in all tenses and conjugations?

• Reformatting dirty data• Validating input.• Command line work – listing files,

grepping log files

Page 4: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.
Page 5: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

The Basics

• A regex is a pattern enclosed within delimiters.

• Most characters match themselves.

• /THATCamp/ is a regular expression that matches “THATCamp”.– Slash is the delimiter enclosing the

expression.– “THATCamp” is the pattern.

Page 6: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

/at/

• Matches strings with “a” followed by “t”.

at hat

that atlas

aft Athens

Page 7: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

/at/

• Matches strings with “a” followed by “t”.

at hat

that atlas

aft Athens

Page 8: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Some Theory

• Finite State Machine for the regex /at/

Page 9: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Characters

• Matching is case sensitive.

• Special characters: ( ) ^ $ { } [ ] \ | . + ? *

• To match a special character in your text, precede it with \ in your pattern:– /snarky [sic]/ does not match “snarky [sic]”– /snarky \[sic\]/ matches “snarky [sic]”

• Regular expressions can support Unicode.

Page 10: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Character Classes

• Characters within [ ] are choices for a single-character match.

• Think of a set operation, or a type of or.

• Order within the set is unimportant.

• /x[01]/ matches “x0” and “x1”.

• /[10][23]/ matches “02”, “03”, “12” and “13”.

• Initial^ negates the class: – /[^45]/ matches all characters except 4 or 5.

Page 11: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

/[ch]at/

• Matches strings with “c” or “h”, followed by “a”, followed by “t”.

that at

chat cat

fat phat

Page 12: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

/[ch]at/

• Matches strings with “c” or “h”, followed by “a”, followed by “t”.

that at

chat cat

fat phat

Page 13: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Ranges

• Ranges define sets of characters within a class.– /[1-9]/ matches any non-zero digit.– /[a-zA-Z]/ matches any letter.– /[12][0-9]/ matches numbers between 10 and

29.

Page 14: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Shortcuts

Shortcut Name Equivalent Class

\d digit [0-9]

\D not digit [^0-9]

\w word [a-zA-Z0-9_]

\W not word [^a-zA-Z0-9_]

\s space [\t\n\r\f\v ]

\S not space [^\t\n\r\f\v ]

. everything [^\n] (depends on mode)

Page 15: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

/\d\d\d[- ]\d\d\d\d/

• Matches strings with:– Three digits– Space or dash– Four digits

501-1234 234 1252

652.2648 713-342-7452

PE6-5000 653-6464x256

Page 16: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

/\d\d\d[- ]\d\d\d\d/

• Matches strings with:– Three digits– Space or dash– Four digits

501-1234 234 1252

652.2648 713-342-7452

PE6-5000 653-6464x256

Page 17: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Repeaters

• Symbols indicating that the preceding element of the pattern can repeat.

• /runs?/ matches runs or run

• /1\d*/ matches any number beginning with “1”.

Repeater Count

? zero or one

+ one or more

* zero or more

{n} exactly n

{n,m} between n and m times

{,m} no more than m times

{n,} at least n times

Page 18: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Repeaters

Strings:

1: “at” 2: “art”

3: “arrrrt” 4: “aft”

Patterns:

A: /ar?t/B: /a[fr]?t/

C: /ar*t/ D: /ar+t/

E: /a.*t/ F: /a.+t/

Repeater Count

? zero or one

+ one or more

* zero or more

{n} exactly n

{n,m} between n and m times

{,m} no more than m times

{n,} at least n times

Page 19: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Repeaters

• /ar?t/ matches “at” and “art” but not “arrrt”.

• /a[fr]?t/ matches “at”, “art”, and “aft”.

• /ar*t/ matches “at”, “art”, and “arrrrt”

• /ar+t/ matches “art” and “arrrt” but not “at”.

• /a.*t/ matches anything with an ‘a’ eventually followed by a ‘t’.

Page 20: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Lab Session I

• http://gskinner.com/RegExr/

• https://gist.github.com/922838

• Match the titles “Mr.” and “Ms.”.

• Find all conjugations and tenses of “sing”.

• Find all places where more than one space follows punctuation. (Add a space if necessary.)

Page 21: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Lab Reference

Repeater Count

? zero or one

+ one or more

* zero or more

{n} exactly n times

{n,m} between n and m times

{,m} no more than m times

{n,} at least n times

Shortcut Name

\d digit

\D not digit

\w word

\W not word

\s space

\S not space

. everything

Page 22: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Anchors

• Anchors match between characters.

• Used to assert that the characters you’re matching must appear in a certain place.

• /\bat\b/ matches “at work” but not “batch”.

Anchor Matches

^ start of line

$ end of line

\b word boundary

\B not boundary

\A start of string

\Z end of string

\z raw end of string (rare)

Page 23: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Alternation

• In Regex, | means “or”.

• You can put a full expression on the left and another full expression on the right.

• Either can match.

• /seeks?|sought/ matches “seek”, “seeks”, or “sought”.

Page 24: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Grouping

• Everything within ( … ) is grouped into a single element for the purposes of repetition and alternation.

• The expression /(la)+/ matches “la”, “lala”, “lalalala” but not “all”.

• /schema(ta)?/ matches “schema” and “schemata” but not “schematic”.

Page 25: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Grouping Example

• What regular expression matches “eat”, “eats”, “ate” and “eaten”?

Page 26: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Grouping Example

• What regular expression matches “eat”, “eats”, “ate” and “eaten”?

• /eat(s|en)?|ate/

• Add word boundary anchors to exclude “sate” and “eating”: /\b(eat(s|en)?|ate)\b/

Page 27: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Replacement

• Regex most often used for search/replace

• Syntax varies; most scripting languages and CLI tools use s/pattern/replacement/ .

• s/dog/hound/ converts “slobbery dogs” to “slobbery hounds”.

• s/\bsheeps\b/sheep/ converts – “sheepskin is made from sheeps” to– “sheepskin is made from sheep”

Page 28: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Capture

• During searches, ( … ) groups capture patterns for use in replacement.

• Special variables $1, $2, $3 etc. contain the capture.

• /(\d\d\d)-(\d\d\d\d)/ “123-4567”– $1 contains “123”– $2 contains “4567”

Page 29: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Capture

• How do you convert – “Smith, James” and “Jones, Sally” to – “James Smith” and “Sally Jones”?

Page 30: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Capture

• How do you convert – “Smith, James” and “Jones, Sally” to – “James Smith” and “Sally Jones”?

• s/(\w+), (\w+)/$2 $1/

Page 31: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Capture

• Given a file containing URLs, create a script that wgets each URL:– http://bit.ly/DHapiTRANSCRIBE

• becomes:

– wget “http://bit.ly/DHapiTRANSCRIBE”

Page 32: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Capture

• Given a file containing URLs, create a script that wgets each URL:– http://bit.ly/DHapiTRANSCRIBE

• becomes

– wget “http://bit.ly/DHapiTRANSCRIBE”

• s/^(.*)$/wget “$1”/

Page 33: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Lab Session II

• Convert all Miss and Mrs. to Ms.

• Convert infinitives to gerunds – “to sing” -> “singing”

• Extract last name, first name from (title first name last name)– Dr. Thelma Dunn– Mr. Clay Shirky– Dana Gray

Page 34: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Caveats

• Do not use regular expressions to parse (complicated) XML!

• Check the language/application-specific documentation: some common shortcuts are not universal.

Page 35: Introduction to Regular Expressions Ben Brumfield THATCamp DigitalFrontiers 2012.

Acknowledgments

• James Edward Gray II and Dana Gray– Much of the structure and some of the

wording of this presentation comes from– http://www.slideshare.net/JamesEdwardGrayII

/regular-expressions-7337223