Post on 20-Dec-2015
Administrivia
• Class:
Major
Math
Linguistics
Creative WritingCivil Engineering
MIS
Accounting
Economics
Computer Science
PsychologyGerman
Administrivia
• Did you bring your laptop?
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Administrivia
• Did you install Perl yet?– Active State Perl
– http://www.activestate.com– install free version
Regular Expressions
• regular expressions are used in string pattern-matching – important tool in automated searching
– formally equivalent to finite-state automata (FSA) and regular grammars
• popular implementations– Unix grep
• command line program• returns lines matching a regular expression• standard part of all Unix-based systems
– including MacOS X (command-line interface in Terminal)
• many shareware/freeware implementations available for Windows XP– just Google and see...
– grep functionality is built into many programming languages• e.g. Perl
– wildcard search in Microsoft Word• limited version of regular expressions (not full power)• with differences in notation
Regular Expressions
• Historical note – grep:
• name comes from Unix ed command – g/re/p– “search globally for lines matching the regular expression, and
print them”– [Source: http://en.wikipedia.org/wiki/Grep]
– ed• is an obscure and difficult-to-use text edit program on
Unix systems– doesn’t need a screen display– would work on an ancient teletype
Regular Expressions
• Formally– a regular expression
(regexp) is formed from:• an alphabet
(= set of characters)
• operators
– a regexp is shorthand for a set of strings
• (possibly infinite set)
• (strings are of finite length)
• Formally– a set of strings is
called a language– a language that can
be defined by a regular expression is called a regular language
• (not all languages are regular)
Regular Expressions
• alphabet– e.g. {a,b,c,...,z}
set of lower case English letters
Note: case is important
• operators– a single symbol
a– an exactly n
occurrences of
a, n a positive integer
– an a3 aaa
– a* zero or more occurrences of a
– a+ one or more occurrences of a
– concatenation• two regexps may be
concatenated, the resulting string is also a regexp
• e.g. abc
– disjunction• infix operator: |• (vertical bar)• e.g. a|b
– parentheses may be used for disambiguation
• e.g. gupp(y|ies)
Regular Expressions
• Technically, a+ is not necessary– aa* = a+
“a concatenated with a* (zero or more occurrences of a)”
= “one or more occurrences of a”
• Disjunction– [set of characters]
set of characters enclosed in square brackets means match one of the characters
– e.g. [aeiou]
matches any of the vowels a, e, i, o or u but not d
– dash (-) shorthand for a range
– e.g. [a-e]
matches a, b, c, d or e
Regular Expressions
• Range defined over a computer character set– typically ASCII– originally a 7 bit
character set – 2^7 = 128 (0-127)
different characters– ASCII = American
Standard Code for Information Interchange
– e.g. [A-z] [0-9A-Za-z]
• http://www.idevelopment.info/data/Programming/ascii_table/PROGRAMMING_ascii_table.shtml
Regular Expressions
• Perl uses the same notation as grep (textbook also uses grep notation)
• More shorthand– question mark (?) means the
previous regexp is optional– e.g. colou?r– matches color or colour– metacharacters or operators
like ? have a function– to match a question mark,
escape it using a backslash (\)– e.g. why\?– ? in Microsoft Word means
match any character
• More shorthand– period (.) stands for any
character (except newline)
– e.g. e.tmatches eat as well as eet
– caret sign (^) as the first character of a range of characters [^set of characters]means don’t match any of the characters mentioned (after the caret)
– e.g. [^aeiou]– any character except for one
of the vowels listed
Regular Expressions
• Text files in Unix consists of sequences of lines separated by a newline character (LF = line feed)
• Typically, text files are read a line at a time by programs
• Matching in Perl and grep is line-oriented(can be changed in Perl)
• Differences in platforms for line breaking:– Unix: LF– Windows (DOS): CR LF– MacOS (X): CR
Regular Expressions
• Line-oriented metacharacters:– caret (^) at the beginning of a
regexp string matches the “beginning of a line”
– e.g. ^The
matches lines beginning with
the sequence The– Note: the caret is very
overloaded...• [^ab]
• a^b
– dollar sign ($) at the end of a regexp string matches the “end of the line”
– e.g. end\.$– matches lines ending in
the sequence end.– e.g. ^$
matches blank lines only– e.g. ^ $
matches lines contains exactly one space
Regular Expressions
• Word-oriented metacharacters:– a word is any sequence of digits [0-9],
underscores (_) and letters [a-zA-Z]– (historical reasons for this)– \b matches a word boundary, e.g. a
space or beginning or end of a line or a non-word character
– e.g. the– matches the, they, breathe and other– but \bthe will only match the and they– the\b will match the and breathe– \bthe\b will only match the– (\< and \> can also be used to match the
beginning and end of a word)
– e.g. \b99– matches 99 but not 299– also matches $99
Regular Expressions
• Range abbreviations:– \d (digit) = [0-9]
– \s (whitespace character) = space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF)
– \w (word character) = [0=9a-zA-Z_]
• uppercase versions denote negation– e.g. \W means a non-word
character
\D means a non-digit
• Repetition abbreviations:– a? a optional– a* zero or more a’s– a+ one or more a’s– a{n,m} between n and m a’s– a{n,} at least n a’s– a{n} exactly n a’s– e.g. \d{7,}
matches numbers with at least 7 digits
– e.g. \d{3}-\d{4}– matches 7 digit telephone
numbers with a separating dash
Reading
• Perl Quick Intro– http://perldoc.perl.org/perlintro.html
• Perl Regular Expressions (RE)– perlrequick - Perl regular expressions quick start
• http://perldoc.perl.org/perlrequick.html
– perlretut - Perl regular expressions tutorial• http://perldoc.perl.org/perlretut.html