Regex Intro
-
Upload
jason-noble -
Category
Technology
-
view
718 -
download
1
description
Transcript of Regex Intro
![Page 1: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/1.jpg)
^[Rr]egular [Ee]xpressions$
Introduction
![Page 2: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/2.jpg)
Vocabulary
• Regular expression / Regex / Regexp– Regex is pronounced Reg (as in register)
Ex (as in FedEx)
• Matching– Regex matches a string means it matches in a string
![Page 3: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/3.jpg)
Regular Expressions
• Composed of two types of characters– Metacharacters / Special characters
• * ? ^ $ . [ ]
– Literal characters• a b c d
![Page 4: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/4.jpg)
Egrep tool
• Allows you to use Regular Expressions to find words that match
• Available for Macs, PCs and Linux
• cat /usr/share/dict/words | egrep ‘…’
• See http://regex.info/egrep.html if you don’t have it preinstalled
![Page 5: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/5.jpg)
My first regex
• cat /usr/share/dict/words | egrep ‘cat’– Matches any words
with a ‘c’ followed by an ‘a’ followed by a ‘t’
• bobcat• cat• catwalk• scatter
• Simple regex, only uses Literal chars
![Page 6: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/6.jpg)
Metacharacters: ^ and $
• ^ matches the beginning of a line• $ matches the end of a line
– ^cat (start of line followed by ‘c’ then ‘a’ then ‘t’)• cat• catwalk
– cat$ (‘c’ followed by ‘a’ then ‘t’ followed by EOL)• bobcat• cat
– ^cat$ (start of line followed by ‘c’ then ‘a’ then ‘t’ then EOL)
• cat
![Page 7: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/7.jpg)
How to read regex
• Read each character one at a time• ^bat
– Start of line followed by ‘b’ then ‘a’ then ‘t’
• rat$– ‘r’ then ‘a’ then ‘t’ followed by end of line
• ^dog$– Start of line followed by ‘d’ then ‘o’ then ‘g’
then EOL
![Page 8: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/8.jpg)
More simple regex’s
• ^– Start of line
• ^$– Start of line followed by end of line
• $– End of line
• ^foot$– Start of line followed by ‘f’ then ‘o’ then ‘o’ then ‘t’
followed by EOL
![Page 9: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/9.jpg)
Character Classes [ ]
• Matches one of the characters in the [ ]– [ae]
• Matches ‘a’ or ‘e’
– [aeiouy]• Matches any vowel
– ^gr[ae]y$• Start of line followed by ‘g’ then ‘r’ then ‘a’ or ‘e’
then ‘y’ followed by end of line• grey or gray
![Page 10: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/10.jpg)
Character Classes cont.
• [Ss]– Matches upper or lower case ‘S’
• [123456]– Matches any of the digits listed
• [Hh][123456]– Matches H1, h2, h3, H4, etc
![Page 11: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/11.jpg)
Special characters in [ ]’s
• - (dash) references a range– [1-6] is the same as [123456]– [a-f] is the same as [abcdef]
• Ranges can be mixed with literals– [0-9a-fA-F_!.?]
• Any digit, upper or lower case ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, underscore, exclamation, period or question mark
![Page 12: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/12.jpg)
Negated character class [^ ]
• ^ inside of [ ] means “not any of these”– [^1-6]
• Any character other than 1, 2, 3, 4, 5, 6
– [^a-fA-F]• Any character other than A-F (upper or lower)
– The ^ must be the first character inside [ ]• [^c] (Matches anything but ‘c’)• [c^] (Matches a ‘c’ or ‘^’)
![Page 13: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/13.jpg)
Translating regex practice
• List of words that have ‘q’ followed by a character other than ‘u’– q[^u]
• List of words with ‘f’ followed by an ‘i’ or ‘o’ followed by ‘r’ then ‘e’– f[io]re
• Line starts with ‘Qu’ or ‘qu’ followed by an ‘e’ followed by any letter between ‘p’ and ‘t’– ^[Qq]ue[p-t]
![Page 14: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/14.jpg)
Metacharacter: . (dot)
• Matches any character• c.t
– ‘c’ followed by any character followed by ‘t’• cat• cot• c8t
• Period inside of [ ]’s matches a period– [a.c]
• Matches ‘a’, ‘.’ or ‘c’
![Page 15: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/15.jpg)
Periods cont.
• 03.19.76– Matches ‘03’ followed by a char then ‘19’
then any char then ‘76’• 03-19-76• 03/19/76• 03.19.76• 03 19 76• 03319876
![Page 16: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/16.jpg)
Alternatives: | (pipe)
• Pipes allow you to specify alternatives• grey|gray
– Matches on grey or gray
• Use parentheses to constrain alternatives– gr(e|a)y
• Within [ ]’s, | is a normal character– [a|b]
• Matches ‘a’ or ‘|’ or ‘b’
![Page 17: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/17.jpg)
Pipes (cont.)
• Use parenthesis to constrain– gre|ay
• matches ‘gre’ or ‘ay’
– gr(e|a)y• matches ‘gr’ followed by ‘e’ or ‘a’ then ‘y’
![Page 18: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/18.jpg)
Regex practice
• Match “First Street” or “1st street”– (First|1st) [Ss]treet– (Fir|1)st [Ss]treet
• These are equivalent, which is better?
• Match “toothbrush” or “hairbrush”– (tooth|hair)brush
![Page 19: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/19.jpg)
^ or $ and alternation
• Be careful when using ^ or $ with alternation• ^From|Subject|Date:
– Start of line followed by From OR– Subject OR– Date:
• ^(From|Subject|Date):– Start of line followed by ‘From’ or ‘Subject’ or
‘Date’ followed by ‘:’
• Safer to use ()’s to group your alternates
![Page 20: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/20.jpg)
Case insensitive match
• Matches are case sensitive by default– [Ff]rom will match From but not FRom
• Use egrep’s -i option to do a case insensitive match
• Most languages have a case insensitive match as well
![Page 21: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/21.jpg)
Quantifiers: ?
• ? metacharacter means optional– colou?r
• matches color or colour• ‘c’ then ‘o’ then ‘l’ then ‘o’ then optionally ‘u’
then ‘r’
• Match July or Jul and fourth, 4th and 4– (July|Jul) (fourth|4th|4)– July? (fourth|4th|4)– July? (fourth|4(th)?)
![Page 22: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/22.jpg)
Quantifiers: + and *
• + (plus) – One or more of the previous item
• * (star)– Zero or more of the previous item
• b[0-9]*a– ba– b9999a– b999999999999999a
![Page 23: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/23.jpg)
Summary of Quantifiers
Minimum Required
Maximum to try
Meaning
? none 1 zero or one occurrence
* none no limit zero or more occurrences
+ 1 no limit one or more occurrences
![Page 24: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/24.jpg)
Escaping metacharacters
• Use \ (backslash) to escape metacharacters– \. matches ‘.’– . matches any character
• c.t matches cat
• c\.t does not match cat
• \(cat\) matches ‘(cat)’ not ‘cat’
![Page 25: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/25.jpg)
More practice
• Match chat, cat, chart– ch?ar?t– c[h]?a[r]?t
• Start of line then M then one or more ‘a’ followed by ‘st’ and zero or more ‘b’– ^M[a]+st[b]*
• Lines ending with one or more ‘c’ followed by a ‘t’ then zero or one ‘e’– [c]+t[e]*$
![Page 26: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/26.jpg)
More practice
• ^[Mm][^a-np-z]ney$– Start of line then ‘M’ or ‘m’ then any
character not a-n and p-z then ‘ney’ followed by end of line
– Money, money, m3ney
• ^be.*(bob|ted)$– Start of line followed by ‘be’ followed by
zero or more characters followed by ‘bob’ or ‘ted’ followed by end of line
![Page 27: Regex Intro](https://reader037.fdocuments.in/reader037/viewer/2022102613/549a3aa4b47959564d8b5961/html5/thumbnails/27.jpg)
More practice
• Match truck, firetruck but not dumptruck– ^(fire)?truck$
• $0.99, $599.95, $1000.45, $5000– \$[0-9]+(\.[0-9][0-9])?$
• 404-555-1212, 404.555.1212, (404) 555-1212– ^[()0-9]+.[0-9]+.[0-9]+$