Regular expressions
-
Upload
srgrn -
Category
Technology
-
view
479 -
download
2
description
Transcript of Regular expressions
1
Regular Expressions
How do they work
2
Several important Facts
1. Everything in computing was discovered in one form or another in the 70-80’s and was probably thought about during the 60’s.2. The easiest way to become a great computer engineer in the 80’s was to work for Bell Labs and have a beard.
3
Back to the subject at hand
4
What are regular expressions?
From Wikipedia:In computing, a regular expression provides a concise and flexible means to "match" (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for "regular expression" include regex and regexp.
5
Why do we need regular expressions (in programming)
Many reasons but most of them are in their base finding strings in text .Preferably without reading it
^(?("")(""[^""]+?""@)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])@))(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9]{2,17}))$
^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z])\S{8,}$
6
Regular Expressions Syntax meta characters
Grouping . – match any other character [ ] – grouping, match single character that is inside the group [^ ] – grouping, match single character that is not inside the group ( ) – sub expression, in Perl can be recalled later from special variables
Quantifier {m,n} –specifies that the character/sub expression before need to be matched at
least m times and no more than n times * - derived from Kleene star in formal logic, matches 0 or more amount of the
character before it. ? –matches zero or one of the preceding elements + - derived from Kleene cross in formal logic, matches 1 or more of the character
before it. Location
^ - Marking start of line $ - Marking end of line
7
Regular Expressions SyntaxCharacter groups
[:alpha:] - Any alphabetical character - [A-Za-z] [:alnum:] - Any alphanumeric character - [A-Za-z0-9] [:ascii:] - Any character in the ASCII character set.[:blank:] - A GNU extension,
equal to a space or a horizontal tab ("\t") [:cntrl:] - Any control character [:digit:] - Any decimal digit - [0-9], equivalent to "\d“ [:graph:] - Any printable character, excluding a space [:lower:] - Any lowercase character - [a-z] [:print:] - Any printable character, including a space [:punct:] - Any graphical character excluding "word" characters [:space:] - Any whitespace character. "\s" plus the vertical tab ("\cK") [:upper:] - Any uppercase character - [A-Z] [:word:] - A Perl extension - [A-Za-z0-9_], equivalent to "\w“ [:xdigit:] - Any hexadecimal digit - [0-9a-fA-F].
8
What is a regular expression engine
A regular expression engine is a program that takes a set of constraints specified in a mini-language, and then applies those constraints to a target string, and determines whether or not the string satisfies the constraints.
In less grandiose terms, the first part of the job is to turn a pattern into something the computer can efficiently use to find the matching point in the string, and the second part is performing the search itself.
9
Famous Regex Engines
10
Part 2
11
How the Perl Regex engine works
• Unlike the army only two steps– Compilation • Parsing (Size, Construction)• Peep-hole optimization and analysis
– Execution • Start position and no-match optimizations• Program execution
12
DFA
13
DFA
14
NFA
Equal in strength to DFASmaller in size
15
Ken Thompson
16
Thompson NFA method
• In 1968 Thompson wrote an article on how to convert a regular expression to still unnamed automata (NFA)
• The article included code to explain the point
17
Thompson NFA method
1. Check the regex and inject . For concat actiona(b|c)*d 2. Convert to reverse polish notationabc|*.d.
18
Thompson NFA method cont.
char
Check single character
exp
exp
OR
exp
Kleene star
19
Thompson NFA method cont.
• 3.Build the NFA
B
C A
D
20
Problems for regex
• NLP
• Unicode vs. ASCII
21
Some examples of Regex
• ([^\s]+(\.(?i)(jpg|png|gif|bmp))$) – Match file with specific extentions
• ^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$– Match URL
• /^#?([a-f0-9]{6}|[a-f0-9]{3})$/ – Match a hex value
• [ -~] – An interesting one.