Regular expressions

21
Regular Expressions How do they work 1

description

A lecture from my Software engineering seminar about the subject of regular expressions engines

Transcript of Regular expressions

Page 1: Regular expressions

1

Regular Expressions

How do they work

Page 2: Regular expressions

2

Several important Facts

1. Everything in computing was discovered in one form or another in the 70-80’s and was probably thought about during the 60’s.2. The easiest way to become a great computer engineer in the 80’s was to work for Bell Labs and have a beard.

Page 3: Regular expressions

3

Back to the subject at hand

Page 4: Regular expressions

4

What are regular expressions?

From Wikipedia:In computing, a regular expression provides a concise and flexible means to "match" (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for "regular expression" include regex and regexp.

Page 5: Regular expressions

5

Why do we need regular expressions (in programming)

Many reasons but most of them are in their base finding strings in text .Preferably without reading it

^(?("")(""[^""]+?""@)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])@))(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9]{2,17}))$

^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z])\S{8,}$

Page 6: Regular expressions

6

Regular Expressions Syntax meta characters

Grouping . – match any other character [ ] – grouping, match single character that is inside the group [^ ] – grouping, match single character that is not inside the group ( ) – sub expression, in Perl can be recalled later from special variables

Quantifier {m,n} –specifies that the character/sub expression before need to be matched at

least m times and no more than n times * - derived from Kleene star in formal logic, matches 0 or more amount of the

character before it. ? –matches zero or one of the preceding elements + - derived from Kleene cross in formal logic, matches 1 or more of the character

before it. Location

^ - Marking start of line $ - Marking end of line

Page 7: Regular expressions

7

Regular Expressions SyntaxCharacter groups

[:alpha:] - Any alphabetical character - [A-Za-z] [:alnum:] - Any alphanumeric character - [A-Za-z0-9] [:ascii:] - Any character in the ASCII character set.[:blank:] - A GNU extension,

equal to a space or a horizontal tab ("\t") [:cntrl:] - Any control character [:digit:] - Any decimal digit - [0-9], equivalent to "\d“ [:graph:] - Any printable character, excluding a space [:lower:] - Any lowercase character - [a-z] [:print:] - Any printable character, including a space [:punct:] - Any graphical character excluding "word" characters [:space:] - Any whitespace character. "\s" plus the vertical tab ("\cK") [:upper:] - Any uppercase character - [A-Z] [:word:] - A Perl extension - [A-Za-z0-9_], equivalent to "\w“ [:xdigit:] - Any hexadecimal digit - [0-9a-fA-F].

Page 8: Regular expressions

8

What is a regular expression engine

A regular expression engine is a program that takes a set of constraints specified in a mini-language, and then applies those constraints to a target string, and determines whether or not the string satisfies the constraints.

In less grandiose terms, the first part of the job is to turn a pattern into something the computer can efficiently use to find the matching point in the string, and the second part is performing the search itself.

Page 9: Regular expressions

9

Famous Regex Engines

Page 10: Regular expressions

10

Part 2

Page 11: Regular expressions

11

How the Perl Regex engine works

• Unlike the army only two steps– Compilation • Parsing (Size, Construction)• Peep-hole optimization and analysis

– Execution • Start position and no-match optimizations• Program execution

Page 12: Regular expressions

12

DFA

Page 13: Regular expressions

13

DFA

Page 14: Regular expressions

14

NFA

Equal in strength to DFASmaller in size

Page 15: Regular expressions

15

Ken Thompson

Page 16: Regular expressions

16

Thompson NFA method

• In 1968 Thompson wrote an article on how to convert a regular expression to still unnamed automata (NFA)

• The article included code to explain the point

Page 17: Regular expressions

17

Thompson NFA method

1. Check the regex and inject . For concat actiona(b|c)*d 2. Convert to reverse polish notationabc|*.d.

Page 18: Regular expressions

18

Thompson NFA method cont.

char

Check single character

exp

exp

OR

exp

Kleene star

Page 19: Regular expressions

19

Thompson NFA method cont.

• 3.Build the NFA

B

C A

D

Page 20: Regular expressions

20

Problems for regex

• NLP

• Unicode vs. ASCII

Page 21: Regular expressions

21

Some examples of Regex

• ([^\s]+(\.(?i)(jpg|png|gif|bmp))$) – Match file with specific extentions

• ^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$– Match URL

• /^#?([a-f0-9]{6}|[a-f0-9]{3})$/ – Match a hex value

• [ -~] – An interesting one.