Regular Expression

39
Regular Expression Minh Hoang TO Portal Team

description

Provides fundamental knowledge on regular expression

Transcript of Regular Expression

Page 1: Regular Expression

Regular Expression

Minh Hoang TOPortal Team

Page 2: Regular Expression

2

Agenda

» Finite State Machine

» Pattern Parser

» Java Regex » Parsers in GateIn

» Advanced Theory

Page 3: Regular Expression

Finite State Machine

Page 4: Regular Expression

4

State Diagram

Page 5: Regular Expression

5

JIRA Issue Lifecycle

Page 6: Regular Expression

6

Java Thread Lifecycle

Page 7: Regular Expression

7

Java Compilation Flow

Page 8: Regular Expression

8

Finite State Machine - FSM

» Behavioral model to describe working flow of a system

Page 9: Regular Expression

9

Finite State Machine - FSM

» Directed graph with labeled edges

Page 10: Regular Expression

Pattern Parser

Page 11: Regular Expression

11

Classic Problem

» A – Finite characters set

Ex:

A = {a, b, c, d,..., z} or A = { a, b, c,..., z, public, class, extends, implements, while, if,...}

» Pattern P and input sequence INPUT made of A 's elements

Ex:

P = “a.*b” or P = “class.*extends.*”INPUT = “aaabbbcc” or INPUT = a Java source file

→ Parser reads character-by-character INPUT and recognizes all subsequences matching pattern P

Page 12: Regular Expression

12

Classic Problem - Samples

» Split a sequence of characters into an array of subsequences

String path = “/portal/en/classic/home”; String[] segments = path.split(“/”);

» Handle comment block encountered in a file

» Override readLine() in BufferedReader

» Extract data from REST response

» Write an XML parser from scratch

Page 13: Regular Expression

13

Finite State Machine & Classic Problem

» Acceptor FSM?

» How to transform Classic Problem into graph traversing problem with well-known generic solution?

Find pattern occurrences ↔ Traversing directed graph with labeled edges

Page 14: Regular Expression

14

FSM – Word Accepting

» Consider a word W – sequence of characters from character set A

W = “abcd...xyz”

FSM having graph edges labeled with characters from A, accepts W if there exists a path connecting START node to one of END nodes

START = S1 → S2 → … → Sn = END

1. Duplicate of intermediate nodes is allowed

2. The transition from S_i → S_(i+1) is determined by i-th character of W

Page 15: Regular Expression

15

FSM – Word Accepting

Page 16: Regular Expression

16

Acceptor FSM

» Consider a pattern P, a FSM is called Acceptor FSM if it accepts any word matching pattern P.

Ex:

Acceptor FSM of “a[0-9]b” accepts any element from word set

{ “a0b”, “a1b”, “a2b”, “a3b”, “a4b”, “a5b”, “a6b”, “a7b”, “a8b”, “a9b”}

Page 17: Regular Expression

17

How Pattern Parser Works?

Traversing directed graph associated with Acceptor FSM

1. Start from root node 2. Read next characters from INPUT, then makes move according to transition rules 3. Repeat second step until visiting one leaf node or INPUT becomes empty

4. Return OK if leaf node refers to success match.

Page 18: Regular Expression

18

Example One

» Recognize pattern

eXo.*er

in:

AAAeXo123erBBBeXoerCCCeXoeXoerDDD

Page 19: Regular Expression

19

Example One

» Acceptor FSM with 8 states:

START – Start reading input sequence

e – encounter eeX – encounter eX

eXo – encounter eXo

eXo.* – encounter eXo.*

eXo.*e – encounter eXo.*e

END – subsequence matching eXo.*er foundFAILURE

Page 20: Regular Expression

20

Page 21: Regular Expression

21

Example Two

» Recognize comment block

/* */in:

/* Don't ask * /final int innerClassVariable;

Page 22: Regular Expression

22

Example Two

» Acceptor FSM with 5 states:

START – start reading input sequence

OUT – stay away from comment blocks

ENTERING – at the beginning of comment block

IN – stay inside a comment block

LEAVING – at the end of comment block

Page 23: Regular Expression

23

Page 24: Regular Expression

24

Finite State Machine With Stack

» Example Two is slightly harder than Example One as transition decision depends on past information → We must keep something in memory

»

FSM with Stack = Ordinary FSM + Stack Structure storing past info

Contextual transition is determined by pair

(next input character , stack state)

Page 25: Regular Expression

Java Regex

Page 26: Regular Expression

26

Model

» Pattern: Acceptor Finite State Machine

» Matcher: Parser

Page 27: Regular Expression

27

java.util.regex.Pattern

» Construct FSM accepting pattern

Pattern p = Pattern.compile(“a.*b”);

FSM states are instances of java.util.regex.Pattern$Node

» Generate parser working on input sequence

Matcher matcher = p.matcher(“aaabbbb”);

Page 28: Regular Expression

28

java.util.regex.Matcher

» Find next subsequence matching pattern

find()

» Get capturing groups from latest match

group()

Page 29: Regular Expression

29

Capturing Group

Two Pattern objects

Pattern p = Pattern.compile(“abcd.*efgh”);Pattern q = Pattern.compile(“abcd(.*)efgh”);String text = “abcd12345efgh”;Matcher pM = p.match(text);Matcher qM = q.match(text);

» pM.find() == qM.find();

» pM.group(1) != qM.group(1);

Page 30: Regular Expression

30

Capturing Group

» Hold additional information on each match

while(matcher.find()){ matcher.group(index);}

» Pattern P = (A)(B(C))

matcher.group(0) = the whole sequence ABCmatcher.group(1) = ABCmatcher.group(2) = BCmatcher.group(3) = C

Page 31: Regular Expression

31

Capturing Group

» Pattern.compile(“abc(defgh”);Pattern.compile(“abcdef)gh”);

→ PatternSyntaxException

» Pattern.compile(“abc\\(defgh”);Pattern.compile(“abcdef\\)gh”);

→ Success thanks to escape character '\'

Page 32: Regular Expression

32

Operators

» Union

[a-zA-Z-0-9]» Negation

[^abc]

[^X]

Page 33: Regular Expression

33

Contextual Match

» X(?=Y)

Once match X, look ahead to find Y

» X(?!= Y)

Once match X, look ahead and expect not find Y

» X(?<= Y)

Once match X, look behind to find Y

» X(?<!= Y)

Once match X, look behind and expect not find Y

Page 34: Regular Expression

34

Tips

» Pattern is stateless → Maximize reuse

We often see:

static final Pattern p = Pattern.compile(“a*b”);

» Be careful with String.split

String.split vs Java loop + String.charAt

Page 35: Regular Expression

Parsers in GateIn

Page 36: Regular Expression

36

Parsers in GateIn

» JavaScript Compressor

» CSS Compressor

» Groovy Template Optimizer

» Navigation Controller

Extracting URL param = Regex matching + Backtracking algorithm

» StaxNavigator (Nice XML parser based on StAX)

Page 37: Regular Expression

Advanced Theory

Page 38: Regular Expression

38

Grammar & Language

» Any word matching pattern eXo.*er is a combination of transforms, starting from S

S → eXoQerQ → RQTQ → ''R → {a,b,c,d,...}T → {a,b,c,d,...}

» Language of a Grammar = Vocabularies generated by finite-combination of transforms, starting from S

Ex: Any valid Java source file is generated by a finite number of transforms mentioned in Java Grammar (JLS)

Page 39: Regular Expression

39

Finite State Machine & Language

» Language accepted by a FSM with Stack must be built from a context-free grammar

Explicit steps to build such context-free grammar are described in Kleene theorem

» Context-free grammar Language is accepted by a FSM with Stack

Explicit steps to build such Finite State Machine aredescribed in Kleene theorem