Natural Language Processing Lecture 4 : Regular Expressions and Automata.

62
Natural Language Processing Lecture 4 : Regular Expressions and Automata

Transcript of Natural Language Processing Lecture 4 : Regular Expressions and Automata.

Page 1: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

Natural Language Processing

Lecture 4 : Regular Expressions and Automata

Page 2: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

2

• A language is a set of strings• String: A sequence of letters

Examples: “cat”, “dog”, “house”, …

Defined over an alphabet:

Definitions

zcba ,,,,

Page 3: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

3

Alphabets and Strings

• We will use small alphabets:• Strings

abbaw

bbbaaav

abu

ba,

baaabbbaaba

baba

abba

ab

a

Page 4: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

Regular Expressions

• In computer science, RE is a language used for specifying text search string.

• A regular expression is a formula in a special language that is used for specifying a simple class of string.

• Formally, a regular expression is an algebraic notation for characterizing a set of strings.

• RE search requires a pattern that we want to search for, and a corpus of texts to search through.

Page 5: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

5

Basic Regular Expression Patterns

• The use of the brackets [] to specify a disjunction of characters.

• The use of the brackets [] plus the dash - to specify a range.

Page 6: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

6

Basic Regular Expression Patterns

• Uses of the caret ^ for negation or just to mean ^

• The question-mark ? marks optionality of the previous expression.

• The use of period . to specify any character

Page 7: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

7

Disjunction, Grouping, and Precedence

• Disjunction• /cat|dog• Precedence• /gupp(y|ies)• To find the English article the• /the/• /[tT]he/• /[^a-zA-Z][tT]he[^a-zA-Z]/

Page 8: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

8

Aliases for common sets of characters

Page 9: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

9

Regular expression operators for counting

Page 10: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

10

Some characters that need to be backslashed

Page 11: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

11

Finite State Automata

• FSAs recognize the regular languages represented by regular expressions SheepTalk: /baa+!/

• Directed graph with labeled nodes and arc transitions

•Five states: q0 the start state, q4 the final state, 5 transitions

q0q0 q4q4q1q1 q2q2 q3q3

b aa

a !

Page 12: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

12

Formally

• FSA is a 5-tuple consisting of Q: set of states {q0,q1,q2,q3,q4}: an alphabet of symbols {a,b,!} q0: A start state F: a set of final states in Q {q4} (q,i): a transition function mapping Q

x to Q

q0q0 q4q4q1q1 q2q2 q3q3

b a

a

a !

Page 13: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

13

• FSA recognizes (accepts) strings of a regular language baa! baaa! baaaa! …

• Tape Input: a rejected input

a b a ! b

q0q0 q4q1q1 q2q2 q3b a

aa !

Page 14: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

14

State Transition Table for SheepTalk

StateInput

b a !

0 1 Ø Ø

1 Ø 2 Ø

2 Ø 3 Ø

3 Ø 3 4

4 Ø Ø Ø

baa! baaa! baaaa! baaaaa! ...

q0q0 q4q4q1q1 q2q2 q3q3

b aa

a !

Page 15: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

15

Non-Deterministic FSAs for SheepTalk

q0q0 q4q4q1q1 q2q2 q3q3

b a a a !

q0q0 q4q4q1q1 q2q2 q3q3

b a a !

Page 16: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

16

Finite Accepter

• Input

“Accept” or“Reject”

String

FiniteAutomata

Output

Page 17: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

17

Transition Graph

initialstate

final state“accept”state

transition

abba -Finite Accepter

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

Page 18: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

18

Initial Configuration

1q 2q 3q 4qa b b a

5q

a a bb

ba,

Input Stringa b b a

ba,

0q

Page 19: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 19

Reading the Input

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

Page 20: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 20

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

Page 21: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 21

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

Page 22: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 22

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

Page 23: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 23

0q 1q 2q 3q 4qa b b a

Output: “accept”

5q

a a bb

ba,

a b b a

ba,

Page 24: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 24

Rejection

1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

0q

Page 25: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 25

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

Page 26: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 26

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

Page 27: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 27

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

Page 28: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 28

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

Output:“reject”

a b a

ba,

Page 29: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 29

Another Example

a

b ba,

ba,

0q 1q 2q

a ba

Page 30: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 30

a

b ba,

ba,

0q 1q 2q

a ba

Page 31: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 31

a

b ba,

ba,

0q 1q 2q

a ba

Page 32: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 32

a

b ba,

ba,

0q 1q 2q

a ba

Page 33: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 33

a

b ba,

ba,

0q 1q 2q

a ba

Output: “accept”

Page 34: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 34

Rejection

a

b ba,

ba,

0q 1q 2q

ab b

Page 35: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 35

a

b ba,

ba,

0q 1q 2q

ab b

Page 36: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 36

a

b ba,

ba,

0q 1q 2q

ab b

Page 37: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 37

a

b ba,

ba,

0q 1q 2q

ab b

Page 38: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 38

a

b ba,

ba,

0q 1q 2q

ab b

Output: “reject”

Page 39: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 39

Formalities

• Deterministic Finite Accepter (DFA)

FqQM ,,,, 0Q

0q

F

: set of states

: input alphabet

: transition function

: initial state

: set of final states

Page 40: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 40

About Alphabets

• Alphabets means we need a finite set of symbols in the input.

• These symbols can and will stand for bigger objects that can have internal structure.

Page 41: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 41

Input Aplhabet

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

ba,

Page 42: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 42

Set of States

Q

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

543210 ,,,,, qqqqqqQ

ba,

Page 43: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 43

Initial State

0q

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 44: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 44

Set of Final States

F

0q 1q 2q 3qa b b a

5q

a a bb

ba,

4qF

ba,

4q

Page 45: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 45

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

QQ :

ba,

Page 46: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 46

10 , qaq

2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q 1q

Page 47: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 47

50 , qbq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 48: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 48

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

32 , qbq

Page 49: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 49

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b

0q

1q

2q

3q

4q

5q

1q 5q

5q 2q

2q 3q

4q 5q

ba,5q5q5q5q

Page 50: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 50

Extended Transition Function(Reads the entire string) *

QQ *:*

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

Page 51: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 51

20 ,* qabq

3q 4qa b b a

5q

a a bb

ba,

ba,

0q 1q 2q

Page 52: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 52

40 ,* qabbaq

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

Page 53: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 53

50 ,* qabbbaaq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 54: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 54

50 ,* qabbbaaq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Observation: There is a walk from to with label

0q 5qabbbaa

Page 55: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 55

Example

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

abbaML M

accept

Page 56: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 56

Another Example

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

abbaabML , M

acceptacceptaccept

Page 57: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 57

More Examples

a

b ba,

ba,

0q 1q 2q

}0:{ nbaML n

accept trap state

Page 58: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 58

ML = { all substrings with prefix }ab

a b

ba,

0q 1q 2q

accept

ba,3q

ab

Page 59: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 59

ML = { all strings without substring }001

0 00 001

1

0

1

10

0 1,0

Page 60: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 60

Regular Languages

• A language is regular if there is a DFA such that

• All regular languages form a language family

LM MLL

Page 61: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 61

Example

• The language• is regular:

*,: bawawaL

a

b

ba,

a

b

ba

0q 2q 3q

4q

Page 62: Natural Language Processing Lecture 4 : Regular Expressions and Automata.

04/21/23 62

Dollars and Cents