Natural Language Processing Lecture 4 : Regular Expressions and Automata.

Post on 18-Jan-2016

224 views 0 download

Transcript of Natural Language Processing Lecture 4 : Regular Expressions and Automata.

Natural Language Processing

Lecture 4 : Regular Expressions and Automata

2

• A language is a set of strings• String: A sequence of letters

Examples: “cat”, “dog”, “house”, …

Defined over an alphabet:

Definitions

zcba ,,,,

3

Alphabets and Strings

• We will use small alphabets:• Strings

abbaw

bbbaaav

abu

ba,

baaabbbaaba

baba

abba

ab

a

Regular Expressions

• In computer science, RE is a language used for specifying text search string.

• A regular expression is a formula in a special language that is used for specifying a simple class of string.

• Formally, a regular expression is an algebraic notation for characterizing a set of strings.

• RE search requires a pattern that we want to search for, and a corpus of texts to search through.

5

Basic Regular Expression Patterns

• The use of the brackets [] to specify a disjunction of characters.

• The use of the brackets [] plus the dash - to specify a range.

6

Basic Regular Expression Patterns

• Uses of the caret ^ for negation or just to mean ^

• The question-mark ? marks optionality of the previous expression.

• The use of period . to specify any character

7

Disjunction, Grouping, and Precedence

• Disjunction• /cat|dog• Precedence• /gupp(y|ies)• To find the English article the• /the/• /[tT]he/• /[^a-zA-Z][tT]he[^a-zA-Z]/

8

Aliases for common sets of characters

9

Regular expression operators for counting

10

Some characters that need to be backslashed

11

Finite State Automata

• FSAs recognize the regular languages represented by regular expressions SheepTalk: /baa+!/

• Directed graph with labeled nodes and arc transitions

•Five states: q0 the start state, q4 the final state, 5 transitions

q0q0 q4q4q1q1 q2q2 q3q3

b aa

a !

12

Formally

• FSA is a 5-tuple consisting of Q: set of states {q0,q1,q2,q3,q4}: an alphabet of symbols {a,b,!} q0: A start state F: a set of final states in Q {q4} (q,i): a transition function mapping Q

x to Q

q0q0 q4q4q1q1 q2q2 q3q3

b a

a

a !

13

• FSA recognizes (accepts) strings of a regular language baa! baaa! baaaa! …

• Tape Input: a rejected input

a b a ! b

q0q0 q4q1q1 q2q2 q3b a

aa !

14

State Transition Table for SheepTalk

StateInput

b a !

0 1 Ø Ø

1 Ø 2 Ø

2 Ø 3 Ø

3 Ø 3 4

4 Ø Ø Ø

baa! baaa! baaaa! baaaaa! ...

q0q0 q4q4q1q1 q2q2 q3q3

b aa

a !

15

Non-Deterministic FSAs for SheepTalk

q0q0 q4q4q1q1 q2q2 q3q3

b a a a !

q0q0 q4q4q1q1 q2q2 q3q3

b a a !

16

Finite Accepter

• Input

“Accept” or“Reject”

String

FiniteAutomata

Output

17

Transition Graph

initialstate

final state“accept”state

transition

abba -Finite Accepter

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

18

Initial Configuration

1q 2q 3q 4qa b b a

5q

a a bb

ba,

Input Stringa b b a

ba,

0q

04/21/23 19

Reading the Input

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

04/21/23 20

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

04/21/23 21

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

04/21/23 22

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

04/21/23 23

0q 1q 2q 3q 4qa b b a

Output: “accept”

5q

a a bb

ba,

a b b a

ba,

04/21/23 24

Rejection

1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

0q

04/21/23 25

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

04/21/23 26

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

04/21/23 27

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

04/21/23 28

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

Output:“reject”

a b a

ba,

04/21/23 29

Another Example

a

b ba,

ba,

0q 1q 2q

a ba

04/21/23 30

a

b ba,

ba,

0q 1q 2q

a ba

04/21/23 31

a

b ba,

ba,

0q 1q 2q

a ba

04/21/23 32

a

b ba,

ba,

0q 1q 2q

a ba

04/21/23 33

a

b ba,

ba,

0q 1q 2q

a ba

Output: “accept”

04/21/23 34

Rejection

a

b ba,

ba,

0q 1q 2q

ab b

04/21/23 35

a

b ba,

ba,

0q 1q 2q

ab b

04/21/23 36

a

b ba,

ba,

0q 1q 2q

ab b

04/21/23 37

a

b ba,

ba,

0q 1q 2q

ab b

04/21/23 38

a

b ba,

ba,

0q 1q 2q

ab b

Output: “reject”

04/21/23 39

Formalities

• Deterministic Finite Accepter (DFA)

FqQM ,,,, 0Q

0q

F

: set of states

: input alphabet

: transition function

: initial state

: set of final states

04/21/23 40

About Alphabets

• Alphabets means we need a finite set of symbols in the input.

• These symbols can and will stand for bigger objects that can have internal structure.

04/21/23 41

Input Aplhabet

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

ba,

04/21/23 42

Set of States

Q

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

543210 ,,,,, qqqqqqQ

ba,

04/21/23 43

Initial State

0q

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

04/21/23 44

Set of Final States

F

0q 1q 2q 3qa b b a

5q

a a bb

ba,

4qF

ba,

4q

04/21/23 45

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

QQ :

ba,

04/21/23 46

10 , qaq

2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q 1q

04/21/23 47

50 , qbq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

04/21/23 48

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

32 , qbq

04/21/23 49

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b

0q

1q

2q

3q

4q

5q

1q 5q

5q 2q

2q 3q

4q 5q

ba,5q5q5q5q

04/21/23 50

Extended Transition Function(Reads the entire string) *

QQ *:*

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

04/21/23 51

20 ,* qabq

3q 4qa b b a

5q

a a bb

ba,

ba,

0q 1q 2q

04/21/23 52

40 ,* qabbaq

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

04/21/23 53

50 ,* qabbbaaq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

04/21/23 54

50 ,* qabbbaaq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Observation: There is a walk from to with label

0q 5qabbbaa

04/21/23 55

Example

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

abbaML M

accept

04/21/23 56

Another Example

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

abbaabML , M

acceptacceptaccept

04/21/23 57

More Examples

a

b ba,

ba,

0q 1q 2q

}0:{ nbaML n

accept trap state

04/21/23 58

ML = { all substrings with prefix }ab

a b

ba,

0q 1q 2q

accept

ba,3q

ab

04/21/23 59

ML = { all strings without substring }001

0 00 001

1

0

1

10

0 1,0

04/21/23 60

Regular Languages

• A language is regular if there is a DFA such that

• All regular languages form a language family

LM MLL

04/21/23 61

Example

• The language• is regular:

*,: bawawaL

a

b

ba,

a

b

ba

0q 2q 3q

4q

04/21/23 62

Dollars and Cents