1 Regular Expressions and Automata September 3 2009 Lecture #2.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
-
Upload
shana-patrick -
Category
Documents
-
view
224 -
download
0
Transcript of Natural Language Processing Lecture 4 : Regular Expressions and Automata.
Natural Language Processing
Lecture 4 : Regular Expressions and Automata
2
• A language is a set of strings• String: A sequence of letters
Examples: “cat”, “dog”, “house”, …
Defined over an alphabet:
Definitions
zcba ,,,,
3
Alphabets and Strings
• We will use small alphabets:• Strings
abbaw
bbbaaav
abu
ba,
baaabbbaaba
baba
abba
ab
a
Regular Expressions
• In computer science, RE is a language used for specifying text search string.
• A regular expression is a formula in a special language that is used for specifying a simple class of string.
• Formally, a regular expression is an algebraic notation for characterizing a set of strings.
• RE search requires a pattern that we want to search for, and a corpus of texts to search through.
5
Basic Regular Expression Patterns
• The use of the brackets [] to specify a disjunction of characters.
• The use of the brackets [] plus the dash - to specify a range.
6
Basic Regular Expression Patterns
• Uses of the caret ^ for negation or just to mean ^
• The question-mark ? marks optionality of the previous expression.
• The use of period . to specify any character
7
Disjunction, Grouping, and Precedence
• Disjunction• /cat|dog• Precedence• /gupp(y|ies)• To find the English article the• /the/• /[tT]he/• /[^a-zA-Z][tT]he[^a-zA-Z]/
8
Aliases for common sets of characters
9
Regular expression operators for counting
10
Some characters that need to be backslashed
11
Finite State Automata
• FSAs recognize the regular languages represented by regular expressions SheepTalk: /baa+!/
• Directed graph with labeled nodes and arc transitions
•Five states: q0 the start state, q4 the final state, 5 transitions
q0q0 q4q4q1q1 q2q2 q3q3
b aa
a !
12
Formally
• FSA is a 5-tuple consisting of Q: set of states {q0,q1,q2,q3,q4}: an alphabet of symbols {a,b,!} q0: A start state F: a set of final states in Q {q4} (q,i): a transition function mapping Q
x to Q
q0q0 q4q4q1q1 q2q2 q3q3
b a
a
a !
13
• FSA recognizes (accepts) strings of a regular language baa! baaa! baaaa! …
• Tape Input: a rejected input
a b a ! b
q0q0 q4q1q1 q2q2 q3b a
aa !
14
State Transition Table for SheepTalk
StateInput
b a !
0 1 Ø Ø
1 Ø 2 Ø
2 Ø 3 Ø
3 Ø 3 4
4 Ø Ø Ø
baa! baaa! baaaa! baaaaa! ...
q0q0 q4q4q1q1 q2q2 q3q3
b aa
a !
15
Non-Deterministic FSAs for SheepTalk
q0q0 q4q4q1q1 q2q2 q3q3
b a a a !
q0q0 q4q4q1q1 q2q2 q3q3
b a a !
16
Finite Accepter
• Input
“Accept” or“Reject”
String
FiniteAutomata
Output
17
Transition Graph
•
initialstate
final state“accept”state
transition
abba -Finite Accepter
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
18
Initial Configuration
•
1q 2q 3q 4qa b b a
5q
a a bb
ba,
Input Stringa b b a
ba,
0q
04/21/23 19
Reading the Input
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
04/21/23 20
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
04/21/23 21
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
04/21/23 22
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
04/21/23 23
0q 1q 2q 3q 4qa b b a
Output: “accept”
5q
a a bb
ba,
a b b a
ba,
04/21/23 24
Rejection
•
1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
0q
04/21/23 25
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
04/21/23 26
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
04/21/23 27
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
04/21/23 28
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
Output:“reject”
a b a
ba,
04/21/23 29
Another Example
a
b ba,
ba,
0q 1q 2q
a ba
04/21/23 30
a
b ba,
ba,
0q 1q 2q
a ba
04/21/23 31
a
b ba,
ba,
0q 1q 2q
a ba
04/21/23 32
a
b ba,
ba,
0q 1q 2q
a ba
04/21/23 33
a
b ba,
ba,
0q 1q 2q
a ba
Output: “accept”
04/21/23 34
Rejection
a
b ba,
ba,
0q 1q 2q
ab b
04/21/23 35
a
b ba,
ba,
0q 1q 2q
ab b
04/21/23 36
a
b ba,
ba,
0q 1q 2q
ab b
04/21/23 37
a
b ba,
ba,
0q 1q 2q
ab b
04/21/23 38
a
b ba,
ba,
0q 1q 2q
ab b
Output: “reject”
04/21/23 39
Formalities
• Deterministic Finite Accepter (DFA)
FqQM ,,,, 0Q
0q
F
: set of states
: input alphabet
: transition function
: initial state
: set of final states
04/21/23 40
About Alphabets
• Alphabets means we need a finite set of symbols in the input.
• These symbols can and will stand for bigger objects that can have internal structure.
04/21/23 41
Input Aplhabet
•
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
ba,
04/21/23 42
Set of States
Q
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
543210 ,,,,, qqqqqqQ
ba,
04/21/23 43
Initial State
0q
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
04/21/23 44
Set of Final States
F
0q 1q 2q 3qa b b a
5q
a a bb
ba,
4qF
ba,
4q
04/21/23 45
Transition Function
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
QQ :
ba,
04/21/23 46
10 , qaq
2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q 1q
04/21/23 47
50 , qbq
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
04/21/23 48
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
32 , qbq
04/21/23 49
Transition Function
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b
0q
1q
2q
3q
4q
5q
1q 5q
5q 2q
2q 3q
4q 5q
ba,5q5q5q5q
04/21/23 50
Extended Transition Function(Reads the entire string) *
QQ *:*
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
04/21/23 51
20 ,* qabq
3q 4qa b b a
5q
a a bb
ba,
ba,
0q 1q 2q
04/21/23 52
40 ,* qabbaq
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
04/21/23 53
50 ,* qabbbaaq
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
04/21/23 54
50 ,* qabbbaaq
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
Observation: There is a walk from to with label
0q 5qabbbaa
04/21/23 55
Example
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
abbaML M
accept
04/21/23 56
Another Example
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
abbaabML , M
acceptacceptaccept
04/21/23 57
More Examples
a
b ba,
ba,
0q 1q 2q
}0:{ nbaML n
accept trap state
04/21/23 58
ML = { all substrings with prefix }ab
a b
ba,
0q 1q 2q
accept
ba,3q
ab
04/21/23 59
ML = { all strings without substring }001
0 00 001
1
0
1
10
0 1,0
04/21/23 60
Regular Languages
• A language is regular if there is a DFA such that
• All regular languages form a language family
LM MLL
04/21/23 61
Example
• The language• is regular:
*,: bawawaL
a
b
ba,
a
b
ba
0q 2q 3q
4q
04/21/23 62
Dollars and Cents