Inductive and statistical learning of formal grammarscs182/sp08/notes/lecture25.GrInd.pdf2002...
Transcript of Inductive and statistical learning of formal grammarscs182/sp08/notes/lecture25.GrInd.pdf2002...
Inductive and statistical learningof formal grammars
Pierre Dupont
– Typeset by FoilTEX –
2002 Grammar Induction
Outline
• Grammar induction definition
• Learning paradigms
• DFA learning from positive and negative examples
• RPNI algorithm
• Probabilistic DFA learning
• Application to a natural language task
• Links with Markov models
• Smoothing issues
• Related problems and future work
Pierre Dupont 1
2002 Grammar Induction
Machine Learning
Goal: to give the learning ability to a machine
Design programs the performance of which improves over time
Inductive learning is a particular instance of machine learning
• Goal: to find a general law from examples
• Subproblem of theoretical computer science , artificial intelligenceor pattern recognition
Pierre Dupont 2
2002 Grammar Induction
Grammar Induction or Grammatical Inference
Grammar induction is a particular case of inductive learning
The general law is represented by a formal grammar or an equivalent machine
The set of examples, known as positive sample , is usually made ofstrings or sequences over a specific alphabet
A negative sample , i.e. a set of strings not belonging to the target language,can sometimes help the induction process
aaabbbab
GrammarS−>aSbS−> λ
InductionData
Pierre Dupont 3
2002 Grammar Induction
Examples
• Natural language sentence
• Speech
• Chronological series
• Successive actions of a WEB user
• Successive moves during a chess game
• A musical piece
• A program
• A form characterized by a chain code
• A biological sequence (DNA, proteins, . . .)
Pierre Dupont 4
2002 Grammar Induction
Pattern Recognition
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
’3.4cont’’3.8cont’
8dC
0
123
4
5 6 7
0000777666766665555454444432110007101123445433110012344543118dC: Pierre Dupont 5
2002 Grammar Induction
Chromosome classification
String of Primitives
"=====CDFDCBBBBBBBA==bcdc==DGFB=bccb== ...... ==cffc=CCC==cdb==BCB==dfdcb====="
-6-4
-20
24
6
0 100 200 300 400 500 600
grey
den
s. d
eriv
ativ
e
position along median axis
30
4050
6070
8090
0 100 200 300 400 500 600
grey
den
sity
�
Chromosome 2a
Centromere
Pierre Dupont 6
2002 Grammar Induction
A modeling hypothesis
InductionG 0 Grammar
GGeneration Data
• Find G as close as possible to G0
• The induction process does not prove the existence of G0
It is a modeling hypothesis
Pierre Dupont 7
2002 Grammar Induction
• Grammar induction definition
• Learning paradigms
• DFA learning from positive and negative examples
• RPNI algorithm
• Probabilistic DFA learning
• Application to a natural language task
• Links with Markov models
• Smoothing issues
• Related problems and future work
Pierre Dupont 8
2002 Grammar Induction
Learning paradigms
How to characterize learning?
• which concept classes can or cannot be learned?
• what is a good example ?
• is it possible to learn in polynomial time ?
Pierre Dupont 9
2002 Grammar Induction
Identification in the limit
G*
GG 2
1
InductionG 0 Grammar
12
d d
dn
Generation Data
• convergence in finite time to G∗
• G∗ is a representation of L(G0) (exact learning )
Pierre Dupont 10
2002 Grammar Induction
PAC Learning
G*
GG 2
1
InductionG 0 Grammar
12
d d
dn
Generation Data
• convergence to G∗
• G∗ is close enough to G0 with high probability
⇒ Probably Approximately Correct learning
• polynomial time complexity
Pierre Dupont 11
2002 Grammar Induction
Identification in the limit: good and bad news
The bad one. . .
Theorem 1. No superfinite class of languages is identifiable in the limit frompositive data only
The good one. . .
Theorem 2. Any admissible class of languages is identifiable in the limit frompositive and negative data
Pierre Dupont 13
2002 Grammar Induction
Other learnability results
• Identification in the limit in polynomial time
– DFAs cannot be efficiently identified in the limit– unless we can ask equivalence and membership queries to an oracle
• PAC learning
– DFAs are not PAC learnable (under some cryptographic limitation assumption)– unless we can ask membership queries to an oracle
Pierre Dupont 14
2002 Grammar Induction
• PAC learning with simple examples , i.e. examples drawn according to theconditional Solomonoff-Levin distribution
Pc(x) = λc2−K(x|c)
K(x|c) denotes the Kolmogorov complexity of x given a representation c of theconcept to be learned
– regular languages are PACS learnable with positive examples only– but Kolmogorov complexity is not computable !
Pierre Dupont 15
2002 Grammar Induction
Cognitive relevance of learning paradigms
A largely unsolved question
Learning paradigms seem irrelevant to model human learning:
• Gold’s identification in the limit framework has been criticized as children seemto learn natural language without negative examples
• All learning models assume a known representation class
• Some learnability results are based on enumeration
Pierre Dupont 16
2002 Grammar Induction
However learning models show that:
• an oracle can help
• some examples are useless, others are good:characteristic samples ⇔ typical examples
• learning well is learning efficiently
• example frequency matters
• good examples are simple examples ⇔ cognitive economy
Pierre Dupont 17
2002 Grammar Induction
• Grammar induction definition
• Learning paradigms
• DFA learning from positive and negative examples
• RPNI algorithm
• Probabilistic DFA learning
• Application to a natural language task
• Links with Markov models
• Smoothing issues
• Related problems and future work
Pierre Dupont 18
2002 Grammar Induction
Regular Inference from Positive and Negative Data
Additional hypothesis: the underlying theory is a regular grammar or, equiva-lently, a finite state automaton
Property 1. Any regular language has a canonical automaton A(L) which isdeterministic and minimal (minimal DFA)
Example : L = (ba∗a)∗
0 1b
2
ab
a
Pierre Dupont 19
2002 Grammar Induction
A few definitions
Definition 1. A positive sample S+ is structurally completewith respect to an automaton A if, when generating S+ from A:
• every transition of A is used at least one
• every final state is used as accepting state of at least one string
Example : {ba, baa, baba, λ}0 1
b2
ab
a
Pierre Dupont 20
2002 Grammar Induction
Merging is fun
A2
A10 1
a
a
2b
0,1
a
2b
0 1,2a
a b
0,2 1a
b
a
0,1,2
ba
20,1
• Merging⇔ definition of a partition π on the set of statesExample : {{0,1}, {2}}
• If A2 = A1/π then L(A1) ⊆ L(A2) : merging states⇔ generalize language
Pierre Dupont 21
2002 Grammar Induction
A theorem
The positive data can be represented by a prefix tree acceptor (PTA)
Example : {aa, abba, baa}0
2
b
1a
5a
7a
4b
3a
6b
8a
Theorem 3. If the positive sample is structurally complete with respect to acanonical automaton A(L0) then there exists a partition π of the state set of PTAsuch that PTA/π = A(L0)
Pierre Dupont 22
2002 Grammar Induction
Summary
InductionGeneration0 Grammar
PTAA(L )
PTA ππ
?
Data
We observe some positive and negative data
The positive sample S+ comes from a regular language L0
The positive sample is assumed to be structurally complete with respect to thecanonical automaton A(L0) of the target language L0 (Not an additional hypothesisbut a way to restrict the search to reasonable generalizations!)
We build the Prefix Tree Acceptor of S+. By construction L(PTA) = S+
Merging states⇔ generalize S+
The negative sample S− helps to control over-generalization
Note: finding the minimal DFA consistent with S+, S− is NP-complete!Pierre Dupont 24
2002 Grammar Induction
Bayesian learning
Find a model M̂ which maximizes the likelihood of the data P (X|M) and the priorprobability of the model P (M) :
M̂ = argmaxM
P (X|M).P (M)
PPTA maximizes the data likelihood
A smaller model (number of states) is a priori assumed more likely
Pierre Dupont 37
2002 Grammar Induction
Links with Markov chains
A subclass of regular languages: the k-testable languages in the strict senseA k-TSS language is generated by an automaton such that all subsequencessharing the same last k − 1 symbols lead to the same state
λ aa
aa
a
abb baa
bbb a
b
p̂(a|bb) = C(bba)C(bb)
A probabilistic k-TSS language is equivalent to a k − 1 order Markov chain
There exists probabilistic regular languages not reducible to Markov chains ofany finite order
0
a
1b
a
Pierre Dupont 45
2002 Grammar Induction
Equivalence between PNFA and HMM
Probabilistic non-deterministic automata (PNFA), with no end-of-string proba-bilities , are equivalent to Hidden Markov Models (HMMs)
0.4 0.6a 0.27
a 0.56
a 0.02
b 0.08 b 0.03
a 0.27
b 0.14
b 0.631 2
0.7
0.4 0.6
0.90.1 0.3
[a 0.2][b 0.8]
[a 0.8][b 0.2]
[a 0.9][b 0.1]
[a 0.3][b 0.7]
1 2
PNFA HMM with emission on transitions
Pierre Dupont 46
2002 Grammar Induction
The smoothing problem
A probabilistic DFA defines a probability distribution over a set of strings
Some strings are not observed on the training sample but they could be observed⇒ their probability should be strictly positive
The smoothing problem: how to assign a reasonable probability to (yet) unseenrandom events ?
Highly optimized smoothing techniques exist for Markov chains
How to adapt these techniques to more general probabilistic automata?
Pierre Dupont 49
2002 Grammar Induction
Related problems and approaches
I did not talk about
• other induction problems (NFA, CFG, tree grammars, . . .)
• heuristic approaches as neural nets or genetic algorithms
• how to use prior knowledge
• smoothing techniques
• how to parse natural language without a grammars (decision trees )
• how to learn transducers
• benchmarks, applications
Pierre Dupont 51