Inductive and statistical learning of formal grammarscs182/sp08/notes/lecture25.GrInd.pdf2002...

Inductive and statistical learningof formal grammars

Pierre Dupont

[email protected]

– Typeset by FoilTEX –

2002 Grammar Induction

Outline

• Grammar induction definition

• Learning paradigms

• DFA learning from positive and negative examples

• RPNI algorithm

• Probabilistic DFA learning

• Application to a natural language task

• Links with Markov models

• Smoothing issues

• Related problems and future work

Pierre Dupont 1


Machine Learning

Goal: to give the learning ability to a machine

Design programs the performance of which improves over time

Inductive learning is a particular instance of machine learning

• Goal: to find a general law from examples

• Subproblem of theoretical computer science , artificial intelligenceor pattern recognition

Pierre Dupont 2


Grammar Induction or Grammatical Inference

Grammar induction is a particular case of inductive learning

The general law is represented by a formal grammar or an equivalent machine

The set of examples, known as positive sample , is usually made ofstrings or sequences over a specific alphabet

A negative sample , i.e. a set of strings not belonging to the target language,can sometimes help the induction process

aaabbbab

GrammarS−>aSbS−> λ

InductionData

Pierre Dupont 3


Examples

• Natural language sentence

• Speech

• Chronological series

• Successive actions of a WEB user

• Successive moves during a chess game

• A musical piece

• A program

• A form characterized by a chain code

• A biological sequence (DNA, proteins, . . .)

Pierre Dupont 4


Pattern Recognition

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

-2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

’3.4cont’’3.8cont’

8dC

0

123

4

5 6 7

0000777666766665555454444432110007101123445433110012344543118dC: Pierre Dupont 5


Chromosome classification

String of Primitives

"=====CDFDCBBBBBBBA==bcdc==DGFB=bccb== ...... ==cffc=CCC==cdb==BCB==dfdcb====="

-6-4

-20

24

6

0 100 200 300 400 500 600

grey

den

s. d

eriv

ativ

e

position along median axis

30

4050

6070

8090

0 100 200 300 400 500 600

grey

den

sity

�

Chromosome 2a

Centromere

Pierre Dupont 6


A modeling hypothesis

InductionG 0 Grammar

GGeneration Data

• Find G as close as possible to G0

• The induction process does not prove the existence of G0

It is a modeling hypothesis

Pierre Dupont 7





• RPNI algorithm






Pierre Dupont 8


Learning paradigms

How to characterize learning?

• which concept classes can or cannot be learned?

• what is a good example ?

• is it possible to learn in polynomial time ?

Pierre Dupont 9


Identification in the limit

G*

GG 2

1


12

d d

dn

Generation Data

• convergence in finite time to G∗

• G∗ is a representation of L(G0) (exact learning )

Pierre Dupont 10


PAC Learning

G*

GG 2

1


12

d d

dn

Generation Data

• convergence to G∗

• G∗ is close enough to G0 with high probability

⇒ Probably Approximately Correct learning

• polynomial time complexity

Pierre Dupont 11


Identification in the limit: good and bad news

The bad one. . .

Theorem 1. No superfinite class of languages is identifiable in the limit frompositive data only

The good one. . .

Theorem 2. Any admissible class of languages is identifiable in the limit frompositive and negative data

Pierre Dupont 13


Other learnability results

• Identification in the limit in polynomial time

– DFAs cannot be efficiently identified in the limit– unless we can ask equivalence and membership queries to an oracle

• PAC learning

– DFAs are not PAC learnable (under some cryptographic limitation assumption)– unless we can ask membership queries to an oracle

Pierre Dupont 14


• PAC learning with simple examples , i.e. examples drawn according to theconditional Solomonoff-Levin distribution

Pc(x) = λc2−K(x|c)

K(x|c) denotes the Kolmogorov complexity of x given a representation c of theconcept to be learned

– regular languages are PACS learnable with positive examples only– but Kolmogorov complexity is not computable !

Pierre Dupont 15


Cognitive relevance of learning paradigms

A largely unsolved question

Learning paradigms seem irrelevant to model human learning:

• Gold’s identification in the limit framework has been criticized as children seemto learn natural language without negative examples

• All learning models assume a known representation class

• Some learnability results are based on enumeration

Pierre Dupont 16


However learning models show that:

• an oracle can help

• some examples are useless, others are good:characteristic samples ⇔ typical examples

• learning well is learning efficiently

• example frequency matters

• good examples are simple examples ⇔ cognitive economy

Pierre Dupont 17





• RPNI algorithm






Pierre Dupont 18


Regular Inference from Positive and Negative Data

Additional hypothesis: the underlying theory is a regular grammar or, equiva-lently, a finite state automaton

Property 1. Any regular language has a canonical automaton A(L) which isdeterministic and minimal (minimal DFA)

Example : L = (ba∗a)∗

0 1b

2

ab

a

Pierre Dupont 19


A few definitions

Definition 1. A positive sample S+ is structurally completewith respect to an automaton A if, when generating S+ from A:

• every transition of A is used at least one

• every final state is used as accepting state of at least one string

Example : {ba, baa, baba, λ}0 1

b2

ab

a

Pierre Dupont 20


Merging is fun

A2

A10 1

a

a

2b

0,1

a

2b

0 1,2a

a b

0,2 1a

b

a

0,1,2

ba

20,1

• Merging⇔ definition of a partition π on the set of statesExample : {{0,1}, {2}}

• If A2 = A1/π then L(A1) ⊆ L(A2) : merging states⇔ generalize language

Pierre Dupont 21


A theorem

The positive data can be represented by a prefix tree acceptor (PTA)

Example : {aa, abba, baa}0

2

b

1a

5a

7a

4b

3a

6b

8a

Theorem 3. If the positive sample is structurally complete with respect to acanonical automaton A(L0) then there exists a partition π of the state set of PTAsuch that PTA/π = A(L0)

Pierre Dupont 22


Summary

InductionGeneration0 Grammar

PTAA(L )

PTA ππ

?

Data

We observe some positive and negative data

The positive sample S+ comes from a regular language L0

The positive sample is assumed to be structurally complete with respect to thecanonical automaton A(L0) of the target language L0 (Not an additional hypothesisbut a way to restrict the search to reasonable generalizations!)

We build the Prefix Tree Acceptor of S+. By construction L(PTA) = S+

Merging states⇔ generalize S+

The negative sample S− helps to control over-generalization

Note: finding the minimal DFA consistent with S+, S− is NP-complete!Pierre Dupont 24


Bayesian learning

Find a model M̂ which maximizes the likelihood of the data P (X|M) and the priorprobability of the model P (M) :

M̂ = argmaxM

P (X|M).P (M)

PPTA maximizes the data likelihood

A smaller model (number of states) is a priori assumed more likely

Pierre Dupont 37


Links with Markov chains

A subclass of regular languages: the k-testable languages in the strict senseA k-TSS language is generated by an automaton such that all subsequencessharing the same last k − 1 symbols lead to the same state

λ aa

aa

a

abb baa

bbb a

b

p̂(a|bb) = C(bba)C(bb)

A probabilistic k-TSS language is equivalent to a k − 1 order Markov chain

There exists probabilistic regular languages not reducible to Markov chains ofany finite order

0

a

1b

a

Pierre Dupont 45


Equivalence between PNFA and HMM

Probabilistic non-deterministic automata (PNFA), with no end-of-string proba-bilities , are equivalent to Hidden Markov Models (HMMs)

0.4 0.6a 0.27

a 0.56

a 0.02

b 0.08 b 0.03

a 0.27

b 0.14

b 0.631 2

0.7

0.4 0.6

0.90.1 0.3

[a 0.2][b 0.8]

[a 0.8][b 0.2]

[a 0.9][b 0.1]

[a 0.3][b 0.7]

1 2

PNFA HMM with emission on transitions

Pierre Dupont 46


The smoothing problem

A probabilistic DFA defines a probability distribution over a set of strings

Some strings are not observed on the training sample but they could be observed⇒ their probability should be strictly positive

The smoothing problem: how to assign a reasonable probability to (yet) unseenrandom events ?

Highly optimized smoothing techniques exist for Markov chains

How to adapt these techniques to more general probabilistic automata?

Pierre Dupont 49


Related problems and approaches

I did not talk about

• other induction problems (NFA, CFG, tree grammars, . . .)

• heuristic approaches as neural nets or genetic algorithms

• how to use prior knowledge

• smoothing techniques

• how to parse natural language without a grammars (decision trees )

• how to learn transducers

• benchmarks, applications

Pierre Dupont 51

Inductive and statistical learning of formal grammarscs182/sp08/notes/lecture25.GrInd.pdf2002...

Documents

Transcript of Inductive and statistical learning of formal grammarscs182/sp08/notes/lecture25.GrInd.pdf2002...