Figure 2.2.doc.doc

A Survey of Learning Methods In Learning Finite Automata (FA)

Submitted By: Priscilla ChiaSupervised By: Dr. Suresh K. Manandhar

Final Year Project Report 1998

Computer Science DepartmentUniversity Of York

Abstract

This report is a survey of learning methods used learning Finite Automata (FA). The

learning issues in machine learning are highlighted and the methods surveyed are

analysed according to how these issues are dealt with. The report also looks at how

additional information is learnt based on given information by the teacher. We

surveyed six algorithms with respect to the various learning methods employed in the

learning process: building a hypothesis and evaluating the hypothesis. The methods

are categorised into probabilistic and non-probabilistic. We conclude with a

discussion on the ability of hypothesis towards error rectification of past experience

instead of only learning new ones.

Acknowledgement

I am very grateful to my supervisor, Dr. Suresh Manandhar, for his invaluable help

and advise throughout this project. I would also like to thank my friends and family

for their support especially mum and dad at home.

CONTENTS

1. INTRODUCTION...........................................................................................1

2. LEARNING....................................................................................................2

2.1 Learning in General...........................................................................................................................2

2.2 The Issues in Learning.......................................................................................................................5

2.3 Learning Finite Automata (FA)........................................................................................................6

2.4 Learning Framework.........................................................................................................................72.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................72.4.2 PAC VIEW............................................................................................................................82.4.3 COMPARISON.........................................................................................................................82.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9

2.5 Results on learning finite automata................................................................................................10

3. NON-PROBABILISTIC LEARNING FOR FA.............................................11

3.1 Learning with queries......................................................................................................................133.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................143.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................213.1.3 DISCUSSION.........................................................................................................................27

3.2 Learning without queries................................................................................................................293.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................303.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................353.2.3 DISCUSSION.........................................................................................................................39

3.3 Homing Sequences in Learning FA................................................................................................423.3.1 HOMING SEQUENCE...............................................................................................................433.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................433.3.3 DISCUSSION.........................................................................................................................46

3.4 Summary (Motivation forward).....................................................................................................47

4. PROBABILISTIC LEARNING.....................................................................50

4.1 PAC learning using membership queries only..............................................................................504.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................504.1.2 DISCUSSION.........................................................................................................................51

4.2 Learning through model merging..................................................................................................544.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................544.2.2 LEARNING FA: REVISITED....................................................................................................554.2.3 BAYESIAN MODEL MERGING.................................................................................................574.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................604.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................644.2.6 DISCUSSION.........................................................................................................................67

4.3 Summary...........................................................................................................................................68

4.4 Chapter Appendix............................................................................................................................70

5. CONCLUSION AND RELATED WORK.....................................................71

REFERENCES...............................................................................................74

1. INTRODUCTION...........................................................................................1

2. LEARNING....................................................................................................2













4.2 Learning through model merging..................................................................................................544.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54

4.2.2 LEARNING FA: REVISITED....................................................................................................554.2.3 BAYESIAN MODEL MERGING.................................................................................................574.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................604.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................644.2.6 DISCUSSION.........................................................................................................................67

4.3 Summary...........................................................................................................................................68



REFERENCES...............................................................................................74

1. INTRODUCTION...........................................................................................1

2. LEARNING....................................................................................................2













4.2 Learning through model merging..................................................................................................544.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................544.2.2 LEARNING FA: REVISITED....................................................................................................554.2.3 BAYESIAN MODEL MERGING.................................................................................................574.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................604.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................644.2.6 DISCUSSION.........................................................................................................................67

4.3 Summary...........................................................................................................................................68



REFERENCES...............................................................................................74

References 72

Introduction

1.Introduction

The class of finite state automaton (FA) is studied from machine learning perspective which involves learning issues and the properties of that particular class. This report is a survey on the learning methods studied and employed in learning FA. We give an overview on learning in general in Section 2.1 and the issues of learning in Section 2.2 with application towards learning FA in Section 2.3. The two important frameworks employed extensively in machine learning which are learning in the limit and PAC learning, are explained in Section 2.4. The complexity of learning FA itself has been studied and the results are given in Section 2.5.

This report which concerns the learning methods employed are divided into two main chapters where various learning algorithms are studied and compared. The usual non-probabilistic learning is discussed in Chapter 3 with the motivation towards probabilistic learning in Section 3.4 before the probabilistic learning is discussed in Chapter 4.

The results of the surveyed is in every chapter and the conclusion with related works in machine learning is in Chapter 4. The are 6 algorithms discussed and each is referred to as L1-L6 throughout this report as the to the following algorithms:

• L1: [Angluin 87]• L2: [Kearns et al 94]• L3: [Porat et al 91]• L4: [Rivest et al 87]• L5: [Anlguin 87; Natarajan 91]• L6: [Stolcke et al 94]

We follow the standard definition of FA as studied in any automata theory [Hopcroft et al 79; Trakhtenbrot 73] and give the following terminology and notation that are used for any FA M:

set of states, Q : the set of finite states q in FA final state : the start reached by an input string that is not recognised by M initial state, q0 : the start state for all input stringsaccepting state : final state which accepts a string that is not recognised by Mrejecting state : final state which rejects a stringtransition, δ(q,a): path from a state q with symbol a from alphabet setalphabet set, A : the set with finite symbols and binary set {0,1} is applied.

1

Chapter 2: Learning

2.Learning

2.1Learning in GeneralLearning in general means the ability to carry out a task with improvement from previous experience. It involves a teacher and a learner. The learning process usually takes place in an environment which constrains the communication between the learner and teacher: how the teacher is to teach or train the learner and how the learner is to receive input from the teacher; and elements or tokens of information that is communicated between the teacher and learner: a class of objects (i.e. a concept) and a description of a subclass (i.e. an object).

Example 1(a): Environment for learning a class of vehiclesThe environment in which the learning process takes place involves a teacher giving descriptions of ships and a learner drawing a conclusion of how a ship looks like from descriptions received. The teacher describes ship (i.e. the subclass of vehicles) by providing pictures (i.e. using pictorial means) of ships and the learner responds (i.e. communicates with the teacher) through some form of visual mechanism (i.e. by detecting shapes or colours of object in pictures) to analyse pictures received from the teacher. This environment only allows the teacher and learner to communicate through pictures, whereas in another environment, other forms of encoding of descriptions may be used (e.g. tables of attributes - width, length, windows, engine capacity etc.)

Figure 2.1: Finite representation of a possibly infinite class A of m elements cs for 1≤ s ≤ m where m may be finite of infinite, by another finite class B with p elements, ri for 1≤ i ≤ p where p is some finite number.

The learner is to learn an unknown subclass from the class with the help (i.e. some form of training) from the teacher who provides descriptions of the unknown subclass. Since the subclass to be learnt may be infinite in size, a finite representation is needed to represent the probable subclasses hypothesised during learning. The task of the learner is to hypothesise a (finite) representation of the unknown subclass as shown in Figure 2.1 where class A of a possibly finite set is represented by a finite class B. Thus learning the class A is to learn its class B representation.

In Example 1(a) above, the learner is to produce a hypothesis of the ship subclass. A finite representation for the hypothesis is necessary for the unknown subclass chosen to be learnt, as not every subclass can be finitely presented or described (i.e. by presenting all elements of the unknown subclass to the learner) by the teacher, as shown in the class of vehicles above where the subclass (i.e. ships) is infinite. Note that there are finitely many

2

c1

:

cn

:cs

:cm

]::

r1

ri]

::

A B

Chapter 2: Learning

different ‘types’ of ships as there are finitely many different ‘types’ of vehicles where in both cases the ‘ships’ and ‘vehicles’ are infinite but the ‘types’ are finite.

Instances from the class used to describe a particular subclass are called examples. These examples are usually classified by a teacher (i.e. a human or some operation or program available in the environment) with respect to the particular subclass being learnt as positive (a member of the subclass) or negative (non-member of the subclass) examples. A set of classified or labelled examples is called the example space.

Only a subset of the example space is used by learner each time an unknown subclass is to be learnt. This subset, used in training the learner, is known as the training set. Each example space contains information (implicit properties or rules that may be infinite) relevant in distinguishing one subclass from another in the given class. The constraints in the environment also determine which type (i.e. positive only, negative only or both) of examples can be provided by the teacher to form the example space. For instance, it may not be possible to collect negative examples and the teacher is restricted to only positive examples which may not be a partial set (i.e. not all members of the unknown subclass is known even to the teacher).

The learner or learning algorithm is therefore required to learn the implicit properties or rules from the information given (built into what is called experience) of a particular subclass. The properties learnt are stored in the learner’s hypothesis (i.e. conclusion or explanation) drawn of the sub-class.

An infinite number of hypotheses of any form of representation (i.e. decision tree, propositional logic expression, finite automata etc.) could be produced that hold the properties obtained from the information received. This results in searching a large hypothesis space. It should be noted that the hypothesis space could be expressed in the same descriptive language used to describe the unknown subclass: In Example 1(a), if the class of vehicles are represented in the form of propositional logic expressions then the hypothesis may be the exact propositional logic expression that represents the unknown subclass chosen (i.e. ships) or in some other form of representation that is equivalent to the propositional logic expressions used.

A set of criteria is necessary to limit (reduce) the size of the search space. Given a reduced hypothesis space that satisfies the set of criteria, the learning goal, is then needed in selecting and justifying a hypothesis from the hypothesis space as the finite representation of the unknown subclass. Together with other knowledge about the rules to manipulate the descriptive language, the set of criteria and learning goal form what is called background knowledge to guide the learner in the learning process.

Example 1(b): Learning process of Example 1(a)Suppose that the hypothesis for ships take the form of a collection of finite number of attributes for ship (i.e. size, engine capacity, shape, weight, anchor and other properties of a ship). The criteria for hypothesis space could include hypotheses that fulfilled five out of say, six attributes used and the learning goal is to be able to select hypothesis that satisfy the criteria with the simplest data structure of some form and could successfully identify subsequent say, ten examples, correctly. There could be infinite number of attribute used but the criteria in the background knowledge reduces the hypothesis space.

3

Chapter 2: Learning

Thus, the learning scenario (Figure 2.2) consists of a given a class, C, of subclasses and an example space, T, from where the training set, t, is drawn. Examples in T are used to describe an unknown subclass, c, in C. The aim of the learning algorithm, L, is to produce a hypothesis, h, from a hypothesis space, H, using information from t and satisfying the conditions set out in the background knowledge. L is to build an h that is equivalent to c. Ideally, h is to be exactly the same as c or h is the exact representation of c. Due to the incompleteness (i.e. teacher usually does not have complete information regarding c) of t received, h is usually taken to be equivalent to c to some extent expressed in the background knowledge. In both cases, learning relies on information contained in t and given by the teacher.

L : T H where T : sets of t for a sub-class, c, in C. Also called, example space.L(t) = h (≡R c) t ∈ T

h ∈ Hc ∈ C≡R: the equivalence relation specified by the learning goal used in selecting hypothesis, h, using t. The selected h contains learnt properties or rules of c that are obtained through information from t.

Figure 2.2: The learning scenario of a learner or learning algorithm, L, with a given

environment.

4

Background Knowledge:• set of criteria• learning goal• type of

representation (descriptive language for H)

Environment

C T H

c h

Class

teacher t

L

h

L

Example space

Hypothesis space

Chapter 2: Learning

2.2The Issues in Learning

The algorithms used in learning are ‘ways’ of achieving the learning goal under the set of criteria in the background knowledge. ‘Ways’ here are methods of constructing hypothesis from information in the set of examples. As shown in Figure 2.2, the learning algorithm, L, has two distinct phases in the learning process:

Phase 1: forming hypothesis, h, from set of examples, t.[shown as arrow from T to H in Figure 2.2]

Phase 2: selecting and justifying h as a finite representation of the unknown subclass, c.

[shown as arrow from H to C in Figure 2.2]

The nature (or design) of L, and the feasibility of the learning problem itself is determined by the following factors:

1. Example space, TUsually considered arbitrary where various kind of information (training sets) can be used to describe c.

2. Classification of training set, t, usually by a teacher or operation carried out on the environment with respect to a particular sub-class, c.• Noisy examples are considered where the teacher may classify instances

wrongly • Type of examples to be presented (i.e. positive only, negative only or both)

3. Presentation of t to LWhether elements from t is fed into L one-by-one or in a small groups or a whole batch of t and whether the elements are presented in any particular order (i.e. in lexicographic order or shortest length first)

4. The size of the tIntuitively, a small t is needed in learning by an efficient and ‘intelligent’ learner or learning algorithm. In machine learning, the size of t contributes to the computational complexity of a learning algorithm, the larger t is, the longer (or more complicated) is the computation.

5. The choice of representation for the hypothesis space, HThis involves issues of how much information to capture and can be captured by a particular choice of representation. A rich descriptive language ideally required as representation means a more complex computation and larger resource (i.e. memory storage) requirement, whereas a simple form of representation may not capture sufficient information to learn.

6. Selection criteria of a hypothesis, h, and justification as an equivalence of c.

All of the above except the last factor constitute to a major part in the design of an algorithm in machine learning, to be exhibited in Phase 1 of L. The last factor and also the choice of representation for H, are usually vital in Phase 2 of the learning process where evaluation are carried out by human experts or some known mechanism such as statistical confirmation or analysis.

The learner, L, is said to be able to learn a class in the given environment if it can learn (i.e. by producing a hypothesis that satisfies both the criteria and learning goal in the background knowledge provided a priori) any subclass chosen from the class.

5

Chapter 2: Learning

2.3Learning Finite Automata (FA)

This report investigates the learning process in a particular environment setting (Figure 2.3): - Teacher: as the source of example space, T, where description to unknown subclass, c, is

of the form of labelled strings.- Learner: learns by receiving information in a form of labelled string drawn from T

following rules set out in the environment constraints.- c: the unknown regular language or FA

Two almost similar environments for learning are shown in Figure 2.3 with difference in the class contents. The first environment (Figure 2.3(a)) consists of:- C1: the class of all languages, - H1: the hypothesis space, H1, is finite automaton (FAs) as the finite representation for

regular languages (i.e. a subclass of C1). - T: the examples are labelled sets of languages.- Criteria: FA accepts all examples (i.e. strings which may or may not be only positive

strings) received from training set, t.- Goal: to produce an FA (i.e. the selected hypothesis) that is equivalent to (i.e. that

accepts) c.

The other learning environment (Figure 2.3(b)) can be obtained by refining C1 to be the class of regular languages only and the hypothesis space, H2, are minimum deterministic finite automata (DFA). The new environment shown in Figure 2.3(b) has more constraints added into the environment as the teacher is to provide descriptions using only regular languages as compared to C1, where the teacher is able to provide descriptions using other languages as well (i.e. context free languages).

This report concerns with the learnability of finite automaton (FA) using minimum DFAs as the hypothesis space. Both environments, with C1 and C2 as classes, use the same set of examples, T, which is a set of strings and the training set, t, is a set of classified strings with respect to a particular sub-class of languages, c. For consistency, throughout the report, the alphabet, A, will for FAs will be set to the binary set {0,1}.

Figure 2.3: (a) c’ is the sub-class of regular languages, c, and H1 is the class of FAs with criteria for H1 being deterministic and minimal in size (number of states). (b) c is a particular subset of regular languages and H2 is the class of minimum DFA itself where no criteria is needed.

6

(a) (b)

C1

(class of all languages)

T H1

(class of FAs )

c’ c

c

h h

C2

(class of regular languages ≡

minimum DFAs)

T H2

(class of minimum DFAs)

Chapter 2: Learning

2.4Learning FrameworkGiven an environment with a class of objects which describe ‘what is to be learnt’, the two phases, Phase 1 and Phase 2, in the learning process bring us two fundamental questions:- ‘how do we learn?’- ‘when do we know we have learnt?’The former is being dealt with in Phase 1 and the latter, in Phase 2, is studied by Gold [Gold 67] and Valiant [Valiant 84] resulting in two major learning frameworks being proposed, the identification in the limit by Gold and the probably approximately correct (PAC) learning by Valiant.

2.4.1Identification in the limit

[Gold 67] states that learning should be a continuous process with the learner (or learning algorithm), L, having the possibility of changing or refining his guess (i.e. hypothesis) each time new information from the training set, t, is presented. The learner, L, is only required to have all his guesses after a finite time to be the same and correct with information seen so far. Hence, the hypothesis, h, obtained after a finite time will remain the same and correct with subsequent information. The hypothesis, h, is said to represent the unknown sub-class, c, described by t in the limit, completing Phase 2 of the learning process. This learning framework, identification in the limit, consists of three items as formulated by Gold:

1. A class of objectsA class, C, is specified (or given) to learner in the environment where the form of communication between the teacher and learner is also specified. An object, c, from C will be chosen for the learner to identify. [In the context of this report, the unknown object (or sub-class), c, is an FA and the class C consists of FAs].

2. A method of information presentation Information about the unknown chosen object is presented to the learner. The training set, t, consists of either positive only, positive and negative, and noisy examples as information describing c.[t is just a set of labelled strings drawn from example space, T, provided by the teacher and the type of t depends on T – all positive strings, all negative strings or combination of both]

3. A naming relationThis basically enables the learner to identify the unknown object, c, by specifying its name1, h . There is a function, f, for L to map the names to the objects in C. Here, an object, c, can have several names (hypotheses) where guesses (or hypotheses are made under f). [L is to build an FA as the hypothesis, h, for an unknown regular language and h could be any of the several DFAs (or TMs) that accepts the unknown regular language].

1 In [Gold 67], name is defined as a Turing Machine (TM). Since the language identified by FA is also identifiable by a TM, it is sufficient to say that every FA has a TM.

7

Chapter 2: Learning

2.4.2PAC view

Another learning framework is Probably Approximately Correct (PAC) learning. This was first proposed by Valiant [Valiant 84] that uses a stochastic setting in the learning process. The learner is required to build a (approximately correct) hypothesis that has a minimal error probability after being trained using the training set, t, constituting Phase 1 of the learning process. Phase 2 under this framework requires the learner to have a high level of confidence that the hypothesis, h, is approximately correct as a representation of sub-class, c. The training set, t, is considered ‘good enough’ with high confidence level. This is appropriate because t generally does not consist of all the positive examples needed to learn c.

The PAC framework relies on two parameters, accuracy (ε) and confidence limit (δ). A fixed but unknown distribution is applied over the class of examples, T, where training sets, t, are drawn at random. Intuitively, PAC learning seems like a passive type of learning with the learner learning only through observation on given data or information. However, [Angluin 88] and [Natarajan 91] showed that PAC learning can be used as an active learner using queries – equivalence, membership, subset, superset, exhaustiveness and disjointness queries [Angluin 87].

Given a real number, δ, from 0 to 1 and a real number, ε, also from 0 to 1, there is a minimum sample length (i.e. the size of training set, t) such that for any unknown subclass, c, with a fixed but unknown distribution on example space, T, there is:

a (1- δ)% chance that ε % of the test set will be classified wrongly by hypothesis h,where test set is another subset of T different from t to test validity of h.

PAC learning is desirable for a good approximation to c, as in most cases it is computationally difficult to build an accurate (exact) hypothesis and [Angluin 88] and [Natarajan 91] have shown that PAC learning can be easily applied to any other non-stochastic learning framework.

2.4.3Comparison

Both frameworks have distinct criteria and goal for learning which deal with Phase 2 in the learning process (Table 2.1). However, they both suggest learning by building tentative hypotheses from a piece of information in the form of a string from training set, t, (Figure2.4). Those tentative hypotheses are each a new ‘experience’ (i.e. a modified hypothesis with slight changes or totally new hypothesis) as new information is received from t. The final hypothesis, h’, taken to represent sub-class, c, may be totally different from previous hypotheses.

Learning framework Identification in the limit PAC learningGoal Same hypothesis (or guess)

after a finite time for subsequent information received.

P (h – c = ε ) < 1- δ on a sufficiently large sample, t.P = the probability functionh = hypothesisc = the unknown sub-classδ and ε are parameters needed

Criteria Hypothesis (guess) made must be consistent (correct) with information seen so far.

Hypothesis, h, has minimal error, ε, with respect to T.

Table 2.1: Comparison between identification in the limit and PAC learning frameworks.

8

Chapter 2: Learning

Figure 2.4: A learning scenario with learning algorithm, L, making several tentative hypotheses (i.e. h1, h2, h3) in H from sequence of labelled examples (i.e. t1, t2, t3).

Recent studies[Kearns et al 94; Rivest et al 88; Porat 91] are done under Gold’s proposed learning framework, as it is more natural to human learning. We can always change our perception (hypothesis) each time a new information is received and still being consistent with the previous information. We never know (or predict) when we finish learning (which is a perpetual process in humans).

2.4.4Other variations of learning framework

There are two other learning frameworks mentioned by Gold in [Gold 67]:1. Finite IdentificationThe learner stops the presentation of information after a finite number of examples and identifies the sub-class, c. The learner is to know when he has acquired sufficient number of examples and therefore able to identify c.

2. Fixed-time IdentificationA fixed finite time2 is specified a priori (i.e. usually as background knowledge) and independently of the unknown object presented whereby the learner stops learning and identifies the unknown object.

These two frameworks seem to ask too much of the learner where the learner is ‘forced’ to identify the sub-class, c, by outputting a hypothesis, h, after some predicted factor or condition is achieved. In finite identification, the learner is able to predict the number of examples needed to learn and stop learning once the predicted number of examples has been presented. On the other hand, the fixed-time identification requires the learner to know in some ways ‘when’ he is able to stop learning.

Learning as mentioned earlier, is to identify or distinguish the ambiguous lines separating each sub-class in a learning environment. Being able to tell when exactly (i.e. able to predict those lines) to stop learning means that there is no need for learning to start in the first place.

2 Time is taken throughout the report, to correspond with the computational complexity and the termination of a successful learning algorithm.

9

teacherc h1

h2

h3

t ={t1, t2, t3,..}

h

L

L

C T H

Environment with oracles

Chapter 2: Learning

2.5Results on learning finite automata.The complexity and learnability of finite automaton identification have received extensive research [Gold 67; Angluin 87; Vazirani et al 88]. The computational complexity is being considered here with respect to the size of the hypothesis space (minimum DFA) searched and the size of training set (examples) required.

Other complexity results that have dealt with the computational efficiency are as follows:

1. Identification in the limit and learnability model, [Gold 67] – Gold classifies the classes of languages that are learnable in the limit into three categories of information presentation (Table 2.2). Learning from positive only examples is proven to be NP-complete.

2. Inferring consistent DFA or NFA, of the size factor (1+1/8) of the minimum consistent DFA is NP-complete given positive and negatives examples. [Li et al 87]

3. There is an efficient learning algorithm to find minimum DFA consistent with given positive and negative data and access to membership and equivalence queries [Angluin 87], using observation table as a representation of FA.

4. Learning FA by experimentation3 (as in 4 above) [Kearns et al 94], using classification tree as a representation for FA in polynomial time.

5. State characterisation and Data Matrix Agreement is introduced for the problem of automaton identification [Gold 78]

6. Inferring minimum DFA’s and regular sets using positive and negative examples only is NP-complete. [Gold 67,78] and [Angluin 78]

Learnability model Class of languagesAnomalous Text - Recursive enumerable

- RecursiveInformant(using positive and negative examples/instances)

- Primitive recursive- Context sensitive- Context free- Regular - Superfinite

Text(using positive only examples/instances)

Finite cardinality

Table 2.2: Learnability and non-learnability of languages [Gold 67] where superfinite language is the class of all finite languages and one infinite regular language.

These results shows that inferring DFA directly from just examples are NP-hard and some other learning methods are employed in successfully learning FA. The methods used in successfully learning FA is surveyed in the following chapters.

3 Experimentation – a form of learning where learner is able to experiment with chosen strings (i.e. selected by learner and not from training set provided) during training.

10

Chapter 3: Non-Probabilistic Learning

3.Non-Probabilistic Learning for FA

In building a hypothesis, h, for an unknown FA, c, the learning algorithm, L, usually receives information (i.e. labelled strings) describing c from a training set, t. L is to build an h that is equivalent to c with the information it received insofar. Ideally, h is to be exactly the same as c. However in practical, as c is unknown, the teacher usually may not have complete information required to build the exact FA and h is then taken to be an approximation to c to some extent to be specified in the background knowledge (i.e. approximately equivalent to c or probably approximately correct h than the usual exact h).

Learning relies on L to make several guesses based on information provided by the teacher in the following ‘ways’ to be discussed in this chapter:

a) learning with queries, section 3.1b) learning without queries, section 3.2c) learning with homing sequences, section 3.3

L is to make guesses about c through a number of tentative hypotheses (i.e. tentative FAs), M’, from the information received. Each guess is a refinement or modification to the previous guess (hypothesis) where new properties of FA (i.e. the characteristic and elements of FA) are discovered. The guess made by L is also called a conjecture. The learner will produce several conjectures until the learning goal is achieved, that is, a final conjecture is accepted as the equivalent FA to c.

All information received and properties learnt through the modifications are kept in a data structure. The modification to the data structure is called an update and a new hypothesis is built based on the updated data structure. Hence, the data structure has several roles:

a) a representation of properties (to be learnt) of an FA :• the finite number of states• transitions (representing the transition function)• the set of distinguishing strings• the accepting and rejecting states

b) a record of modifications made (i.e. updates)• incorporating more information received: strings in t• updating more properties learnt

c) a reference to build next tentative FA, M’, after each update

The data structures being used in this chapter by the learner are briefly explained below, detailed explanation on the updates are given in the relevant section in brackets:

1. observation table (see section 3.1.1)A two dimensional table, in Figure 3.1, where the rows correspond to the states and the columns correspond to the set of distinguishing strings for FA. The entries in the table are values of ‘0’ and ‘1’ corresponding to the transition function of FA to a rejecting and accepting states respectively.

11


e1 e2 …s1 1 0s2 0 0s3

: :

Figure 3.5: Observation table representing elements of FA: states (rows), distinguishing strings (columns) and transition function (table entries in shaded section)

2. classification tree (see section 3.1.2)A binary classification tree where the leaves correspond to the states in FA and the distinguishing strings are represented by the internal nodes (and root) of the tree, as shown in Figure 3.2. The left and right paths from an internal node correspond transition function of FA to a rejecting and accepting states

respectively.

Figure 3.6: Classification tree representing elements of an FA: states (leaves), distinguishing strings (internal nodes including root) and transition function (the right and left paths).

3. minword(q) (see section 3.2.1)A string used to reach a state q in an FA from initial state q0. Thus, the set of minword(q) corresponds to the states in an FA as shown in Figure 3.3.

Figure 3.7: The set of minword(q) for FAs. (a) four minword(q) representing the states in the FA that accepts all strings with even 0’s and 1’s. (b) two minword(q) representing the states in the FA that accept all non-empty strings.

12

λ

q0

0,1

0

1

Distinguishing strings = {e1 , e2,…}States = {s1, s2, …}transition function, δ(q,x)

= 0 (= qx is a rejecting state)= 1 (= qx is an accepting state)for some string x from state q

q1

λ q00

0

q1

States = {s1, s2, s3, …}Distinguishing strings = {d1, d2, …}transition function, δ(q,x)

= left path (= qx is a rejecting state)= right path (= qx is an accepting state)for some string x from state qs

1

s2

d2

d3

Root (d1)

:

q3

111

q2

0

0

0

(a) (b)

minword(q0) = λminword(q1) = 0minword(q2) = 1minword(q3) = 01

minword(q0) = λminword(q1) = 0


3.1Learning with queries

Additional information regarding the unknown c can be requested by L by asking queries [Angluin 88]. The queries can be equivalence query, membership query, subset query, superset query, disjointness query and exhaustiveness query. Two of the six queries are used in the following two algorithms, L1 and L2, (see section 3.1.1 and 3.12) in learning c:

1. Membership queriesThe teacher returns a yes/no answer when the learner presents an input string, x, of its choice in the query, depending upon whether x is accepted by the unknown FA, c.

2. Equivalence queriesThe teacher returns a yes answer if the conjecture, M’, is equivalent to c and otherwise returns a counterexample, y, which is a string in the symmetric difference of M’ and c, if M’ is not equivalent to c.

Hence, L has access to some oracle (could be the teacher or from some operations available in the environment), creating an active interaction between the learner and teacher in the learning process (Figure 3.8). Both queries require a pair of oracles where each oracle is used in separate stage of learning:

a) Phase 1 of learning: updating the data structure used to construct the conjecture, M’

b) Phase 2 of learning: to confirm M’ as a finite representation of c (i.e. when to stop learning)

Figure 3.8: Learning with additional information obtained through access to oracle in the environment.

13

C T H

Oracle(s)

teacher tc

L

h

L

h

Environment


3.1.1L1: by Dana Angluin [Angluin 87]

The observation table (e.g. Figure 3.5) is the data structure used to stores the information and learnt properties about the unknown FA, c. All rows and columns are represented by strings based on information from the training set t and the set of distinguishing strings learnt respectively.

Each row is viewed as a vector with attributes values of ‘0’ and ‘1’ (i.e. the ‘0’ and ‘1’ table entries in each row corresponding to each column) representing a state in c. Thus, the string representing each row also represents a state in c. A string is said to represent a state q when it can be used to reach q from the initial state q0. The vectors are used to distinguish the rows, thus, distinguishing the states in c.

Alternatively, each row can be viewed as a set of distinguishing strings e where each e represents a column in the table. The table entries of ‘1’ and ‘0’ in a row depends on whether e (for the corresponding column) is the distinguishing string or not to the row (state represented) respectively.

There may be rows with the same vector (i.e. with the same set of distinguishing strings) and by the Myhill-Nerode Theorem of equivalence classes, these rows are said to be equivalent to each other, that is, representing the same equivalence class x. Thus, we use the alternative view of a row above in referring to the distinct states represented by these rows. The distinct state, that is, the equivalence class x, is represented by the distinct row vector.

From Figure 3.5, there are only two distinct rows, s1 and s2, with vectors ‘0’ and ‘1’ and strings λ and 0 respectively. The rest of the rows have the same vector ‘1’ as row s2. Thus, there are only two distinct states represented by a set of strings {λ} and {0,1,00,01} each. The sets of distinguishing strings are φ and {λ} for the two distinct states respectively.

e1 = λs1 =s2 =

λ0

01

s3 =s4 =s5 =

10001

111

Figure 3.9: Observation table with five rows representing two distinct states with string from t.

We now specify the three main elements in the observation table O, as shown in Figure 3.6, used by the learner L1 during learning to represent properties and information of c:

1. A non-empty prefix-closed* set of strings, S.This set starts with the null string, λ. All the rows in the observation table are each represented by strings in S∪S.A. There are two distinct divisions of rows in O: the upper division (i.e. as shown by the shaded rows in Figure 3.6) of the table is represented by the strings in S and the lower division is represented by strings in S.A. Each row in the upper division is the particular state reachable through some s∈S from the initial state q0. The

* A prefix-closed set is where every prefix of each member is also an element of the set.

14

Rows: s1…s5

Columns: e1

training set t= {-λ, +0, +1, +00, +01}distinguishing strings = {λ}States: s1, s2


rows in the lower division of O therefore represent the next-states reached through transitions a∈A from rows in the upper division. Thus, S represents the states discovered (learnt) by learner in the course of learning.

2. A non-empty suffix-closed** set of string, E.This set also starts with an initial null string, λ. The columns in the observation table are represented by the strings in this set. The vector for each row is a collection of strings from E. Thus the distinct subsets of E represented by the distinct row vectors are used to identify the distinct states represented by the strings in S ∪ S.A. From Figure 3.6, each φ, {λ} ⊆ E (represented by vectors ‘0’ and ‘1’ respectively) identifies the two distinct states represented by {λ} and {0,1,00,01} in S ∪ S.A.Thus, E represents the characteristics of states learnt through subsets of strings in E.

3. A mapping function, T: (S ∪ S.A).E {0,1} where T(x.e) = ‘1’ if the string x.e ∈ c and ‘0’ otherwise with x ∈ S ∪ S.A .Thus, this mapping function represents the transition function of FA, δ(q0,xe).

λs1 =s2 =

λ0

01

s3 =s4 =s5 =

10001

111

Figure 3.10: Observation table O with upper division (shaded section) and lower division of rows from the set S∪S.A.

Each of the following two properties of O, closed and consistent, are used by L1 as a guide to carry out updates (i.e. the extension of rows and columns) during learning:

a) closedAs the lower division in O are next-states of previous states in the upper division on taking transitions on symbols in A, the row vectors in the lower division must therefore also exist in the upper division , the closed property of O.Thus, for every string, s’, in S.A, there is an s in S where both strings, s’ and s, have the same vector. As shown in Figure 3.6, the vectors in the lower division of O existed in the upper division where all next-states are existing states.

b) consistentEach pair of vectors with the same subset of distinguishing strings should represent the same state. The next-state vectors from this pair of vectors should be the same vector representing the same next-state reached, the consistent property of O.Thus, any pair of strings, s1, s2 in S, with the property of row(s1)=row(s2), then row(s1.a) = row(s2.a) for all a in A. As shown in Figure 3.7, the rows represented by strings λ and 11is to be consistent when both strings is representing the same

** A suffix-closed set is where every suffix of each member is also an element of the set.

15

Rows: s1…s5

S = {λ, 0}E = {λ}S∪S.A = {λ, 0, 1, 00, 01}table entries: T(x.e)

where x∈ S ∪ S.A , e∈E

S

S.A

E


distinct state represented by vector (10) moves into the same next-state(s) represented having the same vectors.

λ 0

λ 1 0 previous state

0 0 11 0 0 next-state11 1 0 previous state00 1 001 0 010 0 0110 0 1111 0 0 next-state

Figure 3.11: Observation table which is consistent, where two rows λ and 11 represented by the same row vector (1 0) has the same row vector (00) representing the same next-state reached from both rows in the upper division (shaded region).

The observation table O is updated by extending the rows and columns (discovering more states and the characteristics of each state) using membership queries and equivalence queries, as shown in Figure 3.8 and Figure 3.9. An update is carried out in two circumstances:

Figure 2.12: (a) Observation table T1 not closed with vector (0). (b) a closed T2 with extension to T1 adding new row into table representing new state discovered.

a) when either one of the close and consistent properties of O does not hold:

• O is not closed when a vector is not represented in the upper division. A new state is said to be discovered (learnt) as it is a non-existing next-state. From Figure 3.8(a) the row with vector (0) in the lower division is not represented in the upper division, indicating that the next-state is not an existing state. Then O is updated by

S ∪ {s’} where s’ ∈ S.A Thus, Figure 3.8(b) shows the updated O with new string (row) in S and new row in the upper division representing the new state learnt.

T1 λλ 0

row vector not in upper division

0 1

1 1

T2 λλ 0

newly added row with vector (0)

0 1

10001

111

16

S ∪ S.A = {λ, 0, 1}T(λ.λ) = 0T(0.λ) = 1T(1.λ) = 1

S = {λ, 0}E = {λ}S ∪ S.A = {λ, 0, 1, 00, 01}

make closed: S ∪ {0}

(a)(b)


[Adding s’ into S still maintains the prefix-closed property of the set as s’ is an element of S being appended with an input letter from the alphabet.]

Note: Membership queries are used to complete the table entries whenever E or S is extended. The queries are made on strings in the (S ∪ S.A).E where a yes answer from the teacher means a ‘1’ entry in O and vice versa.

• O is inconsistent when two rows with the same vector have a pair of different next-state vectors. This indicates that one of the pair of strings s, s’ in S actually represent a different (newly discovered) state not in the existing states (rows). As in Figure 3.9(a), pair of rows with same vector lead to different next-state on transition ‘1’ in O1. Then O is updated by

E ∪ {a.e} where a is the transition symbol which brought the two states to a different next-state and e is the element in E where the next-state vector differs (i.e. at one of the attributes).

Thus, Figure 3.9(b) shows the updated O2 with an extra column represented by string ‘1’, the transition symbol which brought the pair of rows to different rows. The e element previous E where the difference is seen is λ. All the table entries resulting from this additional column are filled in using membership queries on the new (S ∪ S.A).E.[The suffix-closed property of E is also maintained with ‘a.e’ added to E, where e is the previous suffix element being added to the set before ‘a.e’.]

O1 λ 0 O2 λ 0 0

λ 1 0 λ 1 0 00 0 1 same: 0 0 1 001 0 0 current state 01 0 0 0010 0 0 current state 010 0 0 11 0 0 1 0 0 100 1 0 different: 00 1 0 0011 0 1 next-state 011 0 1 00100 0 0 0100 0 0 00101 1 0 next-state 0101 1 0 0

Figure 3.13: (a) O1 is inconsistent with different next-state vector for a pair of rows with same vectors representing same state. (b) An updated O2 obtained with newly learnt state represented by the new row with new vector in the upper division of new table.

c) when a counterexample y is returned from an equivalence queryS is extended during learning to include all the prefixes of y. Thus, the upper division of table is extended with new strings and membership queries are used to fill in all new entries.

We now have the questions of “when to built a tentative M using the data structure?” and “how a tentative M’ is built from data structure?”. The tentative M’ in Figure 3.10(a) is built only when the observation table O has both the properties of closed and consistent, as in (a) where all upper rows with the same vectors leads to a pair of rows with the same vector and all vectors in the lower division are represented in the upper division.

17

Make consistent E∪{1.λ}

new row(new vector)

The e column which the rows differs at shaded entries, ‘0’ and ‘1’


The latter question is answered with a closed and consistent O. This closed and consistent O is used to build a tentative deterministic FA (DFA), M’, with each distinct vectors (i.e. distinct rows) in the upper division representing a state in M’. Then, the M’ is completed by having transitions on all symbols in A from every states. The next-state here is determined by a look-up at the rows represented by the string s.a (i.e. the resulting string from taking a transition a from row s) in the table for the corresponding vector to each strings.

From Figure 3.10(b), the conjecture M’ is built from the closed and consistent observation table O, in Figure 3.10(a). The states in M’ are the distinct vectors in the upper division, which are each shared among strings representing all the rows in O. M’ is the minimum DFA that accepts all non-empty strings as equivalence query on M’ returns a yes answer.

Figure 3.14: (a) The observation table, O, for the unknown FA, c, recognising the set of all non-

empty strings. The rows are elements of S ∪ S.A and columns are elements from E. (b) The conjecture, M’, is constructed using the closed and consistent O. The final state being the row having vector (1) (bold arrow). λ is always the initial state being the first row in the table, a non-accepting state in this case. The next-state transitions are the strings {0,1,00,01}.

Each conjecture, M’, is then presented to the teacher in the form of an equivalence query. At this point, if the guess is correct, no counterexample is returned and M’ is the minimum DFA equivalent to c, as in Figure 3.10(b) where an equivalence query on M’ returns a yes. Thus L1 stops learning and output M’ as its hypothesis. The conjecture M’ is a minimum DFA representing the unknown FA.

However, if a counterexample is returned, an update is carried out to the observation table (i.e. adding all prefixes to S) and another update if the updated table with additional prefixes is not closed and/or inconsistent. Next conjecture is built when both properties are satisfied. Membership queries are used to fill in new entries for the new rows obtained from the extended S where counterexample and its prefixes are the learner’s choice in presenting membership queries.

This minimality on the number of states in the conjectured DFAs is maintained by the closed and consistent property of the observation table. Through the consistent and closed test on every updated table, two rows that have the same vector is considered as belonging to the same equivalent class by Myhill-Nerode Theorem (i.e. class x with the same behaviour for a set of distinguishing strings). Thus, building a conjecture only if a closed and consistent observation table is obtained after every update and taking only the distinct vectors as representing the distinct states in a building DFA always results in a minimum DFA.

“How to start learning?”. This question brings us to the important role of the null string λ, which both S and E starts with as the first element. This string not only brings us to

O Col1 (= λ)s1 = λ T(λ) = 0s2 = 0 1s3 = 1 1s4 = 00 1s5 = 01 1

(b)

18

E = {λ}S = {λ, 0}

(a)

λ

s1 = {λ} s2 = {0, 1,00,01}

M’

0,1

0,1

distinct vector distinct vector (also the final state)


discovering the initial state q0 (being the first row in the table) but also as the distinguishing string which is uses to decide which of the distinct vectors are accepting or rejecting states. Being the first element of E therefore allows every string in all rows to be queried by the learner whether it is accepted or rejected by c in the membership queries. Thus, a row which has the λ as its set of distinguishing strings indicate that the vector with ‘1’ entry in the λ column must represents an accepting state as the string represented by that vector is accepted.

From Figure 3.10(a), the vector (1) for row with string ‘0’ represents an accepting state as ‘0’ is accepted at column represented by λ, which is also in the set of distinguishing strings for row ‘0’ indicated by the ‘1’ entry. Learning process thus starts with S and E having only one element (i.e. the null string) and the initial table with only a column and three rows (one row for λ in the upper division and the other 2 rows for the next-state rows in the lower division).

Another illustration in is shown in , with learner trying to learn the FA that accepts all string with even 0’s and 1’s. The initial table is constructed as Oo which is not closed. L1 updates the table until an equivalence query initiate a termination by a yes answer for conjecture M1 after five updates (i.e. five observation tables) and two conjectures.

The examples required by the learner are obtained through membership queries and counterexamples both from the training set t consisting of positive and negative examples (i.e. the ‘0’ and ‘1’ entries accepted and rejected strings).

O0 λ O1 λ O1 λ 1 make close λ 1 S={λ,0} 1,00 0 S∪{0} 0 0 E={λ} λ 01 0 1 0 S={λ} 00 1 M0 : Equivalence query no (y=010)E={λ} 01 0

O2 λ O3 λ 0 O4 λ 0 1λ 1 λ 1 0 λ 1 0 00 0 make 0 0 1 make 0 0 1 001 0 consistent 01 0 0 consistent 01 0 0 0010 0 E={λ}∪{0.λ) 010 0 0 E∪{1.λ} 010 0 0 11 0 1 0 0 1 0 0 100 1 00 1 0 00 1 0 0011 0 011 0 1 011 0 1 00100 0 0100 0 0 0100 0 0 00101 1 0101 1 0 0101 1 0 0

O4 is closed and consistentS = {λ, 0, 01,0 10}E = {λ, 0, 1}

19

λ

λ 0

01010

0

0

11 11

0

0

M1 Equivalence query yes

1


Figure 3.15: Running examples of learning the unknown FA that accepts the set of all strings with even number of 0’s and 1’s.

20


3.1.2L2: by Kearns and Vazirani [Kearns et al 94]

This algorithm uses the same principles as L1 (i.e. using membership and equivalence queries and positive and negative examples) but the data structure used to construct the tentative FA, M’, is a classification tree, as shown in Figure 3.19. The leaves of the classification tree represent the states learnt (known) in c and the nodes represent the distinguishing strings required to distinguish (discover) the states in c. All the nodes and leaves are represented by a string each, based on the information received from counterexamples and membership queries on chosen strings.

Figure 3.16: Classification tree, T1, with 3 node representing 3 distinguishing strings and 4 leaves each representing an equivalence class.

The Myhill-Nerode Theorem is also adopted by L2, that is, maintaining the set of distinguishing strings that distinguishes between the equivalent states represented as leaves in the tree. The leaves can be viewed as the equivalence class x containing a set of strings (representative strings) having the same behaviour (distinguishability) with respect to c and the set of distinguishing strings. Thus, each node is seen as the distinguishing string between the children on its right subtree and left subtree (i.e. the leaves in the subtrees) to accepting strings and rejecting string respectively. In Figure 3.12, the node d3, represented by string 1, distinguishes between the leaves, 01 and 1, in its right and left subtree respectively with respect to the FA that accepts all strings with even 0’s and 1’s.

The next-state which a string x reached with transition symbol a is determined by traversing the tree with the string xa starting from the root until a leaf s is reached. At each node d visited, the next path to take depends on whether the string xad is accepted or rejected by c. The right path is taken if xad is accepted by the unknown FA and left path if otherwise. The leaf s reached is the equivalence class where x belongs. Thus, xa is said to represent the state represented by s. The membership queries are used here to query which path to take with xad being the string of the learner’s choice.

As in Figure 3.12, the string 01 when traversed through the tree landed up in leaf s4

where the string 011 is rejected at node d3 in T1 by taking the left path to leaf s4. However, in T2 from Figure 3.13, it is accepted as the traversal reached a right leaf s1 from the root d1. Thus, the string 01 is said to represent state s4 and s1 in T1 and T2 respectively with respect to the FAs being learnt.

Figure 3.17: Classification tree, T2 with one node and two leaves represented by the strings in

21

s1 = λ

s2 = 0

s3 = 1

d2 = 0

d3 = 1

Root(d

1 = λ)

s1 = 01s

2 = λ

Nodes : d1

Leaves: s1, s

2

States : s1, s

2

Distinguishing strings = {λ}training set t = {-λ, +01}

T2

Root (d1= λ)Nodes : d1, d2, d3

Leaves: s1, s2, s3, s4

States : s1, s2, s3, s4

Distinguishing strings = {λ, 0, 1}training set t = {+λ, -0, -1, -01}

s4 = 01

T1


The classification tree, T, maintains two main elements to represent the properties learnt of c and also the information received from the training set. The elements are specified as the following:

1. a set of access strings, SThe initial set contains only one string, the null string λ. The leaves in T are each represented by strings in S. All the leaves then represent the known states discovered insofar of the unknown FA. The leaves in left subtree of root contains all s in S that are rejected by c and the leaves in the right subtree of root are strings that are accepted by c. Thus, S is subdivided into 2 subsets of accepting and rejecting states (i.e. the leaves).

From Figure 3.13, S is the set of strings representing the leaves and the 2 subsets for T2 are {λ} and {01} where both sets are the positive and negative set of examples from t.

2. a suffix-closed set of distinguishing strings, DThe initial D starts with the null string, λ, and is used to distinguish each pair of access strings in S. The strings in D represent the nodes of the T. Each node, d, has exactly two children, distinguishing a pair of strings in S such that the right subtree consists of strings s.d that are accepted by the unknown FA at node d and vice versa.

As in Figure 3.13, the root is the distinguishing string for the leaves in the right and left subtree from the root where strings λ and 01 representing the leaves are also representing the rejecting and accepting states respectively.

The classification tree is used to build every conjecture M’ except the initial conjecture (i.e. the learner’s first guess) of FA, M0. There are only two different initial conjectures to choose from as M0, as shown by the initial conjectures in Figure 3.14(a) and Figure 3.14(b) which has only the single start state with all transition to itself. This initial guess depends on a membership query on the null string λ. Thus, M0 either accepts or rejects the set of all strings depending on whether the λ is accepted by the unknown M, that is, the initial state is an accepting state if λ is accepted by M and vice versa.

Figure 3.18: (a) M0 accepting all strings as t provide the positive example, +λ. (b) M0 accepts the empty set where t provide the negative example, -λ. The corresponding tree T0 is an incomplete tree to be completed with the counterexample returned after an equivalence query on M0.

22

+λ

λ

0,1

M0

-λ

λ

0,1M0

(a) t = {+λ}+

y +λ

-λ y

T0

T0

(b) t = {-λ}

(a)

(b)


As in L1, for every conjecture M’ produced, an equivalence query on M’ is presented by the learner. L2 terminates if no counterexample is returned. Thus, the first guess from the learner is whether all strings are accepted or rejected by the unknown M. The first counterexample represents the remaining unrepresented leaf y in the incomplete T0 for the initial conjectures, as in Figure 3.19, where y is the one of the leaves for each tree. The subsequent counterexample y returned is analysed using the divergence concept (see below) and a current classification tree, T.

From Figure 3.15(a), the unknown FA accepting all non-empty strings results in the initial M0 being the DFA accepting the empty set. An equivalence query to M0 returns the counterexample string +01. As λ is rejected at the root, the first classification tree, T0, as shown in Figure 3.19(b), has λ as its left child and the counterexample 01 as its right child.

Figure 3.19: An unknown FA that accepts the set of all non-empty strings is being learnt; (a) The initial conjecture, M0; (b) The classification tree, T0, with 2 leaves from S and a node (root) from D; (c) The conjecture, M’, is constructed using the classification tree in (b) with the first leaf, λ as the initial state and the second leaf, represented by string, 01, as the other state in M’. As 01 is a leaf on the right subtree from root, the final state is represented by the leaf, 01.

23

λ

0,1

M0

(a)

λ y=01

Root (≡ λ)

S = {λ, 01}D = {λ}counterexample, y = 01

T0

(b)

S = {λ}D = {λ}

λ

Equivalence query (y=01)

λ 01

λ

M’

0,1

0,1

(c)

Equivalence query yes

S = {λ, 01}D = {λ}


In Figure 3.15(b), T0 is then used to build a tentative deterministic FA (DFA), M’ (Figure 3.19(c)), using the leaves λ and 01 to represent the states of M’ where all states in M’ are labelled with the leaves of T0. The next-state transitions for every states in M’ is done by traversing T0 with the string representing each state appended with a transition symbol a. The strings used for the next-state transition are {011,010, 0, 1}. The next equivalence query returns a ‘yes’ which terminates learning, as in Figure 3.15(c).

From every counterexample y, each prefix of y is analysed to determine the prefix yi

that leads to different states when it is tested on both the current T and the conjecture M’ which returns y in the equivalence query on M’. Both tests will result in a pair of states: a leaf and a state from T and M’ respectively. Since M’ is built by taking all the leaves in T to represent the states in conjecture, then the pair of states from the tests should point to the same state for a string if T and M’ are equivalent. Thus, there must be a node and transition symbol (path) that yi takes leading to first different pair of states. This is called the divergence point.

Thus, a counterexample indicates that somewhere along the string, y = y1…yn for n input symbols in y, at one of the prefix, ym, M’ and T diverge into a different path leading to different state. The divergence point is

ym-1 (i.e. the immediate prefix before ym where pair of different states sM and sT is obtained from M and T respectively in the test.

where ym is the prefix where divergence occurs

The current tree, T, is used to trace the common ancestor for both sM and sT, that is, the node d, that distinguishes the leaves represented by sM and sT . Both d and ym-1 are used to update the classification tree. Figure 3.16 shows how the divergence point is found from counterexample y.

Figure 3.20: The unknown FA accepts all string with even 0’s and 1’s. The conjectured M’ is returned with counterexample y = 11 in equivalence query. Each prefixes of y is traversed in T’ and also M’ to find the divergence point, y1 = 1 where divergence occurs at y2. The common ancestor for 01 and λ where y2 diverged to is 0 from T’.

24

λ0

λ0

01

λ

λ

0

0,10,1

10

0

01

y = 11Prefixes y1: {1, 11}where i = 1, 2y1 = 1 sM = 01

sT = 01y2 = 11 sM = 01

sT = λ

y = 11 Equivalence query (M’)

M’T’

Divergence point = y1

common ancestor d = 0


As the learner learns more information and properties of c, the new information and properties (regarding new states) are updated in the tree by extending the nodes and leaves. These updates are carried out only using the information from equivalence query, that is, the counterexample. The counterexample is then analysed for divergence point and the results from the divergence analysis are the common ancestor d and the prefix ym-1, representing the divergence point. Therefore, the tree is updated using d and ym-1 as follows:

a) new access string, ym-1 (i.e. a prefix of y), to add to Sa new state represented by ym-1 is discovered with a new leaf is to be updated onto T as S is extended. [S is extended to include the prefix of a counterexample representing a known state of c discovered as shown in Figure 3.17 by the shaded leaf 1]

b) new distinguishing string, a.d, to add to D where d: the most common ancestor for sT and sM

a: the input letter that leads ym-1 to sT and sM (i.e. ym = ym-1.a)ym = y1…ym (the prefix of y)ym = mth symbol in ysT and sM : states reached in T and M

[D is extended when the counterexample is returned and a prefix (i.e. to be included in the extension of S) of the counterexample is identified as the point of divergence. D is to include a new distinguishing the string, a.d, where d is the common ancestor node and ‘a’ is the input letter leading from divergence point the two different strings reached. As shown in Figure 3.17 where the new node is shaded.]

The new extensions involve sT being replaced by the new string a.d forming a new internal node. The leaves, sT and ym-1, are the children of the new node and their position depends on the acceptance of each concatenated with a.d through membership queries. The suffix-closed property of D maintains reachability from other states to the final state(s) each time a new state (leaf) is discovered (i.e. added into S). Figure 3.17 shows how the tree T’ in Figure 3.16 is updated and used to build a new conjecture M”.

Figure 3.21: The new updated tree from Figure 3.16, T”, has new node and new leaf (in shades) being added when a divergence point is found in previous counterexample. New conjecture M” is queried for equivalence query and no counterexample returned.

25

λλ

0

0

0

01

111

1

0

1

0

λ0

1

01 1

λ0

T”

New conjecture


We illustrate another example of L2 learning the FA that accepts the strings with even 0’s and 1’s in Figure 3.18 below. The divergence point in done and represented by ym.

Conjecture Equivalence query

Classification tree

M0 noy = 01

T0

S={λ,01}D={λ)

M1 noy = 00ym = 00

T1

S={λ,01,0}D={λ,0}

M2 noy = 11ym = 11

T2

S={λ,01,0,1}D={λ,0,1}

M3 yes

Figure 3.22: Running example of learning the unknown FA that accepts set of strings with even number of 0’s and 1’s.

Therefore, the access strings in S are prefixes from counterexamples where the size of S is also the number of counterexamples returned (or number of equivalence queries made). L2 maintains a fixed-size S where each string represents a distinct state of the minimum DFA of M. The size of S is at most to the size of the minimum DFA for M at any point of time during the learning process. Hence, each counterexample produces a new access string that immediately creates a new conjecture with a newly discovered state.

26

01 λ

1

0

1

1

0

1

00

λλ

0

1

01

λλ

λ

0,1

λ

λ

λ

λ

λ0

λ0

1

01 1

λ0

λ0

01

0

0,1

01

0,1

0,1

10

0

01

0,1


3.1.3Discussion

L1 an L2 have data structures that are able to capture the properties of the unknown FA M during learning: a) the finiteness of M is represented by the finite number of rows, in observation table and

leaves, in classification tree.b) the distinct states of the minimum DFA representing M c) the transitions to next-stateThe principle and characteristic of L1 and L2 are summarised in Table 3.3.

Both algorithms are consistent with data received (or seen) insofar each time a conjecture, M’, is made: M’ is able to classify all positive and negative examples correctly if they are presented again to the learner. M’ is constructed as the minimum DFA representation to the training set and additional information received through queries, equivalent to the unknown FA, M, by L1 and L2. Both L1 and L2 is said to be learning in the limit as proposed by Gold as each terminates after finite number of counterexamples and queries with a consistent and correct conjecture (or hypothesis), M’, indicated by the ‘yes’ answer from equivalence queries.

However, the use of queries seems to place certain requirements and assumptions on the teacher during learning:

a) to have a substantially large amount of knowledge about the unknown FA, M.This requirement is exhibited through the equivalence query answer where the counterexample returned by the teacher are arbitrary, suggesting that the teacher actually knows quite a lot about what is accepted and rejected by M. The teacher has to look into the symmetric difference between the conjectured M’ and M which is not possible when the teacher usually does not know exactly what M is or what are the complete examples describing M..

b) to be able to classify and classify correctly any membership query presented by the learner.As the choice of string to query for membership is of the learner’s, the teacher does not know what to expect and instead, is expected to be able to classify any string presented correctly (i.e. without error or misclassification) and instantly (i.e. to have seen the string before that belongs or does not belong to M), suggesting a very knowledgeable teacher which is not true in most actual machine learning instances.

Both L1 and L2 made assumptions with respect to the requirements mentioned above about the teacher. The teacher is to provide correct answer to every query and also to be able to provide examples (or counterexamples) until learning is completed (i.e. algorithm comes to a halt when equivalence is achieved).

27


Algorithms L1 (using observation table) L2 (using classification tree)Goal Discovery of new states while

maintaining reachability from initial state and also maintaining correspondence with final states (connectivity of states preserved)

Discovery of new state corresponding to final state but do not have reachability from initial state. (conjecture is a disconnected states of an FA)

State representation

Obtained from the prefix-closed set where each distinct (i.e. distinct row vector or attributes) element represent a state in M

Only one prefix from each counterexample returned is regarded as a new access string, thus a newly discovered state of M

Initial Conjecture, M’

Constructed after data structure (i.e. a consistent and closed observation table) is built. M’ accepts strings from training set that is also accepted in M

Constructed before data structure is built. M’ accepts/rejects all strings, A*

(i.e. depending on whether null string is accepted by M)

Transitions representation

From next state representation in S.A set

From membership query along nodes of updated tree for all leaves.

Final state(s) representation

Row(s) which has a mapping value of 1 at the first column, that is the null string column (this is because rows also represent strings that are used to reach each state)

Those leaves on the right subtree from the root.

Counterexample role

To create a new live (i.e. transitions leading to final state) state. Differs from L2’s counterexample where reachability from initial state is preserved and all prefixes of counterexample submitted to the prefix-closed set, S

As means of providing a new prefix as new access string for a new state. It only requires one prefix which has transitions leading to final state but may not be reachable from initial state instead of all prefixes to be considered in L1.

Queries Equivalence query provides counterexample to improve conjecture Membership query provides means for acquiring transitions between states and also determines the next state.

Distinguishing strings

This is closely related to the Myhill-Nerode Theorem where each state consists of a set of strings that represents it with same behaviour or property .

Prefix-closed property

Can be observed from the set S which subsequently maintains reachability from initial states

Do not have this property at all except that it has a need for a prefix from counterexample to represent a new state as in L1

Suffix-closed property

A very important property in distinguishing states by relating each state to the accepting/rejecting state(s). This also appears to be an important property for minimality constraint in learning DFA.

Examples/ Training Set

Consists of positive and negative examples in training set from teacher as source of examples.

Table 3.3: A summary of characteristics and properties of L1 and L2 where the unknown FA is M.

28


3.2Learning without queries.The learning algorithms L1 and L2 have used queries to obtain more information in learning an unknown FA, M. Both algorithms represent all their guesses in the form of a DFA. This section discusses another learning algorithm L3 that learns successfully without the use of any queries and unlike in L1 and L2, each of its guess is represented as a tentative non-deterministic FA (NFA). L3 learns only from examples presented by the teacher and conjectures a DFA of the learnt M after some finite time.

The information needed in learning is provided passively through a complete training set of labelled strings presented in lexicographic order4 to the learner. A complete set of all labelled strings consists of all strings starting from the null string, λ, up to some string x. This complete lexicographic ordered training set therefore consists of strings in t = {+λ, +0, …., +x…} where the alphabet set, A, is an ordered set of {0,1} and 0 <l 1 in lexicographic order. M’ is the equivalent FA to M where the labelled strings in t is generated.

The data structure used in L3 to hold information received and learnt is a set of minimum length string, known as minword(q). A minword(q) for a given state q is the shortest string required to reach a state, q, from the initial state q0. A new minword(q) is added into the data structure during learning to represent a newly discovered distinct state q in c. Therefore, the number of minword(q) learnt is the number of states in the minimum DFA representing M.

Learning always starts with λ, as shown in Figure 3.19(a) and 3.19(b). Note that λ is always a minword(q0) of the initial state q0 for any DFA. The set of minword(q) is also in lexicographic order where given two shortest strings x and y of the same length reaching a state q, x is the minword(q) chosen to represent q if and only if x <l y. Figure 3.19(a) shows the string 0 as the minword(q) for the accepting state as 0 <l 1.

Figure 3.23: The transition arcs is shown in bold and dashed arrows where the bold arcs represent the shortest path (string) needed to reach each state in both FAs. (a) FA accepting all non-empty strings and two minword(q), λ and 0, representing the rejecting and accepting states respectively; (b) FA accepting strings with even 1’s and 0’s, and four minword(q) representing the four states in c. (dashed and bold arrows represents mutable links and permanent links respectively, see following section)

L3 conforms to the learning in the limit framework as it makes continuous guesses by constructing an NFA for each input string received from the training set, t, and stops learning

4 Given a lexicographic sequence of strings over the alphabet A*, x = x1 x2 … xm and y = y1 y2 … yn are strings in that sequence, then x <l y if m <l t (i) m <l n, or (ii) m = n and there exist an s where 1 <l s <l

m, xi = yi for i=1,..,s-1, and xs <l ys. (e.g. lexicographic sequence over A* = {λ, 0, 1, 00, …, x, y,…} where y is the successor of x)

29

0

0,1

λ

(a)

λ

0

0

11 11

0

0

(b)

Set of minword(q) = {λ, 0}

Set of minword(q)= {λ, 0, 1, 01}

1


when a consistent and correct guess is conjectured (i.e. after seeing a finite number of strings in t). The training set presents labelled strings {λ,……,x,,…} in complete lexicographic order where all the guesses made on receiving input strings x thereafter are the same and consistent DFAs (i.e. for all y >l x, My = Mx) and Mx is being output as the equivalent of M with a set of minword(q) representing the set of states discovered during learning.

All the states are represented by {minword(q)} and each state is discovered in lexicographic order according to the order of discovery of each minword(q). Thus, the set of minword(q) obtained after some finite time is a set of finite prefix from t. The number of minword(q) discovered is the size of the minimum DFA with n states where only n minword(q) are required to infer the minimum DFA.

The learning algorithm L3 also uses two other elements besides the minword(q) in learning M: permanent links and mutable links. L3 is discussed in the next section using the FA in Figure 3.19(a). Its best guess (i.e. same and consistent guess) are constructed using the permanent and mutable links, shown in Figure 2.19, (represented in the following diagrams as bold and dashed arcs respectively) and a set of minword(q) the size M (i.e. number of states).

3.2.1L3: by Porat and Feldman [Porat and Feldman 91]

The transitions in an NFA built by L3 during learning are called links (i.e. multiple next-state transitions on each input symbol of the alphabet) and they are categorised into two types: permanent links and mutable links. L3 uses the stored set of minword(q) and the links in its previous guess to construct a new NFA upon receiving an input string from the training set in lexicographic order. Figure 3.20-3.22 shows how L3 learns through the insertions and deletions (only mutable links) of the two types of links and Figure 3.23 shows the complete learning process of all guesses made during learning: permanent link

Only one incoming permanent link(transition) can be established for each state q in c. This link forms a new minword(q’) by inserting a new permanent link with transition a from an existing state q to a new state q’. The new state is then represented by ‘minword(q).a’ where minword(q) represents state q. The first permanent link is formed as the first example is received from the training set t. Figure 3.20 shows that the initial state q0 is discovered and is the only state with the an incoming arc (transition) with symbol λ not from the alphabet upon seeing -λ as its first example.

Figure 3.24: The initial state, q0, is discovered and a permanent link represented by the null string, λ, points towards q0.

This property maintains reachability of each state from the initial state, q0 (i.e. the state represented by the null string, λ, which is the minword(q0)). Thus, the shortest path from q0 to a state q is a sequence of permanent links (i.e. a concatenation of permanent link/transition symbols) forming the minword(q) for q. A new state, q’, is added if and only if there is no existing path (sequence of links) that brings the example x from q0 to a final state (i.e. no right path for x).

30

• First permanent link is represented by the transition of null string, λ, to reach the initial state, q0.

• The first minword(q0) = λ, represent a non-accepting state as first example is a negative example, -λ.

λ

q0

Set of minword(q) = {-λ}


mutable linkThere may be as many as size(M) (i.e. the number of states in M) transitions from a state in M’ for each symbol from the alphabet. The role of membership queries used in algorithms L1 and L2 for next-state transitions is replaced by mutable links(transitions). A mutable link represented by the transition a from an existing state q to another existing state q’ is established (inserted) if and only if two conditions are satisfied:a) minword(q).a >l minword(q’) b) there is no existing permanent link (i.e. outgoing transition) from q for the symbol a.

The insertions of mutable links eliminate the use of queries for strings yet to be seen (i.e. strings that are further up in the lexicographic order examples from training set, t). Figure 3.21 shows the strings 0 and 1 which are examples further up in t and also the next-state transitions from q0. There is no permanent link with symbols ‘0’ and ‘1’ from q0 and minword(q0).a >l minword(q0) for all a in the alphabet. Thus, mutable links are inserted for each a from q to q itself as there is no other state satisfying the two conditions for mutable link insertions specified above.

Figure 3.25: An NFA Mλ is constructed upon receiving first example from t, -λ, with dashed line indicating mutable links.

The mutable links established can also be deleted as more examples are presented to learner in the lexicographic order from t. Upon receiving an example, x, the first mutable link along a path that is wrong, that is, x ended up in a rejecting state when it is a positive example and vice versa, is deleted. Deletion of mutable links has one of the following consequences:

a) Insertion of a new state, q’, new permanent link, a, and new mutable links

Deletion of all wrong paths for x results in no right path to bring x to a final state (e.g. there is no right path for the example ‘+0’ in Figure 3.22(a) where all mutable links of ‘0’ from q0 is deleted in Mλ). A new state, q’, is then added along with a new permanent link from the state q where the first mutable link(s) are deleted with transition symbol a representing the deleted mutable link(s). The newly discovered state q’ is then represented by minword(q’) = minword(q).a where the string minword(q’) is added into the current set of minword(q) maintained by the learner, L3. Figure 3.22(b) shows the insertion of a new state q1 and a new permanent link with symbol ‘0’ from q0 to q1. The new state q1 is an accepting state as minword(q1) (i.e. minword(q1) = minword(q0).0 = 0) is accepted by c. More mutable links are inserted from both q0 and q1. A new guess M0 is obtained in Figure 3.22(c) with set of minword(q) = {λ, 0} representing q0 and q1 respectively.

31

• The next-state transition from state q0

are in the form of mutable links (dashed arcs) with transitions ‘0’ and ‘1’ from q0 back to itself.

• Mλ is the learner’s current guess which is an NFA accepting nothing.

-λ

0,1

q0


Figure 3.26: Mλ is presented with the next example, +0. (a) Deletion of wrong paths for ‘+0’ (i.e. deletion of all mutable link labelled ‘0’ which leads to a final rejecting state for ‘+0’); (b) Addition of new state q1 with a second permanent link ‘0’ from q0 and a new minword(q1) = 0; (c) M0 with new insertion of mutable links is output as new best guess for examples {-λ, +0}.

b) Amortised NFA with deletion of wrong path(s) (i.e. mutable links)

If the deletion of wrong paths for an example, +x, results in some right paths still available, then the previous NFA is said to be an amortised NFA with fewer mutable links with deletion of wrong paths relative to example seen. Wrong paths are also checked when new mutable links are inserted (i.e. only when a new state is inserted) against the paths remaining after deletion with respect to x in previous NFA. The amortised NFA is then output as best guess consistent with examples seen so far {+λ,…, +x}.

Figure 3.23 shows the new guess M1 built upon seeing example string, +1 (i.e. the amortised M0 with deletion of first mutable link(s)s along wrong path(s) for ‘+1’) which is consistent with the examples seen so far {-λ, +0, +1}. Each conjecture is amortised through deletion of links on subsequent examples being presented to the learner from t until a consistent DFA is obtained where all links are deterministic in nature, as represented by the conjecture, M01, where t = {-λ, +0, +1, +00, +01,+11,…}. The final tentative hypothesis M01 made by the learner is the same and correct conjecture for all subsequent examples from example ‘ +01’ thereafter presented to the learner in complete lexicographic order. Thus, M01 is the learnt in the limit as representation of M by L3 which is equivalent to the FA in Figure 3.23(a).

32

1

λ

q0

(a) (c)

λ

q0

1

q1

0

(b)

Mλ

minword(q1) = λ.0 (= 0)t = {-λ, +0}

λ

1

q0

0,1

0

0,1

1 q1

M0

Set of minword(q) = {λ,0}t = {-λ, +0}

minword(q0) = λt = {-λ}

Example ‘+0’

Chapter 3: Non-Probabilistic Learning 33


Figure 3.27: L3 learning unknown FA M that accepts set of all non-empty strings. The resulting conjecture at input, +11, M11, is consistent throughout subsequent inputs and is deterministic in nature. Thus in the limit, the previous conjectures that are NFAs have become a DFA in the final conjecture.

Examples from t

Conjecture (NFA) state, q

minword(q) Links(transitions) : permanent link : mutable link

-λ Mλ q0 λ• permanent link = {λ}

(i.e. to q0)• mutable links: from q0

to q0

+0 M0 q0

q1

λ0

• permanent link = {λ, 0}• insertion of new state,

q1

• insertion of mutable link: insertion of all next-state transition ‘0’ and ‘1’ except for transition ‘0’ from q0

because a permanent link exists for transition ‘0’ from q0

+1 M1 q0

q1

λ0

• permanent link: not necessary as there is a right path for ‘+1’

• mutable link: deletion of wrong paths for the example string, +1 (i.e. mutable link ‘1’ from q0)

+00 M00 q0

q1

λ0

• permanent link: not necessary

• mutable link: deletion of wrong paths with respect to ‘+00’

+01 M01 q0

q1

λ0


• mutable link: deletion of mutable link ‘1’ from q1 that is incorrect for ‘+01’

+11 M11 = M01 q0

q1

λ0

all links are deterministic and there exist right paths for every example seen.

34

0,1

0

0,1

λ

0,1

q0

1

λ

q0

0,1

0

1 q1

q1

λ

1

q0

λ

q0

0,1

0

1

1 q1

λ

q0

0,1

0

1 q1

0,1


3.2.2Running L3 on worked examples

Figure 3.24 shows several of the guesses made while L3 learns another unknown FA which accepts all strings with even 1’s and 0’s. The final guess M0110 is equivalent to M in Figure 3.19(b). Compared to Mλ in Figure 3.23, Mλ in Figure 3.24 accepts all strings in the first guess because the initial state q0 is the accepting state as minword(q0) (=λ) is a positive example received from t. The final conjecture M0110 is a DFA with four distinct represented by strings λ, 0, 1 and 01 respectively. M0110 can be taken as an amortised NFA with only states q0, q1, q2 and q3, deterministic next-state transitions as the result of mutable links deletion.

Figure 3.28: The sequence of guesses, Mλ, M0, M1, are conjectures made upon receiving the examples, +λ, -0, -1, respectively in learning the unknown FA that has the complete lexicographic t = {+λ, -0, -1, +00, -01, -10, +11, ….., +0111,….} where +0111 and subsequent examples thereafter in t produces the same and correct conjecture as M0111.

(cont.)

Examples from t



+λ Mλ q0 λ• permanent link = {λ}

(i.e. to q0)• insertion of the initial

state q0 which is also the accepting state

• insertion of mutable links: from q0 to q0 for all transition ‘0’ and ‘1’

-0 M0 q0

q1

λ0

• permanent link= {λ, 0}• insertion of new state,

q1 (i.e. a rejecting state)• insertion of mutable

links: all the next-state transition ‘0’ and ‘1’ for q0 and q1 except for transition ‘0’ from q0

because a permanent link exists for transition ‘0’

-1 M1 q0

q1

λ0

• permanent link: not necessary as there is a right path for ‘-1’

• mutable link: deletion of wrong paths for example, -1 (i.e. the mutable link ‘1’ from q0)

35

0,1

0

0,1

λ

0,1

q0

1 q1

λ 1

q0

λ

q0

0,1

0

1 q1

0,1


Figure 3.24: L3 discovered state q2 upon seeing another example ‘-10’ and minword(q2) = 1 represents the newly discovered state in the conjecture M001 consistent with examples seen from λ to 001.

Examples from t

Conjecture (NFA) state,q


+00 M00 q0

q1

λ0


• mutable links: deletion of transition ‘0’ (first mutable link along wrong path for ‘+00’) from q1 to q1

-01 M01 q0

q1

λ0


• mutable link: deletion of transition ‘1’ from q1 to q0.

-10 M10 q0

q1

q2

λ01

• mutable link deletion of wrong paths for

x = ‘-10’ (i.e. mutable link ‘1’ from q1 to q0 and ‘0’ from q2 to q0)• permanent link: new

transition ‘1’ from q0 to q2

• insertion of new state q2 and insertion of mutable links

• mutable link deletion of wrong paths for x:

a) x = ‘-10’ (i.e. mutable link ‘1’ from q1 to q0

and ‘0’ from q2 to q0)b) x = ‘-0’ (i.e. mutable

link ‘0’ from q2 to q0)

+11:::

-001

M11

:::M000

M001 = M11

q0

q1

q2

λ01


• mutable link: deletion of wrong paths with respect to ‘+11’ (i.e. transition ‘1’ from q2 to q2 and q1)

[upon seeing each example in {+11,…, -001} learner produce a conjecture which is equivalent to M11]

36

λ

q0

1

0

1 q1

0,1

λ

q0

1

0

1 q1

0

λ q00

0

q1

q2

11

1

0

0

1

λ q0

0

0

q1

q2

11

1

0,1

0,1

1


(cont.)

Figure 3.24 (cont.): Upon seeing example ‘-011’, L3 discovered the fourth state q3 represented by new minword(q3) = 01 and made a new guess M011 (i.e. an amortised NFA after mutable links deletion) as best guess to M.

Examples from t



-010 M010 q0

q1

q2

λ01


• mutable link deletion of wrong paths for -010 (i.e. mutable link ‘1’ from q1 to q1)

-011 M010

M011

[amortisation of M010 through mutable links deletion]

q0

q1

q2

q3

λ0101

• permanent link: insert transition ‘1’ from q1 to q3

• insertion of new state q3

• insertion of mutable links between two states are established satisfying the two conditions:a) minword(q’).a >l

minword(q)b) no existing

permanent link ‘a’ from q’

where a is the alphabet symbol

• mutable link deletion of wrong paths with respect to previous seen strings (paths) in M010:

a) ‘+11’ (i.e. transition ‘1’ from q2 to q3)

b) ‘-010’ (i.e. transition ‘0’ from q3 to q0)

c) ‘-011’ (i.e. transition ‘1’ from q3 to q0)

37

λ q0

0

0

q1

q2

11

1

0

0

λ q00

0

q1

q3

111

0,10

q2

0

0

0,1

0,1

λ q00

0

q1

q3

111

0,10

q2

0

0,1

0,1

0,1

0,1


(cont.)

Figure 3.24 (cont.): L3 successfully learns c by producing a conjecture M0110 that is consistent with examples from t = {+λ,….., +0110…} where all the guesses for each example seen after ‘+0110’ is equivalent to M0110.

Examples from t



-100:::

-110:::

+0011::

+0110::

y

M100

M110

:

:

:

M0011

::M0110

[amortisation of M100 through mutable links deletion as more examples are presented by the teacher]

M110 = M101

:::M0011 = M0010

::

q0

q1

q2

q3

λ0101


• deletion mutable links along wrong paths for the example ‘-100’ (i.e. transition ‘0’ from q2 to q1 and q3)

.

• mutable link deletion of wrong paths for example succ(x) with respect to previous paths in previous conjecture, Mx:

a) succ(x) = ‘-101’ (i.e. transition ‘1’ from q2

to q3 in M100) b) succ(x) = ‘-0100’ (i.e.

transition ‘0’ from q3

to q1 in M0011)c) succ(x) = ‘+0101’ (i.e.


to q3 in M0100)d) succ(x) = ‘+0110’ (i.e.


to q3 in M0101)

38

λ q00

0

q1

q3

111

q2

0

0

0

λ q00

0

q1

q3

111

0,10

q2

0

0,1

0,1

∀y∈A*, y >l 0110 My = M0110


3.2.3Discussion The learner L3 learns passively by only receiving and storing the information from the teacher in an ordered sequence (i.e. complete lexicographic ordered training set, t) without the additional queries available to both L1 and L2. The guesses made (i.e. conjectured NFA) by the learner are consistent with data (i.e. the labelled strings) seen so far but may have a non-deterministic nature.

The mutable and permanent links in a guess have several characteristics which is due to the conditions specified in inserting new links:

only one permanent link per transition symbol ‘a’ from any one state with no mutable links for that symbol ‘a’: satisfying condition of inserting new permanent link

all links for a transition symbol ‘a’ which has no permanent link from a state are mutable: satisfying condition of inserting new mutable links

each state can have only one incoming permanent link which is the last transition in the shortest path from q0 to q: forming a unique shortest path for q (i.e. the minword(q) property being preserved throughout learning)

The constraint on the training set to be in complete lexicographic order is necessary to satisfy the conditions specified in inserting new links (permanent and mutable). Together with the conditions, this constraint has a twofold effect:

a) Eliminating the use of queriesThe permanent links established represent both the state and also the ‘fixed’ path of a seen string (i.e. a deterministic path from a state that can only exist if the string is known, that is, seen by the learner). The mutable links on the other hand allow temporary links between existing states in the form of next-state transitions. The insertion is constrained by the conditions that require minword(q).a >l minword(q’) and no permanent link ‘a’ from q, so that the added links will also be consistent with the data seen so far. Figure 3.23 shows how M0 is constructed with the insertion of new state and the mutable link ‘0’ from q0 to q1 is not inserted because minword(q0).0 = minword(q1) and a permanent link ‘0’ already existed as the ‘fixed’ transition for ‘0’ (i.e. deterministic).

If there are multiple transitions for a symbol ‘a’ from q, they are then guaranteed to be reduced (i.e. deleted) by examples further up in t when minword(q).a is found lying along a wrong path for a particular example x. Thus, learner is safe to assume plausible paths for unseen strings by establishing mutable link with the guarantee that only the right path will be maintained as more data is seen and consequently eliminate the use of membership queries used in L1 and L2 to perform the same role as the mutable links and its conditions. This can be seen in the conjecture M011 in Figure 3.24, the mutable links are added after the insertion of a new state q3 satisfying the condition for insertion above but some mutable links that do not preserve the paths in the previous conjecture M010 are then deleted. The mutable links of a non-deterministic nature in M011 are then deleted gradually (amortisation) as more examples are presented until a DFA is produced upon seeing example ‘+0110’.

b) Always producing a minimum DFA representing the unknown FAThe mutable links are gradually deleted resulting in an amortised conjecture from the current conjecture upon seeing a new examples. The deletion and insertion of links also preserves the right paths from previous conjecture thus making the new guess still consistent with examples seen before. The whole learning process is iterative with a new state inserted constrained by the condition of permanent link insertion which requires that there is no path that lead an example, x, to a final state (i.e. there is a state from which there is no transition ‘a’ which is in the string x as the transition ‘a’ must be lying along the wrong path for x). The number of states in a guess at any one time during learning is at most the size of M (i.e. the unknown FA) as the set of minword(q) is extended

39


gradually in lexicographic order according to the order of examples from t. Thus, the conjecture, M’, obtained after some finite time (i.e. after seeing some finite number of strings up to x) is deterministic with no multiple mutable links from any state and minimal in size. Then M’ is taken to be the learner’s hypothesis of M by producing M’ as the final conjecture that is a reduced minimal NFA in terms of its mutable links and number of states to be equivalent to M.

However, the constraint on t being in complete lexicographic order also assumed the teacher to have a fair amount of knowledge about the unknown FA being learnt where strings between λ and x must be classified correctly by the teacher. The teacher is expected to provide examples from the shortest string (i.e. null string, λ} until at least string x where for all y >l x, My = Mx, and the final hypothesis that is equivalent to M is Mx. A missing or misclassified string may cause the learner to conjecture an incorrect guess which could then lead to an incorrect final guess (i.e. not the exact minimum DFA representing c) which there is no way of testing as no equivalence query is used.

In Figure 3.25, L3 received an incomplete t where ‘+0’ is missing (not provided by the teacher) and thus conjectured an incorrect guess with the wrong permanent link attached to q2 as compared to M01 in Figure 3.23. Though M11 which is the same and correct conjecture learnt in the limit for the unknown M, it has more than two states which violates the minimality property that is maintained by a complete t and the final conjecture is also not a DFA.

Figure 3.29: L3 learning FA that accepts all non-empty strings with a missing (i.e. the ‘?’ in t) example ‘+0’ received from an incomplete t = {-λ,?,+1,….}. Resulting in a different set of minword(q) which excludes ‘0’ as it has not (will not) seen it.

Examples from t



-λ Mλ q0 λ• permanent link = {λ}

(i.e. to q0)• insertion of the initial

state q0 which is also the rejecting state

• insertion of mutable links: from q0 to q0 for all transition ‘0’ and ‘1’

+1 M1 q0

q1

λ1

• permanent link = {λ, 1}• insertion of new state,

q1 (i.e. accepting state)• insertion of mutable

links: all the next-state transition for q0 and q1

except for transition ‘1’ from q0 because a permanent link exists for transition ‘1’

40

0,1

1

λ

0,1

q0

1 q1

λ

0

q0


(cont.)

Figure 3.25 (cont.): L3 produces an NFA, M11, as the conjectured FA to be equivalent to M not exactly M (i.e. not a minimum DFA) but accepting the same set of strings as M does.

Note that although the example shown in Figure 3.23 does produce an FA which accepts the same set of strings (i.e. same regular language), it may not be the case where an example is misclassified by the teacher or when more then one example is missing. This constraint on t is also a constraint on the teacher indirectly as it is the teacher who is relied upon by the learner L3 for examples (information) through t. Thus, this is highly unrealistic and undesirable for such a constraint on the teacher as the teacher may not have the complete set of information at hand and indicate an unnatural learning process where the learner cannot make a single mistake in its guesses or rather, the teacher must produced a highly accurate and complete set of examples to the learner.

Examples from t



+00 M00 q0

q1

q2

λ10

• permanent link: new transition ‘0’ from q0 to q2

• insertion of new state q2 and insertion of mutable links


a) x = ‘+00’ (i.e. mutable link ‘0’ from q0 to q0

and from q2 to q0)

+01:::

+11::y

M01

:::M11

::M11

q0

q1

q2

λ01



• x = ‘+01’ (i.e. mutable link ‘1’ from q2 to q0)

• x = ‘+11’ (i.e. mutable link ‘1’ from q1 to q0)

[upon seeing each example in {+11,…} learner produce a conjecture which is equivalent to M11]

41

0

0,1

λ q0

1

1

q1

q2

10

0,1

0,1

0,1

0,1

λ q01

q1

q2

0

0,1

0,1

0,1

∀y∈A*, y >l 11 My = M11


3.3Homing Sequences in Learning FA

There are two papers [Kearns et al 94; Rivest et al 93] that look at learning FA from a different view. The papers show that there is another method of learning FA also using queries which is ‘continuous’. The learner is not restricted to access the unknown FA from a fixed initial state q0 every time a membership or equivalence query is made (i.e. having to ‘pause’ and return to the fixed initial state). Instead, the learner is able to ‘continue’ from where it last stopped (i.e. the final state reached in previous query). This enables the learner to start from different initial states in M with a sense of ‘orientation’ while learning M. This ability to learn without having to start from q0 is known as the ‘no-reset’ learning compared to the ‘reset’ learning in two previous learning algorithms discussed, L1 and L2. We shall call the learning algorithm with ‘no-reset’ ability as L4.

As there is no fixed start state to return to after every query as in L1 and L2, learning in L4 is done through observations of the output behaviour of M on an input string x= x1…xn

where n is the length of x. The output behaviour observed is a sequence of ‘0’ and ‘1’ symbols representing the rejecting and accepting state, q, respectively. The output sequence, q⟨x⟩, is then a sequence of accepting and rejecting states reached on each input symbols in x starting from state q as the initial state. The final state reached, qx, on the last input symbol xn

has a single value of ‘0’ or ‘1’ depending on whether x is accepted or rejected by M. The state qx then becomes the next initial state for the next input string y, as shown in Figure 3.26.

q⟨λ⟩ = ⟨q⟩ q⟨ x1…xn ⟩ = ⟨q⟩ ⟨ qx1⟩ … ⟨qx⟩where ⟨q⟩ = accepting/rejecting symbol for state q

x = x1…xn and xi is the input symbol of input string x, for 1< i <n

Figure 3.30: Illustration on output sequences for a given input string of an FA with two states

42

λ

q0

0,1

0,1

q1

Example of output sequences on a given input string:

i) Input string, x = 0 ; q0⟨0⟩ = ⟨q0⟩⟨q00⟩ = 01

ii) Input string, x = 10 ; q0⟨10⟩ =⟨q0⟩ ⟨q01⟩⟨q010⟩ =⟨q0⟩⟨q1⟩⟨q1⟩

= 011


3.3.1Homing sequence

A property of finite automata called homing sequence has been studied extensively by Kohavi [Kohavi 78]. Every finite automata is said to have a homing sequence, h, where h is an input string whose output sequence, q⟨h⟩, when applied to any state q in an FA, uniquely determines the final destination state reached by h from q:

(∀q1∈Q) (∀q2∈Q) q1⟨h⟩ = q2⟨h⟩ ⇒ q1h = q2hwhere Q = set of states in unknown FA, M

h = homing sequence of Mq2h = state reached from q2 by hq1⟨h⟩ = output sequence on applying h from q1

q2⟨h⟩ = output sequence on applying h from q2

This property was first applied to learning FA by Rivest and Schapire [Rivest et al 93] as an extension to the learning algorithm, L1, suggested by Angluin [Angluin 78] and followed by similar efforts by Kearns and Vazirani [Kearns et al 94] where homing sequence is applied to classification tree in L2. The variations of L1 and L2 into a ‘no-reset’ learning is the same as L4.

We shall use L4 to collectively refer to both the no-reset variants of L1 and L2. L4 conforms to the same learning framework as both L1 and L2 since:

a) learning in the limit using membership and equivalence queries• teacher is assumed to provide answers to queries• learner is able to choose a string x at its own will and request for

membership query on xb) the training set consists of positive and negative examplesc) choice of data structure being used: observation table and classification tree

The learning algorithm L4 differs from both L1 and L2 only in the following aspects:

a) no-reset learning: using homing sequences without a fixed initial stateb) having access to the unknown FA M: able to observe behaviour of M through output

sequence on a membership query of a string x instead of just a ‘yes/no’ answer from teacher in L1 and L2.

3.3.2L4: no-reset learning using homing sequences

L4 relies on homing sequence, h, to learn M. The output sequence observed whenever h is applied provides the sense of orientation to the learner while moving about the states in M without having to return to a fixed starting state before every query. Figure 3.27 shows the use of a homing sequence, h=10, in uniquely determining the states of an FA that accepts the set of strings of even ‘0’s and ‘1’s (i.e. also used in L1 and L2).

43


(a) (b)

Figure 3.31: A homing sequence, h = 10, uniquely determines the final destination reached from each state in M through the unique output sequence (a) a state diagram of M; (b) the output sequence from each (initial) state, q, and the final state reached, qh, after applying h.

As the result in Figure 2.37 only trivially satisfies the definition of homing sequences, Figure 3.28 shows the two results where the definition of homing sequence above is satisfied.

States, q Output sequence, q⟨h⟩(h = 10)

Final state, qh

A 100 CB 010 DC 001 AD 000 B


Final state, qh

A 0000 AB 0001 CC 1100 BD 0100 B


Final state, qh

A 001 CB 000 AC 100 BD 001 C

44

A0

0 B

C

111

D0

0

0

C

A

0 01 0

1

B

0

q1

1

1

(a)

(b)

(c)

Figure 3.32: The different results from applying two different homing sequence h to an FA (a) an FA with four distinct states; (b) results of applying h=010 from every states; (c) results of output sequences with h=10.


Both papers claimed that L4 is an efficient algorithm (i.e. algorithm that terminates in polynomial time) able to learn any FA without a reset (i.e. a no-reset learning algorithm) using homing sequences. However, there is an assumption made about the unknown FA being learnt by L4 which assumes that the learner will always be presented a strongly connected unknown FA. We refer strongly connected here to a digraph where for any pair of nodes (i.e. states in this case) in the graph (i.e. the FA), both nodes (states) are reachable from one another. This assumption was made to avoid L4 being trapped in a strongly connected component (i.e. a loop) of an otherwise not strongly connected FA. By the definition of strongly connected below

(∀q1∈Q) (∀q2∈Q) (∃a∈A*) (q1a = q2)where q1a : final state reached from state q1 with string a

the class of FAs to be learnt is restricted to those having a cyclic path from each state where there must be a path that leads from the original state back to itself with all other states being visited at least once along the path. Figure 3.29 shows the two FAs, M and M’, being learnt by L1 and L2 with M satisfying the strongly connected property assumed in L4 and M’ not as the rejecting state is not reachable from the accepting state.

Figure 3.29: (a) a strongly connected FA, M, with both states in every pair of states reachable from one another and also a ‘cyclic’ path traversing all four states. (b) a connected FA, M’, but not strongly connected as the rejecting state is not reachable from the accepting state.

Note that there is no need for start states as shown in Figure 3.29. In Figure 3.29, L4 will not be able to learn M’ as it will fall into a strongly connected component (i.e. the accepting state) and will be trapped there. The FAs learnable by L4 are clearly only those without a deadstate, that is, a state with all input symbol transitions to itself and only those with at least a cyclic path from a start state to all other state back to itself.

We provide a proof by contradiction that there is no strongly connected DFA that recognises the regular language L = A* - {λ}

Proof:

(a) (b)

45

q0

0,1

0

1 q1

q00

0

q1

q3

111

q2

0 0

0

M M’


Given an infinite regular language, L = A* - {λ}, which is the set of all non-empty strings, if there is a strongly connected DFA for all infinite regular languages, then there is a DFA M that accepts L.As λ ∉L implies that λ is not accepted by M, the DFA M that will accepts L is either one of the DFA M’ shown in Figure 3.30. Since M is a strongly connected DFA, there must be at least an incoming arc into each state and the initial state q0 is a rejecting final state because λ must reach a non-accepting final state.

Figure 3.30: All possible form of DFAs M, where M =M’. Each M’ is shown to reject a non-empty string that reached the final state at q0.

Thus in all four cases, there is a non-empty string which is not accepted by M’, a contradiction.

�

3.3.3Discussion

The strongly connected property assumed in L4 seems to restrict the entire class of FAs that the learner has to learn to only a subset of the set of all infinite regular languages. In the proof sketched in the previous section, we proved that there are infinite regular languages that does not have a strongly connected DFA which accept them. The set of all finite regular languages also has no strongly connected DFA to accept any of the finite regular language.

This restriction is clearly unfavourable to the original goal of L4 having an efficient algorithm that can learn any finite state automata. The question of ‘how much assumption’ should be considered with the tradeoff between the class of FAs learnable by the learner and also the time taken to learn. As seen here, L4 has restrictions the on the properties and characteristics of the FA to be learnt. A learning algorithm learning an FA should be able to learn an unknown FA with a deadstate or a strongly connected component. Both of theses characteristics are exhibited and maintained in L1 and L2 (i.e. both are able to learn the FA that accepts all non-empty strings) where the learner is able to continue learning even after being trapped in a strongly connected component. This is not the case in L4 which added some additional constraints that restrict the class of FAs learnable by the learner in general to a subset of FA learnable with the strongly connected assumption .about the class of FAs.

46

q0λ

M’

q0λ

M’

q0λ

M’

q0λ

M’implies ∃x x∈A*, δ(q0,0x1) = q0 /\ 0x1∉ L(M’)

implies ∃x x∈A*, δ(q0, 0x0) = q0 /\ 0x0∉ L(M’)0

0

1

1

1

0

1

0

implies ∃x x∈A*, δ(q0, 1x0) = q0 /\ 1x0∉ L(M’)

implies ∃x x∈A*, δ(q0, 1x1) = q0 /\ 1x1∉ L(M’)


The no-reset learning employed by L4 does not seem to have any practical advantages to learning apart from having the ‘orientation’ ability which is inherent in all machine learning. As L4 no-reset learning is modelled towards human learning, having to reset (return to a fixed start state) seems natural as compared to the ‘recall’ in human learning. Thus with homing sequence as the driving factor behind the no-reset learning in L4, there is the least advantages and motivation towards neither employing further use of homing sequence or learning without reset (without a fixed initial state).

3.4Summary (Motivation forward)The learning algorithms L1, L2, L3 and L4 learn successfully an unknown FA M by producing a hypothesis M’ that not only accepts the same regular language as M but minimal in size, that is, the conjectured M’ is a minimum DFA that is equivalent (exact) to M. All the learning algorithms learns in the limit as shown in Table 3.1.

From Table 3.1, equivalence query plays an important role in determining the successful termination of L1, L2 and L4 with the conjecture M’. As the teacher is assumed to possess the ability to answer queries, the teacher ultimately decides to stop the learning process when no counterexample is found in the symmetric difference between M’ and M (i.e. equivalence query returns ‘yes’). Learning is successful because the conjectured M’ upon termination accepts exactly the same regular language as M. The teacher is therefore required to have a fair amount of knowledge about the unknown M to search for a counterexample from the symmetric difference set of M and M’ to guarantee a successful termination. If the teacher ‘overlooked’ a counterexample within the symmetric difference set, then M’ may accept a larger or smaller set of strings than M depending on whether counterexample ‘overlooked’ belongs only to either M’ or M respectively.

In L3, neither equivalence query nor membership query is used. However, the teacher is required to have the same amount of knowledge about M as in the case where equivalence query is used. Here the teacher is to provide the training set t in complete lexicographic order. The teacher is assumed to be accurate where every example is able to be classified with respect to M (i.e. known to the teacher) and very knowledgeable where each example is presented one after another in complete lexicographic order as defined in Section 2.2. A partially complete, that is, with missing string(s) from the lexicographic order or unclassified (unknown) string(s), may result in the conjecture M’ not being equivalent (exact) to M.

The learning environment for all the algorithms relies a lot on teacher as the source of t and especially in terminating successfully. In a practical situation, the teacher usually does not have the fair amount of knowledge about M that is required by all algorithms in learning M.

47


Algorithm L1 L2 L3 L4Learning framework

identification in the limit




Source of training set, t

teacher teacher teacher teacher

Type of t positive and negative examples

positive and negative examples



Examples received

• upon request• counterexample


• given by teacher in lexicographic order


Type of queries used

• membership• equivalence

• membership• equivalence

none • membership• equivalence

Role of teacher

answers queries answers queries provides a complete lexicographic ordered t

answer queries

Stopping criterion

equivalence query returns ‘yes’


a same and consistent M’ is conjectured after a finite number of examples is seen


Data structures maintained to build hypothesis

• observation table

• classification tree

• a set of minword(q)

• homing sequences

• observation table / classification tree

Fixed initial state

yes yes yes no

Conjectured M’

minimum DFA minimum DFA minimum DFA minimum DFA

Table 3.4: Summary of the learning environment for algorithms L1, L2, L3 and L4

Another property that is consistent in all of the algorithms is the minimality property of the conjecture M’. This property seems to be maintained in the construction of M’ using different data structures as shown in Table 1 where the data structure used also depends on whether learning has a reset or no-reset nature (i.e. with or without a fixed initial state respectively. In maintaining this property, the learner may be restricted to learn only a subset of regular languages by employing an element which could only be used with the additional assumption that M possesses a certain property. This is the case with L4 where the use of homing sequence imposes an additional constraint to the learner when only a strongly connected M can be learnt successfully. The minimality property is also the selection criterion in choosing one DFA among an infinite number of DFAs (i.e. with some states being equivalent) which accept the same regular language as M as there is only a unique minimal DFA for each regular languages. The learners build the conjecture M’ that has finite states and transitions using the examples seen. The teacher plays another role here in ensuring that the minimality property is preserved by providing correct information either when requested through membership queries (i.e. as in L1, L2 and L4), or when presenting counterexample (equivalence query) or ordered examples (i.e. as in L3).

48


There seem to be too much expectation on the teacher in all the algorithms as discussed above. It can be seen clearly that the teacher is only allowed to classify each example with absolute certainty (i.e. with probability of 1 for every classification) and to accept or reject a conjectured M’ as the equivalent of M with the use of two methods:a) equivalence query

used in L1, L2, L4 by searching for counterexample in evaluating the current guess where M’ is the final guess that is to be the same and correct for all subsequent guesses with no counterexample found

b) observing for a same and correct M’ after some finite timeused in L3 where there is an M’ accepted as equivalent of M which is the same and correct for all subsequent guesses after seeing a finite number of strings (i.e. after some finite time)

The success of learning therefore depends highly on the teacher having the amount of knowledge required and no probabilistic measures used. This is highly undesirable in practical applications where the teacher may not have the complete knowledge of M and therefore would not be able to classify examples (i.e. when presented with a membership query) or evaluate and select the exact M’ with absolute certainty.

This constraint on the part of the teacher provides the motivation for probabilistic learning in the following chapter where the teacher may employ some form of probabilistic measure in answering equivalence query. Some probability distribution may also be applied to the examples instead of providing a complete lexicographic ordered examples. Probabilistic learning therefore provide a learning environment which is more desirable as the resulting conjecture M’ does not have to depend entirely on the teacher’s knowledge and absolute judgement.

49

Chapter 4: Probabilistic Learning

4.Probabilistic learning

4.1PAC learning using membership queries only

The learning process using both membership queries and equivalence queries as used by the algorithms discussed in previous chapter (i.e. L1, L2 and L4) can be reduced to only membership queries in a PAC learning framework. We refer to the learning algorithm using only membership queries as L5. Each time a conjecture M’ is put to an equivalence query, the teacher has to search the (infinite) symmetric difference set between M and M’ for counterexample(s). Angluin [Angluin 87] has proposed a modification to her learning algorithm L1 that the equivalence queries be replaced by a sampling oracle which will generate a precalculated number of labelled examples, ri, as the test set each time an equivalence query is called to test the conjecture M’. Thus, L5 is the modified version of L1 that learns M within the PAC framework which only differs from L1 with a sampling oracle in place of the oracle answering the equivalence query. The learner still have a pair of oracles:

a) providing answers to membership queriesb) providing the test set instead of answering equivalence query

4.1.1L5: a variation of the algorithm L1 [Angluin 87; Natarajan 90]

The learner L5 will be provide with ri number of examples by the sampling oracle to test the ith conjecture made during learning and the counterexample is the first example which belongs only to either one of M or M’. The size of the test set is precalculated to be sufficiently large with respect to two parameters, accuracy (ε )and confidence limit (δ), so that the following property defined in Section 2.4.2 when applied to L5 holds:

given a real number, δ, from 0 to 1 and a real number, ε, also from 0 to 1, there is a minimum sample length (i.e. the size of the test set, ri in this case) such that for any unknown M with a fixed but unknown distribution on the example space, T, there is:

a (1- δ)% chance that ε % of the test set will be classified wrongly by hypothesis M’, where test set of size ri is another subset from T to test the validity of M’.

The counterexample found while the sampling oracle is generating the test set is treated as in L1 and L5 is said to learn within the PAC framework as the teacher is required to specify the two parameters in PAC learning, accuracy ε and confidence limit δ, and the precalculated test set with ri examples is computed by the sampling oracle using the following computation:

ri = 1/ε [ln (1/δ) + 2ln(i+1)]

50


where ε, δ : a real number between 0 and 1i = the ith iteration of L5 (i.e. the number of conjecture M’ made)

The ri computation above only relies on the number of iteration without any reference to the DFA M being learnt nor the conjecture M’ being build. This computation still conforms to the PAC framework for learning where as the number of iterations increases, the test set becomes larger implying a larger set to provide counterexample if there is any. As the ε parameter is constant during learning, the error that is allowed is at most ε on the test set decreases as test set increases. This also gives a higher confidence in the conjecture M’ as it is consistent with a larger set of examples. Thus, M’ conjectured can be viewed as a close approximation to M with a high confidence level, a PAC learning.

Figure 4.1 shows L5 learning the same unknown FA that L1 learns with the only modification being the counterexample y is found within the test set z of the size ri with respect to the parameters values set to ε = 0.1 and δ = 0.01. As there are two conjectures made, there are two different test sets, z1 and z2, with size r1 = 60 and r2 = 69 respectively. Thus, the last conjecture M’’ is taken to be equivalent to the unknown M with respect to the test done using the test set generated by the sampling oracle with (1-δ)% of confidence (chances) that M’’ has an error of difference from M of at most ε%, that is,

there is a 99% chance that P(M’’ ∆ M) < 0.1% when M’’ is compared to M using a test set of 69 examples generated to represent M.

If the parameters values are set to ε = 0.01 and δ = 0.001, then the size of the test sets z1 and z2 are r1 = 8995 and r2 = 9104 respectively. Thus, it can be seen that the larger the number of examples presented to the learner, the smaller the error probability of conjecture M’ incorrectly as shown by the large test set generated if the accuracy value is decreased.

4.1.2Discussion

The equivalence query is replaced by the sampling oracle in L5, but it may require the teacher as the sampling oracle generating the test set of ri. The teacher is therefore still required to have a fair amount of knowledge on M to generate to specified test set where the examples are compared to the conjectured M’ to find counterexample if any.

L5 relies heavily on the computed ri with respect to the number of iteration made (i.e. i number of conjectures made) and have the following results:

a) the test set generated have very long strings (examples)b) the examples generated by the sampling oracle must be correctly classified to find

a correct counterexample where L5 may end up with a totally wrong M’ from a misclassified set of examples in the test set. [up to this point, algorithms does not explicitly deal with noisy examples]

c) The test set are too large As shown in Figure 4.1, the change in specifying the values for the parameters may result in test set being too large as the number of counterexample need may only be very small and short as compared to the time the sampling oracle has to generate the test set and also the time taken may be long before the learner discovers no counterexample.

51


All the results are still computationally feasible in terms of complexity but the last result seems to require more information in learning than may be needed and still requires the role of the teacher in the sampling oracle though with only a restricted test set instead of an arbitrary set in the equivalence query.

Figure 4.33: L5 learning the FA that accepts strings of even 0’s and 1’s.

As L1 and L2 could be implement into the PAC framework easily using the sampling oracle in place of the oracle to equivalence query used, the is a question on whether L3 could be implemented into the PAC framework: “could L3 be a variant of PAC learning with the complete lexicographic ordered constraint on training set with no queries used?” This seems to require some method to generate and measure reliability of the test set where the accuracy and confidence parameters are applied in PAC framework. As PAC does not require the probability distribution on the examples space but a fixed distribution, this seems to fit the constraint on the training set by L3. This remains uninvestigated yet.

O0 λ O1 λ O1 is closed and consistent

λ 1 make close λ 1 S={λ,0} 10 0 S∪{0} 0 0 E={λ} λ 01 0 1 0 λ 0S={λ} 00 1E={λ} 01 0

O2 λ O3 λ 0 O4 λ 0 1λ 1 λ 1 0 λ 1 0 00 0 make 0 0 1 make 0 0 1 001 0 consistent 01 0 0 consistent 01 0 0 0010 0 E={λ}∪{0.λ) 010 0 0 E∪{1.λ} 010 0 0 11 0 1 0 0 1 0 0 100 1 00 1 0 00 1 0 0011 0 011 0 1 011 0 1 00100 0 0100 0 0 0100 0 0 00101 1 0101 1 0 0101 1 0 0

O4 is closed and consistentS = {λ, 0, 01,0 10}E = {λ, 0, 1}

52

λ

λ 0

01010

0

0

11 11

0

0

M1M1: no counterexample within z2

r2 = 68.02sampling oracle z2

M0 : counterexample, y=010 r1 = 59.90sampling oracle z1

0,1


Although L5 has a stochastic setting where an unknown but fixed distribution is assumed for all the examples in the training set and thus is said to be PAC learning M with membership queries, it has not offered much improvement over the previous non-probabilistic learning algorithms discussed (i.e. L1-L4). The teacher is still assumed to have a fair amount of knowledge to provide test set with the ri strings that is withing a large range of 60-8000 (as shown by the different parameters used in Figure 4.1). This setback brings us to a truly stochastic setting for probabilistic learning in the next section where the entire conjectured M’ can be evaluated with some form of probability measure on M’ itself.

53


4.2Learning through model merging

4.2.1Hidden Markov Model (HMM)

“HMM is a doubly stochastic process with an underlying stochastic process that is not observable (it is hidden), but can be observed through another set of stochastic processes that produce the sequence of observed symbols”[Rabiner et al 86]. We are concerned with the use of HMM in learning FA and will give only a brief introduction to HMM here, for full introduction to HMM and its application, we refer the reader to [Rabiner et at 86; and Stolcke et al 94].

A HMM consists of hidden states with unobservable (hidden) transition process, and a set of observable output symbols that are emission symbols from each (hidden) state on a given input sequence (string), as in Figure 3.2. The emission symbols are the input symbols from the input string fed into the HMM (i.e. the observable emission process). The two processes are each governed by two independent probability distributions, thus the stochastic nature of the processes (hidden and observable).

Figure 4.34: (a) A HMM fed with an input string x = x1…xn starting at S and ending at final state F; (b) the hidden stochastic process with probability t on reaching q’ is based on the transition probability distribution depending on the previous state and the probability of emitting a symbol xi for 1 ≤ i ≤ n, depending on the emission distribution at current state q.

The elements of HMM are:a) a finite set of states, Q

54

x1 x2 ….. …. xnInput string

q1 q2 … … qm

Hidden transition processx S F

(a)

Observed output sequence, |x| = n

Observable emission process

- - - - - -

HMM withQ states,|Q| = m

(b)

::t

x1

::

….

xn-1

t

xn

…q

q’

Hidden transition process from some q to q’

Output symbols


There is a fixed start state and a final state which do not emit any symbols, as shown in Figure 4.2(a), by states S and F respectively. Only these two states are known (not hidden) and the rest of the states in Q are unknown in terms of size and the associated transitions.

b) transition probability distributionThe transition into a new next-state is based on the probability distribution which depends on the previous state (i.e. a Markovian property) visited, as shown in Figure 4.2(b) where probability of entering q’ depends on its previous state (i.e. previous state emitting symbol xn-1)

c) emission probability distributionA symbol (i.e. the input symbol) is emitted after a transition according to the probability distribution for emission in the present state, as shown in Figure 4.2(a) where the observed output sequence x1…xn, is the input string x and the emission probability a for x1 depends on the current state, q.

From Figure 4.2(b), it can be seen that each distribution is a multinomial distribution of n finite number of choices which has a total probability of all choices equal to 1, where there are finite number of transitions to take from a state in Q and also finite number of emission symbols to output (i.e. usually the alphabet set) when in a current state. The transition and emission probabilities are the parameters of HMM.

4.2.2Learning FA: revisited

In learning FA, the HMM is viewed as a stochastic generalisation of an FA [Stolcke et al 93, 94] where each transition and symbol output is attached a probability value (i.e. the probabilities 0.5 labelled to each arcs in Figure 4.3) depending on the probability distributions. We will refer FA as this generalised HMM throughout this chapter and will now specify briefly the terminology of HMM5 which will be used in the following table (Table 4.1):

Terminology Notation Definitionstart state S a fixed initial state for all input stringsfinal state F a fixed final state reached by all strings accepted by

FAtransition probability p(q q’) the probability that state q’ follows state qemission probability p(q ↑ a) the probability that symbol a is emitted in state q path for string x Sq1…qlF sequence of states from state S to state F with non-

zero transition probability such that symbol a of x is output with non-zero emission probability

probability of path x P(x|M) the product of all transition and emission probabilities along path x in a given FA M

Table 4.5: Terminology for HMM

All the strings generated by an FA M also have a probability value to it where the probability of a string x in M is the probability of its path P(x|M). From Figure 4.3, the probability of string bb generated by M is

P(bb|M) = p(Sq1)p(q1↑b)p(q1q3)p(q3↑b)p(q3F) = (0.5)3 = 0. 125

5 The notation and definitions given here are taken from [Stolcke et al 94] which gives a more detailed and formal definitions with respect to HMM.

55


Figure 4.35: An FA M where all transitions are labelled with probabilities 0.5 and the emission probabilities are 1 for all symbols.

Looking back at the task of learning an unknown FA (i.e. the generalised HMM), the learner now has to take into consideration the parameters of the unknown FA M where as before, the learner is presented with information regarding M in the form of examples (strings) and is expected to produce a hypothesis regarding M. As M now has probabilistic properties from the probability distributions (on the parameters), the problem of learning M is then seen as the task of the learner:

a) To find the best HMM M’ with a finite set of states that maintains connectivity (i.e. having all non-zero parameters) between states with some parameters set in M’ learnt. This search for the best M’ deals with a method of evaluating M’ and possible HMMs during learning directly on M’ using probabilistic measure as compared to evaluation by previous algorithms done using examples set where M’ has no stochastic properties. Thus, best HMM is obtained with respect to the probabilistic measure used.

b) To build a tentative FA (HMM model) M’ that is able to produce unseen strings with high probability after seeing some finite set of strings (i.e. strings from the training set t). The learner will need to adjust or able to specify the parameters needed in building (stochastic) M’ with as reducing or adding states will be constrained by the probability distributions.

c) To maintain consistency with information seen while generalising (i.e. generalised M’ must always agree with the training set t with high probability P(x|M’) for every string x in t.

The role of the teacher in this probabilistic learning is still the same as before, that is, to provide examples but there is no need to answer any queries nor is the teacher restricted to present t in any particular form.

Since the conjectured M’ is a probabilistic network, the entire M’ could be evaluated against M using some known probabilistic measure on M’ itself instead of comparing with some sample size as in previous chapter using the equivalence queries which requires far too much knowledge on the part of the teacher. A method called model merging by bayesian rule induction [Stolcke et al 94] provides a means of building M’ by merging two states as a time.

56

c

c

b

Sa

b

F

a

0.5

0.5

0.5

0.5

0.5

0.5

0.5

0.50.5 0.5

0.5

0.5

0.5

q2

q1

q3

q4 q5

q6


As a result of merging, there will be several plausible merged M’’ for several pairs of states merged and the plausible merged M’’ are compared by evaluating the resulting merged model M’’ using the bayesian posterior rule explained in the following section.

The final hypothesis M’ that the learner output is a close approximation to the unknown M with some degree of probability based on the bayesian posterior rule.

4.2.3Bayesian Model Merging

The Bayesian posterior rule that given two events, A and B, the following holds

P(A|B) = P(B|A)P(A) P(B)

where P(A|B) is the posterior probability of A happening given B has occurredP(B|A) is the posterior probability of B occurring given A has.P(B), P(A) are both probabilities for each A and B occurring, also known as priors.

When applied to learning FA, the two events are the event of obtaining a conjecture M, M, and the event that the examples X from the training set are seen, X. Thus, according to the Bayesian rule,

P(M|X) = P(M)P(X|M) P(X)

where P(M|X) = probability of obtaining a conjecture M after X occurredP(M) = the probability measure of M occurring and this relies on some

prior measure to be set by the teacher.P(X) = the probability of X occurring which is the likelihood of occurring

in tP(X|M) = this is the probability of each strings in X generated from M, that

is, it is easily obtained from the product of all the path probabilities P(x|M) of each strings x in X.

The M’ obtained using the state merging method will have a different topology and parameters. Each merge results in a change to the topology of the current M as shown in Figure 4.4. The shaded pair of states in M when merged produces a merged M’ with changes to the outgoing and incoming transitions and also the size (i.e. number of states) of merged FA is smaller by one size (reduced by one state). Thus, M’ topology is said to differ in terms of states and its transitions and also the parameters attached to each arcs. The learner is able to modify the parameters attached to the arcs and also the emission (not shown in Figure 4.4 here) symbols from each state if some form of probabilistic rule or measure is obeyed.

State merging also result in the generalisation of M’ as shown in Figure 4.4. The resulting merge in M’ may produce a loop or strings with possibly different paths are merged into a single path, thus producing a merged M’ that accepts more unseen strings, as in Figure 4.4(i).

57


Figure 4.36: The plausible M’ from different topology of a pair of states in M (shaded states), q1 and q2, and the result of merging on the outgoing and incoming transitions from q1 and q2 to newly merged (the shaded state in M’) state q3.

The Bayesian rule in learning FA simply provides the mean for learner to adjust M through merging and being able to evaluate the result of adjustment (i.e. learning). The relationship between the information seen, P(X) and also the topology of the conjectured M’, P(M) is expressed in P(M|X) which includes the incorporated information seen into M in P(X|M). Every plausible M’ after a merge can be compared to decide which merge is the most consistent, that is, one which results in the highest P(M|X), with the entire set of data seen, X.

In comparing merged models, P(X) is always fixed as all merged models have the same X thus leaving the learner with P(X|M) and P(M). The merged models are distinct from each other in terms of topology and parameters, but may have the same P(X|M) as this only relies on the parameters of each model, as shown in Figure 4.5. A bias towards a preferred M is required to guide learner in situation where P(X|M) are the same but having different topology and parameters. This bias is incorporated into P(M) and required to be specified by the teacher prior to learning. Thus, P(M) together with the bias is the driving force for the generalisation of M, which influences the posterior probability P(M|X) over P(X|M).

As M consists of not only the usual topology but with parameters distributions, the bias on M is based on both the topology of M and its parameters, also called the structure prior and parameters prior respectively. Intuitively, the M with smaller topology and larger parameters is to have higher probability (preference) over one with many states and

58

M

(a)

M’

(b)

(i)(i)

(ii)

(iii)

(iv)

(v)

The outgoing arcs to same state converge into a single outgoing arcs from merged state

The outgoing arcs to two distinct states diverged from the single merged state as two distinct arcs

The incoming arcs from same states become a single arc into the merged state

The incoming arcs from two distinct states converged to the merged state as distinct arcs

A loop is obtained when merging two sequential states

Merging of states q1, q2 into q3


transitions (large topology) and parameters that are small (i.e. resulting in small P(X|M. As P(M) is to be known prior to calculating P(M|X), the teacher is to specify the bias (preference) by specifying the structure prior and parameters prior before learning. Thus, the prior on M, P(M), is also

P(M) = P(MS,θM)where MS: the topology of M

θM : the parameters of MP(MS,θM) : the prior on topology and parameters of M

The knowledge of what priors to use in learning is known a priori and thus is embedded in the background knowledge of the learner but the learner does not have any a priori knowledge regarding the unknown M to be learnt just as all previous algorithms discussed in Chapter 3.

Figure 4.37: Three different merging one after another resulting in same P(X|M) for examples seen, X = {ab, abb}

A learning algorithm has been proposed using bayesian state merging by Stolke and Omohundro [Stolke et al 94] which successfully learns an unknown FA from positive only examples seen by producing a hypothesis with the maximum posterior probability P(M|X). The learner, which we refer to as L6 hereafter, is guided mainly by the bayesian rule which forms the learner’s background knowledge. The learning setting with the learning criteria and goal (in the background knowledge) of L6 is summarised in Table 4.2 and the specific details of L6 with respect to a chosen the structure prior and parameter priors by the teacher is discussed in the following section using examples of unknown FAs. The reader is referred to [Stolcke et al 93 and 94] for full details on choices of priors available to learner learning using bayesian model merging.

59

a X = {ab, abb}

P(X|M’) = P(ab|M’) P(abb|M’)= 0.52 = 0.25

P(X|M”) = P(ab|M”) P(abb|M”) = 0.52 = 0.25

S

0.5

0.5

q1

q2

aq

4

q2

q5

F

b

b b

M’

S q1

aq2

F

b

q5

b

0.5M”’

S q1

a

q4

q2

q5 F

b

b b

0.5

0.5

M”

0.5

P(X|M”’) = P(ab|M”’) P(abb|M”’) = 0.52 = 0.25


Note that the merging is local (pairs of states) though the bayesian rule used is global in learning FA on the whole. As most FA that accept infinite regular languages has strongly connected component(s), the learner may fall into a local sub-optima during a merge. Thus, a lookahead could be employed to recover from the local sub-optima and enhancing chances of obtaining a global optima instead. This lookahead is specified as part of the learner’s learning goal (see Table 4.2 below) which basically means the number of consecutive times the learner is allowed to obtained a lower posterior probability P(M|X) as best new guess (merge) than the previous best guess before stopping the learning process.

Algorithm L6Teacher • provide training set

• specify structure prior and parameters priorTraining set t positive only examplesCriteria (i.e. the method and priors to use to produce tentative hypothesis/ FA)

• given structure prior and parameters by teacher to calculate P(M)• build hypothesis using bayesian state merging method which is

consistent with positive examples, X, drawn from t• to select best guess (hypothesis) H, among merged models M’ from

possible merging of several pairs of states in previous hypothesis based on highest P(M’|X) (i.e. selecting only a pair of states as best merge)

Goal (i.e. stopping criteria of learner)

stop learning when subsequent merge(s) (i.e. the lookahead) produce P(H’|X) < P(H|X) where H’ is the best merged (guess) model of the previous best guess H

Hypothesis H H is a generalised (result of merging) hypothesis that is also able to generate unseen examples of the unknown M with high probability, P(Y|H), where Y is the set of unseen examples that is accepted by M, as shown in Figure 4.6(b) of the loop obtained from a merge to generate more (an infinite set of) strings unseen (not able to generate) by learner in previous hypothesis Figure 4.6(a).

Table 4.6: The learning setting of L6 using bayesian state merging with M’ of possible merged models in each merging stage, H as the best guess from each stage where H’ is the next best guess to the unknown M being learnt.

4.2.4L6: by Stolcke and Omohundro [Stolcke et al 94]

The learning process of L6 is guided by two priors below that are required to be set by the teacher prior to learning:a) structure prior, P(MS)

This prior is based on the description length of M where a standard encoding scheme is adopted to ‘measure’ M in terms of its code length l(M). Thus, the structure prior penalises M by assigning lower P(MS) as l(M) gets larger according to

P(Ms) ∝ e-l(M)

where l(M) is the code length to encode the all states and the transitions and emissions arcs associated with each state of M

Thus, to be precise, the structure prior P(MS) in L6, is obtained based on the prior associated to each state q in M

P(Mqs) ∝ (|Q| + 1)-nt (|Σ| + 1)-ne

where nt = number of transitions from state q in Mnt = number of emissions from state q in M

Hence,

60


P(MS) ∝ the product of all structure priors to each state q, P(Mqs)where e-l(M) previously, is similar to penalising M according to the transitions

and emissions explicitly by nt and ne here

b) parameters prior, P(θ|MS)The Dirichlet distribution for multinomial is adopted for both the parameters distributions as each parameter distribution is a n multinomial distribution with n choices of probabilities, θi, for 1 ≤ i ≤ n.

P(θ) = B(α1,…, αn) Π θiα-1

where B(α1,…, αn) = the multivariate Beta function with n-variable (see Box 1 in Chapter Appendix)

αi = prior weights set by the teacher where Σ α1 = some constant number, a0

As with the structure prior, the parameters prior for each state q is obtained based on the independent parameter distributions associated with q such that

P(θq|MS) = P(θt)P(θe)where P(θt) = probability distribution for transitions from q

P(θe) = probability distribution for emissions from qHence,

P(θ|MS) = the product of all parameters prior P(θq|MS) for all q ∈ Q

[Note: The Dirichlet distribution in L6 is used to achieve two learning goals:a) obtaining a hypothesis M’ that is the best guess (representation) to an unknown

M by incorporating information from examples seen, X.b) obtaining a hypothesis M’ that represents a class of HMMs (with small structure

differences) by varying the parameters of the M’.The following posterior probability holds for the former and as it is out of the scope of this report, that is, learning an unknown FA from examples, interested reader is referred to [Stolcke et al 94] for the latter]

Thus, the prior probability of M using the above specified priors is

P(M) = P(MS, θ)= P(MS) (P(θ|MS)

Throughout the learning process, L6 is either presented a new example or merge states for best guess (or new guess after a new string is seen). The former is taken into account (i.e. new string x into X) by assigning a state to each input symbol in x with a transition connecting each new state as shown in Figure 4.6(a), where M’ is the initial model with first example aa and all parameters associated with new states are set to 1.

Figure 4.38: Using the Viterbi algorithm in obtaining P(X|M) for the examples X = {aa} and

61

M’

S q1

q2

F

a (1) a (1)(1) (1) (1)(a)

P(X|M) = 1X = {aa}

S q1

F

a (2)(1) (1)

0.5(1)

M”

(b)P(X|M) = (1) (0.5)3

= 0.125X = {aa}

0.5


Next, the merging is carried out until learning goal or another string is seen. The merging method here assumes that all paths for examples seen so far, X, are preserved during a merge such that P(X|M) can then be obtained using the Viterbi algorithm [Viterbi 67; Rabiner et al 88]. Instead of taking the product of all possible paths for each string x in X , this algorithm keeps an update of the most likely path for each x, that is, path with the highest P(x|M), and thus producing an approximation to P(X|M) by accounting only for the most likely P(x|M) for all x in X. The update relies on two counts, c(qq’) and c(q↑a), for each transition (qq’) and emission(q↑a) in M and the counts are updated in two circumstances:

a) when a new string is seeneach count is set to 1 for each transition and emission made for each time string x is

added into X (i.e. seen for the first time) as in Figure 4.6(a), where string aa is seen for the first time and all a count of 1 is set for all transition and emission made along the path Sq1q2F.

b) when a merge occursnew count associated with new merged state takes the sum of counts associated with

the pair of states being merged, as in Figure 4.6(b), where count for emission a from new merged state q1 is the sum of counts c(q↑a) from the merged pair q1

and q2. The transition count remains as no transition is merged into a single transition.

Thus, the Viterbi counts arec(qq’) = number of times transition from state q to q’ is madec(q↑a) = number of times symbol a is emitted at state q

[Note: These counts are different from nt, ne used in the parameter priors calculation where nt and ne refer to number of distinct transition and emission associated with a state q whereas Viterbi counts refer to the number of times a particular distinct transition or emission is made (visited)]

Using the Viterbi counts above for all the transitions and emissions of M corresponding to all string x in X,

P(X|M) = Π p(qq’) c(qq’) p(q↑a) c(q↑a)

= Π P(cq|M)

Hence, the posterior probability on M is

P(M|X) = P(MS) (P( θ |M S) Π P(c q |M) P(X)

We have shown how L6 evaluates a merged M’ based on the priors and P(X|M) based on the Viterbi algorithm but have not shown how to get the merged M’, that is, “what happens during a merge?”, which is dealt by L6 in the following aspect:

a) selecting pairs of states to mergeThe merging is carried out for states with the same emission symbols (as shown by the merge in Figure 4.6) exhaustively before merging in general.

b) making the changes to current parameters An estimation procedure is needed for the new parameters in M’ as only the way M’ is constructed is known (i.e. through the merging procedure) but the topology

62


of M’ is not known (i.e. being the hidden process of a HMM). Thus, L6 adopts the following estimations for transition and emission parameters respectively:

p’(qq’) = c(q q’) + α - 1 Σs∈Q [c(qs) + α - 1]

p’(q↑a) = c(q ↑ a) + α - 1 Σr∈Σ [c(q↑r) + α - 1]where α are the prior weights assumed to be uniformly distributed

here.

The estimations are used in two circumstances during learning: i) when a merge occurs

From Figure 4.7(a) which is the resulting merge in Figure 4.6(b), the following estimations associated to the new merged state q1 are made for the transition parameters from q1 to q1 and q1 to F, and the emission parameter for symbol a

p’(q1 q1) = c(q1 q 1) + α - 1 Σs∈Q [c(q s) + α - 1] = 1 + 0.5 - 1 [1 + 0.5 - 1]2

= 0.5p’(q1 F) = c(q1 F) + α - 1 Σs∈Q [c(q s) + α - 1] = 1 + 0.5 - 1 [1 + 0.5 - 1]2

= 0.5p’(q1 ↑ a) = c(q1 ↑ a) + α - 1 Σr∈Σ [c(q1 ↑ r) + α - 1] = 2 + 1 - 1 [2 + 1 - 1] = 1

ii) when a path exists for a new string seenFor example, a new string aaa is presented to M”,Figure 4.7(a). As there exists a path for aaa, no new states are incorporated but the Viterbi counts are updated and the resulting parameters estimations are exhibited in M’”, Figure 4.7(b).

Figure 4.39: (a) The resulting merge M”, from Figure 4.6(b), with new parameter estimations; (b) M’” with new estimations for new string aaa.

63

S q1 F

a (2)(1) (1)

0.5(1)

M”

(a)P(X|M) = (1) (0.5)3

= 0.125X = {aa}

0.5

S q1 F

a (3)(2) (2)

0.5(2)

M’”

(b)

P(X|M) = (0.5)3 (0.5)4

= (0.5)7

X = {aa, aaa}0.5


4.2.5Running of L6: on worked examples

We illustrate how L6 learns an unknown FA using a lookahead of 1 (i.e. learner stops learning after the 2 consecutive best guess with P(Mn|X) lower than previous best guess P(Mn-1|X) through the first several merges from initial model M0, Figure 4.8 and Figure 4.9. The unknown M used here is the FA that accepts all strings with even 0’s and 1’s. The training set, t = {00, 11, 0101, 1111} is used with the associated calculations towards the posterior probability of every merge result, (see summarised calculations, Box 2 in Chapter Appendix).

Throughout learning, L6 assumes the total weights α0 = 1. Thus, all prior αi = the parameters. L6 builds an initial model M0 for the example 00 seen with X = {00} and a state each for every input symbol and all transition and emission (parameters) are set to 1, as shown in Figure 4.8(a).

64


Figure 4.40: Resulting merges in the running of L6 with examples from t = {00, 11, 0101, 1111}. States to be merged are shaded, all the transition and emission not labelled have probabilities of 1 and the Viterbi counts are in brackets.

In the first merging, there is only a pair of states (and with the same emission symbol ‘0’) to merge. The merged M1 is accepted as the best guess with P(M1|X) > P(M0|X) in Figure 4.8(b). As there can be no more merging, the next string (example) ‘11’ is shown to the learner. New states are incorporated into M1 producing M2 with (modified) new parameters and Viterbi counts, Figure 4.8(c).

The second merging takes place between the only pair of states in M2 with the same emission symbol ‘1’ resulting in M3, in Figure 4.8(d), with the appropriate parameters and Viterbi counts adjustments. However, M3 has a lower posterior probability as compared to M2

65

M0

S Fq1

q2

0 (1) 0 (1)(1) (1) (1)(a)

P(X|M0) = 1X = {00}P(M0|X) = 0.0123

S q1 F

0 (2)(1) (1)

0.5(1)

M1

(b)

P(X|M1) = (1) (0.5)3

= 0.1250P(M1|X) = 0.01330.5

S q1 F

0 (2)(1) (1)

0.5(1)

M2

(c)

P(X’|M2) = (0.5)4 (0.5) = 0.0313

X’ = {00, 11}P(M2|X’) = (2.8782)10-6

0.5

q2 q3

1 (1) 1 (1)(1)(1)

(1)

0.5

0.5

S q1 F

0 (2)(1) (1)

0.5(1)

M3

(d)

P(X’|M3) = (0.5)4 (0.5)4

= (3.9063)10-3

X’ = {00, 11}P(M3|X’) = (2.1717)10-6

0.5

q2

1 (2)(1)(1)

0.5

0.5

(1)

0.5

0.5

S q1 F

0 (2)

(2) (2)

0.5(2)

M4

(e)

P(X|M4) = (0.5)4 (0.5)4

= (3.9063)10-3

X’ = {00, 11}P(M4|X) = (4.3797)10-5

0.5

1 (2)

0.5 0.5


but since we are adopting a lookahead of 1 here, learning will continue and stop if the next merge results in another decreased posterior probability.

From Figure 4.8(e), the subsequent merge following M3 is a general merging as there is no more pair with same emission symbols. The resulting posterior probability for M4 is higher than the previous one for M3, thus, learning continues and stop when two consecutive decrease in the posterior probabilities are obtained.

The learning process continues iteratively until there are no more examples from t and learning goal is achieved (when the lookahead detected 2 consecutive decreased posterior probabilities). In Figure 4.9, the next example ‘0101’ is shown when there are no more pairs to merge (i.e. M4 in Figure 4.8(e) ). There are several possible merges as there are 6 pairs of states with the same emission symbols and L6 will merge all 6 pairs and only take the merge with the highest P(M|X) to be compared with the posterior probability

P(M5|X)=(6.9863)10-13.

Figure 4.41: The resulting new model incorporating new string ‘0101’ into M4 from Figure 4.8(e) and the subsequent 6 possible pairs of states to merge.

66

S q1 F

0 (2)

(2) (2)

0.5(2)

M5 P(X|M6) = (0.67)2 (0.5)10 (0.33) = (1.4467)10-4

X’’ = {00, 11, 0101}P(M6|X”’) = (6.9863)10-130.5

1 (2)

0.5 0.5

q2 q5q3 q4

1 (1) 0 (1)(1) (1) (1)0 (1)

1 (1)

(1)

0.67

0.33 (1)

6 pairs of states as possible merges

q2 and q4 q3 and q5 q1 and q2 / q3 / q4 / q4

4 pairs


4.2.6Discussion

L6 is provided with the ability to build a conjecture to an unknown FA that is entirely stochastic with the parameter to the transitions and emission. The teacher specifies the appropriate a priori knowledge with respect to the bias in the structure and parameters priors used to guide learning. The example set use are positive only examples which the teacher knows and can provide, that is, the teacher may provide learner with only a small t. The learner thus learns as it sees examples X from t through state merging guided by the bayesian posterior probability P(M|X) to conjecture M’ with the maximum P(M|X) as its best guess (best HMM).

As the prior P(M) has the greater over P(X|M) in the resulting P(M|X), the likelihood of X with respest to a M’ might be insignificant as P(X|M) decreases as M’ is generalised more (able to generate more strings than seen in X). Therefore, in cases where a loop results from a merge, the likelihood of X against a larger (infinite) strings that M can generate is decreased significantly, as shown in Figure 4.8 and Figure 4.9 where the likelihood decreased from 0.125 to (1.4467)10-4 after only 3 examples seen in X.

The generalised M’ could therefore result in a low likehood for X despite being consistent (i.e. non-zero likelihood) where M’ does not fit X best (i.e. represents X with relatively high probability) during learning. A control factor is employed by L6 to balance the generalisation which is driven by P(M) against the high likelihood probability of X that L6 is to maintained in learning. This control factor is taken to be λ-1 where

This control value when applied to the posterior rule gives

P(M|X) = P(M) P(X|M) f P(X)

where f = λ-1 (the control factor)

where more weight is to be given to P(X|M) initially when X is small but will decreased as X gets larger resulting in less weight given to P(X|M). Thus, the constant value indicating an effecive sample will also be known a priori and given by the teacher according to the number of samples to be considered as effective (constant thoughout learning despite the actual samples seen in X).

The calculation using this control factor is compared with the learning without control factor illustrated in Figure 4.8 and Figure 4.9. Table 4.3 below shows the posterior probability calculated at each stage. The effective samples used is 50. All the corresponding posterior probability using the control factor will not allow a merge until a larger set of X is obtained and has been analysed in [Stolcke et al 94] that merging using the control factor starts after 8 samples resulting in λ-1 = 6.25.

67

|X| = number of effective samples (a constant)

1λ


Effective sample =50

P(M|X) P(X|M) P(M|X) with control factor

|X|

M0 0.0123 1 0.0123 50 1M1 0.0133 0.125 (7.455)10-47 50 1M2 (2.878)10-6 0.0313 (2.250)10-42 25 2M3 (2.172)10-6 (3.906)10-3 (3.355)10-64 25 2M4 (4.379)10-5 (3.906)10-3 (6.9654)10-63 25 2M5 (6.986)10-13 (1.447)10-4 (3.661)10-73 16.7 3

Table 4.7: comparison of posterior probability using control factor and without control factor using the sample set X in the learning process in Figure 4.8 and Figure 4.9.

The control factor therefore control the merging stage such that merging do not occur too early during learning where examples seen is very small. This again is just another prior to be specified by the teacher or even set by the learner. The learning algorithm using the bayesian rule with priors measures has the following advantages:

• learner only required to conjecture DFA with uncertainties: parameters estimations

• teacher only required to provide examples known, that is, positive only examples• training size required is relatively small.

4.3Summary

The probabilistic learning in PAC and bayesian model merging has an evaluation method where some degree of error is accounted for and the conjecture could be accepted as best guess with some certain that is not necessary absolute (i.e. probability 1). A comparison between L5 and L6 where PAC and Bayesian model merging is used respectively is summarised in Table 4.4.

Both probabilistic learning have not accounted for the accuracy of the examples given from the teacher. In building conjecture, both L5 and L6 have assumed that the information received is correctly classified by the teacher in the following ways:

• membership queries and counterexamples• training set t

L6 seems to be able to handle the last analysis with the stochastic settings in the conjecture M’ whereas in L5 and also the previous non-probabilistic learning the error made in the conjecture is inherently there and has no way of rectifying. In L6, the estimations are used to modify the parametes representing the information received and also having some statistic to some unseen strings. Thus, if an example is previously misclassified, it seems trivially easy to rectify by modifying the Viterbi counts and hence a new estimations to the path likelihood with respect to misclassified example.

68

1λ

50 |X|=


Algorithm L5 L6Background knowledge known a priori

a) parameters:• accuracy, ε• confidence limit, δ

b) samples oracle with calculation for test set size, ri

a) priors • total weights, α0

• structure prior, P(MS)• parameters priors, P(θ|

MS) for transition and emission

b) effective samples sizec) lookaheadd) parameters estimations

Training set t positive and negative examples positive only examplesCriteria to build M’ M’ to represent the equivalent state

and the next state based on examples from t

Goal Samples oracle produced no counterexamples from test set generated based on paramerters and test set size

when the number of consecutiveP(M’|X) < P(M|X) |> lookaheadwhere M : is previous conjecture M’ : current conjecture

the resulting hypothesis M’

• a non-stochastic DFA DFA has transitions with no probability value other than 1 and the probability of reaching a state is either 1 or 0 using any strings depending on whether the string is in that equivalence class of string for that state.

• a stochastic FA viewed as the generalised HMM the FA has probabilities to all transitions and the probability of reaching a state can be measured by different strings which also indicates how many times the state is visited.

Information received membership queries and counterexamples

only strings from training set t

Data structures observation table Viterbi counts

Table 4.8: Summary of L5 and L6 in probabilistic learning of FA.

The strings X generated again by conjectured M’ in L6 as a likelihood probability attached with respect to the topology of M’ to the strings X originally received from teacher as compared to the strings generated by the conjecture in L5. The conjecture M’ is also built based on X and also some bias towards the topology where size of states and tansitions are kept by the structure prior P(M). Thus, the following analysis could be done to further evaluate the resulting hypothesis M’:

a) approximation of M’ to the unknown FA on a set of samplesthe likelihood probability of on strings in samples may be used to measure how close M’ is to c based on recognition of samples and also based on topology of M’

b) accuracy of M’ in generating examples unseenhave a certain form of measure using again the likelihood P(Y|M) for unseen Y

c) performance of M’ receiving noise in examplesexamples with probabilities and examples misclassified by teacher.

69

The product of each is to be calculated for all states in Q to obtained: P(MS) and (P(θ|MS)


4.4Chapter Appendix

Box 1: Beta Function

Beta function of m variables αi for 1≤ i ≤ mB(α1,…, αm) = Γ ( α 1)… Γ ( α m)

Γ(α1+…+αm)

where Γ(n) = ∫x=0 xn-1e-x for n > 0

and Γ(n) is the Gamma function

Box 2: calculations during learning

P(Mqs) ∝ (|Q| + 1)-nt (|Σ| + 1)-ne

P(θq|MS) = P(θt)P(θe)

P(X|M) = Π p(qq’) c(qq’) p(q↑a) c(q↑a)

= Π P(cq|M)

Hence, the posterior probability on M is

P(M|X) = P(MS) (P( θ |M S) Π P(c q |M) P(X)

and the estimations arep’(qq’) = c(q q’) + α - 1

Σs∈Q [c(qs) + α - 1]

p’(q↑a) = c(q ↑ a) + α - 1 Σr∈Σ [c(q↑r) + α - 1]where Σαi = α0 (a constant) and θi = α I for 1 < i < n

α0

n = number of discrete choices in the multinomial parameter distributions

70

∞

Conclusion and Related Work

5.Conclusion and Related Work

The issues in learning highlighted in Section 2.2 as applied to the learning of FA are dealt by all the algorithms discussed using various learning methods. The learning methods employed by each algorithm in the learning process result in categorisations of algorithms with respect to each learning phase as shown in Table 5.1. Learning phase Category Description of category AlgorithmsPhase 1 : the resulting hypothesis M’

non-probabilistic DFA with no inherent probabilistic property in topology

L1-L5

probabilistic A HMM modelled FA with inherent probabilistic properties

L6

Phase 2 : when to stop equivalence oracle counterexample from arbitrary test set

L1, L2, L4

sampling oracle counterexample from precalculated test set based on parameters

L5

in the limit stopping after every input string consistent and correctwith examples seen

L3

bayes’ posterior rule using posterior probabilities L6

Table 5.9: Categorisation of algorithm with respect to learning phases.

From Table 5.1, both phases has some probabilistic methods employed during learning:a) the sampling oracle produces the test set satisfying some probabilities with respect to the

accuracy and confidence limit parameters given and the b) the posterior probabilities from bayes’ rule c) the estimations used in constructing HMM

Thus, the learning methods used in learning FA can be grouped into 2 distinct categories, probabilistic and non-probabilistic, as shown in Table 5.2.

Category of learning methods ApplicationProbabilistic • in constructing HMM

• to evaluate posterior probability of hypothesis to determine when to stop

• to generate test setNon-probabilistic • used in constructing DFA

• answering equivalence queries

Table 5.10: The categories of learning methods

The choice of which learning methods to adopt clearly depends on two factors:

a) the choice of hypothesis representation in the hypothesis space

71


The choice of representation using minimal DFAs has shown that the following properties of FA are captured:

• the equivalence class representing each state• identifying next-states from transitions• the set of distinguishing strings

These properties captured during learning results in a hypothesis that is always guaranteed to be a minimum DFA. Note that the tentative hypothesis may be NFA as in L3 or a DFA without reachability from initial states to every states. The resulting guess is the final guess made after some finite time and is a minimum DFA. This is because of the properties above is captured during learning which are kept in the data structures through updates. Construction of hypothesis here is more rigid as every state and transition has to be certain because a path established (the permanent link in L3) cannot be changed if an example is misclassified.

Modelling FA with HMM however does not guarantee minimal but instead has the following advantages over a DFA:

• able to evaluate strings generated by hypothesis • able to generate strings unseen and measure accuracy or likelihood of strings

generated• may update HMM to accommodate for noisy data through estimations compared

the non-probabilistic DFA which only updates hypothesis with new properties learnt and not correcting any errors. Seems to allow room for errors.

b) task of evaluating the hypothesisUsing the probabilistic method in L6, the evaluation is done entirely by the learner as compared to the evaluation done by teacher in all other algorithms. Note that even in L5, though a sample oracle is used, the teacher is still needed to at least provide the test set required.

In conclusion, the probabilistic learning methods used by L6 in both of the learning phases seemed to be more robust is most of the issues in learning an FA as compared to non-probabilistic learning, as in Table 5.3. However, the question of how reliable is the information received from the teacher is not dealt with here but L6 seems to be able to incorporate errors and rectifying it far better than the non-probabilistic learning methods. There has not been rigorous analysis about information with probabilities (i.e. examples with probability values) received from teacher in all algorithms. As all algorithm hypothesise by representing the information and properties learnt using different data structure and also criteria and learning goal specified in the background knowledge of the learner assuming no probabilities values to information received.

72


Issues Probabilistic learning (L6) Non-probabilistic learningExamples space arbitrary (teacher) arbitrary (teacher)Classification of training set t • only need to provide

positive only examples• teacher seem to be able to

misclassify

• teacher assumed to have answers when queried

• to provide lexicographic ordered set

Presentation of t at random in some order (L3)when requested (queries)

The size of t small • may be quite large (sampling oracle)

• at least each string representing each equivalent states

The choice of hypothesis space representation

concerns only finite states as parameters are used

has to be relative to minimum DFA

Selection and justification of hypothesis (i.e. evaluation)

done by learner teacher mostly though in PAC learning, learner generates sample set to evaluate hypothesis

Table 5.11: Comparison between probabilistic and non-probabilistic learning in handling issues of learning FA.

There are also related works to learning FA where an FA is characterised in a diversity-based representation [Rivest et al 87] as compared to the state-based characterisation [Gold 72] surveyed here and also other works using the methods discussed [Angluin et al 90] as well as methods dealing with the noisy examples issues [Angluin 88] which have not been applied to FA but k-CNF formulas and also the random walks used in [Freund et al 93].

The direction towards learning FA seems to moved along the probabilistic learning line with the statistical analysis for evaluation that it offers as in the learning algorithm using bayes’ rule and also the learning capabilities seems to be more ‘powerful’ with the ability to some how deal with inconsistency with the information provided instead of being consistent with the data provided which seems to be the case with all learning algorithms so far.

73

References

References

[Angluin 81] D. Angluin, “A Note on the Number of Queries Needed to Identify Regular Languages”, Information and Control 51, pp. 76-87 (1981)

[Angluin 87] D. Angluin, “Learning Regular Sets from Queries and Counterexamples”, Information Computation 75, pp. 87-106 (1987)

[Angluin 88] D. Angluin, “Queries and Concept Learning”, Machine Learning 2, pp. 319-342 (1988)

[Angluin 90] D. Angluin, “Negative results for Equivalence Queries”, Machine Learning 5, pp. 121-150 (1990)

[Angluin et al 88] D. Angluin and P. Laird, “Learning From Noisy Examples”, Machine Learning 2, pp. 343-370 (1988)

[Freund et al 93] Y. Freund, M. Kearns, D. Ron, R. Rubinfeld, R. Schapire and L. Sellie, “Efficient Learning of Typical Finite Automata from Random Walks”, Proceedings of 25th ACM Symposium of Computational Learning Theory, pp. 315-324 (1993)

[Gold 67a] E.M. Gold, “Complexity of Automaton Identification from Given Data”, Information and Control 37, pp. 302-320 (1978)

[Gold 67b] E.M. Gold, “Language Identification in the Limit”, Information and Control 10, pp. 447-474 (1967)

[Gold 72] E.M. Gold, “System Identification via State Characterization”, Automatica 8, pp. 621- 626 (1972)

[Hopcroft et al 79] J.E. Hopcroft and J.D. Ullman, “Introduction to Automata Theory, Languages and Computation”, Addison-Wesley (1979)

[Kearns et al 94] M. Kearns and U. Vazirani, “An introduction to Computational Learning Theory”, MIT Press Cambridge, Massachussets, London, England (1994)

[Kohavi 78] Z. Kohavi, “Switching and Finite Automata”, McGraw-Hill, 2nd ed. (1978)

[Li et al 87] M. Li and U. Vazirani, “On the learnability of Finite Automata”, Proceedings of the 1988 Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, Inc. (1998)

[Natarajan 91] B.K. Natarajan, “Machine Learning: A Theoretical Approach”, Morgan Kaufmann Publishers, Inc. (1991)

[Porat et al 91] S. Porat and J. Feldman, ” Learning automata from ordered examples”, Machine Learning 7, pp.109-138 (1991)

[Rabiner et al 86] L.R. Rabiner and B.H. Juang, “An introduction to Hidden Markov Models”, IEEE ASSP Magazine, pp. 4-16 (1986)

[Rivest et al 87] R. Rivest and R. Schapire, “Diversity-based Inference of Finite Automata”, Proceedings of the 28th IEEE Symposium of Computer Science, pp. 78-87 (1987)

[Rivest et al 93] R. Rivest and R. Schapire, “Inference in Finite Automata Using Homing Sequences”, Information and Computation 103, pp. 299-347 (1993)

[Stolcke et al 93] A. Stolcke and S. Omohundro, “Hidden Markov Model Induction by Bayesian Model Merging”, Advances in Neural Information Processing Systems 6 (1993)

[Stolcke et al 94] A. Stolcke and S. Omohundro, “Best-first Model of Merging for Hidden Markov Model Induction”, Technical Report TR-93-003, International Computer Science Institute, Berkeley, Ca. (1994)

[Trakhtenbrot 73] B.A. Trakhtenbrot and Ya M. Barzdin’,” Finite Automata: Behaviour and Synthesis”, North-Holland/American Elsevier (1973)

[Valiant 84] L. Valiant, “A Theory of the Learnable”, Comm. ACM 27(11), pp. 1134-1142 (1984)

74

References

[Viterbi 67] A. Viterbi, “Error Bounds for Convolutional Codes and Asympptotically Optimum Decoding Algorithm”, IEEE Transactions on Information Theory, pp. 260-269 (1967)

75

Figure 2.2.doc.doc

Documents

Transcript of Figure 2.2.doc.doc