Induction as Knowledge Integration

University of Southern California

Induction as Knowledge Integration

Benjamin Douglas Smith

USC/Information Sciences Institute

December 1995 ISI/RR-95-431

J^PpzoTM im puoue N1MM| I PiBaaaacc linhnrtwd /-J&J

19960611 125

INFORMATION SCIENCES

INSTITUTE *mr, 3101822-1511

4676 Admiralty Way/Marina del ReylCalifornia 90292-6695



USC/Information Sciences Institute

December 1995 ISI/RR-95-431

Q||w§i|pTiOr? s^ivTEMm*.

DTIC QUALITY INSPECTED 3

REPORT DOCUMENTATION PAGE FORM APPROVED OMB NO. 0704-0188

Operations Washington*,' DC^SM61*0" DBV'S h'9hW'y' SU't6 1204, Ariln9ton> VA ™°2Ja*2< »"d «« the Office of management and Budget, Paperwork ReducuT^iert (070M188),

1. AGENCY USE ONLY (Leave blank)

4.TITLEANDSUBTrTLE

2. REPORT DATE

December 1995 3. REPORT TYPE AND DATES COVERED

Research Report


6.AUTHOR(S)


7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)

USC INFORMATION SCIENCES INSTITUTE 4676 ADMIRALTY WAY MARINA DEL REY, CA 90292-6695

9. SPONSORING/MONITORING AGENCY NAMES(S) AND ADDRESS(ES) ARPA NASA-Ames Research Center NCCOSCRDT&E 3701 N. Fairfax Dr. Moffett Field, CA 94035-1000 Arlington, VA 22203-1714

DIV CODE 02211 53560 Hull Street San Diego, CA 92152-5410

11. SUPPLEMENTARY NOTES

5. FUNDING NUMBERS

NASA-Ames: Cooperative Agreement NCC 2-538 ARPA/NCCOSC: N66001-95-C-6013

8. PERFORMING ORGANIZATON REPORT NUMBER

ISI/RR-95-431

10. SPONSORING/MONITORING AGENCY REPORT NUMBER

12A. DISTRIBUTION/AVAILABILITY STATEMENT

UNCLASSIFffiD/UNLIMITED

13. ABSTRACT (Maximum 200 words)

12B. DISTRIBUTION CODE

Accuracy and efficiency are the two main evaluation criteria for induction algorithms. One of the most powerful ways to improve Performance along these dimensions is bv inteeratine additional knowlpHep into the* inHnrtinn nnv«c J5™»,0™- ;„fo„,.,«—

nowledge that differs si; performance along these dimensions is by integrating additional knowledge into the induction process, however, inteeratine knowledge that differs significantly from the knowledge already used by the algorithm usually requires rewriting the algorithm

This dissertation presents KII, a Knowledge Integration framework for Induction, that provides a straightforward method for inte- *,y„ ^.«i.uu ^L^^Liio ivii, a xviiuwicugc iiuegictuun irameworK ror inaucnon, mat provides a straightforward method for integrating knowledge into induction, and provides new insights into the effects of knowledge on the accuracy and complexity of induction. The idea behind KII is to express all knowledge uniformly as constraints and preferences on hypotheses. Knowledge is integrated by conjoining constraints and disjoining preferences. A hypothesis is induced from the integrated knowledge by finding a hypothesis consistent with all of the constraints and maximally preferred by the preferences.

Theoretically, just about any knowledge can be expressed in this manner. In practice, the constraint and preference languages determine both the knowledge that can be expressed and the complexity of identifying a consistent hypothesis. RS-KH, an instantiation of KII based on a very expressive set representation, is described. RS-KII can utilize the knowledge of at least two disparate !£ "Älgorithms--AQ-ll and CEA ("version spaces")-in addition to knowledge neither algorithm can utilize. It seems likely mat KIj-Kll can utilize knowledge from other induction algorithms, as well as novel kinds of knowledge, but this is left for future W D^T?

S comPlexity1S comparable to these algorithms when using only the knowledge of a given algorithm, and in some cases RS-KII s complexity is dramatically superior. KII also provides new insights into the effects ofknowledge on induction that are used to derive classes ofknowledge for which induction is not computable.

14. SUBJECT TERMS ~ ~~

bias, induction, knowledge intensive induction, machine learning

17. SECURITY CLASSIFICTION OF REPORT

UNCLASSIFIED

NSN 7540-01-280-5500

18. SECURITY CLASSIFICATION 0OF THIS PAGE

UNCLASSIFIED

19. SECURITY CLASSIFICATION OF ABSTRACT

UNCLASSIFIED

15. NUMBER OF PAGES

183

16. PRICE CODE

20. LIMITATION OF ABSTRACT

UNLIMITED

DTiC QUALITY W8PEVTED 3 Standard Form 298 (Rev. 2-89) Prescribed by ANSI Std. Z39-18 298-102

GENERAL INSTRUCTIONS FOR COMPLETING SF 298 The Report Documentation Page (RDP) is used in announcing and cataloging reports. It is important that this information be consistent with the rest of the report, particularly the cover and title page. Instructions for filling in each block of the form follow. It is important to stay within the lines to meet optical scanning requirements.

Block 1. Agency Use Only (Leave blank).

Block 2. Report Date. Full publication date Including day, month, and year, if available (e.g. 1 Jan 88). Must cite at least the year.

Block 3. Type of Report and Dates Covered. State whether report is interim, final, etc. If applicable, enter inclusive report dates (e.g. 10 Jun 87 - 30 Jun 88).

Block 4. Title and Subtitle. A title is taken from the part of the report that provides the most meaningful and complete information. When a report is prepared in more than one volume, repeat the primary title, add volume number, and include subtitle for the specific volume. On classified documents enter the title classification in parentheses.

Block 5. Funding Numbers. To include contract and grant numbers; may include program

element numbers(s), project number(s), task number(s), and work unit number(s). Use the following labels:

C - Contract G - Grant PE -Program

Element

PR - Project TA -Task WU - Work Unit

Accession No.

Block 6. Author(s). Name(s) of person(s) responsible for writing the report, performing the research, or credited with the content of the report. If editor or compiler, this should follow the name(s).

Block 7. Performing Organization Name(s) and Address(es). Self-explanatory.

Block 8. Performing Organization Report Number. Enter the unique alphanumeric report number(s) assigned by the organization performing the repor.

Block 9. Sponsoring/Monitoring Agency Names(s) and Address(es). Self-explanatory

Block 10. Sponsoring/Monitoring Agency Report Number. (If known)

Block 11. Supplementary Notes. Enter information not included elsewhere such as: Prepared in cooperation with...; Trans, of...; To be published in... When a report is revised, include a statement whether the new report supersedes or supplements the older report.

Block 12a. Distribution/Availability Statement. Denotes public availability or limitations. Cite any availability to the public. Enter additional limitations or special markings in all capitals (e.g. NOFORN, REL, ITAR).

DOD

DOE NASA NTIS

- See DoDD 5230.24, "Distribution Statements on Technical Documents."

- See authorities. - See Handbook NHB 2200.2. - Leave blank.

Block 12b. Distribution Code.

DOD - Leave blank. DOE - Enter DOE distribution categories

from the Standard Distribution for Unclassified Scientific and Technical Reports.

NASA- Leave blank. NTIS - Leave blank.

Block 13. Abstract. Include a brief (Maximum 200 words) factual summary of the most significant information contained in the report.

Block 14. Subject Terms. Keywords or phrases identifying major subjects in the report.

Block 15. Number of Pages. Enter the total number of pages.

Block 16. Price Code. Enter appropriate price code (NTIS only).

Blocks 17.-19. Security Classifications. Self- explanatory. Enter U.S. Security Classification in accordance with U.S. Security Regulations (i.e., UNCLASSIFIED). If form contins classified information, stamp classification on the top and bottom of the page.

Block 20. Limitation of Abstract. This block must be completed to assign a limitation to the abstract. Enter either UL (unlimited) or SAR (same as report). An entry in this block is necessary if the abstract is to be limited. If blank, the abstract is assumed to be unlimited.

Standard Form 298 Back (Rev. 2-89)

INDUCTION AS KNOWLEDGE INTEGRATION

by


A Dissertation Presented to the

FACULTY OF THE GRADUATE SCHOOL

UNIVERSITY OF SOUTHERN CALIFORNIA

In Partial Fulfillment of the

Requirements for the Degree

DOCTOR OF PHILOSOPHY

(Electrical Engineering)

December 1995

Copyright 1996 Benjamin Douglas Smith

Dedication

To my wife Christina, for all her love, support, and patience.

in

Acknowledgments

This work benefited from discussions with several people, particularly Haym

Hirsh, who helped me to shape this work in its early stages. I would also like

to thank the other members of my guidance committee, Yolanda Gil, Shankhar

Rajamoney, Jean-Luc Gaudiot, and my advisor Paul Rosenbloom. The Soar group

at USC provided many useful comments throughout the evolution of this work, and

made the graduate program fun as well.

Thanks are also due to my parents for their support and their assurances that

there was indeed a light at the end of the tunnel. Finally, I would not be where I am

today without my wife Christina, who persevered through the ups and downs of my

graduate school career, and provided me with love and encouragement throughout.

This work was partially supported by the National Aeronautics and Space Ad-

ministration (NASA Ames Research Center) under cooperative agreement num-

ber NCC 2-538, and by the Information Systems Office of the Advanced Research

Projects Agency (ARPA/ISO) and the Naval Command, Control and Ocean Surveil-

lance Center RDT&E Division (NRaD) under contract number N66001-95-C-6013.

IV

Contents

Dedication in

Acknowledgments iv

List Of Tables ix

List Of Figures x

Abstract xii

1 Introduction 1 1.1 Research Goals 2 1.2 Organization of the Dissertation 4

2 KII 6 2.1 Knowledge Representation 8 2.2 Operations 10

2.2.1 Translators 11 2.2.1.1 Examples of Translators 12 2.2.1.2 Independent and Dependent Knowledge Translation . 14

2.2.2 Integration 16 2.2.3 Enumeration 17 2.2.4 Solution-set Queries 18 2.2.5 Incremental versus Batch Processing 20

2.3 KII Solves an Induction Task 22 2.3.1 Hypothesis Space 22 2.3.2 Instance Space 23 2.3.3 Available Knowledge 23 2.3.4 Translators 23 2.3.5 Integration and Enumeration 25

3 Set Representations 27 3.1 Grammars as Set Representations 28

3.2 Closure of C and P under Integration 29 3.3 Computability of Induction 29

3.3.1 Closure Properties of Set Operations 30 3.3.2 Computability of Solution Set 33

3.3.2.1 Expressiveness Bounds on (CxC)nP 34 3.3.2.2 Expressiveness Bounds on C and P 34

4 RS-KII 37 4.1 The Regular Set Representation 37

4.1.1 Definition of DFAs 37 4.1.2 Definitions of Set Operations 38

4.1.2.1 Union 39 4.1.2.2 Intersection 39 4.1.2.3 Complement 40 4.1.2.4 Cartesian Product . 40 4.1.2.5 Projection 45 4.1.2.6 Transitive Closure 49

4.1.3 DFA Implementation 50 4.1.3.1 Recursive DFAs 51 4.1.3.2 Primitive DFAs 52

4.2 RS-KII Operator Implementations 53 4.2.1 Integration 54

4.2.1.1 Intersection 54 4.2.1.2 Union 56 4.2.1.3 Minimizing after Integration 58

4.2.2 Enumeration 60 4.2.2.1 The Basic Branch-and-Bound Algorithm 62 4.2.2.2 Branch-and-Bound with Partially Ordered Hypotheses 65 4.2.2.3 Fully Modified Branch-and-Bound Algorithm .... 73 4.2.2.4 Efficient Selection and Pruning 75

5 RS-KII and AQ11 77 5.1 AQ-11 Algorithm 78

5.1.1 The Hypothesis Space 78 5.1.2 Instance Space 79 5.1.3 The Search Algorithm 80

5.2 Knowledge Sources and Translators 81 5.2.1 Examples 81 5.2.2 Lexicographic Evaluation Function 84

5.2.2.1 Constructing? 89 5.2.3 Domain Theory 96

5.3 An Induction Task 99 5.3.1 Iris Task 99

vi

5.3.1.1 Translators for Task Knowledge 100 5.3.1.2 Results of Learning 101

5.3.2 The CUP Task 101 5.3.2.1 Results of Learning 102

5.4 Complexity Analysis 102 5.4.1 Complexity of AQ-11 103

5.4.1.1 Cost of AQ-ll's Main Loop 103 5.4.1.2 Cost of LearnTerm 103 5.4.1.3 Total Time Complexity for AQ-11 106

5.4.2 Complexity of RS-KII when Emulating AQ-11 107 5.5 Summary 112

6 RS-KII and IVSM 113 6.1 The IVSM and CEA Algorithms 114

6.1.1 The Candidate Elimination Algorithm 114 6.1.2 Incremental Version Space Merging 115 6.1.3 Convex Set Representation 116 6.1.4 Equivalence of CS-KII and IVSM 116

6.2 Subsumption of IVSM by RS-KII 120 6.2.1 Expressiveness of Regular and Convex Sets 120 6.2.2 Spaces for which RS-KII Subsumes CS-KII 121

6.2.2.1 Conjunctive Feature Languages 122 6.2.2.2 Conjunctive Languages where RS-KII Subsumes CS-

KII 124 6.2.3 RS-KII Translators for IVSM Knowledge 126

6.2.3.1 Noise-free Examples 126 6.2.3.2 Noisy Examples with Bounded Inconsistency .... 127 6.2.3.3 Domain Theory 129

6.3 Complexity of Set Operations 131 6.3.1 Complexity of Regular Set Intersection 132 6.3.2 Complexity of Regular Set Enumeration 132 6.3.3 Complexity of Convex Set Intersection 133 6.3.4 Complexity of Enumerating Convex Sets 135 6.3.5 Complexity Comparison 136

6.3.5.1 Equating Regular and Convex sets 136 6.3.5.2 Tree Structured Hierarchies . 138 6.3.5.3 Lattice Structured Features 140

6.4 Exponential Behavior in CS-KII and RS-KII 143 6.4.1 Haussler's Task 144 6.4.2 Performance of CS-KII and RS-KII 145

6.4.2.1 RS-KII's Performance on Haussler's Task 146 6.4.3 Complexity Analysis 151 6.4.4 Summary 154

vii

6.5 Discussion 155

7 Related Work 156 7.1 IVSM 156 7.2 Grendel 157 7.3 Bayesian Paradigms 158 7.4 Declarative Specification of Biases 159 7.5 PAC learning 160

8 Future Work 162 8.1 Long Term Vision 162 8.2 Immediate Issues 163

9 Conclusions 165

Reference List 166

vm

List Of Tables

2.1 Examples 23 2.2 Translated Examples 25

3.1 Language Families 29 3.2 Closure under Union and Intersection 29 3.3 Closure Properties of Languages under Union and Intersection 31 3.4 Closure Under Operations Needed to Compute the Solution Set. ... 33 3.5 Summary of Expressiveness Bounds 36

5.1 Parameters for VLi 100 5.2 Examples for the CUP Task 101

6.1 Translations of Haussler's Examples 146

IX

List Of Figures

2.1 Translators 24 2.2 Computing the Deductive Closure of (C, P) 26

3.1 Projection Defined as a Homomorphism 33

4.1 Definition of Union 39 4.2 Definition of Intersection 40 4.3 Definition of Complement 40 4.4 Regular Grammar for {shuffle(x,y) \x,y€ (o|6)*$ s.t. x < y} .... 43 4.5 Definition of Cartesian Product 45 4.6 NDFA for Projection (First) 47 4.7 DFA for Projection (First) 48 4.8 DFA for Projection (Second) ' 48 4.9 Intersection Implementation 55 4.10 Union Implementation 57 4.11 Branch-and-Bound where £>< is a Total Order 63 4.12 Branch-and-Bound where <x is a Partial Order 67 4.13 Parameters to BranchAndBound-2 68 4.14 Branch-and-Bound That Returns n Solutions Also in A 74 4.15 Parameters to BranchAndBound-3 for Implementing Enumerate. ... 75

5.1 The VLi Concept Description Language 79 5.2 The Outer Loop of the AQ-11 Algorithm 81 5.3 The inner Loop of the AQ11 Algorithm 82 5.4 Example Translator for VLX 83 5.5 Regular Expression for the Set of VLX Hypotheses Covering an Instance. 84 5.6 Mapping Function from Hypotheses onto Strings 88 5.7 Translator for the LEF 88 5.8 DFA Recognizing {shuffle(x,y) 6 (0|l)2fc \ x < y} 90 5.9 Moore Machine for the Mapping 92 5.10 CUP Domain Theory. 97 5.11 Grammar for VLX Hypotheses Satisfying the CUP Theory Bias 98 5.12 Grammar for Instantiated VLX Language 100

6.1 Classify Instances Against the Version Space 119 6.2 RS-KII Translator for Noise-free Examples 128 6.3 RS-KII Translator for Examples with Bounded Inconsistency 129 6.4 RS-KII Translator for an Overspecial Domain Theory 131 6.5 Intersection Implementation 132 6.6 Explicit DFA for the Intersection of Two DFAs 133 6.7 Intersection of Convex Sets, Phase One 134 6.8 Haussler's Negative Examples 145 6.9 DFA for C0, the Version Space Consistent with p0 147 6.10 DFA for Cu the Version Space Consistent with nx 147 6.11 DFA for C2, the Version Space Consistent with n2 147 6.12 DFA for C3, the Version Space Consistent with n3 148 6.13 DFA for C0nd 149 6.14 DFA for C0nCi After Empty Test 149 6.15 DFA for C0r\CxnC2 149 6.16 DFA for C0nCinC2 After Empty Test 150 6.17 DFA for C0f)C1nC2nC3 150 6.18 DFA for C0nC1nC2nC3 After Empty Test 151 6.19 DFA for C0nCin...nCi_i After Empty Test 153 6.20 DFA for CnQ I53

XI

Abstract

Accuracy and efficiency are the two main evaluation criteria for induction algorithms.

One of the most powerful ways to improve performance along these dimensions is by

integrating additional knowledge into the induction process. However, integrating

knowledge that differs significantly from the knowledge already used by the algorithm

usually requires rewriting the algorithm.

This dissertation presents KII, a Knowledge Integration framework for Induction,

that provides a straightforward method for integrating knowledge into induction, and

provides new insights into the effects of knowledge on the accuracy and complexity of

induction. The idea behind KII is to express all knowledge uniformly as constraints

and preferences on hypotheses. Knowledge is integrated by conjoining constraints

and disjoining preferences. A hypothesis is induced from the integrated knowledge

by finding a hypothesis consistent with all of the constraints and maximally preferred

by the preferences.

Theoretically, just about any knowledge can be expressed in this manner. In prac-

tice, the constraint and preference languages determine both the knowledge that can

be expressed and the complexity of identifying a consistent hypothesis. RS-KII, an

instantiation of KII based on a very expressive set representation, is described. RS-

KII can utilize the knowledge of at least two disparate induction algorithms—AQ-11

and CEA ("version spaces")—in addition to knowledge neither algorithm can utilize.

It seems likely that RS-KII can utilize knowledge from other induction algorithms,

as well as novel kinds of knowledge, but this is left for future work. RS-KII's com-

plexity is comparable to these algorithms when using only the knowledge of a given

algorithm, and in some cases RS-KII's complexity is dramatically superior. KII also

provides new insights into the effects of knowledge on induction that are used to

derive classes of knowledge for which induction is not computable.

xn

Chapter 1

Introduction

Induction is the process of going beyond what is known in order to reach a conclusion

that is not deductively implied by the knowledge. One form of induction commonly

studied in machine learning is classifier learning (e.g, [Mitchell, 1982, Brieman et

al., 1984, Quinlan, 1986, Pazzani and Kibler, 1992]). In this form of induction, the

objective is to learn an operational description of a classifier that maps objects from

some predefined universe of instances into classes. The induction process is provided

with knowledge about the classes, but there is usually not enough knowledge to

derive the classifier from the knowledge deductively. It is necessary to go beyond

what is known in order to induce the classifier. The desired classifier is usually

referred to as the target concept, and a classifier is often referred to as a hypothesis.

The classifier induced from the knowledge is not necessarily the target concept.

In order to go beyond what is known deductively, it is necessary to make a number

of assumptions and guesses, and these are not always correct. The induced classifier

may classify some instances incorrectly with respect to the target concept. The

degree to which the induced classifier makes the same classifications as the target

concept is a measure of its accuracy. Hopefully, the induced hypothesis is fairly

accurate, and ideally it is equivalent to the target concept.

The more knowledge there is to induce from, the more accurate the induced

classifier can be. The assumptions and guesses made in the inductive process can

be based on a broader base of knowledge, so there is less opportunity for making

an incorrect decision. In the extreme case, there is enough knowledge to deduce

the classifier, in which case no mistakes are made, and the induced classifier exactly

matches the target concept.

An ideal induction algorithm would be able to utilize all of the available knowl-

edge in order to maximize the accuracy of the induced classifier. In practice, most

existing algorithms are written with certain kinds of knowledge in mind. If some of

the available knowledge is not of these types, then the algorithm cannot use it, or

can only use it with difficulty.

One way an algorithm can make use of such knowledge is to express it in terms

of the kinds of knowledge already used by the algorithm. For instance, Quinlan

suggested casting knowledge as pseudo-examples and providing the pseudo-examples

to the induction algorithm as input [Quinlan, 1990]. However, it is not always

possible to express new knowledge in terms of the existing knowledge, in which case

the knowledge still cannot be used.

If the first approach fails, a second approach is to rewrite the algorithm to make

use of the new knowledge. Rewriting an algorithm to utilize a new kind of knowledge

is difficult. It also fails to solve the underlying problem. If yet another kind of

knowledge is made available, the algorithm may have to be modified once again.

Existing induction algorithms therefore fall short of the ideal algorithm. Existing

algorithms each use varying ranges of knowledge, but none can utilize all of the

available knowledge in every learning scenario. The accuracy of the hypotheses

induced by these algorithms is less than it could be since not all of the knowledge is

being utilized. The methods mentioned above for integrating additional knowledge

into the induction algorithm can be used to extend the algorithm, but these methods

are of limited practicality.

1.1 Research Goals

The goal of this dissertation is to construct an induction algorithm that can utilize

arbitrary knowledge, and therefore maximize the accuracy of the learned hypoth-

esis. Utilizing all of the available knowledge may either increase or decrease the

computational complexity compared to using less knowledge. The induction pro-

cess is guided by the knowledge, so additional knowledge may reduce the amount

of computational effort. However, there is more knowledge to utilize, and the cost

of utilizing that knowledge may overwhelm any computational benefits it provides.

Thus there is a potential trade-off between accuracy and computational complexity.

An algorithm that utilizes all of the knowledge may be prohibitively costly, or even

uncomputable. It may be necessary to accept a reduction in accuracy in order to

reduce the computational complexity. The second goal of this research is to inves-

tigate the ways in which expressiveness (breadth of utilizable knowledge) can be

exchanged for computational cost.

The idea is to express all knowledge about the target concept uniformly in terms

of constraints and preferences on the hypothesis space. Knowledge is integrated

by conjoining the constraints and computing the union of the preferences. The

integrated constraints and preferences specify a constrained optimization problem

(COP). Solutions to the COP are the hypotheses identified by the knowledge as

possible target concepts. The knowledge cannot discriminate further among these

hypotheses, so one is selected arbitrarily as the induced hypothesis. This idea is

embodied in KII, a Knowledge Integration framework for Induction. KII is described

in Chapter 2.

Theoretically, any knowledge that can be expressed in terms of constraints and

preferences can be utilized by KII. However, it may be computationally intractable

to solve the COP resulting from these constraints and preferences. It may even be

undecidable whether a given hypothesis is a solution to the COP. The constraint and

preference languages determine what constraints and preferences can be expressed,

and the computational complexity of solving the resulting COP. Thus, these lan-

guages provide a way of trading expressiveness for computability. Each choice of

languages yields an operationalization of KII that makes a different trade off be-

tween expressiveness and complexity.

KII provides a method for integrating arbitrary knowledge into induction and

formalizes the trade-offs that can be made between expressiveness and complexity

in terms of the constraint and preference languages. This relationship is used to

determine the most expressive languages for which induction is computable. These

analyses, and the space of possible constraint and preference languages, appear in

Chapter 3.

KII can generate algorithms at practical and interesting trade-off points. One

such trade-off is demonstrated by RS-KII, an instantiation of KII with expressive

constraint and preference languages described in Chapter 4. RS-KII is expressive,

utilizing the knowledge used by at least two disparate induction algorithms, AQ-

11 [Michalski, 1978] and, for certain hypothesis spaces, the Candidate Elimination

Algorithm (CEA) [Mitchell, 1982]. In conjunction with this knowledge, RS-KII can

also utilize knowledge that these algorithms can not, such as a domain theory and

noisy examples with bounded inconsistency [Hirsh, 1990]. It seems likely that RS-

KII can utilize the knowledge of other induction algorithms as well, although this is

left for future work. This raises the possibility of combining knowledge from several

induction algorithms in order to form hybrid algorithms (as in adding a domain

theory to AQ-11). RS-KII's apparent expressiveness also raises the possibility of

improving the accuracy of the induction by utilizing additional knowledge.

RS-KII is expressive, but it can also be computationally complex. In the worst

case it is exponential in the number of knowledge fragments it utilizes. However,

when RS-KII utilizes only the knowledge of AQ-11, it has a computational complex-

ity that is slightly higher than that of AQ-11, but still polynomial in the number

of examples. When RS-KII uses only the knowledge used by CEA, RS-KII's worst-

case complexity is the same as that of CEA, but there are problems where RS-KII's

complexity is 0(n2) when the complexity of CEA is 0(2").

Emulations of these algorithms by RS-KII, and analyses of RS-KII's complexity

with respect to AQ-11 and CEA, are discussed in Chapter 5 (AQ-11) and Chapter 6

(CEA).

1.2 Organization of the Dissertation

The remainder of this dissertation is organized as follows. The KII framework and an

illustration of KII solving an induction problem are discussed in Chapter 2. Chap-

ter 3 lays out the space of possible set representations and identifies the most expres-

sive representations for which induction is computable. RS-KII, an instantiation of

KII based on the regular set representation, is described in Chapter 4. Emulations of

AQ-11 and IVSM by RS-KII are described in Chapter 5 and Chapter 6, respectively.

These chapters demonstrate that RS-KII can utilize the knowledge used by these

algorithms, and that RS-KII can also utilize this knowledge while using knowledge

these algorithms cannot. RS-KII's computational complexity when utilizing this

knowledge is compared to the complexity of the original algorithms. Related work

is discussed in Chapter 7. Future work is discussed in Chapter 8, and conclusions appear in Chapter 9.

Chapter 2

KII

KII is a Knowledge Integration framework for Induction. It can integrate arbitrary-

knowledge into induction, and provides a foundation for understanding the effects

of knowledge on the complexity and accuracy of induction. Knowledge comprises

examples, biases, domain theories, meta-knowledge, implicit biases such as those

found in the search control of induction algorithms, and any other information that

would justify the selection of one hypothesis over another as the induced hypothesis.

Knowledge can be defined more precisely as the examples plus all of the biases.

Mitchell [Mitchell, 1980] defines bias to be any factor that influences the selection

of one hypothesis over the other as the target concept, other than strict consis-

tency with the examples. The distinction between biases and examples is somewhat

arbitrary, since they both influence selection of the target concept. More recent

definitions consider consistency with the examples as a form of bias [Gordon and

desJardins, 1995]. The latter definition of bias is used here. The terms knowledge

and bias will be used interchangeably.

By definition, the biases are the only factors that determine which hypothesis

is selected as the target concept. Induction is then a matter of determining which

hypotheses are identified (preferred for selection) by the biases. The biases may

prefer several hypotheses equally. In this case, the best that can be done is to select

one of the hypotheses at random, since there are no other factors upon which to

base a selection of one hypothesis over another. If there were such a factor, it would

be one of the biases by definition.

The hypotheses identified by the biases are termed the deductive closure of the

knowledge, since these are the hypotheses deductively implied by the biases. Induc-

ing a hypothesis by selecting a concept from the deductive closure of the knowledge

may seem more like deduction than induction. However, there is still plenty of room

for inductive leaps within this formulation. Inductive leaps come from two main

sources. One inductive leap occurs when there is more than one hypothesis in the

deductive closure, so that one of them must be selected arbitrarily as the induced

hypothesis. The second place where inductive leaps occur is in the knowledge itself.

The knowledge can include unsupported biases, assumptions, and guesses—that is,

inductive leaps. Even if there is only one hypothesis in the deductive closure of the

knowledge, it may not be the target concept if the knowledge is not firmly grounded

in fact.

In the knowledge integration paradigm, a hypothesis is induced from a collection

of knowledge (biases) as follows. Each bias is represented explicitly as an expression

in some representation language. An expression representing the biases collectively

is constructed by composing the expressions for each individual bias. The composite

expression is then used to generate the hypotheses identified by the biases. Ideally,

the composite expression is the set of hypotheses identified by the biases, but real-

istically it may be necessary to do some processing in order to extract the identified

hypotheses.

One example of an algorithm in the knowledge integration paradigm is incremen-

tal version space merging (IVSM) [Hirsh, 1990]. Biases are translated into version

spaces [Mitchell, 1982] consisting of the hypotheses identified by the bias (i.e., con-

sistent with the knowledge). The version spaces for each of the biases are intersected

to yield a new version space consistent with all of the knowledge. This composite

version space is the set of hypotheses identified by the biases. A hypothesis can be

selected arbitrarily from the composite version space as the target concept.

The version space representation for biases in IVSM allows the identified hy-

potheses to be easily extracted, but it is somewhat limited in the knowledge it can

represent. It can only utilize biases that reject hypotheses from consideration, but

it cannot utilize knowledge that prefers one hypothesis over another. Also, version

spaces are represented as convex sets, and the expressiveness of this representation

further limits the kinds of knowledge that can be represented. KII can utilize both

constraint knowledge and preference knowledge, and allows a range of representation

languages. These are discussed further in Section 2.1.

KII consists of three operations: translation, integration, and enumeration. The

first step is translation. The form in which knowledge occurs in the learning scenario

is usually not the form in which KII represents knowledge. Translators [Cohen,

1992] convert knowledge into the form used by KII. Once translated, the knowledge

is then integrated into a composite representation from which the deductive closure

can be computed. At any time, one or more hypotheses can be enumerated from

the deductive closure of the integrated knowledge. A hypothesis is induced from a

collection of knowledge by translating the individual knowledge fragments (biases)

in the collection, integrating the fragments together, and enumerating one of the

hypotheses identified by the biases.

Other information about the set of hypotheses identified by the biases is also

relevant to induction, such as whether the set is empty, or whether it contains a

single hypothesis. KII provides this information via queries. These are discussed

further in Section 2.2.4.

The rest of this chapter is organized as follows. KII's knowledge representation

is described in Section 2.1. The translation, integration, enumeration, and query

operations are described in Section 2.2. An example of KII solving an induction

task is given in Section 2.3.

2.1 Knowledge Representation

KIFs knowledge representation is subject to a number of constraints. First, it must

be expressive. Knowledge that can not be expressed can not be utilized, and the goal

of KII it to utilize all of the available knowledge. Second, the representation must

be composable. This facilitates integration by allowing a collection of knowledge

fragments to be represented as a composition of individual fragments. Finally, the

representation must be operational. That is, it must be possible both to enumerate

hypotheses from the deductive closure of the integrated knowledge and to answer

queries about the deductive closure.

KII satisfies these three criteria on the representation language by expressing

knowledge in terms of constraints and preferences on the hypothesis space. The

constraints correspond to representational biases, and the preferences correspond to

procedural biases. For instance, a positive example might be expressed as a con-

straint that hypotheses in the deductive closure must cover the example. A negative

example can be expressed as a constraint requiring hypotheses in the deductive clo-

sure not to cover the example. An information gain metric, used by such algorithms

as ID3[Quinlan, 1986] and FOCL[Pazzani and Kibler, 1992], would be expressed as

a preference for hypotheses with a higher information gain. There is more than one

way to express a given knowledge fragment in terms of constraints and preferences.

These issues are discussed further in Section 2.2.1.

Each fragment of knowledge is translated into constraints and preferences. The

constraints and preferences of a knowledge fragment are represented in KII by a tuple

of three sets, (H, C, P), where H is the hypothesis space, C is the set of hypotheses

in H that satisfy the constraints, and P is a set of hypothesis pairs, {a, b), such that

a is less preferred than b (a < b). When the hypothesis space is obvious, H can be omitted from the tuple.

The biases encoded by tuple (H,C,P) identify the most preferred hypotheses

among those satisfying the constraints. Hypotheses that do not satisfy the con-

straints can not be the target concept, so these are eliminated. Among the remain-

ing hypotheses, P indicates that some of these hypotheses should be selected over

others. The less preferred hypotheses are eliminated, leaving a set of hypotheses

that satisfy the constraints and are preferred by the preferences. The knowledge

can not discriminate further among these hypotheses. This is the set of hypotheses

identified by the biases in (H,C,P). This set is also referred to as the deductive closure of the knowledge in {H, C, P).

Specifically, the deductive closure of {H,C,P) is the set {h e C | \/h>eC(h,h') 0

P}. This is equivalent to Equation 2.1, below. In this equation, 1 is the complement

of set A with respect to universe H, and first projects a set of pairs onto its first

elements. For example, first({(xuyi), (x2,y2),...}) returns the set {xux2l...}.

first((CxC)nP)DC (2.1)

The deductive closure of (H,C,P) tuple can also be thought of as the solution

set of a constrained optimization problem (COP), where the domain of the COP is

the hypothesis space (#), the constraints are specified by C, and the optimization

criteria are specified by P. For this reason, an (H, C, P) tuple is also called a COP,

and the deductive closure of (H,C,P) is also referred to as the solution set.

The (H, C, P) representation is expressive and composable, but since KII does

not specify particular set representations for H, C, and P, it is not operational.

It is expressive, since just about any knowledge relevant to induction can be ex-

pressed in terms of constraints and preferences. Integration is defined for every pair

of {H, C, P) tuples, so the representation is also composable. Constraints are inte-

grated by conjoining them into a single conjunctive constraint, and preferences are

integrated by computing their union. Specifically, the integration of (#, Ci, Pi) and

<#,C2,P2)is(#,CinC2,PiUP2).

In order for the {H, C, P) representation to be operational, there must be an

algorithm for enumerating hypotheses from the deductive closure of {H,C,P). The

deductive closure of (H,C,P) is always defined in terms of set operations over H,

C, and P, but since KII does not specify any particular set representations, the set

operations are not operational. The (H,C,P) representation is operationalized by

providing set representations for H, C, and P. In .theory, any knowledge can be

represented in terms of constraints and preferences, but in practice the set represen-

tation determines the constraints and preferences that can be represented, and the

complexity of integrating knowledge and extracting hypotheses from the deductive

closure of the knowledge. Specifying a set representation operationalizes KII, and produces an induction

algorithm (or family of algorithms) at a particular level of complexity and expres-

siveness. This provides a useful handle for investigating the effects of knowledge on

induction, by relating these effects to properties of set representations. Set repre-

sentations and their properties are discussed further in Chapter 3. An instantiation

of KII based on the regular-set representation is discussed in Chapter 4.

2.2 Operations

This section formally describes the operations provided by KII. The four families of

operations are translators for expressing knowledge as (H, C, P) tuples, an integrator

for integrating (H, C, P) tuples, an enumerator for enumerating hypotheses from the

10

deductive closure of a (H, C, P) pair, and queries about properties of the deductive

closure.

An induction task consists of an unknown target concept and knowledge from

which the target concept is to be induced. Knowledge provided by the induction

task is expressed in some representation, but probably not the COP representation

expected by KII. Translators convert knowledge from its task representation into

COPs.

COPs for each knowledge fragment are integrated by KIFs integrate operator

into a composite COP representing all of the knowledge. A hypothesis is induced

by selecting a hypothesis arbitrarily from the deductive closure of the integrated

knowledge. This is achieved by the enumerate operator. The deductive closure of

the integrated knowledge is exactly the solution set to the COP resulting from inte-

grating all the translated knowledge. The enumerate operator returns a requested

number of hypotheses from the solution set of this COP.

Also relevant to induction are certain properties of the solution set, such as

whether the solution set is empty, whether the induced hypothesis is the only one

in the solution set, and whether a given hypothesis is in the solution set. These

properties, and others, are determined by solution-set queries.

The remainder of this section discusses the KII operations in more detail.

2.2.1 Translators

An induction task consists of an unknown target concept and some collection of

knowledge from which the target concept is to be induced. In order for KII to use

this knowledge, it must first be translated from whatever form it has in the task

(its naturalistic representation [Rosenbloom et al, 1993]) into the representation

expected by KII—namely, a COP. This operation is performed by translators [Cohen,

1992].

A translator converts knowledge from some naturalistic representation into the

constraints and preferences that make up a COP. The translators are highly task-

dependent. They depend on the kind of knowledge being translated, the naturalistic

representation of that knowledge, and the hypothesis space. The hypothesis space

is a necessary input to the translator since the output of the translator consists of

11

constraints and preferences over the hypothesis space. It is necessary to know the

domain being constrained in order to generate meaningful constraints, and likewise

for preferences.

The dependence of the translator on the hypothesis space means that all trans-

lators for a given induction task must take the same hypothesis space as input, and

output constraints and preferences over the same domain. Effectively, all of the

knowledge fragments are dependent on the hypothesis space bias, which is also a

knowledge fragment.

These restrictions mean that each induction task will usually need its own set

of translators. However, this does not mean that reuse is impossible. If there is

some knowledge used by several tasks, and the knowledge has the same naturalistic

representation in all of the tasks, and the tasks use the same hypothesis space, then

the same translator can be used for that knowledge in all of the tasks.

2.2.1.1 Examples ofTranslators

Specifications of translators for common knowledge sources are shown below. In the

following, H is the hypothesis space, pos is a positive example, neg is a negative

example, and T is a domain theory. A domain theory is a collection of horn-clause

rules that explain why an instance is an example of the target concept.

• PosExample(H, pos) -»• (H,C, {}) where

C = {h e H I h covers pos}

(e.g., FOCL [Pazzani and Kibler, 1992], AQll [Michalski, 1978],

CN2 [Clark and Niblett, 1989], CEA [Mitchell, 1982])

Every noise-free positive example is covered by the target concept. No hypoth-

esis that fails to cover a noise-free positive example can be the target concept.

A noise-free positive example is therefore translated as a constraint that is only

satisfied by hypotheses that cover the example. This bias is known as strict

consistency with the examples, and is used by many existing algorithms.

12

• NegExample(H, neg) -* (H,C,{}) where

C = {he H \h does not cover neg}

(e.g., FOCL, AQll, CN2, CEA)

The target concept does not cover noise-free negative examples. The transla-

tor for noise-free negative examples is similar to the translator for noise-free

positive examples, except that the constraint is satisfied by hypotheses that do not cover the example.

• M>Mj,üramj>fe»(ff, examples) -> <#,{}, P) where

P = {(*, y) e H | x is consistent with fewer examples than y}

In the real world, examples can contain errors, or noise. Demanding strict

consistency with noisy examples conld canse the target concept to be rejected

or there may be no hypotheses consistent with all of the examples.

One simple approach is to assnme that the examples are mostly correct, and

therefore hypotheses that are consistent with most of the examples are more

preferred than those consistent with only a few examples. It is difficnlt to find

a translation for individnal examples that would yield this preference when

ttar ranslations are integrated. Instead, all of the examples are translated collectively into a single set of preferences.

This translation is somewhat naive, bnt provides a simple illnstration of how

preferences are used, and how noisy examples can be handled. More sophisti-

cated translators for noisy examples are discussed in Section 6.2.3.2.

• CorrectDomainTheory(H,T) -> {H,C,{}) where

C={heH\h covers exactly those instances explained by T]

(e.g., EBL[DeJong and Mooney, 1986, Mitchell et al., 1986])

A complete and correct domain theory exactly describes the target concept.

The theory is translated into a constraint that is satisfied only by the target

concept, which is described by the theory. There is very little induction here

13

since the target concept has been provided. This kind of theory is more often

used in speed-up learning, where the objective is not to learn a new classifier,

but to make a known classifier more efficient (e.g., [DeJong and Mooney, 1986]).

• OverSpecialDomainTheory(H,T) -»■ (H,C, {}) where

C = {he H j /i is a generalization of domain theory T}

(e.g., IOE [Flann and Dietterich, 1989])

An overspecial domain theory can only describe why some of the instances

covered by the target concept are examples of the target. The theory is a

specialization of the target concept. The theory is translated as a constraint

satisfied by all generalizations of the concept described by the theory. See

Section 5.2.3 and Section 6.2.3.3 for more detailed domain theory translators.

A given fragment of knowledge can have several translations, depending on how

the knowledge is to be used. One can think of the intended use as knowledge

that is provided implicitly to the translator. For example, there are at least two

translations for domain theories, depending on what assumptions are made about

the correctness of the theory. Likewise, examples also have different translations,

depending on whether the examples are assumed to be noisy or error free.

2.2.1.2 Independent and Dependent Knowledge Translation

A collection of knowledge fragments that can be translated independently of each

other are said to be independent. The only inputs to a translator for an independent

fragment is the fragment itself. There are no truly independent knowledge fragments

in KII, since each fragment is translated into constraints and preferences over the

hypothesis space, so both the fragment and the hypothesis space are necessary inputs

to the translator. However, if each fragment is paired with the hypothesis space to

form a new unit, then these units can all be translated independently.

Knowledge fragments that can only be translated in conjunction with each other

are said to be dependent. This can occur either because the knowledge fragments are

inherently dependent, or because the set representations for C and P cannot express

the constraints and preferences for the individual fragments, but can express the

14

constraints and preferences imposed by the fragments collectively. A translator for

a collection of dependent knowledge fragments takes the whole collection as input

and produces a single {H, C, P) tuple. The inputs are termed a dependent unit.

Ideally, the dependencies among the knowledge are minimal, so that each depen-

dent unit consists of only a few knowledge fragments. Every knowledge fragment

is dependent on the hypothesis space, so there is at least one unit per knowledge

fragment, each of which contains the fragment and the hypothesis space. Additional

dependencies can decrease the number of units and increase the number of frag-

ments in each, until at maximum dependence there is a single unit containing all

of the knowledge. Independence among the knowledge leads to greater flexibility in

deciding what knowledge to utilize, and ensures that knowledge integration occurs

within KII's integration operator and not within the translators.

Independence leads to greater flexibility in using knowledge. When a collection

of knowledge fragments is dependent, it is not possible to translate only some of the

fragments in the collection (unless they participate in other dependent units as well).

It is often an all or none proposition. This constrains the choices of what knowledge

to utilize. The greater the independence among the knowledge, the smaller the

dependent units tend to be, and the fewer constraints there are. At maximum

independence, each unit consists of a knowledge fragment and the hypothesis space,

so individual fragments can be utilized or omitted as desired.

When the knowledge fragments are independent, they can be translated and

integrated incrementally. However, this does not mean that induction in KII is

necessarily incremental. In order to induce a hypothesis from a body of knowledge

in KII, the knowledge is translated and integrated, and a hypothesis is enumerated

from the deductive closure of the knowledge. The translation and integration steps

may be incremental when the knowledge is independent, but the enumeration step

may not be. The enumeration algorithm may have to start over when new knowledge

is integrated, instead of picking up where it left off. This subject is discussed further

in Section 2.2.5.

Independence among the knowledge ensures that the integration work occurs

within KII, and not within the translators. Independent knowledge fragments are

translated independently, and the {H, C, P) tuples for the individual fragments are

composed by the Integrate operator into a single (if, C, P) tuple. The integration

• 15

work occurs within KII. Collections of dependent knowledge fragments are translated

directly into a single (H, C, P) tuple. The integration of the knowledge occurs within

the translator, and not within KII.

When most of the induction process occurs outside of KII, the power of KII

cannot be brought to bear in facilitating the integration of new knowledge. In the

extreme case there is a single translator for all of the knowledge, and all of the in-

tegration takes place within that translator. The translator is effectively a special

purpose integration algorithm replacing KII's integration operator. In order to uti-

lize additional knowledge, a new translator has to be constructed. KII's knowledge

integration capabilities are circumvented.

Independence of the knowledge is clearly desirable, but not always easy to obtain.

Some knowledge is inherently dependent, and must always be translated as a unit.

Other knowledge is dependent in one representation, but independent in another.

Independence tends to increase with the expressiveness of the C and P represen-

tations, since it is more likely that translations of individual knowledge fragments

can be expressed. However, complexity also increases with expressiveness, so it

may be necessary to trade some independence among the knowledge for improved

complexity.

2.2.2 Integration

Two COPs can be integrated, yielding a new COP whose solution set is the de-

ductive closure of the knowledge represented by the two COPs being integrated.

The integration operation is defined as follows. Both COPs must have the same

hypothesis space, since it makes no sense to integrate COPs whose constraints and

preferences are over entirely different universes of hypotheses.

Integrate«^, Cu Pi), (H, C2, P2)) = (H, dnC2, PXUP2) (2.2)

The C and P sets of each COP represent reject knowledge and preference knowl-

edge respectively. If a hypothesis is rejected by either COP, then it cannot be the

target concept. Thus the constraints are conjoined, which corresponds to intersect-

ing the C sets. Preference knowledge, by contrast, is unioned. If one COP has no

16

preference between hypotheses a and b, but the other COP prefers a to b, then a

should be preferred over b. This corresponds to computing the union of the prefer-

ences in the two P sets.

2.2.3 Enumeration

The enumeration operator returns a requested number of hypotheses from the de-

ductive closure of the knowledge represented by an (H, C, P) tuple. The solution

set to {H, C, P) consists of the undominated elements of C, and is defined formally

in Equation 2.3. In this definition, the function first({(xi,yi), (£2,2/2), • • •}) is a

projection returning the set of tuple first-elements, namely {zi,^,...}.

Solutions((H,C,P}) = {xeC\ Vy6C (x,y) £ P}

= {x G H I (x G C and 3y€C(x, y) G P) or x g C}

= {x G H I x G C and 3yec(x, y) G P} U C

= /ir5i({(a;,T/)GCxC|(a:,y)GP})nC

= yirsf((CxC)nP)nC (2.3)

Enumerate takes three arguments, (H, C, P), A, and n, where (H, C, P) is a COP,

A is an arbitrary set from the same universe as C and P, and n is a non-negative

integer. Enumerate({H, C, P), A, n) returns an extensional list of up to n hypotheses

that are in both A and the deductive closure of (H,C,P). If there are fewer than

n hypotheses in the intersection of A and the deductive closure of (H,C,P), then

Enumerate((H,C,P),A,n) returns all of the hypotheses in the intersection. The

argument A is necessary to implement some of the solution-set queries, as shown in

Section 2.2.4. It can be effectively eliminated by setting A to C, or any superset of

C, since the solution set is always a subset of C.

Enumerate((H, C, P), A, n) -»• {hi, h2,... hm}

where m = min(n, \SolnSet((H, C, P)) n A\)

17

The tuple (H, C, P) is the result of integrating the tuples generated by the trans-

lators. C is the intersection of several constraint sets, and P is a union of preference

sets. Equation 2.3 assumes that P is transitively closed and a partial order. How-

ever, neither of these conditions is guaranteed by the integration operator. The

preference sets produced by the translators must be transitively closed partial or-

ders, but these conditions are not preserved by union. If Pi has the preference a < b

and P2 has the preference b < a, then Pi L)P2 contains a cycle (a < b and b < a).

A partial ordering is antisymmetric and irreflexive by definition, so the union is not

a partial ordering. Transitive closure is not preserved by union either. If a < b and

c < d are in Pi, and b < c is in P2, PiUP2 will not contain a < d, even though this

is in the transitive closure.

The lack of transitive closure is dealt with by transitively closing P prior to

calling Enumerate. The implementation of the closure operation depends on the

set representation for P. Cycles in P are more problematic. Currently, KII simply

assumes that the preferences, are consistent, so that P has no cycles. If P does

contain cycles, then hypotheses participating in the cycles are all dominated, and not

part of the solution set. In other words, hypotheses about which there is conflicting

knowledge are rejected, but the remaining hypotheses are considered normally for

membership in the solution set.

Contradictions can also occur among the constraint knowledge. If the constraints

are mutually exclusive, then C is empty, and so is the solution set. KII tolerates

such conflicts, but does not have any sophisticated strategy for dealing with them.

The solution space simply collapses, and it is up to the user to try a different set of

knowledge.

A more sophisticated approach would be to detect mutually exclusive constraints

and cycles in the preferences, either as part of integration or prior to enumeration,

and resolve the conflicts according to a conflict resolution strategy. This is an area

for future research.

2.2.4 Solution-set Queries

Solution-set queries return information about the deductive closure of a (H, C, P)

tuple. There are four solution-set queries, based on those Hirsh proposed for version

18

spaces [Hirsh, 1992]. As discussed earlier, version spaces represent deductive closures

of knowledge, so these queries should also be appropriate for KII.

There are four solution-set queries, defined as follows. In these definitions,

(H, C, P) is a COP, and h is some hypothesis in H.

• Member(h, (H, C, P)) -* Boolean

Returns true if h is a member of the solution set to (H, C, P). Returns false

otherwise. Member(h, (H,C,P)) is equivalent to h e C and ({h}xC)C\P = 0

• Empty((H,C,P)) ->• Boolean

Returns true if the solution set to (H, C, P) is empty. Returns false otherwise.

• Unique((H,C, P)) -»• Boolean

Returns true if the solution set to (H, C, P) contains exactly one member.

Returns false otherwise.

• Subset((H,C,P),A)-> Boolean

A is a set of hypotheses. Returns true if (H, C, P) C A, and returns false

otherwise. Subset((H, C, P), A) = SolutionSet{(H, C,P))nA~ = $.

The queries take (H, C, P) as an argument instead of the solution set itself.

This is because it may be possible to answer the query without computing the entire

solution set, which could yield significant computational savings. Taking (H, C, P) as

an argument also allows the queries to be implemented in terms of Enumerate, which

facilitates both implementation of the queries and analysis of their computational

complexity.

The computational complexity of the queries is essentially the complexity of

enumerating one or two hypotheses. Whether enumerating a few hypotheses is

significantly cheaper than computing the entire solution set depends on the set

representation and on the COP itself. The computational complexity of enumeration

as a function of set representations is discussed in Chapter 3, and the question of

whether enumeration is cheaper than computing the whole solution set is discussed

in Chapter 4.

The queries are defined in terms of Enumerate as follows.

• Member(h, (H, C, P)) & Enumerate((H, C, P), {h}, 1)^0

19

• Empty((H, C, P)) & Enumerate{{H, C, P), H, 1) = 0

• Unique((H,C,P)) & \Enumerate({H,C,P),H,2)\ = l

• Subset{(H, C, P), A) & Enumerate{{H, C, P), A, 1) = 0

It is conjectured that these four queries plus the enumeration operator are suf-

ficient for the vast majority of induction tasks. Most existing induction algorithms

involve only the enumeration operator and perhaps an Empty or Unique query.

The candidate elimination algorithm [Mitchell, 1982] and incremental version space

merging (IVSM) [Hirsh, 1990] use all four queries, but do not use the enumeration

operator.

2.2.5 Incremental versus Batch Processing

In an induction algorithm, a hypothesis is induced from all of the knowledge inte-

grated so far. Inducing the hypothesis involves some amount of work. When new

knowledge is integrated, the old induced hypothesis may no longer be valid in light

of the new knowledge. A new hypothesis should be induced. In an incremental

algorithm, the work done in inducing a hypothesis from the old knowledge can be

applied towards inducing the new hypothesis.

For example, in the candidate elimination algorithm [Mitchell, 1982], the version

space represents all of the hypotheses consistent with the examples. Any of these

hypotheses can be selected as the induced hypothesis. The work done in computing

the version space is conserved in computing a new version space consistent with the

old examples plus a new example. The new version space is computed by removing

inconsistent hypotheses from the old version space. The new version space does not

need to be computed from scratch.

In a batch algorithm, the old work cannot be applied to inducing a new hy-

pothesis. The algorithm must start over again from scratch. For example, in AQ-11

[Michalski, 1978], the hypothesis is found by a beam search guided by an information

theoretic measure, which is derived from the examples. If a new example is added,

the metric changes. The search must start from the beginning with the new metric.

20

KII is neither inherently batch nor inherently incremental. Rather, individual

instantiations of KII can be incremental or batch, depending on the set represen-

tation and the translators. These factors can be manipulated and analyzed, which

makes KII a potentially useful tool for experimenting with these factors, and may

suggest ways for making algorithms more incremental.

A hypothesis is induced from a collection of knowledge by translating the frag-

ments into (H, C, P) tuples, integrating the tuples into a single (H, C, P) tuple, and

then enumerating a hypothesis from the solution set of this composite tuple. When

new knowledge is added, it is translated into (H,C',P'), integrated with the old

(H, C, P) tuple to yield {H, CnC, PUP'), and a new hypothesis is enumerated from

{H,CnC,PuP'). There are two places where work done in processing the old knowledge can be

saved or lost when new knowledge becomes available. The work done in enumer-

ating a hypothesis from the solution set of the (H, C, P) tuple can be applied to

enumerating a hypothesis from the solution set of (H,Cr\C',P\JP'). Work done

in integrating the old knowledge can either be saved or lost in integrating the new

knowledge fragment, (H,C,P').

Whether any of the work involved in enumerating a hypothesis from the solution

set of {H, C, P) can be applied to enumerating a hypothesis from the solution set of

(H, CnC, PUP') depends largely on the set representations for C and P, and the

extent of the changes introduced by C" and P'. As an extreme example, if P and

P' are both empty, then the solution set to (H, C, P) is just C, and the solution

set to (H, CnC, PUP') is just CnC. A hypothesis can be enumerated from CnC

by continuing the enumeration of C until a hypothesis is found that is also in C.

There is no need to start the enumeration of C over from the beginning, since any

hypothesis not in C is not in CnC either. All of the previous work is saved.

The second place where KII can either save or lose previous work is in integrating

knowledge. Work is lost only if the translation of the existing knowledge depends

on the new knowledge. This usually occurs in translators where one of the inputs

takes all knowledge of a given type (e.g., all the examples). When a new knowledge

fragment of this type is added, the old fragments must be retranslated to take the new

fragment into account. However, the old translation has already been integrated.

If this translation is not retracted before the new translation is integrated, then

21

the old and new translations may conflict. Since KII has no facility for retracting

translations, the only way to retract the old translation is to re-integrate all of the

knowledge from scratch, omitting the offending translation. The previous integration

work is lost, as is any enumeration work.

Not all translators of this type necessarily require the old translation to be re-

tracted when new knowledge arrives. Consider a translator, tran(E) that takes as

input the set of all available examples. Let E be a collection of examples, and e a new

example. Furthermore, let tran(E) be (H,C,P) and tran(Eu{e}) be (H,C*,P*).

If {H, C, P) has already been integrated, then it usually must be retracted before

integrating tuple(H,C*,P*). However, consider what happens if (H,C*,P*) can be

expressed as (H, C(~)Ce, Pl)Pe), where {H, Ce, Pe) is derived from e (and possibly E).

The translator can simply output (H,Ce,Pe). This will be integrated with the pre-

viously integrated (H,C,P) producing (H,C*,P*), which is the correct translation

of El){e}. There is no need to retract (H, C, P) first.

2.3 KII Solves an Induction Task

An example of how KII can solve a simple induction task is given below. Sets have

been represented extensionally in this example for illustrative purposes. This is not

the only possible set representation, and is generally a very poor one. The space of

possible set representations is investigated more fully in Chapter 3.

2.3.1 Hypothesis Space

The target concept is a member of a hypothesis space in which hypotheses are de-

scribed by conjunctive feature vectors. There are three features size, color, and

shape. The values for these features are size G {small, large, any-size}, color G

{black, white, any-color}, and shape G {circle, rectangle, any-shape}. A hy-

pothesis is some assignment of values to features for a total of 27 distinct hypotheses.

Hypotheses are described as 3-tuples from sizexsizexshape. For shorthand iden-

tification, a value is specified by the first character of its name, except for the "any-z"

values which are represented by a "?". So the hypothesis (white, any-size, circle)

would be written as (w, ?, c) or just w?c.

22

2.3.2 Instance Space

Instances are the "ground" hypotheses in the hypothesis space. An instance is a

tuple {color, size, shape) where color e {black, white}, size e {small, large}, and

shape € {circle,rectangle}.

A hypothesis covers an instance if the instance is a member of the hypothesis. Re-

call that a hypothesis is a subset of the instance space, as described in some language.

In this learning scenario, an instance is a member of (covered by) all hypotheses that

are more general than the instance. Specifically, an instance, {z, c, s) is covered by

every hypothesis in the set {z, any-size}x{c, any-color}x{s, any-shape}.

2.3.3 Available Knowledge

The available knowledge consists of three examples (classified instances), and an

assumption that accuracy increases with generality. There are three examples, two

positive and one negative, as shown in Table 2.1. The target concept is (s,?,?).

That is, size = small, and color and shape are irrelevant.

identifier class example e-1 e-2 e-3

+ +

(small,white,circle) (small,black,circle) (large,white,rectangle)

Table 2.1: Examples.

2.3.4 Translators

The first step is to translate the knowledge into constraints and preferences. Three

translators are constructed, one for each type of knowledge: the positive examples,

negative examples, and the generality preference. These translators are shown in

Figure 2.1. As with all translators in KII, each of these translators takes the hy-

pothesis space, H, as one of its inputs. Since the hypothesis space is understood,

(H, C, P) tuples will generally be referred to as just (C, P) tuples for the remainder

of this illustration.

23

• TranPosExample(H, (z, c, s)) ->• (C, {}) where

C = {x E H I x covers (2, c, s)}

= {2, any-size} x {c, any-color} x {s, any-shape}

• TranNegExample(H, (z, c, s)) ->• (C, {}) where

C = {a: G if I a; does not cover (2, c, s)}

= complement of {z,any-size}x{c, any-color}x{s, any-shape}

• TranPreferGeneral(H) -> ({},P) where

P = {(a;, y) G HxH \ x is more specific than y}

= {(sbr, ?6r), (s6r, s?r), (s6r, ??r), (swr, ?it;r),...}

Figure 2.1: Translators.

The examples are translated in this scenario under the assumption that they

are correct. Under this assumption, hypotheses that do not cover all of the positive

examples, or that cover any of the negative examples, can be eliminated from consid-

eration. Beyond these constraints, examples do not express preferences for certain

hypotheses over others. Thus, positive examples are translated into (C, P) pairs in

which P is empty and C is satisfied only by hypotheses that cover the example.

Negative examples are translated similarly, except that C is satisfied by hypotheses

that do not cover the example. The translated examples are shown in Table 2.2.

The hypothesis space is omitted for brevity.

The assumption that general hypotheses are more accurate is represented as a

a preference. Specifically, the assumption is translated into a (C, P) pair where C

is empty (it rejects nothing), and P = {(x, y) G HxH \ x is more specific than y},

where H is the hypothesis space. Hypothesis x is more specific than hypothesis y if

x is equivalent to y except that some of the values in y have been replaced by "any"

values. For example, (s, w, r) is more specific than (?, it;, r), but there is no ordering

between (?,w,c) and (s,w,r). There are too many preferences to list explicitly,

24

Example Class C

e\ : swc + swc, swl, S?C, IwC, sV, ?tt>?, V.c, ??? e2 : sbc + sbc, sW, sic, ?6c, W, sV, V.c, ??? ez : Iwr swc, swr, Iwr, sbc, sbr, Ibc, Ibr, lb?, sb?,

swl, Iwc, ?bc, Ibr, sic, sir, He, ?&?, sV, V.c

Table 2.2: Translated Examples.

but the first few are shown in the definition for the TranPrefer General translator in

Figure 2.1.

2.3.5 Integration and Enumeration

The COPs are integrated into a single COP, (C, P). C is the intersection of the C sets

for each knowledge fragment, and P is the union of preferences for each knowledge

fragment. This is shown in the following equation. The elements of each (C, P) pair

are subscripted with the knowledge fragment from which the pair was translated:

el, e2, and e3 for the examples, and mg for the "more general" preference. In the

following, (Ci, Pi) © (C2, P2) is shorthand for Integrate((Ci, Pi), (C2, P2)).

(C,P) = (Cel,{})©<Ce2,{})©<Ce3,{}>©(tf,Pm5>

= (CelnCe2nCe3n#,{}u{}u{}uPm5)

= ({s??, V.c, sic}, {(sbr, Ibr), (sbr, s?r),...})

A hypothesis is induced by making a call to Enumerate((C,P),H,l), which

returns a hypothesis selected arbitrarily from the deductive closure of (C,P). The

deductive closure consists of the undominated elements of C with respect to the

dominance relation P. One way to compute this set is to partially order C according

to P and find the elements at the top of the order.

C contains three elements, (s,?,?), (?,?,c) and (s,?,c). P prefers both (s,?,?)

and (?,?,c) to (s,?,c), but there is no preference ordering between (s, ?, ?) and

(?,?,c). The deductive closure contains the undominated elements of C, namely

(s,?,?) and (?,?, c). Figure 2.2 illustrates this computation. The arrows indicate

25

???

|9? ?w?

/fa- /fa-

lw? lb? sw? sb? s?r fes?c) |?c ?wc ?bc l?r ?vyr AA A A A 7T A A A A A

Figure 2.2: Computing the Deductive Closure of {C,P).

the partial ordering imposed by P, and the highlighted hypotheses are the elements

ofC.

Enumerate((C, P), #, 1) returns one hypothesis selected arbitrarily from the de-

ductive closure of (C, P) as the induced hypothesis. If it selects (s, ?, ?}, then it has

correctly induced the target concept. However, it may just as easily select (?,?,c>,

the other element of the deductive closure. The selection of a hypothesis is an in-

ductive leap, and not guaranteed to be correct. The amount of ambiguity in the leap

can be reduced by utilizing more examples or other knowledge, thus reducing the

number of hypotheses in the deductive closure.

Additional information about the deductive closure of (C,P) that is relevant

to induction can be obtained through the solution-set queries. For example, we

may wish to know whether or not the induced hypothesis was the only one in the

deductive closure. This and other queries are shown below.

• Unique({(s, ?, ?), (?, ?, c)}) = false

. £mp*y({<s,?,?>,(?,?,c>}) = false

• Member({(s, ?, ?), (?, ?, c», (?, ?, c» = true

26

Chapter 3

Set Representations

The KII framework discussed in Chapter 2 is not operational. Knowledge in KII

is represented as (C, P) pairs, where C is a set of hypotheses and P is a set of

hypothesis pairs. The integration, enumeration, and query operators are all defined

in terms of set operations on (C, P) pairs. KII does not specify any particular set

representation for C and P—in fact, these sets may be arbitrarily expressive. This

expressiveness allows KII to utilize any knowledge that can be expressed in terms of

constraints and preferences. However, arbitrarily expressive sets are not operational,

and so neither is KII.

In order to operationalize KII, operational set representations must be provided

for C and P. These representations determine which C and P sets can be expressed,

and therefore the kinds of knowledge that KII can utilize. The more expressive

the representation, the more knowledge that KII can use. However, set operations

in more expressive representations tend to have higher computational complexities.

Since KIFs integration, enumeration, and query operators are defined in terms of set

operations, their computational complexity is determined by the set representation.

KII can be operationalized with many different set representations. The expres-

siveness of the representation, and the computational complexity of its set opera-

tions, determine the knowledge that the operationalization can utilize, and the cost

of its integration, enumeration, and query operations. If expressiveness is at a pre-

mium, one could select a very expressive set representation in exchange for increased

complexity. If speed is of the essence, an inexpressive yet low complexity represen-

tation may be more appropriate. There may also be a representation that blends

27

the best of both worlds: expressive enough to represent most knowledge, yet with a

reasonable computational cost.

In order to understand the space of operationalizations, it is necessary to under-

stand the space of possible set representations. Section 3.1 casts the space of set

representations in terms of the well understood space of grammars.

The set representations for C and P should also be closed under integration,

although this is not strictly necessary. This condition is discussed in Section 3.2,

and languages that are closed under integration are identified.

The complexity of induction also restricts the choice of set representations for C

and P. Operationalizations of KII with expressive representations tend to have high

induction costs, and for some representations, induction may even be undecidable.

Clearly, such representations cannot be used to operationalize KII. The most ex-

pressive representations for which induction is decidable within the KII framework

are identified in Section 3.3.

3.1 Grammars as Set Representations

Every computable set is the language of some grammar. Similarly, every com-

putable set representation is equivalent to some family of grammars. These families

include, but are not limited to, the families of the Chomsky hierarchy [Chomsky,

1959]—regular, context free, context sensitive, and recursively enumerable. Table 3.1

lists the language families in the Chomsky hierarchy in addition to other important

language families. The list is in order of decreasing expressiveness, with the most

expressive languages at the top and the least expressive at the bottom. The com-

plexity of set operations generally increases with the expressiveness of the language.

Each family in the list properly contains all of the families below it.

Every computable set representation either corresponds to one of the families in

the list of Figure 3.1, or is subsumed by one of them. Mapping set representations

onto families of grammars clarifies the space of possible representations, and makes it

possible to use results from automata and formal language theory to analyze relevant

properties of a set representation, such as expressiveness, and the decidability and

computational complexity of set operations.

28

Recursively Enumerable (r.e.) Recursive

Context Sensitive (CSL) Context Free (CFL)

Deterministic Context Free (DCFL) Regular Finite

Table 3.1: Language Families.

3.2 Closure of C and P under Integration

The integration operator is denned in terms of intersection and union, as shown

below.

Integrate^, i\>, (C2, P2)) = {CxnC2, PXVP2) (3.1)

The representation for C must be closed under intersection, and the representation

for P must be closed under union. This ensures that the (C, P) pair resulting from

integration can be expressed in these representations. Table 3.2 indicates which of

the languages in Table 3.1 are closed under intersection and union. Some language

are not closed under intersection, but are closed under intersection with regular

languages. Intersection with a regular language is often denoted as DR.

Operations Language Regular DCFL CFL CSL recursive r.e.

n

u

V V V V

V V V

V V

V

Table 3.2: Closure under Union and Intersection.

3.3 Computability of Induction

In order to induce a hypothesis, or to answer the solution set queries, it is necessary

to enumerate one or more hypotheses from the deductive closure, or solution set,

29

of {C,P}. This is only possible if the solution set is at most recursively enumer-

able. Otherwise, the solution set cannot be enumerated by any Turing machine, and

therefore the enumeration and query operations are uncomputable.

Whether or not the solution set is recursively enumerable depends on the set

representations for C and P. Recall that the solution set is computed from C and

P according to the following equation:

first((CxC)DP)r)C (3.2)

A set representation is a class of grammars, and a specific set corresponds to a

grammar in the class. Closing the set representations for C and P under Equa-

tion 3.2 must yield at most the recursively enumerable languages. Otherwise, the

representations can express C and P sets for which the solution set is not recursively

enumerable.

It is possible to determine the most expressive set representations for C and

P that will yield a recursively enumerable solution set for every C and P set ex-

pressible in those representations. The closure properties of the set operations in

Equation 3.2 can be used to express the solution set representation as a function of

the representations for C and P. Inverting this function yields the most expressive

representations for which the solution set is recursively enumerable.

3.3.1 Closure Properties of Set Operations

The closure properties of intersection and complement are well known for most lan-

guage classes, although it is an open problem whether the context sensitive languages

are closed under complementation [Hopcroft and Ullman, 1979]. These properties

are summarized in Table 3.3. Cartesian product and projection (first) present more

of a problem. A Cartesian product can be represented a number of different ways,

each with different closure properties and expressive power. The representations dif-

fer in the way that a pair, (x, y), in the Cartesian product is mapped onto a string

in the grammar that represents the product. Projection is the inverse of Cartesian

product, and its definition depends on the mapping used to represent Cartesian

products.

30


n complement

V V V

V ?

V V

V

Table 3.3: Closure Properties of Languages under Union and Intersection.

The most straightforward mapping is concatenation. That is, a tuple (x, y)

in the Cartesian product is represented as the string xy in the grammar for the

product. Under this mapping, the set AxB is represented as A-B. Other mappings

interleave the symbols in xy in various ways in order to represent the pair (x,y).

For example, the symbols in x and y could be strictly alternated, or shuffled, so that

(x, y) is represented as x\yxx2y2 ■ ■ ■ xnyn where x = Xix2 ... xn and y = yxy2 ...yn-

An interleaving maintains the order of the symbols in each of the component strings,

so that if Xi comes before Xi+\ in x, then X{ comes before X{+i in the interleaving as

well. However, there may be an arbitrary number of symbols from y between Xi and

For a given Cartesian product, AxB, not all mappings of AxB can be repre-

sented in the same language class. For example, context free grammars are closed

under concatenation, but not under shuffling [Hopcroft and Ullman, 1979]. If A and

B are context free grammars, then under the concatenation mapping, AxB is rep-

resented as AB, which is also a context free grammar. Under the shuffle mapping,

AxB is represented by the shuffle of A and B, which could be a context sensitive

grammar. Whether a language class is closed under Cartesian product depends on

the mapping. The expressive power of various mappings are discussed in more detail

in Section 4.1.2.4.

Although the closure properties of languages under Cartesian product depends

on the mapping, the closure properties of languages under projection are fixed. Let

X be some subset of AxB. Regardless of the mapping, X will be represented as

some interleaving of the strings in A and B. Let Y,A be the alphabet for A and let EB

be the alphabet for B. The string in X corresponding to (x, y) is some interleaving

of the symbols in x and y—that is, (x, y) maps onto a string in (EA U SB)*. Call this

string w. Projecting w onto its first element extracts x from the interleaved string.

31

If 12 A and 12 B are disjoint, this can be done by deleting from the string symbols

that belong to 12B. Mappings preserve the order of the symbols in x and y, so the

remaining symbols in the strings are the symbols of x in the correct order.

If the alphabets of A and B are not disjoint, they can be made disjoint by

applying appropriate homomorphisms [Hopcroft and Ullman, 1979]. Applying a

homomorphism, h, to a language, L, replaces each symbol in the alphabet of L

with a string from some other alphabet (each symbol can be replaced by a different

string). The strings in h(L) are the same as in L, except that the symbols in each

strings have been substituted according to h. For example, if h replaces a by 1 and

replaces b by 2, then h({ab,bb}) is {12,22}.

Let EA be the alphabet of A and let EB be the alphabet of B. If XU and EB are

not disjoint, define homomorphisms hx and h2 such that hi maps symbols in 12A into

corresponding symbols in 12 A> , and h2 maps symbols in 12B to corresponding symbols

in Eß/. The alphabets 12 A> and 12B> are selected so as to be disjoint. The Cartesian

product of A and B is defined to be some interleaving of the strings in hi(A) and

h2{B). In the case of the solution set formula, the only Cartesian product is CxC,

which is represented by some interleaving of the strings in hi(C) and h2{C).

Let X be a subset of hi(A)xh2(B), where hi and h2 are defined as described

above. The projection first(X) is computed by erasing from strings in X symbols

that belong to £#, where 12B> is the alphabet of h2{B). This is accomplished by a

homomorphism that maps every symbol in 12B> to the empty string, e. The resulting

strings are members of H*A,, where 12 A> is the alphabet of hi{A). However, strings in

first(X) should be composed of symbols from 12A, not symbols from 12A>. A second

homomorphism is applied that maps symbols in 12A> to corresponding symbols in

SA. This is the inverse of h\. The last two homomorphisms can be accomplished by

a single homomorphism, h', that maps symbols in Y2B> onto e, and maps symbols in

EAi onto symbols in E^. This definition of projection in terms of homomorphisms

is summarized in Figure 3.1.

The closure properties of languages under arbitrary homomorphisms are well

known—specifically, the regular, context free, and recursively enumerable languages

are closed under arbitrary homomorphisms, and the rest are not.1

xMore precisely, full trios and full AFLs axe closed under arbitrary homomorphisms [Hopcroft and Ullman, 1979].

32

Let A" be a subset of hi(A)xh2(B) where

h\{<7i) = the ith symbol in EAi

^2(^1) = the ith symbol in E#'

E^/DEB' =0

first(X) = h'(X) where

,,, » f the ith symbol in E^ if ai G E^/ v ' [ e if a* G EB>

Figure 3.1: Projection Defined as a Homomorphism.

The closure properties of languages under projection, intersection, intersection

with a regular grammar, and complement are summarized in Table 3.4. The closure

properties of Cartesian product depend on the mapping, as discussed earlier.


n V V V V nfi V V V V V V complement V V ? V projection (homomorphisms)

V V V

Table 3.4: Closure Under Operations Needed to Compute the Solution Set.

3.3.2 Computability of Solution Set

The closure information in Table 3.4 is sufficient to determine the most expressive

class of languages for C and P for which the solution set is recursively enumerable

(computable). First, expressiveness bounds on (CxC)nP are established from the

r.e. bounds on the solution set (first((CxC)DP)r\C) using the known closure prop-

erties of projection, intersection, and complement. The next step is to establish

expressiveness bounds on C and P from the bounds on (CxC)DP. The dependence

33

between the closure properties of languages under Cartesian product and the map-

ping used to represent the product makes this somewhat imprecise, but an effective

bound can still be obtained on the expressiveness of C and P.

3.3.2.1 Expressiveness Bounds on (CxC)nP

The first step is to establish expressiveness bounds on (CxC)nP. This is accom-

plished with the following theorem.

Theorem 1 first((CxC)nP)nC is computable (r.e.) if and only if(CxC)(lP is at

most context free and C is at most r.e.

Proof. The proof of this theorem follows from the closure properties of the lan-

guage representing (CxC)nP under projection, complement, and intersection. If

(CxC)nP is context free, then first((CxC)nP) is also context free. Closing the

context free languages under complement yields the recursive languages, so the set

first((CxC)nP) could require a recursive language to express. Intersecting this set

with C yields the solution set. Both recursive and r.e. languages are closed under

intersection, so the solution set is r.e. as long as C is at most r.e. (i.e., C is not

uncomputable).

If (CxC)nP is not context free, then first((CxC)r\P) can be recursively enu-

merable, since this is the next most expressive language that is closed under pro-

jection. Recursively enumerable sets are not closed under complement. The com-

plement of a recursively enumerable set is not recursively enumerable.2 Therefore,

first((CxC)nP) is not computable. Intersecting first((CxC)nP) with C yields the

solution set. The intersection of an uncomputable set with any other set is also

uncomputable, so the solution set is uncomputable. □

3.3.2.2 Expressiveness Bounds on C and P

The expressiveness bounds on C and P can be derived from the bounds on (CxC)nP

established in the previous subsection. (CxC)DP can be at most context free.

2For every r.e. language, L, L is not r.e. [Hopcroft and Ullman, 1979]. This is stronger than simply stating that the r.e. languages are not closed under complement. If only the latter were true, it might be possible to find r.e. languages whose complements are also r.e. However, the first statement says that there are no such languages.

34

Context free sets are not closed under intersection, so CxC and P cannot both be

context free. Table 3.4 indicates that at most, one of CxC and P can be regular

and the other can be context free. This follows from the closure of context free

grammars under intersection with regular grammars.

The expressiveness bound on C depends on what mapping is used to represent

CxC, since different mappings yield different closure properties for languages under

Cartesian product. Regular grammars are closed under Cartesian product for the

concatenation mapping, and for arbitrary interleavings (e.g., shuffling). This follows

from the closure of regular grammars under concatenation and interleaving [Hopcroft

and Ullman, 1979]. Context free grammars are closed under concatenation, but

not under arbitrary interleavings [Hopcroft and Ullman, 1979]. C can be at most

context free if the concatenation mapping is used, or at most regular if an interleaving

mapping such as shuffle is used.

Another possible set representation is the null representation. The only set ex-

pressible in this representation is the empty set. An instantiation of KII that rep-

resents P with the null representation cannot use any preference information—the

P set is always empty. When P is represented this way, the solution set equation

reduces to just C. This follows from the closure of the null representation under

intersection with any set (even an uncomputable one). The intersection of any set

with the empty set is just the empty set. Since the solution set reduces to C when P

is represented by the null representation, C can be at most recursively enumerable.

It is also possible to choose the null representation for C instead of for P. How-

ever, choosing this representation for C means that C is always empty When C is

always empty, so is the solution set, so this is not a very useful representation for C.

Table 3.5 summarizes the expressiveness bounds on (CxC)nP, C, and P, under

the assumption that the representations for C and P are closed under the given

mapping for Cartesian product.

The requirement that the solution set be computable restricts the choice of lan-

guages for C and P to at most context free for one of them, and regular for the

other. This holds when the mapping for Cartesian product is concatenation, but for

interleaving-type mappings, C can be at most regular, and P can be at most context

free.

35

c P (CxC)DP first((CxC)DP)nC

< regular < regular < regular < CFL > CFL

< regular <DCFL <CFL < regular >CFL

< regular <DCFL <CFL <CFL >CSL > CFG

< regular < recursive < recursive < recursive uncomputable uncomputable

null < r.e.

< r.e. null

null null

null < r.e.

Table 3.5: Summary of Expressiveness Bounds.

The integration operator further requires that the language for C be closed under

intersection, and that the language for P be closed under union. Both the regular

and context free languages are closed under union, so both of these are appropriate

representations for P. Of these two languages, only the regular languages are closed

under intersection, so C is limited to the regular languages. The only choice allowed

by the restrictions of the integration operator is to represent P as at most a context

free grammar, and C as at most a regular grammar.

It is possible for C to be context free and P to be regular under certain limited

conditions. Context free languages are not closed under intersection, but they are

closed under intersection with regular sets. This means that at most one of the

(C, P) pairs being integrated can have a context free C set as long as the C sets of

the remaining pairs are regular. Integrating all of these pairs results in a (C, P) pair

in which C is context free and P is regular.

36

Chapter 4

RS-KII

RS-KII is an implementation of KII in which C, and P are represented as regular sets.

This section describes the regular set representation, and implementations of all the

KII operations for regular sets: integrate, enumerate, and the solution-set queries.

Translators are not discussed since translator specifications and implementations are

not fixed, but are provided by the user for each knowledge source and hypothesis

space.

4.1 The Regular Set Representation

A regular set is specified by a regular grammar, and contains all strings recognized

by the grammar. Regular sets are closed under intersection, union, complement,

concatenation, and a number of other useful operations [Aho et al., 1974]. Since the

solution set is defined in terms of these operations, the solution set is also expressible

as a regular grammar.

For every regular grammar there are equivalent deterministic finite automata

(DFAs). Since DFAs are easier to work with than regular grammars, regular sets are

implemented as DFAs. DFAs are discussed in Section 4.1.1, and the set operations

are defined in Section 4.1.2.

4.1.1 Definition of DFAs

A DFA is defined as a 5-tuple (Q, s, 6, F, E) where Q is a finite set of states, s e Q

is the start state, 6 is a state transition function from QxT, to Q, F is a set of final

37

(accept) states, and E is an alphabet of input symbols. A second move function, 6*

from QxE* to Q, that is defined over strings instead of single symbols can be derived

from 5. For example, S*(q,w) returns the state reached from q by path w e E*. A

string w € E* is accepted by a DFA iff 6*(s, w) € F.

Every regular set can be represented by several equivalent DFAs. Among these

DFAs there is exactly one minimal DFA[Hopcroft and Ullman, 1979]. A minimal

DFA has the fewest states needed to recognize a given regular set. A non-minimal

DFA can be minimized as follows.

1. Remove all states that can not be reached from the start state.

2. Remove all states that can not reach an accept state (these are dead states).

3. Merge equivalent states.

Two states, p and q, are equivalent if for every input string w, S*(p,w) is in F

if and only if 5*(q,w) is in F. That is, the DFAs (Q,p,S,F,T.) and (Q,q,6,F,Z)

recognize exactly the same strings. There are well known algorithms for minimizing

DFAs, but we will not describe them here. See [Hopcroft and Ullman, 1979] for more

information on this topic. In general, minimizing a DFA takes time 0(|E||<3|2).

4.1.2 Definitions of Set Operations

The set operations used by KII for integrating {H, C, P) tuples are intersection

and union. The representation for C must be closed under intersection, and the

representation for P must be closed under union. The solution set is defined in

terms of complement, Cartesian product, projection, intersection and union. The

representations for C and P do not need to be closed under these operations, since

the solution set may require a more expressive representation than that used for

either C or P. However, regular grammars happen to be closed under all of these

operations. This guarantees that the solution set can also be expressed as a regular

grammar. These operations are defined below for regular grammars.

Finally, an operation is needed for computing the transitive closure of P. The

enumeration operator assumes that P is transitively closed, but this condition is not

preserved by integration. An operator is needed for transitively closing P prior to

38

enumerating the solution set. This operator is described below, along with the other

set operations.

4.1.2.1 Union

The union of two DFAs, A and B, results in a DFA that recognizes a string if the

string is recognized by either A or B. The DFA for the union can be constructed

from A and B as shown in Figure 4.1. A state in this DFA is a pair of states,

{QA,QB), one from A and one from B. The move function for the union is computed

from the move functions of A and B. On input a, AUB simulates a move in A from

qA and a move in B from qB, both on input a. A state (qA,qB) in the union is a

final state if qA is a final state in A or if qB is a final state in B.

<Q,M,F,E) = (Qi,s1,5uF1,Z1)L}(Q2,s2,62,F2,X2)

where

Q = Qi x Q2

s = (si,s2)

<K(9i,?2>,<7) = (5i(qi,a),82(q2,a)) F = F1xQ2uQ1xF2

E = EiUEs

Figure 4.1: Definition of Union.

4.1.2.2 Intersection

A string is accepted by the intersection of two DFAs, A and B, if it is accepted by

both A and B. The DFA for the intersection of A and B can be constructed from

A and B as shown in Figure 4.2. As with union, a state in the DFA for Af)B is a

pair of states, (&4,?B), one from A and one from B. On input a, AnB simulates a

move in A from qA and a move in B from qb, both on input a. If both qA and qB

are accept states in their respective DFAs, then (qA,qB) is an accept state in the

intersection.

39

<Q,s,5,F,E) = (Qus1,6uFu'E1)n(Q2,s2,62,F2,X2)

where

Q = Qi*Q2

s = (si,s2)

tf«0ii«2>,<O = <*i(9i,^)i ^2(92,0-)) F = FixF2

E = EinE2

Figure 4.2: Definition of Intersection.

4.1.2.3 Complement

The complement of set A with respect to the universe E* is written A. A string is in

A if and only if it is not accepted by the DFA for set A. The DFA for A is exactly

the same as the DFA for A, except that accept states in A's DFA correspond to the

reject states in A's DFA, and vice versa. A reject state is any state that is not an

accept state. The construction for the complement of a DFA is shown in Figure 4.3.

(Q,s,6,F,2) = (Q,s,6,F,X)

Figure 4.3: Definition of Complement.

4.1.2.4 Cartesian Product

The Cartesian product of two sets, A and B, is a set of tuples {{a, b) \ a G A and y G

B}. A set of tuples can not be directly represented as a regular grammar, or indeed as

the language of any grammar. This is because the language of a grammar consists of

strings, not tuples per se. A mapping between tuples and strings is required. Under

mapping M, the set AxB is represented by a regular grammar whose language is

{M(x,y) I (x,y) G AxB}. This language is also denoted as M(AxB). The choice

of mapping greatly determines which sets of tuples can be expressed by regular

grammars.

40

One obvious mapping is concatenation. That is, the Cartesian product of two reg-

ular sets is represented by the concatenation of the two sets. Formally, M({x,y)) =

xy, and M(AxB) = AB. Regular grammars are closed under concatenation, so

regular grammars are also closed under Cartesian product for this mapping.

A less obvious mapping is to map (x, y) onto a string w where w is an interleav-

ing of the strings x and y. For example, the tuple (x\x2 • • • ^n, 2/12/2 • • • Vn) would be

represented by the string "xiyix2y2 ■ ..xnyn", where the symbols of x and y alter-

nate. This mapping is called shuffling. Under this mapping, AxB is represented by

shuffle(A, B), which is the set {shuffle(x, y) \ x e A and y £ B}. Regular grammars

are closed under shuffling. A DFA for the shuffle of any two DFAs is specified in

Section 4.1.2.5. This proves by construction that regular grammars are closed under

shuffling.

Differences in Expressiveness. The Cartesian product of any two regular sets

can be expressed as a regular set under both the concatenation and shuffle mappings.

This follows from the closure of regular sets under both concatenation and shuffling.

Every subset of a Cartesian product that can be expressed as a regular grammar

under the concatenation mapping can also be expressed as a regular grammar under

the shuffle mapping. However, some subsets can be expressed as regular grammars

under the shuffle mapping, but not under the concatenation mapping. The shuffle

mapping is strictly more expressive than the concatenation mapping in this sense.

Let S be a subset of AxB, where A and B are sets, but not necessarily regular

sets. If S can be expressed as a regular grammar under the concatenation mapping,

then S can also be expressed as a regular grammar under the shuffle mapping. To

see why this is so, consider that S can be written as Ui^iX-Bi, where A, c A and

Bi C B. S can be expressed as a regular grammar under the concatenation mapping

if and only if S can be written as a finite union of AixBi pairs, where each A{ and

Bi is a regular set. Under the concatenation mapping, this union is represented as

U£=i AiBi, where k is finite. Regular grammars are closed under concatenation and

finite union, so (Ji=i AiBi is a regular grammar. Under the shuffle mapping, S is

represented as (JiU shuffle(Au B{). Since regular sets are closed under shuffling, this

is also a regular grammar. Therefore, every subset of AxB expressible as a regular

41

grammar under the concatenation mapping is also expressible as a regular grammar

under the shuffle mapping.

The reverse is not necessarily true. Some subsets of a Cartesian product can be

expressed as regular grammars under the shuffle mapping, but not under the con-

catenation mapping. For example, consider the set {(w, w) | w e (a\b)*}. Under the

concatenation mapping, M({(w,w) \ w e {a\b)*}) = {ww \ w € (a\b)*}, which is a

context sensitive language [Hopcroft and Ullman, 1979]. Under the shuffle mapping,

M{{(w,w) | w e (a\b)*}) = {shuffle(w,w) \ w e (a\b)*}, which is ((aa)\{bb)Y \* i

An Illustration. The following is a simple example that will hopefully provide

some intuition for the differences between these representations. Let H be a set of

strings recognized by the regular grammar (a\b)% where $ indicates the end of a

string. Let P be the preference {(x,y) € HxH \ x < y}, where < is the standard

dictionary ordering (e.g., "aab$ < ab$").

P can be expressed under the shuffled mapping, but not under the concatenation

mapping. In the shuffled mapping, the tuple {x = xix2 ■ ■ ■, y = 2/12/2 • ■ •} maps onto

the string xxyxx2y2 ■■■■ If an = a and yx = b, then x < y, so (x, y) is in P. If an = 6

and yi = a, then y < x, so (x, y) is not in P. If Xi = 2/1, then a similar comparison

is made between x2 and y2. If x2 and y2 are equal, then x3 and y3 are compared,

and so on until the compared symbols differ, or one of the symbols is the "$" string

termination symbol. If x terminates before y, but the two strings are equal up to

that point, then x comes before y in dictionary order, so (x, y) is also in P. If both

terminate at the same time, or x is longer than y, then x ^ y, so (a:, y) is not in P.

A regular grammar that recognizes {shuffle(x,y) | (x,y) € (a|6)*$x(a|6)*$s.t.a: < y}

is shown in Figure 4.4

This grammar recognizes a shuffled string, X\y\X2y2 ..., if adjacent pairs of sym-

bols, xtyu in the string are the same up to some pair xkyk. This is recognized by

the production SAME. In the next pair, xk+\Vk+\, ^fc+i must be lexicographically

less than yk+i, or xk+i must be the end-of-string symbol ($). These conditions are

recognized by x_LESS_THAN_Y and $(a\b), respectively. In the first case, the remain-

ing symbols alternate between symbols from x and y until one of the strings has no

1shuffle(x1x2--.xn,xix2...xn) = xiXix2x2.-.xnxn. Symbol if can be either a or b, so zirrio^a^ • ■ -^n^n is a member of ((aa)\(bb))*.

42

s -> SAME X_LESS_THAN_Y -> SAME ($(a|6)) ANY_Y

SAME -)■ (aa) |(66) X_LESS_THAN_Y -> ab

ANY_XY -» (a|6)(a|6) ANY_XY -»■ (a|6)$ ANY_X

-> $(a|6) ANY_Y ANY_X ->■ (a|6)*$ ANY_Y -> (a|6)*$

Figure 4.4: Regular Grammar for {shuffle(x,y) \x,ye (a|6)*$ s.t. a; < y}

more symbols. After this, the remaining symbols are from the longer string. Strings

of this form are recognized by ANY_XY. In the second case, the string x has termi-

nated, so the remaining symbols are all from y. Strings of this form are recognized by ANY_Y.

P cannot be recognized by a regular grammar under the concatenation grammar.

Under the concatenation mapping, (x, y) maps onto xy. Every symbol in x is seen

by the DFA for the grammar before any symbol of y is seen. The DFA for the

grammar would have to store enough information about x to determine whether

x < y. However, in order to make this determination, all the symbols of x must be

saved. The string x can be arbitrarily long, but by definition a DFA can have only a

finite number of states. Therefore, the DFA cannot store enough information about

x to make the determination. The shuffled grammar gets around this problem by

changing the order in which the symbols are seen, so that all the information needed

to make the determination is available locally. Very little state has to be saved, if any.

More formally, P cannot be expressed as a regular grammar under the concate-

nation mapping because P would not be a regular language. That this language is

not regular can be proven by the pumping lemma, which says that a language L is

not regular if for all n it is possible to select an integer i and a string z from L such

that for every way of breaking z into strings u, v, and w, the string uv'w is not in

L. The length of uv must be less than n, and v cannot be the empty string.

43

P is a set of strings of the form wawß, where w is a string in (a\b)*, and either

a is in a{a\b)* and ß is in fc(a|6)*$, or a = $ and ß is a string in (a\b)+$. For an

arbitrary but fixed n, let z be the string bna$bnb$. For all ways of breaking z into u,

u, and w such that |w| < n and t; is not empty, uv is the string bk where 1 < k < n,

and u; is the string bn~ka$bnb$. The string string uv*w is 6fc-lt'lftlrli6n-fca$6n6$, which

is equivalent to Fb^-VaSFbS. This string is not in P when \v\{i - 1) > 1, since

6"&M('-i)a$ comes after bnb% in dictionary order under this condition. Since v must

be chosen to be non-empty, |v|(t - 1) > 1 when i > 1. Therefore, there exists at

least one i for which uviiw is not in P when uvw is in P. Therefore, according to

the pumping lemma, P is not regular.

Definition of Cartesian Product as Shuffling. Because of superior expressive

power of the shuffle mapping, RS-KII defines the Cartesian product of two regular

sets, A and B, as shuffle(A,B). The DFA for shuffle(A,B), is constructed from

DFAs A and B as shown in Figure 4.5.

The input to the DFA are strings of the form shuffle(x,y). As mentioned above,

this returns a string in which the symbols from x and y alternate. This is defined here

a little differently than it was above in order to address some subtle points. First,

each symbol is preceded by an identifier that indicates whether the symbol belongs

to x or y. This helps the DFA sort out the symbols. For example, shuffle(abc, xyz) =

Ia2xlb2ylc2z. Second, x may have fewer symbols than y, or vice versa. If one

string has fewer symbols than the other, then the symbols alternate until the shorter

string is exhausted, after which the remaining symbols are all from the longer string.

For example, shuffle(abcd,xy) = Ia2xlb2ylcld.

The DFA for shuffle(A,B) processes the input string shuffle(x,y) by simulating

moves in A on the symbols from x and simulating moves in B on the symbols from

y. The shuffled DFA accepts the string shuffle(x, y) iff x is accepted by A and y is

accepted by B.

More formally, a state in the shuffled DFA is a pair (qA,Qß) where qA is a state

in A, qß is a state in B. On input a, a move is simulated in A if a is from E^, and

simulated in B if a is from EB. The shuffled DFA accepts when both A and B are

in accept states.

44

(Q,M,F,E) = (QuSuSuFuXi) x(Q2f82,62,F2lX2)

where

Q = Q1XQ2

s = (si,s2)

F = FiXF2

E = E1UE2U{1,2} where 1 & 2 not in Ei or E2

Figure 4.5: Definition of Cartesian Product.

4.1.2.5 Projection

The function first (A) projects a set of pairs, A, onto the set of first elements of the

pairs. For example, first({(xi, yx), (x2, y2), •■-, (xn, yn)}) returns {xu x2,..., xn}.

As discussed in Section 4.1.2.4, A is a set of pairs represented by a regular set of the

form {M(x,y) | (x,y) € A}, where A4 is a mapping from x,y pairs onto strings.

Given this representation for A, the regular set representing first (A) is of the form

{x I 3yM.(x,y) e ^4}, where M{x,y) is the same mapping used in A.

The DFA for first(A) recognizes a string x if there is a string y such that M(x, y)

is recognized by the DFA for A. The DFA is provided with the string x as input,

but it must make a guess as to what y should be. It can then test whether Ai(x, y)

is recognized by the DFA for A.

If the mapping M is concatenation, then the DFA for first (A) is relatively

straightforward. It simulates moves in the DFA for A on input x. After consuming

x, the DFA is in state q. If there is a path from q to an accept state, then there

exists a y such that M(x, y) is accepted by the DFA for A. The string y corresponds

to the path. If there is no such path, so that q is a dead state, then x is not accepted

by the DFA for first (A).

RS-KII does not use the concatenation mapping, however. It uses the shuffle

mapping instead, due to its greater expressiveness (see Section 4.1.2.4). The shuffle

mapping interleaves the symbols from x and y, which complicates the DFA for

45

first(A). The DFA is provided with x as input, but must guess a value for y. If

shuffle(x, y) is recognized by the DFA for A, then x is in first(A). The symbols of x

and y are interleaved by shuffling, so that a symbol from x is followed by one from

y. After simulating a move in A on symbol X{ from x, a symbol is guessed for yi,

the next symbol of y, and a move is simulated in A on yi.

A DFA that makes "guesses" of this sort is most easily described by a non-

deterministic DFA (NDFA). An NDFA is just like a DFA, except that it can have

a choice of several next-states on any given input. On any given move, one of the

possible next-states is selected arbitrarily, and the NDFA moves to this state. The

possible next-states are shown as a set of states in the definition of the NDFA's S

function. Every NDFA can be converted into an equivalent DFA. The NDFA for the

projection will be described first, for clarity, and then this grammar will be converted

into an equivalent DFA.

Constructing the NDFA. The NDFA for first(A) is constructed as follows. The

input to the NDFA is a string x. The NDFA guesses a string y, such that shuffle(x, y)

is in A. On each input, ax from x, the NDFA simulates a move in A on ax. This

is followed by a move in A on symbol ay. This second symbol is the NDFA's guess

for the next symbol in y. The set of possible guesses for ay is the alphabet for y,

T,y. Thus from a state q in A, the NDFA can move on input ox to any state in

{5A(q',Gy) | q' — SA(q,crx) and ay G £,,}. The actual next state is selected non-

deterministically on every move. The NDFA effectively creates an input string for

A composed of alternating symbols from x and its guess for y.

An NDFA accepts a string x if there is some sequence of non-deterministic moves

on input x that will lead to an accept state. In other words, the NDFA described

above accepts x if there is some way to select symbols for y such that shuffle(x,y)

is recognized by A.

Modifying the NDFA. This NDFA works fine as long as x and y are the same

length, but must be modified slightly to deal with cases where x and y are of different

lengths. Let (x,y) be a pair such that shuffle(x,y) is in A. If x is shorter than y,

then shuffle(x,y) alternates symbols from x and y until x is exhausted, after which

the remaining symbols are all from y. After the NDFA sees the last symbol in x,

46

it will have to guess the remaining symbols in y, and make moves in A on each of

those symbols. The NDFA described above does not do this—it guesses only one y

symbol for each symbol in x.

A similar condition holds when x is longer than y. In this case, shuffle(x,y)

alternates symbols from x and y until y ends, after which the remaining symbols are

all from x. After y terminates, the NDFA should not make any guesses for y. This

is equivalent to guessing the empty string for y on each move. However, the NDFA

described above does not guess a string for y after each symbol from x, but instead

guesses a single symbol for y.

For both cases, the NDFA must be able to guess a string of symbols for y after

each symbol from x instead of guessing only a single symbol. The NDFA can be made

to do this by changing the next-state function as follows. From state q on input ax,

the NDFA can move to any state in {5A(q',w) | w € E* and q' = £4(5, <?x)}- Thus

on every input x, the NDFA can move to any state in A reachable by a string in

ax~Ey. In most cases, the only states reachable in A from q are of the form axoy. In

this case the modified NDFA behaves just like the NDFA described above. However,

when x and y are differing lengths, and all of the symbols in one of the strings have

been exhausted, there will be states reachable from q of the form axT,*

One final modification is also needed. A expects symbols in x to be preceded by

the identifier "1", and symbols in y to be preceded by the identifier "2". The NDFA's

move function must be modified to provide these additional identifier symbols to A.

The NDFA with all of the above modifications is shown in Figure 4.6.

<Q,M',F,E) = first((Q,s,6,F,Z}) where

S'(q,a) = {S(q',w)\w£ (2£)* and ?'= %, 1 a)}

Figure 4.6: NDFA for Projection (First).

Converting the NDFA to a DFA. The NDFA in Figure 4.6 can be converted

into an equivalent DFA. Each state in the DFA corresponds to a set of states in the

47

NDFA. The state moved to from state q in the DFA on input a is the set of NDFA

states that can be reached non-deterministically on input o from any of the NDFA

states in q. A set of NDFA states is an accept state in the DFA if the set contains

at least one state that is an accept state in the NDFA. The equivalent DFA for the

NDFA of Figure 4.6 is shown in Figure 4.7.

{Q',s',5',F',E) = first((Q,s,6,F,X))

where

Q' = 2Q (the power set of Q)

s> = {s}

6'({quq2,...,qn},o-) = \J{5(q',w) \ w € (2£)* and q' = %, 1 a)}

F' = {qe2Q\qr\F^Q}

Figure 4.7: DFA for Projection (First).

DFA for Second. Although it is not strictly necessary, the function second (A) is

occasionally useful. This function projects a set of pairs onto their second elements.

The DFA for second(A) is essentially the same as the DFA for first(A), except that

guesses are made for the first element instead of the second. The DFA is shown in

Figure 4.8.

<Q',5',<5',F',£) = /irs*«Q,s,<S,F,E))

where

Q' = 2Q

s> = {s}

«'({«Zi.fc,..., *»}.') = \J{5(q',2a)\q' = 6(qi,w) and we (lX)*} 9i

F' = {qe2Q\qr\F^®}

Figure 4.8: DFA for Projection (Second).

48

4.1.2.6 Transitive Closure

The preference set, P, is assumed to be transitively closed. However, this condition

is not preserved by integration (union). An explicit operator is therefore defined for

the purpose of re-establishing the transitive closure of a preference set.

Let tc{P) be the transitive closure of P. The regular grammar for tc(P) accepts

{x, y) if and only if (x, y) G P, or there is a finite sequence of one or more strings,

zx,z2,..., zn, such that (x, zx) G P and (zu z2) G P and (z2, zz) G P and ... and

(zn, y) G P. That is, (x, y) is in the transitive closure of P either if (x, y) is in P, or

if there is a finite sequence of relations in P connecting x to y.

There are a number of algorithms for computing the transitive closure of a binary

relation defined over a finite set of objects (e.g., Warshall's algorithm [Warshall,

1962]). Every preference set, P, expressible as a regular grammar can be mapped

onto a finite relation. The finite relation can be transitively closed using one of the

standard algorithms, and the result mapped back into a regular grammar for the

transitive closure of P. Every regular set can be expressed as a finite union of This is simplest to see

when P is expressed as a regular grammar under the concatenation mapping. Under

this mapping, the regular grammar for P can be expressed as U*=i(^i5i), where A*

and Bi are regular subsets of the hypothesis space, H (see Section 4.1.2.4). Every

hypothesis in A{ is less preferred than every hypothesis in B{. That is, A{ < Bi. P

can be mapped onto a finite relation, P, over the A{ and B{ sets. Each of these sets

is a single object, Xit in R. The finite relation is then transitively closed. For every

pair of relations Xx < X2 and X2 < X3 in R, the relation Xx < X3 is added to R if

it is not already there. At most m such relations are added, where m<k2. The set

tc(P) is PU(J£U Pi> where Piis one of the relations added t0 R- Each relation is of

the form X{ < Xjt and is represented by the regular grammar XiXj.

If P is represented with the shuffle mapping, then P is a finite union of the form

Uti shuffle(Xi, Yi) where X{ and Y< are regular subsets of the hypothesis space. Ev-

ery hypothesis in X{ is less preferred than every hypothesis in Y{. The transitive

closure of P is computed just as it was for the concatenation mapping. A rela-

tion R is defined where (XuYj) is in R if shuffle{Xi,Yi) is in the finite union that

specifies P. R is transitively closed using any standard algorithm (e.g., Warshall's

49

algorithm). The set tc(P) is PU(J£Li Ri, where Ri is one of the relations added to

R. Each of these relations is of the form A < B, where A and B are elements of

{Xi,X2,...,Xk,Yi,Y2,...Yk}. The relation A < B is represented by the regular

grammar shuffle(A,B).

4.1.3 DFA Implementation

A DFA is implemented in RS-KII in terms of the following components.

• A start state, s.

• An alphabet of input symbols, E.

• A next-state function, S : (QxE) -» Q

• A final-state function, F : Q —> Boolean.

• A dead-state function, Dead : Q —> {t,f, ?}.

• An accept-all-state function, AcceptAU : Q —> {t, f, ?}.

With the exception of the last two functions, these components correspond to

obvious elements of the standard (Q, s, S, F, E) 5-tuple that specifies a DFA. The last

two functions allow for fast identification of at least some dead states and accept-all

states, respectively. A dead state is a reject (non-accept) state from which there is

no path to an accept state. An accept-all state is an accept state from which every

path leads to an accept state (i.e., no path leads to a reject state). The dead-state

and accept-all state function can only identify some dead states and accept-all states.

If a state can not be identified by the appropriate function, the function returns "?"

or "unknown".

Determining whether a state in a non-minimal DFA is a dead state or an accept-

all state can require an exhaustive search of the DFA. However, many set operations

preserve, or partially preserve, this information. These two functions provide a way

to save such information. The ability to identify dead-states and accept-all states

inexpensively is central to the solution-set enumeration algorithm, which is described

in Section 4.2.2.

50

Notably absent from the list of DFA components is Q, the set of states. States

are generated as needed by applying the next-state function to the current state.

In order to find a path through the DFA from start state to accept state, or to

determine whether a string is accepted by the DFA, only a few of the states in the

DFA are usually visited. The complexity of these searches is proportional to the

number of states visited. However, if Q is represented extensionally, then the time

needed to create the DFA far exceeds the cost of searching it.

By not representing the states explicitly, the time and space complexity of search-

ing the DFA can be kept proportional to the number of states visited. In order for

this to work, the time and space complexity of the move-function and other com-

ponents of the DFA must be significantly less than 0(|Q|). A DFA that meets

these criteria is called an intensional DFA. If the time or space complexity of the

components are proportional to 0(|<2|), then the DFA is said to be extensional.

This occurs, for example, when the next-state function is implemented as an explicit

look-up table.

DFAs constructed from set operations on other DFAs can be represented inten-

sionally. Let R be the DFA resulting from a set operation over DFAs A and B. The

idea is to implement the next-state function of R in terms of the next-state func-

tions of A and B. The other components of R are implemented similarly. The space

complexity of fi's components is O(l), and their time complexity is proportional to

that of A's components and B's components. DFAs represented this way are called

recursive, since they are defined "recursively" in terms of more primitive DFAs. All

of the DFAs resulting from set operations in RS-KII are represented as recursive

DFAs. Recursive DFAs are discussed further in Section 4.1.3.1

The DFAs for the H, C, and P sets are called primitive DFAs. These DFAs

can be either intensional or extensional. Primitive DFAs are discussed further in

Section 4.1.3.2.

4.1.3.1 Recursive DFAs

Recursive DFAs are the results of set operations. A DFA resulting from an expression

of set operations is essentially an expression tree of DFAs, where the root is the

DFA representing the result of the expression, the nodes are DFAs resulting from

' 51

intermediate set operations, and the leaves are the primitive DFAs input to the

expression. The components of the DFA at each node are defined in terms of the

components of the DFAs at the node's child nodes. The DFA at each node requires

only enough space for pointers to its children, and definitions for the components.

This is essentially constant space for each DFA.

The total space for the expression is the sum of the space consumed by each of

the primitive DFAs at the leaves, plus a small constant amount of space for each

node in the tree. If the tree is balanced and binary, there are about as many internal

nodes as leaves. If there are n primitive DFAs, each consuming k units of space,

then the total space cost is 0{nk). The root DFA could have as many as kn states,

depending on the set operations in the expression tree, so considerable space is saved

by not representing the DFA's states explicitly.

DFAs represented as expression trees are called recursive DFAs. The components

of these DFAs are defined recursively in terms of the components of other DFAs.

Eventually, this regression must ground out in DFAs in which the components are

extensionally defined. These are called primitive DFAs. The C and P sets specifying

the integration of two (C, P) tuples are represented in RS-KII by recursive DFAs.

The DFAs for the C and P sets generated by translators are primitive DFAs, by

definition.

4.1.3.2 Primitive DFAs

Primitive DFAs are DFAs in which the components are implemented extensionally.

The start state is represented extensionally, as is the alphabet. In practice, all of

the translators for a given hypothesis space can be made to generate DFAs that

use the same alphabet. Thus in practice, the alphabet is stored once, and each

of the primitive DFAs have a pointer to it. There are several ways to implement

the functions of a primitive DFA, but the most straightforward way is with explicit

lookup tables.

It is generally a good idea to reduce the space complexity of primitive DFAs, since

the complexity of recursive DFAs—such as the solution set—is proportional to the

space complexity of the primitive DFAs from which the recursive DFA is constructed.

The time complexities of the next-state and other functions are similarly related, so

52

it is also worth using implementations of these functions for primitive DFAs with as

low a time complexity as possible. A simple lookup table can consume up to 0(|Q||E|) units of space for the next-

state function, and 0(\Q\) units for each of the other functions. Table lookup is

a constant-time operation, so the DFAs functions can all be executed in constant

time. Compressing the lookup tables can reduce their space complexity, but may

increase the time complexity of table lookup. Good implementations of compressed

tables have lookup times of at most log(n), where n is the size of the table. In some

DFAs, the next state can be computed from the current state and the input symbol

in time proportional to the size of the state. This is an effectively constant time

and space implementation as long as the size of the state is not a function of some

relevant scale-up variable. Using minimal DFAs is another way to keep the space complexity down without

increasing the time complexity. Translators should be able to generate minimal

DFAs directly. If it is not possible to generate a minimal DFA directly, then the

translator can create a non-minimal DFA and then minimize it. However, the space

savings from using a minimal DFA may not be worth the computational complexity

of explicit minimization. One way to dramatically reduce the space complexity of any primitive DFA in

exchange for increased time complexity is to explicitly store an equivalent NDFA,

and use the NDFA to simulate the DFA's functions, such as next-state and final-

state. An NDFA generally has exponentially fewer states than an equivalent DFA,

and the DFA move function can be simulated from the NDFA move function in

0(n) time, where n is the number of states in the NDFA[Aho et al, 1974]. Testing

whether a state is final is also 0(n), as are the other functions.

4.2 RS-KII Operator Implementations

Recall that the operations defined by KII are translation, integration, enumeration,

and queries. This section discusses RS-KII's implementation of the enumeration and

integration operations. Translator and query implementations are not discussed.

Translators depend on the hypothesis space and the knowledge being translated, so

the translators are provided by the user for each hypothesis space and knowledge

53

source rather than being a fixed part of RS-KII. Translators for specific induction

tasks are discussed in Chapter 5 and Chapter 6. The queries are implemented in

terms of the enumeration operator, as described in Section 2.2.4 of Chapter 2, and do

not require further discussion. The integration operator is discussed in Section 4.2.1,

and the enumeration operator is discussed in Section 4.2.2.

4.2.1 Integration

As discussed in Chapter 2, integration is defined as follows.

Integrate«!*, Clt Pi), (A C*2, P2)) = (D, dnC2, PlUP2) (4.1)

This requires intersection and union. Implementations of these set operations are

described below.

4.2.1.1 Intersection

The intersection of two DFAs, A and B, is implemented as a recursive DFA, as

shown in Figure 4.9. The components of this DFA are derived from the components

of A and B according to the definition of intersection given in Figure 4.2. Tuples

in the intersection are implemented as pairs of pointers to states in A and B. The

alphabet, E, is E^nEß, the intersection of the alphabets for A and B. If EA and EB

are different alphabets, then their intersection is computed and stored extensionally

in the DFA for Af)B. If *EA and EB are the same alphabet, then E^DEB is just EA

(or equivalently, E5). In this case, the DFA for AnB just stores a pointer to E^

instead of storing an explicit copy. This is the most common case, since translators

with the same domain can usually be made to output DFAs with the same alphabet.

The Dead function takes as input a state, (91,92), in AnB, and determines

whether or not the state is dead. A state is dead if there is no path from the state to

an accept state. Determining this fact can require an exhaustive search of the states

reachable from (91, q2). However, there are cases where it can be quickly determined

whether (91,92) is dead from information about 91 and 92. If the status of (91,92)

can be quickly determined, then Dead((qi,q2)) returns true or false, appropriately.

54

(si,öi,F1,Deadi,AccAlli,T,1) n (s2,S2,F2,Dead2, AccAll2,Y,2) - {s, S, F, Dead, AcceptAll, E) where

s =

<5((tfi,92),tf) =

F((quq2)) =

Dead({quq2)) =

Accept All ((qi,q2)) =

E =

(si,s2)

(Si(qi,a),S2(q2,a))

Fi(qi) A F2(q2)

true if

else unknown

true if

false if

else unknown

EinE2

Pointer to Ei

Deadi(qi) = true or Dead2(q2) = true

AccAlli(qi) = true and AccAll2(q2) = true AccAlli(qi) = false or AccAll2(q2) = false

if E: ^ E2

if Ex = E2

Figure 4.9: Intersection Implementation.

Otherwise, it returns unknown, and it is up to the caller to decide whether to perform

the expensive search or to do without the information.

It can be quickly determined that a state, {qx,q2), in AnB is dead if gx is a

known dead state in A or if q2 is a known dead state in B. This information can be

obtained from the Dead functions for A and B, respectively. In all other cases, it

is necessary to perform an exhaustive search of the states reachable from (qi,q2) to

determine whether it is a dead state. If qx and q2 are known to be non-dead states

in A and B, then there is at least one path from qx to an accept state in A, and

at least one path from q2 to an accept state in B. However, if there is no path in

common between qt and q2, then the state (qu q2) is a dead state in AnB. The Dead

function for AnB returns unknown in this case. It also returns unknown if A's Dead

function returns unknown for qi, or if B's Dead function returns unknown for q2.

The AcceptAll function is essentially the dual of the Dead function. A state,

(91,92), in AnB, is an accept-all state if every path from (qi,q2) leads to an accept

state ((qi,q2) need not be an accept state, though, unless there is a path from the

55

State to itself). The function Accept All ((qi,q2)) returns true if it can be quickly

determined that (91,52) is an accept-all state, returns false if it can be quickly

determined that it is not an accept-all state, and unknown otherwise.

The state (qi,q2) is an accept-all state when q\ is an accept-all state in A and

q2 is an accept-all state in B. The state is not an accept-all state if either q\ or q2

is not an accept-all state in its respective DFA. Both of these cases can be quickly

determined from the AcceptAll functions of A and B, respectively. The accept-all

status of (qx, q2) is unknown in AnB only if the accept-all status of qi is unknown in

A or the accept-all status of q2 is unknown in B.

The AcceptAll function "preserves" accept-all state information from A and B,

in the sense that if the accept-all status of every state in A and B is known (i.e,

either true or false), then the accept-all status of every state in AnB is also known.

That is, the AcceptAll function for AnB never returns unknown for any state in AnB

unless the AcceptAll function for A or B returns unknown for some state in A or B,

respectively.

The Dead function for AnB does not preserve dead state information from A

and B. Given two DFAs, A and B, in which all of the dead states are known, AnB

can contain states for which the Dead function returns unknown, as was described

above.

4.2.1.2 Union

The union of two DFAs, A and B, is implemented similarly to intersection. The

main differences are that the components of the union are derived from A and B

according to the definition of union in Figure 4.1, and the alphabet E is E^UE^

instead of E^nEß. If E^ and E# are the same alphabet, as is usually the case, then

E is just a pointer to one of these alphabets. Otherwise, E is a pointer to a new

alphabet containing the symbols in both E^ and Eß. The implementation of union

is shown in Figure 4.10.

A state in AliB is of the form (<?i, q2), where qi is a state in A and q2 is a state in

B. The function Dead((qi,q2)) returns true or false, accordingly, when it can be

quickly determined whether (qi, q2) is a dead state in AliB. When this determination

cannot be made quickly, the function returns unknown. A state, (qi,q2), in AuB is

56

(si,6uFi,Deadi, AccAlh,Ei) U (s2,62,F2,Dead2,AccAll2,'Z2) = (s, 6, F, Dead, AcceptAll, E) where

s =

*«9i,92>»ff) = F((?i,(fe» =

Dead((q1,q2)) =

Accept All ((qi,q2)) =

E =

(<yi(9l,^),*2(?2,ff)> Fi(qi) V F2(g2)

true if

false if

else unknown

true if

else unknown

EiUE2

Pointer to Ei

Dead\{q\) = true and Dead2(q2) = true Deadi(qi) = false or Dead2(q2) = false

AccAlli(qi) ■ AccAll2{q2) :

if Ei ^ E2

if Ei = E2

true or true

Figure 4.10: Union Implementation.

a dead state if and only if there is no path from (91,92) to an accept state in AUB.

This only occurs when there is no path from 91 to an accept state in A, and no path

from 92 to an accept state in B. Otherwise, (91,92) is not a dead state. Both of

these conditions can be quickly determined from the Dead functions of A and B,

respectively. The Dead function for AuB only returns unknown for (91,92) when the

dead state status of one of 91 and 92 is unknown, and the status of the other state is

either unknown or false.

The Dead state function for the union of two DFAs, A and B, preserves the dead

state information of A and B, in that the dead state status of every state in AuB

is known if the dead state status of every state in A and B is known. Union is said

to preserve dead state information.

A state (91,92) in AuB is an accept-all state if every path from (91, q2) leads to

an accept state. The function AcceptAll returns true or false, accordingly, when

this determination can be made cheaply, and unknown when it cannot. It can be

cheaply determined that (91,92) is an accept-all state if 91 is an accept-all state in

57

A or if B is an accept-all state in B. Otherwise, the determination cannot be made

cheaply, and AcceptAll((qi,q2)) returns unknown.

If neither qi nor q2 are accept-all states in A and B, respectively, then it still

possible that {qu q2) is an accept-all state in AuB. This occurs when the paths that

do not lead to accept states from qi in A do lead to accept states in B from q2,

and vice-versa. In general, ascertaining this fact requires an exhaustive search of

the states reachable from (qi,q2) in AuB to determine whether they are all accept

states. The function Accept All ({qi,q2)) returns unknown in this case. It also returns

unknown if the accept-all status of one of q\ and q2 is unknown, and the accept-all

status of the other state is either false or unknown.

The AcceptAll function for the union of two DFAs, A and B, does not preserve

the accept-all information of A and B. Even if the accept-all status of every state

in A and B is known, there can still be states in AuB for which AcceptAll returns

unknown. Union does not preserve accept-all information.

4.2.1.3 Minimizing after Integration

The integration of several (H, C, P) tuples is accomplished by by intersecting all of

the C sets and computing the union of the P sets. The result is a single tuple. The

intersection and union operations in RS-KII return recursive DFAs. These opera-

tions are constant time, and the resulting DFAs have a very low space complexity.

However, the resulting DFAs have not been minimized, which means they can have

far more states than the corresponding minimal DFAs, and that they may have

states for which the Dead and AcceptAll functions return unknown.

The combination of extraneous states in C and P, and the presence of dead

states that cannot be identified by the DeadState function can increase the cost of

enumerating a hypothesis from the solution set by introducing similar states into

the DFA for the solution set. A hypothesis in the solution set can be generated

by finding a path from the start state of the DFA to one of its accept states—a

simple graph search. The presence of dead states that cannot be detected by the

DFA's DeadState function can cause backtracking. All of the state reachable from

the unidentified dead state must be visited before it is obvious that the state cannot

58

reach an accept state, and is therefore dead. If the DFA is non-minimal, there may

be many such states reachable from each unidentifiable dead state.

If the DFA's DeadState function can detect every dead state in the DFA, then the

search will never backtrack, and the extraneous states are not detrimental. Back-

tracking only occurs when there is no path from the current state to any of the accept

states—that is, when the current state is a dead state. If the DFAs DeadState func-

tion can always determine whether a given state is dead, then the states reached

by each edge out of the current state can be tested by this function. One of the

non-dead states is visited next. The current state always has at least one such child

state. If it did not, then the current state would also be dead, and it would never

have been visited in the first place.

The DFA for C is the intersection of several DFAs, and since intersection does

not preserve dead state information, the DFA may have several states for which the

DeadState function returns unknown. The problem is that a state, (gi,^), in the

intersection is dead if q\ or q2 are dead in their respective DFAs, or if neither q\ or

q2 is dead, but there is no path (string) from q\ to an accept state in the first DFA

that is also a path from q2 to an accept state in the second DFA. The first case is

easy to detect, but the second requires an exhaustive search of the states reachable

from qi and q2, which can be expensive.

One way to reduce this cost is to minimize the DFAs for C and P after each

integration, or at least do an exhaustive search of the states in C to identify all of

the dead states. However, minimization takes time proportional to 0(|S||Q|2), and

identifying dead states takes time proportional to 0(|Q|). This cost generally negates

any benefit from storing the DFAs intensionally. The motivation for representing

DFAs intensionally is that only a few of the states are visited while searching the

DFA for a path from the start state to an accept state. Only in the worst case are

all of the states visited. By minimizing the DFA, or by identifying the dead states,

the cost is proportional to the number of states in the average case as well as in the

worst case. Therefore, RS-KII does not minimize after integration, or try to identify

dead states in C. It is generally better to identify and remove only those dead states

encountered during the search, since this usually involves only some of the states in

the DFA.

59

Although identifying and removing dead states from C after integration is not

generally cost-effective, there is at least one case where it can be beneficial. If

(Q,s,S,F,E) is the intersection of (Qi,su5i,Fi,Y,i) and (Q2,S2,^2,-^2,^2), then

the size of Q is at most |Qi||Q2|- However, many of the states in Q may be dead

states. If after removing these states, the size of Q is bounded by the size of the

larger of the two original DFAs, then the cost of intersecting and removing dead

states from k such DFAs is only k\Q*\2, where |Q*| is the number of states in the

largest of the k DFAs. Removing dead states after each intersection can be much

cheaper in this case than removing them after all of the DFAs have been intersected.

RS-KII does not have an explicit operation for removing dead states, but the

enumeration and query operations do identify any dead states they happen to en-

counter. Applying a query, such as Empty, after each {H, C, P) tuple is integrated

will identify and remove many of the dead states in C and P. This approach is used

in the RS-KII emulation of the Candidate Elimination algorithm [Mitchell, 1982] in

Section 6.4 in order to achieve benefits similar to those described in the previous

paragraph.

4.2.2 Enumeration

The enumeration operator returns hypotheses from the deductive closure of (C, P).

It is used both to select the induced hypothesis and to implement solution-set queries.

Enumerate((C, P), A, n) returns a list of n hypotheses that are both in the solution

set of (C, P) and in the regular set A. If there are only m < n hypotheses in the

intersection of A and the solution set, Enumerate returns all m hypotheses. The

A and n arguments are necessary to implement the queries. In order to induce

a hypothesis, it is only necessary to select a single hypothesis from the deductive

closure. This can be done by setting n = 1 and A — E*.

The deductive closure of (C,P) consists of the most preferred hypotheses in

C, according to the preference relation P. Specifically, this is the set shown in

Equation 4.2.

first({CxC)nP)DC (4.2)

In this equation, P is assumed to be transitively closed. However, P is the union

of several DFAs, and transitive closure is not preserved by union (see Section 2.2.3).

60

The first action of Enumerate is to reestablish the transitive closure of P by applying

the TransitiveClosure operator (see Section 4.1.2.6). In the remainder of this section,

P is assumed to be transitively closed unless stated otherwise.

One way to enumerate the solution set is to compute the DFA for the solution

set, and generate strings from this DFA using a graph search. The solution set can

be expressed as a DFA since regular grammars are closed under all of the operations

in Equation 4.2.

Although RS-KII could compute the solution set directly using the set opera-

tions defined in Section 4.1.2, this option is not used for efficiency reasons. The

DFAs resulting from the set operations are non-minimal, possibly containing a large

number of dead states. When a graph search enters a region of dead states, it must

eventually backtrack, since there is no way to reach an accept state from such a

region. In the worst case, all the states in the region must be visited before the

search discovers that it has to backtrack. The problem is that information about

which states are dead is not preserved by intersection, and the only way to tell that

a state is dead is to visit all of the states reachable from the state, and determine

that none of them are accept states.

The amount of backtracking can be reduced by using certain regularities in the

structure of the DFA to increase the amount of dead-state information available

to the search. For example, P is a partial order, and therefore irreflexive, anti-

symmetric, and transitive. The solution set equation also contains a CxC term.

This introduces a certain amount of symmetry into the DFA. These regularities can

be easily exploited by a more sophisticated search algorithm, but are not captured

in the DFA in such a way that a graph search can easily make use of them. These

regularities are meta-information true of all solution sets.

For this reason, RS-KII enumerates the solution set using a modified branch-

and-bound search, into which the regularities mentioned above are integrated. The

solution set consists of the undominated hypotheses in C, where the domination

relation is determined by P. This maps very well onto branch-and-bound, which

also finds the undominated elements of a set. However, a few modifications are

required.

The basic branch-and-bound algorithm takes a set of elements, X, and an evalu-

ation function, /: X —>• 3?, that assigns a real number evaluation to each hypothesis.

61

The algorithm returns an element of the set {x G X | Vy EX f(x) •£ f(y)}. This

algorithm is described in Section 4.2.2.1.

Two modifications to the basic branch-and-bound algorithm are required in order

to implement the enumeration operator. Enumerate((C, P),A, n) returns n hypothe-

ses from the set {x G X | My G X x -fiP y} r\A, where -<P is the partial ordering im-

posed by P. The first modification is to evaluate hypotheses according to the partial

ordering <P instead of the total ordering imposed by /. This modification, described

in Section 4.2.2.2, returns a single element of the set {x G X \ My G X x -fiP y},

which is the deductive closure of (C,P). This is sufficient to implement Enumerate

when n = 1 and A = S* (i.e., A is not used).

In order to implement Enumerate((C,P),A,n) for arbitrary values of n and A,

the modified branch-and-bound algorithm must be extended to find n elements that

are in both A and the deductive closure of (C, P). These modifications are described

in Section 4.2.2.3.

In order to perform efficiently, the modified branch-and-bound algorithm must

implement certain data structures efficiently. These issues are discussed briefly in

Section 4.2.2.4.

4.2.2.1 The Basic Branch-and-Bound Algorithm

Branch-and-Bound takes a set of elements, X, and an evaluation function /, and

finds an undominated element of X. An element x is dominated if and only if there

is an element y in X such that x <x y, where <x is a domination relation over

elements of X. The domination relation <x can be any ordering relation on X. In

the basic branch-and-bound algorithm, <x is a total ordering on X derived from

the evaluation function /, where x <x y if and only if f(x) < f(y)

The idea behind branch-and-bound is to find an undominated element of X by

pruning regions of the search space, X, in which every element in the region is

dominated by the best element of another region. This is the "bound" part of the

algorithm. It is followed by the "branch", which splits the un-pruned regions into

several sub-regions. This splitting and pruning process continues until all but one

of the regions have been pruned, and the remaining region contains only a single

element, x. This element is an undominated element of X.

62

A version of the basic branch-and-bound algorithm based on Kumar's formulation

[Kumar and Laveen, 1983, Kumar, 1992] is shown in Figure 4.11. The search space

is a set of elements, X, and regions of the search space are subsets of X. The

algorithm maintains a collection of subsets, L, which initially contains only X. In

each iteration, a subset Xs is selected from this collection by the Select function, and

split into smaller subsets with the Split function. Dominated subsets are then pruned

from the collection. Subset Xi is dominated by subset Xj if for every hypothesis in

Xi there exists a preferable hypothesis in Xj. Formally, Xi is dominated by Xj if

and only if (VheXi)(Bh'eXj)h <x h'. This process continues until there is only one

subset left in the collection, and this subset contains exactly one element, x. This

element is returned as the undominated element found by the search.

Branch AndBound(X,D<) returns x £ X s.t. Vyex x -ft y X is a set of elements. D< is a domination relation on subsets of X derived from <. < is a total ordering over elements of X. BEGIN L^{X} WHILE L is not empty DO

IF L is a singleton, {Xi}, and Xi is a singleton, {x}, THEN

RETURN X

ELSE

Xs <- Select {L) Replace Xs in L with Split(Xs). Dominated <- {X{ € L \ BXjEL (X{ £>< Xj)} L <— L— Dominated

END WHILE RETURN failure

END BranchAndBound.

Figure 4.11: Branch-and-Bound where D< is a Total Order.

The Domination Relation among Subsets. The domination relation among

subsets is derived from the domination relation among individual hypotheses, <x,

and is expressed as a binary relation, D<, where (Xt, Xj) is an element of D< if and

. 63

only if subset X{ is dominated by subset Xj. This is written Xt D< Xj. If (Xit Xj)

is not an element of £><, then Xt is not dominated by Xj. This is written Xt $< Xj.

In order for branch-and-bound to be cost-effective, the cost of testing whether a

subset is dominated and should thus be pruned must be significantly cheaper than

the cost of the search avoided by pruning the subset. It may only be possible to

cheaply determine the domination relation between some pairs of subsets. For this

reason, The basic branch and bound algorithm uses the incomplete but inexpensive

domination relation, Z)<, instead of the complete but potentially expensive relation

D<.

A pair of subsets, (X»Xj) is in D< if and only if X{ D< Xj, and this relation

can be determined by an inexpensive test. If Xi D< Xj, but this fact cannot be

determined inexpensively, Xt D< Xj will be false. Xi D< Xj is also false if X{ is not

dominated by Xj. It is always possible to cheaply determine the domination relation

between singleton sets, {x} and {y}, since this is a trivial matter of determining

whether x <x y, and <x is defined between every pair of elements in X.

Even though £>< is incomplete, this does not invalidate the branch and bound

algorithm. Eventually, the subsets in the collection, L, will be split to the point

where the subsets in L can be compared by D<. This requires that the Split function

be defined so that X is eventually split into a finite number of subsets that can be

compared by D<. If X is finite, then the branch and bound algorithm is always

guaranteed to halt, since in the worst case, X will be split into a finite number of

singleton subsets, and singleton subsets can always be compared by D<.

Total and Partial Orderings. The basic branch and bound algorithm assumes

that D< is a total ordering over the subsets of X. When D< is a total order, there

is at most one dominant subset among any collection of subsets. This guarantees

that after splitting X into subsets that can be compared by D<, all but one subset

will be pruned from the collection. The remaining subset can be split further until

a singleton subset is identified that dominates the remaining subsets. The single

element in this set is returned as the undominated element of X. If D< is a partial

order, then in any collection of subsets there may be more than one undominated

subset, in which case the algorithm will not halt.

64

The relation D< is derived from <x. If <x is a total order, and Split(Xs) par-

titions Xs into disjoint subsets, then D< is also a total order. Given any set of

elements ordered by <x, there is exactly one maximal element. Therefore, between

two disjoint subsets, Xi and Xj, one subset is guaranteed to contain the maximal

element of X&Xj, and this subset dominates the other. If the subsets are not dis-

joint, then both Xi and Xj may contain the maximal element in which case neither

dominates the other. However, this is somewhat irrelevant since the dominant el-

ement is the same in both subsets. One subset can be arbitrarily selected as the

dominant one without fear of pruning away the dominant element.

If <x is a partial ordering, then so is D<. For example, Let a and b be the

undominated elements of Xh and let y and z be the undominated elements of Xj.

If a <x y and z <x b, but there is no relation between a and z nor between b and y,

then Xi does not dominate Xj, nor does Xj dominate X{. There is no domination

relation between Xi and Xj.

4.2.2.2 Branch-and-Bound with Partially Ordered Hypotheses

The deductive closure of (C, P) consists of the undominated hypotheses of C, ac-

cording to P, the partially ordered dominance relation over individual hypotheses.

The basic branch-and-bound algorithm finds an undominated element of a set X

according to a totally ordered dominance relation, <x- In order to enumerate an

element of the deductive closure of (C,P), the basic branch-and-bound algorithm

must be modified to accept a partially ordered dominance relation over the individual

hypotheses instead of requiring a totally ordered relation.

The problem with using a partial order in the basic branch-and-bound algorithm

is that the termination condition assumes the dominance relation <x is a total

order. The basic algorithm terminates when the collection contains only a single

subset, and this subset is a singleton. In a total ordering, there is at most one

undominated element. Eventually, all of the other elements will be pruned leaving

only the undominated element in the collection. In a partial ordering, there can be

several undominated elements. Eventually, all of the dominated elements will be

pruned leaving only the undominated elements, but since there can be more than

one undominated element, the algorithm may not halt.

65

One solution is to modify the termination condition so that the algorithm halts

as soon as one of the subsets in the collection contains nothing but undominated

elements. An arbitrary element of this subset is returned by the search as the

undominated element. Other subsets in the collection may also contain undominated

elements, but since only one element is needed, the termination condition described

above is sufficient. One advantage of this approach is that it can be easily extended

to return multiple hypotheses, as will be seen in Section 4.2.2.3.

A subset Xi in the collection, L, contains only undominated elements if X{ is

undominated by all of the other subsets in L, and if no hypothesis in Xt dominates

any other hypothesis in Xi. The first condition is true if (VX,y; e L) X{ P< Xj.

The second condition is true if Xt $< Xt. That is, no hypothesis in X{ dominates

any other hypothesis in X{. These two conditions can be combined into one test,

namely (VX^eL) Xi #>< Xj. If Xi contains only undominated hypotheses, then any

one of them can be returned by the search as the undominated element.

In order to evaluate this termination test, it must be possible to determine

whether or not (Xs,Xj) G D<, where Xs is the selected subset, and Xj is an

element of the collection. If the test is true, Xs contains only undominated hy-

potheses. Unfortunately, the complete dominance relation, £><, is not available to

the branch and bound algorithm. Only the less expensive, but incomplete, relation

Z)< is available. Since this relation is incomplete, (XuXj) £ £>< could mean either

that (Xi,Xj) & DK, or that the domination relation between Xt and Xj cannot be

determined cheaply. In the latter case, (Xi,Xj) may or may not be a member of

D<. In order to discriminate between these two cases, the relation D^ is defined.

Xi D{ Xj is true if and only if Xi P< Xj and this relation can be determined

cheaply for Xi and Xj. The relation Xi D{ Xj is false either if the dominance

relation between Xi and Xj cannot be determined cheaply, or if X; D< Xj.. The

two cases cannot be distinguished. The D^ relation requires an inexpensive test for

determining when one subset does not dominate another.

The modified algorithm, BranchAndBound-2, is shown in Figure 4.12. It uses

the D^ relation to determine when one of the subsets in the collection contains only

undominated elements. An element of this subset is selected arbitrarily and returned

as the undominated element found by the search. The relation £)< is used to prune

66

dominated subsets. The relations Z)< and D^ are derived from the domination

relation <x over individual hypotheses. The relation <x can be a partial order.

BranchAndBound-2(X, D<, D^) returns x G {x e X \ VyeX x <£ y} X is a set of elements D< is a domination relation on subsets of X derived from < D^ is a not-dominated relation on subsets of X derived from < < is a partial ordering over the elements in X

BEGIN

WHILE L is not empty DO

Xs <- Select (L) IF (X, D{ Xi) for every Xi in L and (Xs D^ Xs) THEN

RETURN any element of X-s

ELSE

Replace Xs in L with Split{Xs). Dominated <- {X{ e L | BXjEL (Xi DK Xj)} L •f- L — Dominated

END WHILE RETURN failure

END BranchAndBound-2.

Figure 4.12: Branch-and-Bound where <x is a Partial Order.

Given appropriate arguments, BranchAndBound-2 can find a single element

of the deductive closure of (C,P), which is sufficient to implement the function

^nwmera^((C,P),S*,l). The arguments passed to BranchAndBound-2 in order

to implement Enumerate((C,P), E*,l) are shown in Figure 4.13, and described in

detail below.

The arguments to BranchAndBound-2 consist of X, and the dominance rela-

tions £>< and D^ over subsets of X, as derived from <x. The Split and Select

functions are also parameters to Branch AndBound-2, albeit implicit ones. The

algorithm Branch AndBound-2 finds a single element of the set {x G X \ Vy e x (x <x y)}. In order to implement Enumerate((C,P),IZ*, 1), the arguments must

be set so that Branch AndBound-2 finds a single element of the deductive closure of

(C,P)—namely, {x € C \ VyGC (x fa y)}.

67

Enumerate((C, P), S*, 1) = BranchAndBound-2(C, £)'<, D{) where

• A subset of C is denoted (w,q) where w € S* and <j = <5c(sc, w). C itself is (e, sc), where e is the empty string and sc is the start state of C.

• (wuQi) D* (w2,q2) iff ((wi,9i)x(w2,92))nP = 0 where

((wi,9i>x(tü2,gä»nP = 0if

Deadc(qi) = true or Deadc(q2) = true or

(9 = delta*P(sp, shuffle(wi,w2)) and

AcceptAllP(q) = true)

• (lüi.ft) £>< (tü2,fe> iff {(wi,qi)x(w2,q2)) Q P and (i^.fc) ^ 0 where

«wi,gi>x{w2,92»CPif

g = 6p(sp, shuffle(wl,w2)) and Accept All P(q) = true

(w2,92) # 0 if Deadc(q2) = true

• Split({w, q)) = {(w • (7, q') \ o E S and q' = <5c(g, cr)}

• Select (L) returns IsGi s.t. (VX; € L)(XS, X,) £ D'K where

<«;i,9i> £>< {w2,g2) iff q = £p(sp, shuffle(wi, w2)) and Accept All P{q) = true

Figure 4.13: Parameters to BranchAndBound-2.

Specifying Subsets. The set of elements, X, is set to C. The relations D<

and D^, and the Split function all depend on the way in which subsets of X are

represented. There are several ways to partition C into subsets, but one that seems

to work well is to specify a subset by the tuple (w,q), where w is a string and q

is the state in C reached on input w from the start state. This subset contains all

strings in C of the form tutu', where tu' is a string leading from q to an accept state.

Since C is a DFA, there is at most one state, q, reached by a given input, so (tu, q)

contains all strings in C that begin with tu.

The notation (tu, q) is in fact shorthand for the concatenation of two DFAs, one

that recognizes tu, and one that recognizes all strings tu' such that S*(q,w') reaches

68

an accept state in C. The latter DFA can be derived from the DFA for C by simply

making q the start state. Because of this simple derivation, all of the subsets of C

can share the DFA for C, so new DFAs do not need to be created for each subset.

Subset {w,q) is a singleton if and only if q is an accept state and every path

from q leads to a dead state. That is, w is in C, but no extension of w is in C.

Subset (w, q) is empty if and only if q is a dead state in C, since no extension of w

is recognized by C.

The Split Function. The Split function takes a subset of C, (w, q), and partitions

it into subsets by extending w with each of the symbols in the alphabet, S:

Split((w, q)) = {{w ■ a, q') \ q' = Sc(q, <?)}

Alternately, w could be extended by strings instead of by individual symbols.

One possibility is to extend w by a string w' such that ww' is a hypothesis in H,

though perhaps not one recognized by C. In this case a subset {w,q) contains all

hypotheses that are extensions of hypothesis w, and also recognized by C. This

version of the split function would be defined as follows:

Split((w,q)) = {(w ■ w',q') | q' = 5*c{q,w) and 5*H{s,ww') E F„}

The Dominance Relation. The dominance relations, D< and D< are incomplete

versions of £>< that are defined only on subsets for which the dominance relation

can be determined inexpensively. The complete dominance relation over subsets of

the form (w, q) will be defined first, followed by inexpensive definitions for D< and

A subset (wuqi) is dominated by subset (w2,92) if and only if (V/iG (wu qi))(Bh' <E

(w2,92)) {h, h!) G P. That is, every hypothesis in (wu qr) is less preferred than some

hypothesis in (w2, q2), according to P. This relation can be expressed in terms of set operations on P and the two subsets

as follows. The set ((w1,q1)x(w2,q2))nP is {(x,y) \ x < y and x G {wuqi) and y €

(w2,q2)}. Projecting this set onto its first elements yields the set of hypotheses in

(u>i,Qi) that are dominated by one or more elements of (^2,92)- If the projection

is exactly {wuqi), then every hypothesis in (wi,gi) is dominated by elements of

69

{w2,q2), so (w1,qi) D< {w2,q2)- Otherwise, some elements of (w2,q2) are not dom-

inated, so {wuqi) P< (w2,q2). The relation D< is therefore defined as shown in

Equation 4.3.

(iüi,gi) D< (w2,q2) iff first(((wl,qi)x(w2,q2))riP) = (wuqi) (4.3)

Approximating the Dominance Relation. This test is expensive to compute.

A less expensive, though possibly overspecific test is needed to define D<. As a first

approximation, a sufficient but not necessary condition for (wuqi) D< (w2,q2) is

({wi,qi)x (iu2, q2)) C P. When this is true, every hypothesis in (iwi, qi) is dominated

by every hypothesis in (w2,q2). To this condition a test must be added to ensure

that (w2,q2) is not empty, since D< does not allow a subset to be dominated by an

empty set. The complete test is summarized in Equation 4.4.

(wuq1)D<(w2,q2) iff ((wl,ql)x(w2,q2))) Q P and {w2,q2) ^® (4.4)

This definition is a step in the right direction, but it is still too expensive to

compute. The second condition of this test determines whether {w2,q2) is empty.

This is equivalent to testing whether q2 is a dead state, which can be expensive

if this information is not already known by the DFA. The first condition is also

expensive. Specifically, it determines whether ((wi,<7i)x(w2,g2)) is a subset of P,

which is equivalent to testing whether {(wi,qi)x{w2,q2))nP is empty. This is also

an emptiness test, and therefore potentially expensive.

The two tests of Equation 4.4 are expensive to compute in general, but for some

subsets they can be computed cheaply. A sufficient condition for the first test,

((w1,q1)x(w2,q2)) C P, is 8P{s,shuffle(wuw2)) = q and Accept All P{q) = true.

That is, the string shuffle(w1,w2) leads to an accept-all state in P, which means

that every pair of strings in {(wix,w2y) | x G E* and y G £*} is a member of P.

This certainly includes all the strings in {wi,qi)x(w2,q2), so {wi,qi)x{w2,q2) is a

subset of P.

Computing q = 8P(startP, shuffle(wi, w2)) takes time proportional to the lengths

of wi and w2. Determining whether q is an accept-all state is accomplished by

the AcceptAll function, which only returns true or false when the answer can be

70

computed cheaply, and returns unknown when it cannot. This test can only compare

subsets for which AcceptAll does not return unknown.

The second test in Equation 4.4 determines whether (w2,q2) is empty—i.e.,

whether q2 is a dead state. If it can be determined cheaply whether q2 is a dead

state, then the function Deadcfa) returns true or false, according to whether or

not the state is dead. Otherwise, the function returns unknown. This leads to the

definition for D< shown in Equation 4.6.

(w1,ql)D<(w2,q2) iff

AcceptAllp(5p(startP, shuffle(wi,w2))) = true and (4.5)

Deadc(q2) = false (4.6)

Defining the Not-Dominated Relation. The definition of D^ is derived sim-

ilarly. The complete and correct test for determining whether one subset is not

dominated by another is shown in Equation 4.7. This is the complement of the

dominance relation, D<.

{wi,qi)P<(w2,q2) & first((wi,qi)x(w2,q2))nP^(wi,qi) (4.7)

This relation is expensive to determine. It must be replaced by a less expen-

sive, though possibly incomplete relation. A sufficient condition for this relation is

((wi,qi)x(w2,q2))r\P = 0. When this condition is true, no hypothesis in (wi,<Zi) is

dominated by any hypothesis in (w2,q2). The approximation of the not-dominated

relation based on this test is shown in Equation 4.8.

(wi,qi)P><(w2,q2) if ((wi,<7i)x<w2,?2))nP = 0 (4.8)

This test is still too expensive to compute, since it involves testing whether

the intersection of two sets is empty. However, a sufficient condition for this test

can be computed inexpensively. Namely, ({wi,qi)x(w2,q2))nP = 0 if the string

shuffle(wi,w2) leads to a dead state in P, or if either qx or q2 are dead states in

C. The Dead functions for C and P return unknown if it is expensive to determine

whether a given state is dead, and otherwise return true if it is dead, and false if

71

it is not. Fortunately, P is a union of several DFAs, so if the dead state information

is known for all of the states in these DFAs, then the Dead function for P never

returns unknown (see Section 4.2.1.2). The definition for D^ is given formally in

Equation 4.10.

(u/i,gi) D+ (w2,q2) iff

Deadc(qi) = true or Deadc(q2) — true or (4.9)

DeadP(S*p(startp, shuffle (wi,w2))) = true (4.10)

The Selection Criteria. Because of the way Split and D< are defined, it is

possible for a subset, (w,q), in the collection to be empty. Subset (w,q) is empty

if q is a dead state. This is expensive to determine, so empty subsets are not

eliminated automatically by the Split function, nor are they identified as dominated

subsets by DK. Eventually, (w,q) will be split into child subsets, (wwi,qi) through

(wwk, qk), whose emptiness can be cheaply determined. In the worst case, q1 through

qk comprise all of the states reachable from q. At least one of these is a known dead

state, and this information can then be propagated to the other states. However, the

cost of splitting (w,q) to this point can be quite expensive since it is proportional

to the number of states that are reachable from q.

This expense can be mitigated by being intelligent about which subset is selected

in each iteration. Given a choice between (wi,qi) and {w2,q2), choose (w2,q2) if

shuffle(wi,w2) leads to an accept-all state in P, since this means that every string in

(w2,q2) is preferred to every string in (wi,qi). The subset (wi,qi) could be pruned

as dominated at this point, except that is not known whether (w2,q2) is empty.

Recall that the domination relation D< does not allow a subset to be dominated

by an empty subset. If the same selection criteria is applied consistently, the child

subsets of (w2,q2) will also be selected over the subset (wi,qi). Eventually, (w2,q2)

will be split into enough child subsets that it can either be determined that all of

the children are empty, or that one of them is non-empty. In the former case, the

empty children are pruned from the collection, since all empty sets are dominated,

which frees up the selection criteria to select (wi,qi). In the latter case, there is at

least one non-empty child subset of (w2,q2). Call this subset (w*,q*). Since it is

known that this subset is not empty, and it is also known that every extension of w*

72

is preferred to every extension of W\, the subset (wi,qi) is dominated by (w*,q*),

and can be pruned from the collection.

Whether or not the selected subset eventually turns out to be empty, the selection

criteria described above is often cheaper and never worse than the inverse policy. If

(wi,qi) and its children were preferred over (w2,q2) in the above example, then the

search cost would be higher. If (wi,qi) is either empty or non-empty, and (w2,q2) is

non-empty, then the child subsets of (wi,qi) are searched, followed by the children

of (w2,q2). In the above example, only the children of (w2,q2) would be searched. If

(wi,qi) were non-empty, and (w2,q2) were empty, then the children of both subsets

must be searched using either selection criteria. The costs are the same in this case.

Subsets are therefore selected from the collection according to the first policy,

since it can reduce the amount of search required. The selection policy prefers

subset (w2, q2) over (wi, <?i) when shuffle(wi,w2) leads to an accept state in P. This

is the same as the definition of £><, except that the emptiness test for (w2,q2) is

omitted. Call this new relation £><. The subsets in the collection are partially

ordered according to JD<, and one of the subsets at the top of the order is selected.

The relation £>< is denned formally in Equation 4.11, and the selection criterion is

defined in Equation 4.12.

(u/i,gi) £>< (1^2,92) iff

q — Sp(sp, shuffle(wi,w2)) and Accept All P(q) = true (4.11)

Select(L) returns Xs e L s.t. (VXt- € L) (Xa,Xi) £ £>'< (4.12)

4.2.2.3 Fully Modified Branch-and-Bound Algorithm

The full implementation of Enumerate((C,P),A,n) must be able to enumerate up

to n hypotheses that are both in the deductive closure of (C,P) and in the set

A. This is accomplished with a few simple modifications, as shown in Figure 4.14.

If the selected subset, Xs, contains only undominated hypotheses then instead of

returning a single element of Xs as is done in BranchAndBound-2, Xs is intersected

with A, and elements are enumerated from Xs D A until all n hypotheses have been

enumerated or there are no more hypotheses in the intersection. Xs is marked as

73

enumerated, and remains in the collection. If additional hypotheses are needed, the

search continues until another undominated subset is found, and its elements are

enumerated as mentioned above. When all n hypotheses have been enumerated,

or all of the subsets in the collection have been marked, the search halts and the

list of hypotheses enumerated so far is returned. This list contains no more than

n hypotheses, but may contain fewer if the intersection of A and the undominated

elements of X contains fewer than n hypotheses. The algorithm incorporating these

modifications is called BranchAndBound-3 and is shown in Figure 4.14.

BranchAndBound-3{X,D<,D<,n,A) returns n elements of {xeX | VyeX x {. y}(~\A X is a set of hypotheses D< is a domination relation on subsets of X derived from < D^ is a not-dominated relation on subsets of X derived from < n is the number of solutions requested < is a partial ordering over the elements of X

BEGIN

L*-{X} Solutions <— {} Enumerated ■«— {} WHILE |Solutions| < n and L - Enumerated is not empty DO

Xs «— select(L — Enumerated) IF (Xs £){ X{) for every X{ in L THEN

Add hypotheses from XSC\L to Solutions until |Solutions| = n or there are no more hypotheses in XsnL

ELSE

Replace Xs in L with Split(Xs) Dominated <- {X{ 6 L | 3X,<EL (X, £>< X,)} L <— L — Dominated

END IF END WHILE RETURN Solutions

END BranchAndBound-3'.

Figure 4.14: Branch-and-Bound That Returns n Solutions Also in A.

RS-KII implements Enumerate((C, P), A, n) by calling BranchAndBound-3 with

the arguments shown in Figure 4.15. These are the same arguments passed to

74

BranchAndBound-2 in order to implement Enumerate((C, P), £*, 1), with the addi-

tion of the n and A arguments.

Enumerate((C, P), A, n) = BranchAndBound-3(C, DK, D^A, n) where

• A subset of C is denoted (w,q) where w G E* and q = 6c(sc,w).

• (wu qt) D+ (w2, q2) iff ((wi, ft)x (w2, q2))C\P = 0 where

((wi,qi)x(w2,q2))nP = 0 iS

Deadc{qi) = true or Deadc(q2) = true or

(q = delta*P(sp, shuffle(wu w2)) and Accept All P(q) = true)

• (wuql)D< (w2,q2) iff ((wi,ft)x(«/2,g2)) c P where

(k?i>xK«2»cPiff

9 = ^p(sp, shuffle(w1,w2)) and AcceptAUP(q) = true

• SpHt((wt q)) = {(w ■ a, q') \a G E and 9' = <5C(?, a)}

• Select(L) returns IseL s.t. (WX{ e L)(Xs,Xi) 0 D'< where

(itfi.gi) £>< (tü2,g2> iff

q = 6p(sp, shuffle(wuw2)) and Accept All P(q) = true

Figure 4.15: Parameters to BranchAndBound-3 for Implementing Enumerate.

4.2.2.4 Efficient Selection and Pruning

The collection L must be represented intelligently in order to minimize the complex-

ity of selecting subsets and pruning dominated subsets. In RS-KII, L is maintained

as a lattice representing the known dominance relations among the subsets in L.

Each node in L is a subset, and an edge from X{ to Xj indicates that Xj D'< X{ is

true. Selecting a hypothesis is a matter of selecting an element from the top of the

lattice. The nodes at the top are maintained as a list, with each top-node pointing to the next top-node in the list.

A subset X{ is dominated if there is some subset Xj in the collection such that

Xi £>'< Xj and Xj is not empty. Thus, if the selected subset is discovered to be

■75

non-empty, every subset below it in the lattice is pruned. If the selected subset is

discovered to be empty, then it is removed from the top of the lattice. The subsets it

used to dominated are promoted to the top of the lattice if they are not dominated

by other top-nodes. When the selected subset, Xs, is split, it is removed from the lattice and replaced

with its children. If some subset X{ is below Xs in the lattice, then Xt is below each

child of Xs as well. This is because X{ £'< S is true if and only if every hypothesis

in Xi is less preferred than every hypothesis in Xs. Thus, the relation holds for

every subset of Xs. However, !>'< is not known for every pair of subsets. Thus, D'<

may be known between some subset Xt and a child of Xs where it was not known

between X{ and Xs. The children of Xs are tested against the other top-nodes, and

edges are added between the children and these nodes accordingly.

76

Chapter 5

RS-KII and AQ11

AQ-ll [Michalski, 1978] is an induction algorithm that induces a hypothesis in the

VLi hypothesis space language [Michalski, 1974] from examples and a user-defined

evaluation function. RS-KII can emulate the AQ-ll algorithm by utilizing this

knowledge and the biases implicit in AQ-ll. When RS-KII utilizes only this knowl-

edge, the computational complexity of RS-KII is a little worse than that of AQ-ll.

RS-KII can also utilize knowledge that AQ-ll cannot, such as a domain theory.

By using this knowledge in conjunction with the AQ-ll knowledge sources described

above, RS-KII effectively integrates this knowledge into AQ-ll, forming a hybrid

induction algorithm.

RS-KII's ability to emulate AQ-ll and to integrate additional knowledge are

demonstrated in this chapter. The AQ-ll algorithm is described in Section 5.1.

This is followed in Section 5.2 by a description of the knowledge used by AQ11,

and implementations of RS-KII translators for that knowledge. This section also

describes RS-KII translators for domain theories, which AQ-ll cannot use.

In Section 5.3, RS-KII uses these translators to solve a synthetic induction task.

The knowledge made available by this task includes the knowledge used by AQ-ll,

plus a domain theory. It is demonstrated that when RS-KII uses only the knowledge

used by AQ-ll, both RS-KII and AQ-ll induce the same hypothesis. When RS-KII

uses the domain theory in addition to the AQ-ll knowledge, RS-KII induces a more

accurate hypothesis than the one induced by AQ-ll.

In Section 5.4, the computational complexity of RS-KII is analyzed and compared

to that of AQ-ll. When using only the knowledge available to AQ-ll, RS-KII's

complexity is a little worse than that of AQ-ll.

77

5.1 AQ-11 Algorithm

The AQ-11 induction algorithm [Michalski, 1978, Michalski and Chilausky, 1980]

induces a hypothesis from a list of noise-free positive and negative examples, and

a user-defined evaluation function that Michalski calls a "lexicographic evaluation

function," or LEF [Michalski, 1978].

Since the examples are assumed to be noise-free, AQ-11 induces a hypothesis

that is strictly consistent with the examples. That is, the induced hypothesis covers

all of the positive examples and none of the negative examples. There may be several

hypotheses consistent with the examples, in which case the LEF is used to select

among them. Ideally, the selected hypothesis is one that is most preferred by the

LEF. However, finding a global maximum of the LEF is intractable, so AQ-11 settles

for a local maximum. A locally optimal hypothesis is found by a beam search, which

is guided by the LEF. The hypothesis space for AQ-11 is one described by the VLX hypothesis space

language. This language is described in Section 5.1.1. The instance space, from

which the examples are selected, is described in Section 5.1.2. The algorithm itself

in described Section 5.1.3.

5.1.1 The Hypothesis Space

The hypothesis space consists of sentences in the VLX concept description language

[Michalski, 1974]. This is actually a family of languages, parameterized by a list

of features and feature-values. Specific hypothesis space languages are obtained by

instantiating VL\ with these parameters.

Figure 5.1 describes a regular grammar for the VLX family of languages. This

grammar is parameterized by a list of (/», V{) pairs, where /{is a feature name, and

Vi is the set of values that the feature /< can take. The set of values is expressed

as a regular grammar, which simplifies the specification of the VLn grammar. Any

finite or countably infinite set of values can be expressed as a regular grammar by

mapping each value onto a binary integer in l(0jl)* (i.e., the set of binary numbers

greater than zero).

78

Parameters: (fu Vx), (f2, V2),..., (fk, Vk)

VLl -»■ TERM | TERM or VLl

TERM -4 SELECTOR | SELECTOR TERM SELECTOR ->

1" h -+ ( | >) Vx "]" I 1" h -+ ( I >) V2 "]" I

1" A -+ ( | >) vk"]"

Figure 5.1: The VLX Concept Description Language.

Sentences (hypotheses) in VLX languages are disjunctive normal form (DNF)

expressions of selectors. A sentence in this language is a disjunct of terms, and a

term is a conjunct of selectors. A selector is an expression of the form [/* # v],

where ft is a feature, # is a relation from the set {=, ^, <, <, >, >}, and v is one

of the values that feature /, can take. For example, the following is a hypothesis in

one possible VLX language:

[color = red][size > 20] or [size < 5].

5.1.2 Instance Space

An instance in AQ-11 is an ordered tuple, {vx,v2,...,vk), in which v{ is the value of

feature /, for that instance. For example, if there are two features, size and color,

then instances would be tuples such as (50, red) and (25, green).

Examples are classified instances. An example is positive if it is covered by the

target concept, and negative if it is not. An instance is covered by a VLX hypothesis

if it satisfies the hypothesis. That is, the hypothesis must contain at least one

term in which every selector is satisfied by the instance. A selector [f{ # x{] is

satisfied by instance {vuv2,.. .,vk) if v{ # x{. Recall that "#" is a relation from

{<,<,=,=A>,>}- For example, the hypothesis [color = red] [size > 20] or [size < 5] cov-

ers the instance (50, red) since red = red and 50 > 20. The hypothesis does not

cover (25, green), since the instance is satisfied by neither term.

79

5.1.3 The Search Algorithm

The AQ-11 algorithm searches the hypothesis space for a hypothesis that is con-

sistent with all of the examples, and locally maximizes the LEF. The algorithm is

shown in two parts in Figure 5.2 and Figure 5.3.

AQ-11 is essentially a separate-and-conquer [Pagallo and Haussler, 1990] algo-

rithm. The positive examples are placed in a "pot." A term is found that covers at

least one of the positive examples in the pot and none of the negative examples. The

positive examples covered by the term are removed from the pot, and the process

repeats on the remaining examples in the pot. Eventually, every positive example is

covered by at least one of the terms, and no negative example is covered by any of

the terms. The terms are disjoined, and returned as the induced hypothesis. This

main loop of the algorithm is shown in Figure 5.2.

The second part of the algorithm is an inner loop that searches the space of terms

for a term that covers a positive example called the seed, but none of the negative

examples. The seed is selected in each iteration of the main loop from the pot of

uncovered positive examples. There are many terms that satisfy these criteria. The

term returned to the main loop by the search is ideally one that maximizes the LEF.

In practice, the space of terms is so huge that searching it for the best term would

be intractable. Instead, a beam search with width b is used to find a locally optimal

term. The width of the beam is determined by the user.

The beam initially contains the term true. On each iteration, each term in

the beam is replaced with its children. Child terms are generated from the parent

term by conjoining a selector to the end of the parent. This selector must cover the

positive seed, and fail to cover at least one new negative example that was covered

by the parent. This ensures progress towards a term that covers the seed but none

of the negative examples. The best b terms in the beam are retained, as determined

by the LEF. When every term in the beam covers the seed but none of the negative

examples, the best term in the beam is returned to the main loop. The beam search

is shown in Figure 5.3.

80

Algorithm AQ-11 (Pos, Neg, LEF, b) returns h PosList is a list of positive examples NegList is a list of negative examples LEF is a term evaluation function b is the beam width

BEGIN

h = false LOOP

Select seed from PosList term = Learn Term(seed, NegList, LEF, b) h = h V term Remove from PosList every example covered by Term

UNTIL PosList is empty RETURN h

END AQ-11

Figure 5.2: The Outer Loop of the AQ-11 Algorithm.

5.2 Knowledge Sources and Translators

The knowledge used by AQ-11 [Michalski, 1978] consists of positive and negative

examples, a user-defined lexicographic evaluation function (LEF), a list of features,

and a set of values for each feature. The features and values parameterize the VL\

concept description language, and the remaining knowledge drives the search.

In order for RS-KII to use this knowledge, it must be expressed in terms of

constraints and preferences. This is accomplished by translators, which are de-

scribed below. In addition to the knowledge used by AQ-11, RS-KII can utilize any

other knowledge for which a translator can be constructed. A translator for one

such knowledge source—a domain theory—is described below. Translators for other

knowledge sources are also possible.

5.2.1 Examples

AQ-11 induces hypotheses that are strictly consistent with the examples. That is,

the induced hypothesis must cover all of the positive examples and none of the

81

Algorithm LearnTerm(seed, NegList, LEF, b) returns term seed is a positive example NegList is a list of negative examples LEF is a term evaluation function b is the beam width

BEGIN

Terms = {true} LOOP (1) Select neg-seed from NegList (2) S = set of selectors covering seed but not neg-seed. (3) Children = {t ■ selector | t G Terms and selector G S} (4) Terms = best b terms in Children according to LEF (5) Remove from NegList examples covered by none of the terms in Terms UNTIL NegList is empty RETURN the best element of Terms according to LEF

END LearnTerm

Figure 5.3: The inner Loop of the AQll Algorithm.

negative examples. A positive example is translated into a constraint satisfied only

by hypotheses that cover the example. A negative example is translated into a

constraint satisfied only by hypotheses that do not cover the example.

The implementation of this example translator is shown in Figure 5.4. It takes

as input a hypothesis space and an example. The hypothesis space is an instantia-

tion of VLi with appropriate features and values. The function Covers(H, i) returns

the set of hypotheses in H that cover instance i, and the function Excludes(H,i)

returns the set of hypotheses in H that do not cover instance i. Both sets are rep-

resented as regular expressions (i.e., regular sets). The regular expression returned

by Covers(H,i) is shown in Figure 5.5. Excludes(H,i) returns the complement of

Covers(H,i).

The regular grammar returned by Covers(H,i) recognizes all hypotheses in H

that are satisfied by instance i. H is an instantiation of VLX, and i is an ordered

vector of feature values, (vi,v2, ■ ■ ■ ,vk). A hypothesis covers an instance if the

hypothesis contains at least one term in which every selector is satisfied by the

instance. In general, selector [ft # v] covers (vuv2,... ,vk) if v{ # v, where "#" is

82

TranAQExample( VL^fu Vx), (f2, V2),..., (fk, Vk)), ((vi,v2,...,vk), class)) ->• (C,{})

where

( Covers( VL^h, Vi), (f2, V2),..., (fkt Vk)), r _ I (vi,v2,...,vk)) if class = positive

" j Excludes( VL1((f1,V1),(f2,V2),...,(fk,Vk)), I {v\,v2,... ,vk)) if class = negative

Figure 5.4: Example Translator for VL\.

a relation in {<, <, =, ^, >, >}. For instance, the selector [/3 < 12] covers instance

(red, 18,9) since 9 < 12.

Figure 5.5 shows the regular expression returned by Covers(H,i). At the top

level, the expression is (ANY-TERM or)* COVERING-TERM (or ANY-TERM)*. This

says that at least one term in the hypothesis must cover the instance, namely the

term recognized by COVERING-TERM.

A term covers the instance if every selector in it covers the instance, COVERING-

TERM expands to COVERING-SELECTOR"1", which is a conjunction of covering selec-

tors. COVERING-SELECTOR is the set of selectors that are satisfied by the instance.

The instance is an ordered vector of feature values, (vi,v2,...,vk). A selector cov-

ers (vi,v2,... ,vk) if it is of the form [fi # v], where Vi # v and # is a relation

in {<!<)=)^i>i>}' This set of selectors is recognized by the regular expression

COVERING-SELECTOR.

The definition of COVERING-SELECTOR is a little obscure. An instance is covered

by a selector of the form [/{ < v] if the value of the instance on feature fi is less

than v. That is, the instance is covered by all selectors in the set {[fi <v]\v> Vi},

where Vi is the value of the instance on feature /». A similar analysis holds for

the other relations, <,=, ^, >, and >. COVERING-SELECTOR expands into a union

of regular expressions for each of these sets. Each of these sets can be written as

regular expressions. For example, the regular expression for {[/2 < v] \ v > 10} is

[/2< (l-9)(0-9)+].

The function Excludes(H,i) returns the complement of the grammar returned

by Covers (H,i).

83

Covers{VLx((fx, Vx), (f2, V2),..., (h,Vk))t {vx,v2,...,vk)) -+ G where G = (ANY-TERM or)* COVERING-TERM (or ANY-TERM)* ANY-TERM = SELECTOR4

COVERING-TERM = COVERING-SELECTOR"*"

SELECTOR = "[" h ( I >) Vl 1" I "[" h ( I >) V2 "]" |

1" A (< | I >) f* 1" COVERING-SELECTOR = {[/i # v] | Vf # V and # G {<, <, =, ^, >, >}}

"[" /i < {« € Vi | /x > Vl} "]" | "[" /i < {v e Vi | A > «a "]" I "[" /l = «1"]" I

1" /i > {v e Vx I /x < «i} "]"

"[" /* < {t; € Vfc I /* > M "]" 1" /* < {« e Vi | A > ^} "]" 1" /* = vk "]" | 1" /* ¥> (Vk - {vk}) "]" I "[" fk>{veVk\fk< vk} "]" "[" h>{veVk\h< vk} "]"

Figure 5.5: Regular Expression for the Set of VXi Hypotheses Covering an Instance.

5.2.2 Lexicographic Evaluation Function

AQ-11 searches the hypothesis space in a particular order, and returns the first

hypothesis in the order that is consistent with the examples. The search order

therefore expresses a preference for hypotheses that are visited earlier in the search

to those that are visited later. The search order is partly determined by the search

algorithm, and partly determined by the lexicographic evaluation function (LEF),

which guides ehe search. The LEF and the search algorithm are translated conjointly

into a preference for hypotheses that occur earlier in the search.

Expressing this preference as a regular grammar can be somewhat difficult. Given

two hypotheses, hx and h2, determining which of them is visited first by the search is

often difficult to extract just from the information available in hi and h2. Often, the

only way to determine their relative ordering is to run the search and see which one

is visited first. This is clearly impractical. However, for some searches, the ordering

can be determined directly from hx and h2. For example, in best first search, hx is

generally visited before h2 if hx is preferred by the evaluation function, although the

topology of the search tree also affects the visitation order.

84

The'beam search used by AQ-11 falls into the class of search orderings that are

difficult to express. However, when the beam width is one, beam search devolves into

hill climbing search, which can be expressed as a regular grammar, though somewhat

awkwardly. Although the biases imposed by some search algorithms are difficult to

represent, it should be remembered that these biases are themselves approximations

of intractable biases. For example, AQ-11 uses a beam search to find a locally

optimal approximation to the globally optimal term. Although the beam search is

difficult to encode in RS-KII, it may be possible to find some other approximation

of the intractable bias that can be expressed more naturally in RS-KII. This is an

area for future work.

The remainder of this subsection describes the search ordering used by AQ-11,

and how a restriction on this ordering can be expressed as a regular grammar in

RS-KII. AQ-11 uses a two-tiered search. At the bottom level, it conducts a beam

search through the space of terms for a term that covers a specified positive example

and none of the negative examples. This search is guided by the LEF. Let the search

order over the terms be specified by -<t where t\ -<t h if h and t2 are terms, and h

is visited before t2. The LEF may assign equivalent evaluations to some terms, in

which case it does not matter which term is visited first. In this case, there is no

ordering between the two terms.

The top level is a search through the space of hypotheses. In each iteration, the

term found by the beam search is disjoined to the end of the current hypothesis.

This can be viewed as a non-backtracking hill climbing search, where children are

generated by disjoining a term to the parent hypothesis, and the terms are ordered

according to -<*, with terms occurring earlier in -<t being selected first.

Let -<h be the order in which AQ-11 visits hypotheses, where hi -<h h2 if hi

and h2 are both hypotheses, and ht is visited before h2. The LEF and the fixed

search algorithm are translated into a preference defined in terms of -<h- Namely,

the preference set P is {(x,y) € HxH | y ^h x}, which says that hypothesis x is

less preferred than y if x is visited after y.

P is essentially the preference set for -<h, {(a,b) € HxH \ a <h b}, except that

the order of the tuples is reversed. If -<h can be expressed as a regular grammar,

it is likely that P can also be expressed as a regular grammar. To express <h as

a regular grammar, a decision procedure is first defined that determines whether

85

hi -<h h2 for arbitrary hypotheses hi and h2. This procedure is then expressed as a

regular grammar.

A hill climbing search first visits the root of the search tree, followed by the best

child of that root. The subtree rooted at the child is searched recursively in the

same fashion until a goal node is found. If no goal is found after searching to the

bottom of the tree, the search would backtrack from the leaf node, and search the

next best child of the leaf's parent. This is a pre-order traversal of the search tree.

The root is visited first, and then the subtrees rooted at each child node are searched

recursively from best child to worst child, as determined by the evaluation function.

The top-level search of AQ-11 is a hill-climbing search. Let hypothesis hi be

*i,iV<i)2V... V<i,„, and let hypothesis h2 be £2,iVi2i2V... V*2>m, where titj is a term.

The top-level search visits hi before h2 if hi comes before h2 in pre-order. This

can be determined by comparing the terms of each hypothesis from left to right. If

*i,i -<t *2,i, then hi comes first, and if t2,i -<t *i,u then h? comes first- If the first

terms of both hypotheses are the same, then the next pair of terms, i1)2 and t2,2, are

compared in the same fashion. This continues until the terms differ, or until one

hypothesis runs out of terms. In this case, the longer hypothesis is an extension of

the shorter one, and therefore a descendent of the shorter one. Hill climbing visits

descendents after visiting the parent, so the shorter one is visited first.

This decision procedure is based on -<t, the order in which the beam search visits

the terms, as guided by the LEF. In a beam search, the terms in the beam can be

visited in any order. There is one beam for each iteration of the search, and the terms

in each beam are visited after the terms in all of the beams for prior iterations. Terms

that participate in none of the beams are never visited, and therefore least preferred.

Determining whether term ti is visited before t2 is difficult, since it requires knowing

which beams the terms are in, which in turn requires knowing the beams for each

of the previous iterations. This information is difficult to compute from t\ and t2,

other than by running a beam search to determine whether tx or t2 is visited first.

However, when the beam width is set to one, a beam search becomes a hill

climbing search, for which it is easier to specify the search order. A hill climbing

search visits the search tree in pre-order, as described above for the top-level search.

Nodes in this search tree are terms. A term is a conjunction of selectors, Sis2... s„.

To determine which of two terms is visited first, compare their selectors from left

86

to right. Let ti = si,iSi,2 • • • «l.n and let t2 = s2^s2t2 • • • s2,m- Term t\ is visited first

if the LEF assigns a better evaluation to s^i than to s2>i. If «2,1 is better, then t2

comes first. If both Siti and S2,i are equally good, then compare Sit2 and s2,2 in the

same fashion. Compare from left to right until one term has a better selector. If one

term has no more selectors, then the shorter term is preferred.

One caveat for this search is that the LEF is an evaluation for terms, not for

selectors. For a term sis2... s„, the evaluation of selector sn is LEF(siS2 ■ ■. sn), not

LEF(sn).

To summarize, the preference order between two hypotheses, hi and h2 is de-

termined as follows. Compare the terms of the two hypotheses from left to right

according to -<t- The terms are compared by -<t in the same fashion. The selectors

of each term are compared from left to right until one term runs out of selectors, or

the selector of one term has a lower evaluation than the corresponding selector of the

other term. Recall that the evaluation of the ith selector in a term is LEF(siS2 ... Sj),

where S\ through S(_i are the preceding selectors in the term. This is the procedure

for determining whether h\ -</, h2.

The ordering -<h is essentially a lexicographic ordering. A lexicographic ordering

is an alphabetical ordering over strings in some alphabet, where the symbols in the

alphabet are ranked according to some total ordering. For example, a lexicographic

ordering over strings in (a — z)* is the standard dictionary order.

The -<h ordering can be mapped onto a lexicographic ordering over strings in

(0 — 9)*, where the symbols in the alphabet (0 — 9) have the usual ordering for

digits. This ordering can be easily expressed as a regular grammar, as can the

mapping from hypotheses onto digit strings. These two grammars can be composed

into a single regular grammar according to a construction that will be explained

later. This is the regular grammar for -<h-

A hypothesis is mapped onto a digit string as follows. Each selector is replaced by

the evaluation assigned to it by the LEF. Evaluations are assumed to be fixed-length

integers of length /, where lower valued integers correspond to better evaluations.

The V symbols between terms are replaced by a zero. This mapping is shown in

Figure 5.6. Hypothesis hi is preferred to hypothesis h2 if M(hi) -<iex M(h2), where

-<iex is the standard lexicographic ordering over strings of digits.

87

MfaVfeV.-.V*«) = Mtih) • 0 ■ Mt(t2) ■ 0 •... • 0 ■ Mt(tn)

Mt(Sls2 ...sm) = LEF{sl)LEF{sls2)... LEF{sxs2 ...sn)

Figure 5.6: Mapping Function from Hypotheses onto Strings.

TranLEF(H, LEF, examples) ->• (C, P) where

C = H (i.e., no constraints)

P = {(x,y) e HxH | M(y) <iex M(x)}

(M has access to H, the LEF, and the examples)

Figure 5.7: Translator for the LEF.

The reason for placing a zero between terms can be seen in the following example.

Let hi be trVt2Vsi and let h2 be tiVt2s2Vt3, where t2s2 is the conjunction of term t2

and selector s2. Without the zeros between terms, M{hi) is Mt(ti)-Mt(t2)-LEF(si),

and M{h2) is Mt(h) • Mt(t2)LEF(t2s2). If LEF(t2s2) is less than LEF(si), then

hi will be preferred to h2, which is not correct. The evaluation of t2s2 is being

improperly compared to the evaluation of t2Vs2 instead of t2 because the presence

of the disjunction (v) is not being taken into account. Appending a zero to the end

of a term's evaluation assigns a better evaluation to a terminated term than to any

extension of that term, thereby producing the correct comparison.

Given this mapping, the preference ordering over the hypotheses imposed by the

LEF can be expressed as a preference set P. A hypothesis is preferred if its mapped

string comes earlier in the lexicographic ordering, and less preferred if it comes later.

The preference set P is therefore {(x,y) € HxH \ M(y) -<Ux M(x)}, where H is

the hypothesis space and -<iex is the standard lexicographic ordering over strings of

digits. This leads to the translators for the LEF shown in Figure 5.7. This translator

takes as input the hypothesis space (H), and the LEF. Since the LEF often depends

on additional knowledge, this information is also provided to the translator. In most

cases, this information is the examples covered and not covered by the term being

evaluated, so the translator takes as input the list of examples.

88

The next step is to express C and P as regular sets. C is an instantiation of

the VL\ hypothesis space, all of which can be expressed as regular grammars as

shown in Figure 5.1. Representing P as a regular set is a little more difficult. This

is discussed in Section 5.2.2.1, below.

5.2.2.1 Constructing P

A pair of hypotheses, {x, y), is a member of P if M(x) comes after M(y) lexico-

graphically. This can be expressed as a composition of two simpler DFA, one for the

lexicographic ordering, and one for the mapping, M.

The Lexicographic Ordering. The DFA for the standard lexicographic ordering

over strings of digits, -<iex, is fairly simple. The DFA takes as input a pair of shuffled

strings, w = shuffle(wi,W2), where w\ and w^ are strings of digits from (0 — 9)*.

The symbols in w alternate between W\ and IU2, so that the first two digits in w are

the first digit from w\ and the first digit from w-i- These two digits are compared,

and if the digit from W\ is lower, then w is accepted (w\ -<iex W2). If the digit from

wi is higher, then W\ comes after W2, and w is rejected. Otherwise, the next pair of

symbols are compared in the same manner. If one string runs out of symbols, and

the two strings are equivalent up to this point, the shorter string is preferred. If w\

is shorter, then w is accepted, and if Wi is longer, w is rejected. This DFA is shown

in Figure 5.8.

The Mapping. The mapping, M, is expressed as a kind of DFA known as a

Moore machine [Hopcroft and Ullman, 1979]. A Moore machine is a DFA that has

an output string associated with each state. The output string for each state is

composed of symbols from an output alphabet, A. The output string for a state is

emitted by the machine whenever it enters that state. The Moore machine for M

takes as input a hypothesis, h, and emits a string w such that w = M(h). Formally,

a Moore machine is specified by the tuple (Q,s,S:QxE —> Q, out: Q —> A',E, A),

where / is a (usually small) fixed integer.

The Moore machine for M is constructed as follows. Since the mapping depends

on the LEF, the Moore machine is is parameterized by the LEF. The machine

takes a hypothesis as input. When it recognizes a selector, it emits the string

89

0-9,$

Figure 5.8: DFA Recognizing {shuffle(x,y) G (0|l)2fc \ x < y}.

LEF(sis2 ■ ■ ■ sn), where sn is the selector it has just recognized, and S1S2. •. sn_i are

the preceding selectors in the term.

Since each state in the machine can have only one output string, there must be

at least one state in the machine for each possible term. This is clearly impossible.

However, most evaluation functions evaluate a term based on the number of positive

and negative examples covered by the term. Instead of having one state for each

term, the machine has one state for each way of partitioning the examples into

covered and uncovered subsets. Each partitioning corresponds to a set of terms that

are all assigned the same evaluation by the LEF. The list of covered and uncovered

examples is passed to a modified version of the LEF, which returns an evaluation of

the term based on this information rather than by looking at the term itself. The

evaluation returned by the LEF is the output string for this state.

90

There are 2n distinct ways to partition a list of n examples into sets of covered

and uncovered examples, so the machine can have at least 2n states. Fortunately,

the DFA is represented as an implicit DFA, so all of these states do not have to be

kept in memory.

Even though this Moore machine has 0(2n) states, the cost of induction can be

considerably less than 0(2n) (it is in fact polynomial, when using only the knowledge

used by AQ-11, as is demonstrated in Section 5.4). Enumerating a hypothesis from

the deductive closure of (C, P) involves visiting some of the states in C and P.

Only in the worst case (e.g., when the deductive closure is empty) are all of the

states visited. Since the DFAs for C and P are represented implicitly, the cost of

enumeration is proportional to the number of states visited, not to the total number

of states in each DFA. The Moore machine for the mapping function is specified in

Figure 5.9.

Behavior of the Moore Machine. The machine takes as input a hypothesis,

h, and outputs M(h), where M is the mapping described previously in Figure 5.6.

Recall that M(iiVi2V... Vtk), where tx through tk are terms, maps onto the string

Mt(ti)-0-Mt(t2)-0...0-Mt(tk). The mapping Mt maps a term, sis2...sm, where Si

through sm are selectors, onto the string LEF(si) • LEF(siS2) •■■■• LEF(siS2 ... sm).

The LEF assigns an /-digit integer to each term. As was discussed previously, this

value depends on which examples are covered by the term, and on which positive

examples are covered by the previous terms in the hypothesis. Recall that a term is

any sequence of selectors, so that Si, Sis2, and so on are all terms, not just S\S2 ... sm.

The machine keeps track of which examples are covered by the current term, and

which examples are covered by the previous terms in the hypotheses. This is done

by maintaining two binary vectors, Et and Eh, of length n, where n is the number

of examples. The vector Et indicates which examples are covered by the current

term, and Eh indicates which examples are covered by the current hypothesis (all

terms but the current one). A one in position i of the vector indicates that example

i is covered, and a zero indicates that example i is not covered. The vector for the

hypothesis is initially all zeros, since the initial hypothesis is false. The vector for

the current term is all ones, since the term is initially true.

91

Parameters: E an ordered list of n examples LEF: (Eh, Et) -> (1 - 9)' where

Eh is a binary vector of length n indicating which examples are covered by the hypothesis. Et is a binary vector of length n indicating which examples are covered by the current term. / is a small, fixed integer.

SEL is a DFA that recognizes selectors /i through fk are the features

A Moore machine for M is (Q,s,8:QxE -»• Q, out:Q ->• A',S, A) where:

Q = {(Eh,Et,end)\Eh,Ete(0\l)n,ende{0,l}}u{d}

end is a flag indicating whether the machine has just seen the end of a term (1), or is in the middle of a term (0). d is the terminal state.

s = (Ön,ln,0)

if w E SEL Clear bit i of Et if example i not covered by selector w. Set end to 0. Goto state (Eh,Eu0).

if w = "V" V marks the end of a term. Set end to 1. Set bit i of Eh to 1 if bit i of Ef is 1, else leave it unchanged. —* Set every bit in Et to 1. Goto state (E*h,Et,e)

if w = "$" Goto state d, the terminal state.

6*{d,w) = d

mi? B AW [ LEF(Eh,Et) if end = 0 out((Eh,Et,end)) = j Ql if erzrf = 1

out(d) = e, the empty string.

£ = {[,/i,/2,...,/fc,<,<,=,>,>,0-9,],V,$}

A = {0,1,2,3,4,5,6,7,8,9}

Figure 5.9: Moore Machine for the Mapping.

92

6*((Ek,En,end),w)

When a selector is seen, Et is updated by clearing bit i of the vector if the selector

does not cover example i. Since a term is a conjunction of selectors, if any selector in

the term fails to cover an example, there is no way to cover the example by adding —*

more selectors to the term. Therefore, seeing a selector never causes a bit in Et to be

turned on. After seeing the selector, the machine outputs the value assigned by the

LEF to the current term. This value depends on the examples covered by the current

hypothesis and the current term. Specifically, the machine outputs LEF(Eh,Et).

The machine continues in this fashion until the symbol V is seen. This symbol

marks the completion of the current term and the beginning of a new term. The

completed term is now considered part of the current hypothesis, and a new current

term is started. The new current hypothesis covers every example covered by the

completed term, since the the term is a disjunct of the hypothesis. Bit i of Eh is

set to one if bit i of Et is one, and is left unchanged otherwise. This is equivalent

to computing the logical or of Eh and Et. Since a new current term has started,

the vector Et is set to all ones. The machine outputs a 0 to mark the end of the

term. More specifically, it outputs a string of / zeros, since the output strings of the

machine must all be of the same length, and the strings output by the LEF are of

length /.

The machine continues to process symbols as described above until the $ symbol

is seen, signifying the end of the hypothesis, at which point the machine halts.

Specifically, it moves into a state that has an edge to itself on every input symbol,

and has no output string associated with it.

The only memory available to a Moore machine are its states. Therefore, the

vectors Eh and Et correspond to states in the machine. The vectors are of finite

length, so there are finite number of states in the machine. When a selector is

seen, the vectors are recomputed as described above, and the machine moves to this

state. Each state has an output string associated with it, which is emitted when the —# —#

machine moves to that state. Specifically, the machine emits the string LEF(Eh, Et)

after seeing each selector, and the string 0' after seeing each V symbol. Since the

tuple (Eh, Et) does not indicate whether a selector or a V symbol was just seen, an

additional bit, labeled end, is added to the state. A state in the machine is therefore

of the form (Eh, Et, end). When end is one, indicating that a V symbol has just

93

been seen, the output for the state is 0'. When end is zero, the output for the state

isLEF{Eh,Et).

Composing the Grammars. The DFA for P = {(x, y) € HxH | M{y) < M{x)}

is computed from M (the Moore machine for the mapping), and from the DFA for

-< (the standard lexicographic ordering over strings of digits). The idea is to use two

copies of M, and pass x to one of them, and y to the other. The output from the

copies of M, is passed to the DFA for -<, which determines whether M(x) -< M(y).

Input strings to P are of the form shuffle(a, b) = aib1a2b2... anbn where a and b

are hypotheses, and a, and ft, are symbols in a and b. P maintains two copies of M,

Ma and Mb. On odd numbered input symbols (i.e, symbols from a), P simulates a

move on Ma, and on even numbered symbols (from b) P simulates a move on Mb.

When one move has been made in both Ma and Mb, the output strings, wa and

wb, from each Moore machine are shuffled together. Moves are then simulated in

the DFA for -< on the shuffled input string, shuffle(wb,wa)—that is, the DFA tests

whether wb -< wa. P is in an accept state iff the DFA for -< is in an accept state. P

recognizes the set {(a, b) € HxH | M(b) -< M(a)}.

The above construction for P is essentially a substitution of the regular language

generated by the Moore machine into the regular language accepted by the DFA for

-<. Regular languages are closed under substitution [Hopcroft and Ullman, 1979],

and all Moore machines generate regular languages, so this construction always yields

a regular grammar for P.

An Information Gain Metric. One common way to compare terms is with an

information gain metric (e.g., [Quinlan, 1986]). The information of a term is a

measure of how well it distinguishes between the positive and negative example.

Terms with higher information are preferred.

Let po be the number of positive examples covered by the term, let n0 be the

number of negative examples by the term, and let pi and nx be the number of

positive and negative examples that are not covered by the term. Let p and n be

94

the total number of positive and negative examples, respectively. The information

of the term is shown in Equation 5.1.

Po + n0

p + n

Pi+ni p + n

Po , Po . n0 . n0 log, 1 log, Po + n0 Po + n0 po + no po + nQ

Pi , Pi nx m loS2 —; + —; log2

+

Pi + ni pi + ni pi + ni pi + nx (5.1)

When this information metric is used to evaluate terms, the LEF function is

defined as follows. The LEF takes as input the vectors Et and Eh), which indicate

the examples covered by the current term and the previous terms in the hypothesis.

The evaluation is the information of the current term with respect to the negative

examples and the positive examples that have not yet been covered by the previous

terms. If the positive examples covered by the previous terms are not excluded, the

same term gets selected in each call to LeamTerm.

Bit i of Et is on if the term covers example i, and is off if the examples is not

covered. The LEF is assumed to know which bits in each vector correspond to

positive and negative examples. The total number of negative examples covered and

not covered by the term (no and ni) can be computed from Eh by counting the

number of ones and zeros in the appropriate positions.

To compute the number of positive examples in the pot, a bit-wise complement —♦

of Eh is first performed. A bit in the resulting vector is on if example i is not covered

by any of the previous terms. The number of ones in this vector corresponding to

positive examples are computed. This is p, the total number of positive examples in

the pot—the set of positive examples not yet covered by the current hypothesis.

The number of positive examples in the pot that are covered by the term (p0)

can be determined by computing the logical and of Et with the complement of Eh-

Bit i in this vector is on if example i is covered by the term, but not by any of the

previous terms. To compute pi, the number of positive examples in the pot that are

not covered by the term, subtract po from p.

The values for n0, n\, po, pi, p and n are then substituted into Equation 5.1,

which returns a real number between -1 and 1. The first / digits of this number

constitute the digit string returned by LEF(Eh,Et).

95

5.2.3 Domain Theory

A domain theory encodes background knowledge about the target concept as a

collection of horn-clause inference rules that explain why an instance is a member

of the target concept. The way in which this knowledge biases induction depends

on assumptions about the correctness and completeness of the theory. Each of

these assumptions requires a different translator, since the bias maps onto different

constraints and preferences.

A complete and correct theory can explain why every instance covered by the

target concept is a member of the concept. The theory exactly describes the target

concept. This is a very strong form of bias, since the theory identifies the target

concept uniquely and accurately. No other knowledge is necessary. The bias can be

expressed as a constraint that is satisfied only by the target concept. This bias is

used in algorithms such as Explanation Based Learning [DeJong and Mooney, 1986,

Mitchell et al., 1986].

Domain theories are not always complete nor correct. An incomplete theory

can explain some instances, but not all of them. This occurs, for example, when

the theory is overspecific. In an overspecific theory, the target concept is some

generalization of the theory. The bias imposed by an overspecific theory can be

translated as a constraint that is satisfied only by generalizations of the theory.

In addition to being incomplete, a theory can also be incorrect. An incorrect

theory misclassifies some instances. One common kinds of incorrectness is overgen-

erality. An overgeneral theory explains all of the examples covered by the target

concept, and some that are not. An overgeneral theory can be expressed as a con-

straint that is satisfied only by hypotheses that are specializations of the theory.

Overgeneral theories are used by algorithms such as IOE [Flann and Dietterich,

1989] and OCCAM [Pazzani, 1988].

A translator for a particular overgeneral domain theory is described below. The

theory being translated is derived from the classic "cup" theory [Mitchell et al, 1986,

Winston et al, 1983], as shown in Figure 5.10. Since the theory is assumed to be

overgeneral, the target concept must be a specialization of the theory. This bias is

expressed as a constraint, which is satisfied only by specializations of the theory.

More precisely, the theory is a disjunction of several sufficient conditions for CUP,

96

cup(X) :- hokUiquid(X), liftable(X), stable(X), drinkfrom(X). holdJiquid(X) :- plastic(X) | china(X) | metal(X). liftable(X) :- small(X), graspable(X). graspable(X) :- small(X), cylindrical(X) | small(X), hasJhandle(X). stable(X) :- flat.bottom(X). drinkfrom(X) :- open-top(X).

Figure 5.10: CUP Domain Theory.

and it is assumed that some disjunction of these conditions is the correct definition

for CUP.

The domain theory is a union of several sufficient conditions that can be expressed

in terms of the operational predicates. The target concept is assumed to be equivalent

to a disjunction of one or more of these conditions. The operational predicates are

those that meet user-defined criteria for ease-of-evaluation and comprehensibility,

among others. For this theory, the only operationality criteria for predicates is that

they can be evaluated on an instance (e.g., small(X) can be determined for instance

X, but cup(X) cannot). For this theory, the leaf predicates are the operational

predicates. The sufficient conditions for the theory are as follows:

1. cup(X) :- plastic(X), small(X), cylindrical(X), flat-bottom(X), open_top(X).

2. cup(X) :- china(X), small(X), cylindrical(X), flat-bottom(X), open_top(X).

3. cup(X) :- metal (X), small (X), cylindrical (X), flat-bottom (X), open_top(X).

4. cup(X) :- plastic(X), small(X), hasJiandle(X), flat-bottom(X), open_top(X).

5. cup(X) :- metal(X), small(X), has_handle(X), flat-bottom(X), open_top(X).

6. cup(X) :- china(X), small(X), has_handle(X), flat-bottom(X), open_top(X).

The target concept is assumed to be a disjunction of one or more of these con-

ditions. This bias is translated as a constraint satisfied by hypotheses that are

equivalent to these conditions. The regular grammar for the constraint is computed

by mapping each condition onto an equivalent hypothesis (or set of hypotheses), and

computing the union of these hypotheses.

97

In order to map the conditions, the hypothesis space language is assumed to

have features corresponding to each of the operational predicates in the theory. The

operational predicates are all Boolean valued (e.g., small(X) is either true or false),

so the corresponding feature is also Boolean valued. This leads to selectors such as

[small = true] and [small = false]. The features are listed below. Each feature

has the same name as its corresponding predicate.

plastic small flat_bottom

china cylindrical open_top

metal has_handle

The regular grammar for the set of hypotheses corresponding to disjunctions of

sufficient conditions (specializations) of the domain theory is expressed as shown in

Figure 5.11. This grammar is a little less general than it could be, since it does not

allow all permutations of the selectors within each term. However, the more general

grammar contains considerably more rules, and permuting the selectors does not

change the semantics of a hypothesis.

c -> TERM -» SUFFICIENT-CONDITION -)■

TERM | C Or TERM

SUFFICIENT-CONDITION

(PLASTIC I CHINA | METAL)

(HAS_HANDLE I CYLINDRICAL)

SMALL FLAT-BOTTOM OPEN-TOP

PLASTIC

CHINA

METAL

HAS_HANDLE

CYLINDRICAL

SMALL

FLAT-BOTTOM

OPEN-TOP

—)• [plastic = true] —> [china = true] —»■ [metal = true] —Y [has_handle = true] —Y [cylindrical = true] —» [small = true] ->• [flat-bottom = true] -» [open-top = true]

Figure 5.11: Grammar for VLi Hypotheses Satisfying the CUP Theory Bias.

.98

5.3 An Induction Task

This section provides two concrete examples of AQ-11 and RS-KII solving simple

induction tasks. In the first task, both AQ-11 and RS-KII use only the knowledge

available to AQ-11. Both algorithms learn the same hypothesis. The second task is

a synthetic one designed to show the effects of making a domain theory available.

The theory is the CUP theory described in the previous section. AQ-11 cannot use

the theory, and so is forced to learn from examples and the LEF alone. RS-KII

has access to the domain theory as well, and this improves the accuracy of the

hypothesis.

5.3.1 Iris Task

The first induction task is taken from the Iris domain [Fisher, 1936]. The data set

contains three classes of fifty instances each. Each class corresponds to a different

kind of iris. The three classes are as follows.

1. Iris Setosa

2. Iris Versicolour

3. Iris Virginica

The goal is to learn a VL\ hypothesis that identifies one of the classes as distinct

from the other two. For this task, the goal is to learn the first class, Iris Setosa.

Instances are 4-tuples consisting of the values of four objective measurements

taken from a given plant. Specifically, these features are as follows (in order):

1. Sepal length in millimeters

2. Sepal width in millimeters

3. Petal length in millimeters

4. Petal width in millimeters

The values for these features are integers between 0 and 300.

99

5.3.1.1 Translators for Task Knowledge

The VZq hypothesis space is parameterized with a list of features and a set of values

for each feature. There are four features, each of which can be a positive integer. This

leads to the following list of parameters for VXi shown in Table 5.1. Instantiating

Vi with these parameters yields the grammar shown in Figure 5.12.

feature values

h h h h

(l-9)(0-9)* (l-9)(0-9)* (l-9)(0-9)* (l-9)(0-9)*

Table 5.1: Parameters for VL\.

VLl -» TERM | TERM or VLl TERM -» SELECTOR | SELECTOR TERM SELECTOR -> "[" /i RELATION (1 - 9)(0 - 9)* "]"

"[" f2 RELATION (1 - 9)(0 - 9)* "]" "[" /3 RELATION (1 - 9)(0 - 9)* "]" "[" /4 RELATION (1 - 9)(0 - 9)* "]"

RELATION -»<|<| = |TH>|>

Figure 5.12: Grammar for Instantiated VLX Language.

Translators are needed for the positive and negative examples. An example is

positive if it is an instance of the target class (e.g., Iris Setosa), and negative other-

wise. Examples are translated by the TranAQExample translator shown previously

in Figure 5.4.

The LEF is not specified by the Iris task. The information gain translator de-

scribed earlier is a good general-purpose LEF, and is the one used in this example.

The translator for the LEF is the one described in Section 5.2.2.

100

5.3.1.2 Results of Learning

The concept definition learned by both AQ-11 and RS-KII for Iris Setosa is the

one-selector hypothesis [/3 < 30].

5.3.2 The CUP Task

The second task is a synthetic one based on the CUP domain. The knowledge consists

of a list of examples, an information-gain metric for evaluating hypotheses, and the

CUP domain theory of Figure 5.10. RS-KII can make use of all of this knowledge,

but AQ-11 cannot use the domain theory.

The features for this task are the operational predicates of the CUP theory, as

shown below. All of these features are Boolean valued.

/i: plastic /4: cylindrical ff. flat-bottom

f2: china /5: has_handle /8: open_top

/3: metal fe. small

There are also two irrelevant features, color (/9) and percent-full (/i0). The

color feature can take the values {red, blue, green}, and the percent-full feature

can take integer values between 0 and 100 indicating the amount of liquid in the

cup.

The target concept is "plastic cups with handles", which is a specialization of

the CUP domain theory. The concept is represented by the VLX hypothesis [/x =

true] [/5 = true] [/6 = true] [/7 = true] [/8=true] . There are four examples

of this concept, as shown in Table 5.2.

ID class /i h h u h h h /8 u /io

ei + t f f f t t t t blue 10

e2 + t f f f t t t t red 50

e.3 — f t f f t t t t green 60

e4 — t f f t t t t t blue 20

Table 5.2: Examples for the CUP Task.

The other available knowledge sources are the CUP domain theory shown previ-

ously in Figure 5.10, and an information gain metric for evaluating hypotheses.

101

The examples are translated according to TranAQExample, as shown previously

in Figure 5.4. The domain theory is translated into (C, {}), where the regular

grammar for C is as shown in Figure 5.11. The information-gain metric is translated

according to the translator described in Section 5.2.2.

5.3.2.1 Results of Learning

AQ-11 has no access to the domain theory. It learns a hypothesis that is consistent

with the examples, and preferred by the information gain metric. It learns the

following hypothesis:

[fi = false] and [/4 = false].

This hypothesis recognizes all plastic, cylindrical objects, even if they are not cups.

In order to learn a more accurate concept, considerably more examples are necessary.

RS-KII can learn the correct hypothesis from the same number of examples. This

is because RS-KII has access to the domain theory, which provides a very strong bias

on the hypothesis space. It only considers hypotheses that are specializations of the

domain theory. This drastically reduces the search space, increasing the accuracy of

the induced concept. RS-KII learns the following concept:

[/i = true] [/5 = true] [/6 = true] C/7 = true] [/8=true]

This is the only hypothesis that is both consistent with the examples, and a special-

ization of the domain theory.

5.4 Complexity Analysis

The computational complexity of RS-KII depends on the number of knowledge frag-

ments it utilizes as well as the nature of those fragments. A general analysis of

RS-KII's complexity would depend on the nature of the knowledge, which is im-

possible to quantify meaningfully. However, it is possible to obtain a complexity

analysis for particular classes of knowledge. This is the approach taken in this sec-

tion to compare the complexity of RS-KII to AQ-11.

102

5.4.1 Complexity of AQ-11

In the following complexity analysis, e denotes the number of examples, k indicates

the number of features, and b is the width of the beam search.

5.4.1.1 Cost of AQ-11 »s Main Loop

The AQ-11 algorithm consists of a main loop that calls LearnTerm once per iteration.

On each iteration, the term found by LearnTerm is disjoined to the end of the current

hypothesis, and the positive examples covered by the term are removed from the pot.

Removing covered examples from the pot takes time proportional to e. The main

loop iterates until the pot is empty—i.e., every positive example is covered by at

least one of the terms. Since each term is guaranteed to cover at least one positive

example that the other terms do not cover, the main loop makes at most one call to

LearnTerm for each positive example. The worst-case complexity of the main loop

is therefore 0(e(x + e)) where x is the complexity of LearnTerm

5.4.1.2 Cost of LearnTerm

LearnTerm conducts a beam search through the space of terms to find a term that

covers a given positive example (the seed) and none of the negative examples. The

search maintains a "pot" of negative examples that are covered by at least one

term in the beam. A set of selectors is computed that covers the positive seed but

not the negative examples. Child terms are generated from the terms in the beam

by extending them with the selectors in this set. The child terms are evaluated

according to the LEF, and the best b terms are retained in the beam.

The cost of LearnTerm depends partly on the number of selectors that cover the

positive seed but not the negative example. In theory, this set can be quite large, or

even infinite, which would make AQ-11 intractable. In practice, the set can be made

much smaller with a simple bias. The generation of this set of selectors is discussed

below, and an upper bound on its size is derived. The cost of LearnTerm is then

computed using this bound.

Generating the Set of Selectors. The complete set of selectors that cover the

positive seed but not the negative example is generated as follows. Let vp>i be the

103

value of the positive seed on feature fi, and let vn>i be the value of the negative

example on feature /,-. The union of the following sets of selectors over all k features

is the set of selectors that cover the positive seed but not the negative example:

[fi = VP,i\ if VP,i ^ *W

[fi # vn,i] if vP!i ^ vn>i

{[fi < V] I VP,i <V^ Vn,i} if VP,i < Vn,i-

{[fi < V] | VPti <V< Vn,i} if VPti < Vn,i

{[fi > v] | vn < v < vp} if vPii > vn>i

{[fi >v]\vn<V<Vp} if Vp Vn,i

The last four sets in this list can contain a large to infinite number of selectors,

depending on whether the feature values map onto the integers or the reals. How-

ever, many of these selectors are redundant, in that they can be partitioned into

classes of "equivalent" selectors. For any hypothesis, one selector can be substituted

for any other in the same class, and the resulting hypothesis will still cover the

same examples. Of course, the two hypotheses may assign different classes to new

instances. This suggests a way to bias the set of selectors—namely, include only one

selector from each class in the set. As will be argued below, the number of classes

for a given feature is bounded by the number of examples. If there are k features,

the biased set of selectors contains at most 0(ek) elements.

Consider the set of selectors, {[/< < v] \ vPji < v < t/n>i} where vPti < vn>i. A

selector partitions the examples into a set of covered and uncovered examples. The

selectors in this set can induce at most e different partitions on the examples. To

see why this is so, take the value of each example on feature fi, and order the values

from smallest to largest. If there are e examples, there are at most e partitions that

can be induced by selectors of the form [ft < v]: the first example in the list is

covered, the first two are covered, and so on up to all of the examples being covered.

For each of these partitions, the characteristic selector for the equivalence class is

[fi < v] where v is one of the values in the list of examples. The set of characteristic

selectors is {[/, < v] \ vpi and v is the value of some example on feature

/;}•

104

The characteristic selectors for the other sets of selectors are computed similarly.

Let E be the set of positive and negative examples, and let iri(E) be the projection

of the examples onto their values for feature fi—that is, iti{E) is the set of values

held by one or more of the examples on feature /». The set of characteristic selectors

that cover the positive seed but not the negative seed is the union of the following

sets for i between one and k:

{[ft = vPti]} if vPii ^ vn>i

{[fi ^Vnj]} tivPii^vnti

{[f < v] | vPti <v< vni

{[fi < v] | vPii < v < vn>i and v <E iri(E)} if vPti < «„,,-

{[fi > v] | vn < v < vp and v e ^i{E)} if vp>i > vn>i

{[fi > v] | vn < v < vp and v e iti(E)} if vp vn>i

Each of these sets contains at most e elements, since iTi(E) contains at most e = \E\

values. The union of these sets over all k features contains at most 0(ke) selectors.

The time needed to generate this set is also 0(ke).

Cost of One Iteration. Each iteration of LearnTerm consists of six steps:

1. Select a negative example from the pot. This takes time 0(1).

2. Compute S, the set of selectors that covers the seed, but does not cover the

negative example selected in line (1). Computing S takes time 0(ek), and

contains 0{ek) selectors. The derivation of these results is given below.

3. Generate the children of the terms in the beam: {U • s \ U is a term in the

beam, and s G S}. Computing the children takes time 0(6150 = 0(bek).

4. Evaluate the children with the LEF. In most cases, the evaluation of a term

depends on the number of positive and negative examples the term covers.

The time to evaluate all the child terms is therefore bounded by 0(bke2)

105

5. Sort the children according to the LEF, and select the first 6 children as the

new beam. This takes time 0(b\S\log(b\S\)) = 0(bek\og(bek)). If the beam

width is one, the new beam is computed by selecting the best child. This only-

takes time 0(b\S\) = 0(bek).

6. Remove from the pot of negative examples every example covered by none of

the terms in the beam. This takes time 0(eb).

The total time for one iteration is bounded by the sum of these costs, or 0(\S\ +

b\S\+b\S\e+b\S\ log(6|S'|)+e6), where S is the number of selectors that cover the seed

but not the negative examples. Since there are at most ek such selectors, substituting

ek for \S\ yields 0(ek + bek + bke2 + bek log{bek) + eb), or 0{bke2 + bek \og{bek)).

When the beam width is one, the term bek \og(bek) can be replaced by bek, in

which case the time cost for one iteration is bounded by 0(bke2 + bek) = 0(bke2).

Since the beam width (6) is one, this can be rewritten as 0{ke2).

Total Cost of LearnTerm. LearnTerm makes at most one iteration per negative

example. The number of negative examples are bounded by e, so the time complexity

of one call to LearnTerm is is bounded by 0(bke2 \og(bke) + bke3) when b > 1, or by

0(ke3) when 6=1. These costs are summarized in Equation 5.2 and Equation 5.3.

0{ke3) if 6=1 (5.2)

0(6A;e2log(6A;e)+6A;e3) if 6 > 1 (5.3)

5.4.1.3 Total Time Complexity for AQ-11.

The total cost of AQ-11 is 0(e(x + e)) where x is the time complexity of LearnTerm.

This yields the following computational complexity for AQ-11:

0(6A;e3log(6A;e)-(-6A;e4) if 6 > 1 (5.4)

0{bke4) if 6=1 (5.5)

The bke3 log(6fce) term is the cost of expanding and pruning the beam, and the bke4

term is the cost of evaluating the terms with the LEF.

106

5.4.2 Complexity of RS-KII when Emulating AQ-11

When emulating AQ-11, RS-KII induces a hypothesis by enumerating a single hy-

pothesis from the solution set of (C,P). This is accomplished by the branch-and-

bound algorithm described in Figure 4.14 of Chapter 4. The parameters to the

algorithm are shown in Figure 4.15 of the same chapter.

The branch-and-bound algorithm maintains L, a collection of subsets of C. On

each iteration, one of these subsets, X$, is selected. Xs is compared to all of the

subsets in L. If Xs is not dominated by any of the subsets, then it contains only solu-

tions, and an element of Xs is enumerated. Otherwise, Xs is replaced by Split(Xs) in

the collection, and dominated subsets are pruned from the collection. This continues

until a hypothesis is enumerated, or L is empty.

The cost of one iteration depends on the number of subsets in L. An iteration

consists of four steps, the costs of which are analyzed below.

1. Select Xs from L. This can be done in constant time if L is maintained

according to the efficient schemes mentioned in Section 4.2.2.4.

2. Determine whether Xs is undominated by all of the subsets in L. This takes

time 0(\L\t^), where t^ is the cost of comparing two subsets with the not-

dominated relation, D^.

3. Split Xs. A subset of C is of the form Wi • qi, where W{ is some string and

q{ = Sc(startc,Wi), the state reached by Wj in C. Split(wi • qi) generates a

child subset of Wj ■ qi by extending Wi with a symbol a from Ec and changing

qi to be the state reached by the new string, Wi<7. There is at most one child

subset for each symbol, for a total cost of 0(|Sc|).

4. Replace Xs with Split(Xs) in L, and prune dominated subsets from L. None

of the existing subsets in L dominate each other. The child subsets need to be

compared to each other, and to each of the existing subsets in L.

There are at most |Ec| child subsets, and a — 1 existing subsets, where a is the

number of subset in L prior to replacing Xs with its children. The children are

added to the collection as follows. Each child is compared to the a — 1 existing

subsets, and each of the a — 1 existing subsets is compared to the child. If an

107

existing subset is dominated, it is removed from the collection. If the child is

not dominated by any of the existing subsets, the child is added to the list.

Otherwise the child is discarded.

For the first child, there are 2(o - 1) comparisons. If the child is added to

the collection, the collection contains a subsets. Determining whether the

second child should be added requires 2a comparisons in this case. If each

child is added to the collection, the total number of comparisons required is

E'=OI_1 2((fl " 1) + 0» for a total cost of °(0E<?I2 + lsd|£|)<-c) where U is the cost of comparing two subsets with the domination relation D'^.

The total cost of one iteration is the sum of these costs. The costs of the two

dominance relations, t^ and t^, are of the same order, so these values can be replaced

by a single variable, t. The resulting computational complexity for one iteration of

BranchAndBound-3 is shown in Equation 5.6.

0((|£c|2 + |£c||L|)t) (5-6)

The Size of L. L initially contains a single subset that is equivalent to C. In

each iteration, a subset is selected and split into at most Sc subsets. L can grow

geometrically unless subsets are pruned aggressively.

The pattern of growth for L when RS-KII emulates AQ-11 is a cyclic one. It grows

geometrically for a few iterations, and is then pruned back to a single subset. Call

the initial single subset X\ = wvqx. This is split into subsets by adding symbols from

Ec to the end of wu the prefix string for Xx. Symbols in Ec are not selectors, but

rather components of selectors (i.e., feature names, the six relations, and digits from

which to construct the value field). These subsets cannot be meaningfully compared

since their prefix strings end in partially constructed selectors. Additional symbols

are needed to form a complete selector, at which point the subsets can be compared.

Selectors are strings of up to I symbols, where I depends on the number of digits

that can be in the value field of a selector. Subsets with the shortest prefix strings

are selected first, so the subsets are expanded breadth first until of the subsets have

complete selectors. L grows geometrically for these iterations.

At the end of this growth cycle, all of the subsets in L have complete selectors.

The subsets can be compared with the dominance relation £><, and the dominated

108

subsets pruned from L. The relation prunes a subset wt • $ if there is another subset

in the collection, Wj • qjf such that shuffle(wi,Wj) leads to an accept-all state in P

and Wj ■ qj is not empty (i.e., qj is not a dead state in C). Omitting the empty test

yields the relation D'K, which is used to order select the a subset from the collection.

Since P is a lexicographic ordering, if Wi comes before Wj, then every extension

of Wi also comes before Wj and its extensions. This means that shuffle(wi, Wj) leads

to an accept-all state in P. Since a lexicographic ordering is a total ordering, for

every pair of subsets, Wi • $ and Wj • qj in the collection, either w{ comes before Wj,

or Wj comes before wi, or w^ and Wj are equivalent. In all but the last case, one of

the two subsets is dominated, assuming the other is not empty. In a total ordering

there can be at most one maximal element, so all but one subset can potentially be

pruned, assuming the dominant subset can be shown to be non-empty.

However, it is not usually possible to tell whether a subset is empty, since C"s

Dead function returns unknown for most states. Therefore, L is not pruned. However,

the way in which subsets are selected mitigates against this. The subsets are ordered

according to £)'<, and the best subset is selected as Xs in the next iteration. The D'<

relation is exactly D< without the empty test. The Branch-and-Bound algorithm

sorts the subsets in L once in each iteration according to D'<. L is stored as a lattice

reflecting the known dominance relations among the subsets. The elements in the top

of the order are the undominated elements according to D'<. After splitting a subset

sufficiently, it may be identified as either empty or non-empty instead of unknown.

If it is non-empty, all the subsets dominated by it in L are pruned. Otherwise, the

empty subset is removed from L.

The representation for L effectively allows the subsets to be compared according

to D'< instead of D<. The subsets at the top of the lattice are the undominated

elements according to !)'<. Since D'< does not check for emptiness, it allows a subset

to be dominated by an empty subset. However, dominated subsets are not pruned.

They are only moved lower in the lattice. If the dominating subset later turns out

to be empty, the dominated subsets may become new top elements. This allows

incorrectly "pruned" subsets to be reinstated in L. Since £)'< is a total ordering,

and defined over all subsets that have prefix strings with complete selectors, there is

at most one subset at the top of the lattice after the growth stage. When children

are added, they are only compared to the subsets at the top of the lattice, since if

109

they dominate these elements, they certainly dominate those below it. In effect, \L\

is always one after each growth stage.

The size of L grows by |Ec|-fold until \L\ is the number of selectors (i.e., L

contains all one-selector extensions of the initial single hypothesis, Xi). There is

one iteration where \L\ = 1, |EC| iterations where \L\ = 0(|EC|), \^c\2 iterations

where \L\ = 0(|EC|2), and so on up to |EC|' iterations where \L\ = 0(|EC|')- The

cost of the last tier dominates the cost of all the other tiers. The cost of this tier

is |EC|' times the cost of an iteration in which \L\ = |EC|'. The cost of the growth

stage is bounded by this cost, which is 0(\Ec\2lt)-

The maximum size to which L can grow is the number of selectors with which the

initial single subset, Xu can be extended. There are at most 0(ek) such selectors,

as was shown in Section 5.4.1. Since the size of L is |EC|', this term can be replaced

with ek in 0(|EC|2^). This yields the cost complexity bound for the growth stage

shown in Equation 5.7. This cost includes pruning L back to a single element.

0((ek)H) (5.7)

A Special Case. The dominance relation D'^ is not entirely accurate, in that it

allows a subset W\ • q\ to be dominated by w2 ■ q2 if Q2 is a dead state. This means a

subset is dominated by an empty subset, which is not correct. This approximation is

necessitated by the expense of determining whether q2 is dead. Dominated subsets

are removed from L, but since £>< is not entirely correct, some subsets may be

removed because £>< thinks they are dominated by subsets that are in fact empty.

This is mitigated by maintaining L in such a way that a dominated subset can be

reinstated to L if the subsets that supposedly dominate it later turn out to be empty.

The selected subset, Xs, is empty if for each of its child subsets, Wi • qt, qt is a

recognizable dead state. In this case, Xs is marked as empty, and the subsets that

Xs was thought to dominate are reinstated to L if they are not otherwise dominated.

The emptiness of Xs is propagated to the supersets of Xs, since some of these may

also be empty. If these are in fact empty, then the subsets they dominated may also

be reinstated to L. All of the reinstated subsets are then compared to the other

subsets of L as the children of Xs have been. The complexity of this step depends

on the number of subsets dominated by Xs and its empty parent subsets.

110

However, Xs is never empty when RS-KII is emulating AQ-11. This is because

all of the dead states in C are immediately recognizable. C is an intersection of

constraint sets from each example. A constraint set for an example contains the

hypotheses that are consistent with the example, and an intersection of two of these

sets, CinC2, contains hypotheses consistent with both examples. A partial hypothe-

sis, w, leads to a state w-(q1,q2) in the intersection. The state (qu q2) is recognizably

dead if either qx or q2 are dead—that is, there is no extension of hypothesis w that is

consistent with one of the examples. The state is not recognizably dead if and only

if there is one extension of w that is consistent with the first example, and a second

extension of w that is consistent with the second example, but no extension that is

consistent with both examples. This never happens, since any VLX hypothesis can

be extended to be consistent with any set of mutually consistent examples. If the

examples are not consistent, then C is immediately recognizable as empty.

Since Xs is never empty unless C is empty, none of the dominated subsets ever

need to be reinstated. The cost of the reinstatement operation is therefore omitted

from the computational complexity of RS-KII when emulating AQ-11.

Number of Iterations. The search algorithm makes one iteration for every se-

lector in the induced hypothesis. The hypothesis induced by RS-KII is consistent

with the positive and negative examples. Furthermore, each selector in the hypoth-

esis makes progress towards consistency with the examples, and no selector or term

is added unnecessarily (e.g., once a term is consistent with the negative examples,

additional selectors are not appended to it). These latter conditions are guaranteed

by the LEF. By the same arguments used in the analysis of AQ-11, the induced

hypothesis contains at most one term per positive example, and each term contains

at most one selector per negative example. If there are p positive examples, and n

negative examples, the hypothesis contains at most pn selectors.

Total Computational Complexity. The search algorithm makes at most pn

iterations. The cost to induce a hypothesis is therefore pn times the cost of the

growth stage, in which all of the selectors are generated (Equation 5.7. This yields

111

the cost complexity for RS-KII shown in Equation 5.8 when RS-KII uses only the

knowledge used by AQ-11. 0(pn{ek)H) (5-8)

The cost of comparison, t, is proportional to the cost of computing the next-state

function in P. A state in P is a vector of length e, so the cost of computing the

next-state is 0(e). Substituting e for t, and e2 for pn in Equation 5.8 yields the

following computational complexity for RS-KII when emulating AQ-11:

0(e5k2) (5-9)

The complexity of AQ-11 with a beam-size of one is 0{eAk). The complexity

of RS-KII is worse by a factor of ek. This comes from the comparisons between

incomplete selectors performed during the growth stage. If the Split function is

defined so that the string w in subset (w, q) is extended by a full selector instead of

by a single symbol, then the complexity is 0(fce4), the same as for AQ-11.

5.5 Summary

RS-KII can utilize all of the knowledge used by AQ-11 when the beam size is one,

and thus functionally subsumes AQ-11 for this beam width. The complexity of RS-

KII when utilizing only the knowledge of AQ-11 is a little worse than that of AQ-11,

but still polynomial in the number of examples and features. RS-KII can also utilize

a domain theory, which AQ-11 can not. RS-KII can utilize a domain theory at the

same time it is using AQ-11 knowledge, effectively integrating this knowledge into

AQ-11.

112

Chapter 6

RS-KII and IVSM

Incremental version space merging (IVSM) [Hirsh, 1990] is an extension of Mitchell's

candidate elimination algorithm (CEA) [Mitchell, 1982] that can utilize noisy exam-

ples, domain theories, and other knowledge sources in addition to the the noise-free

examples to which CEA is limited. IVSM strictly subsumes CEA. Each fragment of

knowledge is translated into a version space of hypotheses that are consistent with

the knowledge. The version spaces for each knowledge fragment are intersected,

yielding a new version space consistent with all of the knowledge.

IVSM is exactly an instantiation of KII in which constraints are represented as

version spaces, or convex sets, and preferences are represented by the null represen-

tation, which can express only the empty set (i.e., no preference information can

be utilized). This instantiation is called CS-KII, for convex-set KII. The IVSM and

CEA algorithms are described further in Section 6.1, and the equivalence of IVSM

and CS-KII is demonstrated.

RS-KII can utilize the same knowledge as IVSM, and thereby subsume both

IVSM and CEA, but only for some hypothesis spaces. IVSM expresses knowledge

as a set of hypotheses that are consistent with the knowledge. For some hypothesis

spaces, every subset of the hypothesis space that can be expressed as a convex set

can also be expressed as a regular set. In these spaces, every knowledge fragment

that can be expressed as a convex set in IVSM can be expressed as an equivalent

regular set in RS-KII. These are the hypothesis spaces for which RS-KII subsumes

IVSM. For other spaces, there are at least some subsets that can be expressed as

a convex set but not as a regular set. For these spaces, RS-KII and IVSM overlap

in the knowledge they can use, or the knowledge utilizable by IVSM and RS-KII is

113

disjoint. There are no non-trivial spaces for which IVSM subsumes RS-KII, since

for every non-trivial hypothesis space there is at least one regular set that cannot be

expressed as a convex set. The expressiveness of RS-KII and IVSM is investigated

further in Section 6.2.1, and sufficient conditions are identified for hypotheses spaces

for which RS-KII subsumes IVSM.

One class of hypothesis spaces for which RS-KII subsumes IVSM is the class

of spaces described by conjunctive feature languages. These are the hypothesis

spaces commonly used by CEA and IVSM. This class is defined more precisely

in Section 6.2.2.1. The subsumption of IVSM by RS-KII for this class of spaces is

investigated in this section as well. This includes both a general proof, and empirical

support in the form of translators for a number of common knowledge sources that

are also used by IVSM.

RS-KII also subsumes IVSM in terms of computational complexity for this class

of hypothesis spaces. Section 6.3 compares the complexity of set operations in the

regular and convex set representations for subsets of these hypothesis spaces. For

these spaces, the complexity of RS-KII is up to a squared factor less than that

of IVSM, and in all cases RS-KII's worst case complexity is bounded by IVSM's

worst case complexity. For at least one space, there is a set of examples for which

the complexity of IVSM is exponential in the number of examples, but for which

the complexity of RS-KII is only polynomial. These examples are the well known

examples demonstrated by Haussler [Haussler, 1988] to cause exponential behavior

in CEA. The behavior of RS-KII for these examples is investigated in Section 6.4.

Concluding remarks are made in Section 6.5

6.1 The IVSM and CEA Algorithms

6.1.1 The Candidate Elimination Algorithm

The candidate elimination algorithm (CEA) [Mitchell, 1982] induces a hypothesis

from positive and negative examples of the target concept. The implicit assumption

of the algorithm is that the target concept is both a member of the hypothesis space

and consistent with all of the examples. CEA maintains a set of hypotheses, called

a version space, that consists of the hypotheses consistent with all of the examples

114

seen so far. Examples are processed incrementally. Each example is processed by

removing hypotheses from the version space that are inconsistent with the example.

If the example is positive, then all hypotheses not covering the example are removed.

Conversely, a negative example is processed by removing the hypotheses that do

cover the example.

If after processing an example the version space is empty, then the algorithm

halts. No hypothesis in the hypothesis space is consistent with all of the examples,

and the version space is said to have collapsed. If the version space contains only

a single hypothesis, then the version space has converged. This single hypothesis is

uniquely identified by all of the examples seen so far. If the implicit assumptions of

CEA are correct, then this hypothesis is the target concept.

If there are multiple hypotheses in the version space, then it is assumed that one

of them is the target concept. However, the target concept can not be discriminated

from the other candidates without observing additional examples. If more examples

are available, they can be processed in the hopes that the version space will con-

verge. Otherwise, a hypothesis can be selected arbitrarily from the version space as

the induced hypothesis, or the entire version space can be used to classify unseen

instances as follows. If the instance is covered by all of the hypotheses in the version

space, then the instance is also covered by the target concept and is classified as

a positive instance. Likewise, if none of the hypotheses in the version space cover

the example, neither does the target concept. The instance is classified as negative.

Otherwise, the instance may or may not be covered by the target concept. In this

case, the instance is classified as unknown.

6.1.2 Incremental Version Space Merging

Incremental version-space merging (IVSM) [Hirsh, 1990] is an extension of the candi-

date elimination algorithm that learns from non-example knowledge as well as from

examples. Each knowledge fragment is translated into a constraint on the hypothesis

space, and this constraint is represented as a version space of hypotheses that satisfy

the constraint. The version spaces for each fragment are intersected to produce a

single version space consistent with all of the knowledge. Every hypothesis in this

115

set satisfies all of the constraints, and is therefore equally preferred by the knowledge

for being the target concept.

When the knowledge consists solely of noise-free examples, IVSM computes ex-

actly the same version space as CEA, and with essentially the same time and space

complexities [Hirsh, 1990]. That is, IVSM subsumes CEA.

A hypothesis can be selected arbitrarily from the final version space as the target

concept, or instances can be classified against all of the hypotheses in the version

space, as was described for CEA. Other queries about the version space are also

possible. The queries that have generally proven useful are membership, emptiness

(collapse), uniqueness (convergence), and subset [Hirsh, 1992].

6.1.3 Convex Set Representation

In both IVSM and CEA, the version space is represented as a convex set. A convex

set consists of all hypotheses "between" two boundary sets, S and G, where S is the

set of most specific hypotheses in the version space and G is the set of most general

hypotheses in the version space. A hypothesis is in the version space if it is more

general than or equivalent to some hypothesis in S, and is more specific than or

equivalent to some hypothesis in G.

Formally, a convex set is specified by the tuple (H,-<,S,G}, where H is the

hypothesis space, -< is a partial ordering of generality over H, and S and G are

the boundary sets. When the values of H and -< are obvious, a convex set can be

written as just (5, G). The convex set (H, -<,S,G) consists of all hypotheses h such

that s ■< h -< g for some s in S and g in G. The relation ^ is derived from -<, such

that x ■< y iff x -< y or x = y.

{H, ^,S,G) = {heH\ 3seS 3geG (s * h 1 g)} (6.1)

6.1.4 Equivalence of CS-KII and IVSM

IVSM essentially provides a collection of operations on convex sets: translation of

knowledge into constraints expressed as convex sets, intersection of convex sets,

enumeration of hypotheses from a convex set, and queries on convex sets such as

116

membership, emptiness, uniqueness and subset. These correspond to the operations

provided by KII.

IVSM is equivalent to CS-KII, an instantiation of KII in which constraints are

represented as convex sets, and preferences are represented by the null represen-

tation, which can represent only the empty set (i.e., preference information is not

allowed). Both CS-KII and IVSM represent knowledge as convex sets, and provide

the same operations on convex sets.

In IVSM, each knowledge fragment is translated into a a convex set containing

all of the hypotheses that are consistent with the knowledge. The fragments are

integrated by intersecting their corresponding convex sets into a single set contain-

ing the hypotheses consistent with all of the fragments. This set corresponds to

the version space in CEA. A hypothesis can be selected from this set as the target

concept, or instances can be classified against the entire set, in the same way that

CEA classifies instances against the version space. IVSM also supports other op-

erations on the convex set that represents all of the integrated knowledge, namely

membership, emptiness (collapse), and uniqueness (convergence).

CS-KII represents knowledge in essentially the same way, and provides the same

operations. In CS-KII, a knowledge fragment is translated into an (H, C, P) tuple,

where H and C are convex sets, and P is the empty set. The representation for P

can represent only the empty set, so P is always empty, regardless of the knowledge.

The C set contains all of the hypotheses that are consistent with the knowledge

fragment. This is exactly the same convex set that IVSM would use to represent the

knowledge fragment. IVSM and CS-KII both represent a given knowledge fragment

in terms of constraints on the hypothesis space, which is represented as a convex set

of the hypotheses satisfying the constraints. Neither IVSM nor CS-KII can represent

knowledge in terms of preference information. In CS-KII, this lack of knowledge is

expressed explicitly by an empty P set. In IVSM, the lack of preference knowledge

is implicit, so the P set can be omitted. The two representations are equivalent.

Knowledge is integrated in CS-KII by intersecting (H, C, P) tuples. The C sets

of the pairs are intersected, and the P sets are unioned. Since the P sets are

always empty, the union of two P sets is also empty. The result of integrating

several knowledge fragments is (i/,C,0), where C is the convex set containing the

hypotheses consistent with all of the integrated knowledge fragments. This C set is

117

the the same convex set that IVSM would use to represent the same collection of

knowledge fragments. Each fragment is represented by the same convex set in both

IVSM and CS-KII, and both IVSM and CS-KII integrate knowledge by intersecting

these convex sets.

Finally, CS-KII provides the same operations on convex sets as IVSM. In IVSM

the queries are applied to the version space, and in CS-KII they are applied to the

solution set of (C, P). Since P is always empty, the solution set of (C, P) is just C.

This is the same convex set computed by IVSM. The queries in IVSM and CS-KII

therefore both apply to C, the convex set of hypotheses consistent with all of the

knowledge. The provided queries are membership, emptiness, and uniqueness. Both

IVSM and CS-KII can also test whether C is a subset of another convex set. CS-KII

also provides an explicit operator for enumerating hypotheses from C. IVSM does

not provide an explicit enumeration operator, but its existence is implied by IVSM's

ability to select an arbitrary hypothesis from the version space as the target concept.

In both IVSM and CS-KII, hypotheses are enumerated from a convex set, so they

can both use the same implementation of the enumeration operator.

IVSM provides one additional operation that CS-KII does not provide directly.

This is the classification of instances against the entire version space (see Sec-

tion 6.1.1). CS-KII does not provide this classification operation directly, but it

can be implemented in terms of two operations that CS-KII already does provide:

translation and the subset query. The implementation is shown in Figure 6.1. The

idea is to translate the instance as if it were a positive example. This yields the

set of hypotheses that cover the instance. If the version space (solution set) is a

subset of this set, then every hypothesis in the version space covers the instance.

The instance is assigned the positive classification. A similar procedure determines

whether every hypothesis fails to cover the instance: translate the instance as if it

were negative, and determine whether the solution set is a subset of the resulting set.

If the version space is a subset of neither translation, then the instance is assigned

the unknown classification.

CS-KII and IVSM are equivalent. Both represent knowledge as convex sets of

hypotheses consistent with the knowledge, and both provide the same operations

on this representation. These operations are implemented the same way in both

118

Algorithm Classify<(i,{C,P)) i: unclassified instance (C,P): a COP

BEGIN

(C+,P+) <- TranExample(H,^,(i,positive)) {C~, P~) <- TranExample(H, ■<, (i, negative)) IF Subset({C,P),(C+,P+)) THEN

RETURN positive ELSE IF (Subset((C,P),{C-,P-))THEN

RETURN negative ELSE

RETURN unknown END IF

END Classify

Figure 6.1: Classify Instances Against the Version Space.

IVSM and CS-KII, except that CS-KII represents the lack of preference knowledge

explicitly with an empty P set.

The representation for P can express only the empty set. One could imagine

an instantiation of KII in which both C and P were represented as convex sets.

However, convex sets are not closed under union [Hirsh, 1990], so integration would

not be defined in this instantiation of KII.

The candidate elimination algorithm is subsumed by both IVSM and CS-KII.

This follows from the equivalence of CS-KII and IVSM, and the fact that IVSM

subsumes CEA [Hirsh, 1990]. One might ask whether AQ-11 was also subsumed

by CS-KII and IVSM, at least to the extent that AQ-11 is subsumed by RS-KII

(see Chapter 5). The answer is a qualified no. AQ-11 finds a VL\ hypothesis that

is strictly consistent with the examples and preferred by the LEF. The ability to

express preference information is required in order to utilize the LEF, but CS-KII

cannot utilize preference information. The only P set that can be expressed in CS-

KII is the empty set. Extending the P representation to convex sets does not help

much either, since convex sets are not closed under union, which makes it impossible

to integrate (C, P) tuples.

119

However, CS-KII as it stands can at least find a VLX hypothesis consistent with

the examples. The generality ordering over VXi hypotheses is well defined, so convex

subsets of VXi can certainly be expressed. Each example is translated as a convex

set of hypotheses consistent with the example, the convex sets are intersected, and

a hypothesis is selected arbitrarily from the intersection as the target concept.

6.2 Subsumption of IVSM by RS-KII

RS-KII subsumes IVSM, but only for hypothesis spaces for which every convex subset

of the space is also expressible as a regular set. In general, the convex set repre-

sentation and the regular set representation overlap in expressiveness, but neither

subsumes the other.

Section 6.2.1 discusses the relative expressiveness of the two representations in

general terms, and identifies conditions for which every convex subset of hypothesis

space can also be expressed as a regular set. These are the hypothesis spaces for

which RS-KII subsumes IVSM.

A specific class of hypothesis spaces for which RS-KII subsumes IVSM is iden-

tified in Section 6.2.2.1. Since RS-KII subsumes IVSM for these spaces, it should

be possible to construct RS-KII translators for the knowledge used by IVSM. Sec-

tion 6.2.3 provides such translators as additional empirical support for the subsump-

tion of IVSM by RS-KII.

6.2.1 Expressiveness of Regular and Convex Sets

Convex sets and regular sets overlap in expressiveness. There is some knowledge

that can be expressed in both representations, and some that can be expressed in

one representation but not the other.

The convex set representation places no restrictions on either the hypothesis

space or the partial ordering of generality. The hypothesis space and the generality

relation could require arbitrarily powerful Turing machines to represent them. This

means that convex sets can express many sets that regular sets cannot, at least in

theory. In practice, operations on convex sets—such as finding the most specific

120

common generalization of two hypotheses—may be computationally infeasible for

sufficiently expressive hypothesis spaces or generality orderings.

Although the convex set representation is very expressive, it does not subsume

the regular set representation. For every partial ordering having transitive chains of

length three or more (e.g., w -< x ■< y), there are regular subsets of the hypothesis

space that cannot be expressed as convex sets with that ordering. For example,

consider a set containing two elements, g and s, where g is a very general hypothesis,

and s is a very specific hypothesis. If there is at least one hypothesis between these

two hypotheses in the partial ordering, such that s -< x ■< g, then the set {s,g}

cannot be represented by a convex set since every convex set containing s and g

must also contain x. However, the set {s, g} can be represented by a regular set.

There are only two elements, and any finite set can be expressed as a regular set.

Another way of stating this phenomenon is that convex sets cannot have "holes",

but regular sets can. A convex set must contain every hypothesis between the

boundary sets. It cannot exclude hypotheses unless all of the remaining hypotheses

are between some other pair of boundary sets. Regular sets, on the other hand, can

represent sets with holes.

Whether a given subset of the hypothesis space has a "hole" depends on the

generality ordering. Under one ordering, -<, a given set may have a hole, but in

ordering -<', the same set may not have a hole. However, there will be other subsets

of the hypothesis space that do have holes in the ■<' ordering. At least some of these

can be expressed as regular sets.

6.2.2 Spaces for which RS-KII Subsumes CS-KII

For some hypothesis spaces, every convex subset of the space can be expressed as

a regular set. These are the hypothesis spaces for which RS-KII subsumes CS-KII.

A convex set, (S,G,^.,H), contains all hypotheses that are more specific than or

equal to some element of G, and more general than or equivalent to some element

of S. S is the set of maximally specific elements in the set, and G is the set of

maximally general elements, where every element of S is more specific than one or

121

more elements of G. A convex set (S, G,<,H) is therefore equivalent to the following

set:

SuGU U ({heH\s^h}n{heH\h^g} (6.2) s€S,g€G

Every convex subset of H can be expressed as a regular grammar if for every

hypothesis x in the hypothesis space, the sets {h € H | x -< h} and {h G H \ h -< x}

can be expressed as regular sets. This follows from closure of regular sets under

intersection and union, and the finiteness of the S and G sets.

It is difficult to determine whether an arbitrary hypothesis space satisfies this

condition, and is therefore a hypothesis space for which RS-KII subsumes CS-KII.

However, there is a generalization that is easier to test, namely that both the hypoth-

esis space (H) and the generalization ordering (-<) can be expressed as regular sets.

The set {h G H \ h X x} is equivalent to first({HY.{x})nR<), where &< is a gram-

mar encoding the generality ordering -<. That is, A« = {(x,y) G HxH | x -< y}.

Since regular grammars are closed under intersection, Cartesian product, and projec-

tion, {h G H | h ■< x} is expressible as a regular set for every hypothesis x. Similarly,

the set {h G H j x -< h} is equivalent to second(({x}xH)nR^), and therefore ex-

pressible as a regular grammar as long as both H and R^ are expressible as regular

grammars.

6.2.2.1 Conjunctive Feature Languages

One class of languages for which RS-KII subsumes CS-KII are a subset of the con-

junctive feature languages. A hypothesis in a conjunctive feature language is a

conjunct of feature values, with one value for each of the features. For example, in

a language with k features, hypotheses are of the form (Vu V2,..., Vk) where Vi is

an abstract value corresponding to a set of ground values. An instance is a vector

of ground values, (vi,v2, • • •, vk). An instance is covered by a hypothesis if v{ G V{

for every i between one and k. If the "single representation trick" is used (e.g.,

[Dietterich et al, 1982]), there is an abstract value, v, for each ground value, v.

Feature values consist of both ground values and abstract values. The ground

values are the specific values that the feature can take, and the abstract values

correspond to sets of ground values. For example, the feature color may have

ground values such as green and red, and abstract values such as dark-color and

122

light-color. The abstract values correspond to sets of ground values. Features may

be continuously valued or discrete. If the ground values are continuously valued

(e.g., the real numbers) then the abstract values are sets of ground values, such as

the set of real numbers between one and five, inclusive.

The abstract values for each feature can be partially ordered by generality. Value

Vi is more general than value V2 if V2 is a subset of Vx. If neither value is a subset

of the other, then there is no generality ordering between them. The generality

ordering over the abstract features is called a generalization tree. A value in the tree

is more general than its descendents. Typically, the leaves of the tree are abstract

values that cover only a single ground value.

The conjunctive feature languages are a family of languages. The family is pa-

rameterized by the number of features, and a generalization tree for each of the

features. A generalization tree is a tuple (F, -<), where F is a set of abstract values,

and x -< y means that abstract value x is more specific than abstract value y.

The grammar for the language is a concatenation of the grammars for the values

of each feature. G(Fj) is the grammar for the abstract values in feature Fj.

ConjunctiveFeatureLanguage((Fu -<x),..., (Fk, -<k)) -> G(Fi) • G(F2) •... • G(Fk)

(6.3)

The generality relation among hypotheses is derived from the relations on in-

dividual features. A hypothesis, (xi,x2,.. .,xk), is more general than or equal to

(Vi, 1/2, ■■■,yk) if yi da Xi for every i between one and k. Hypothesis {xx, x2,..., xk)

is strictly more general than (yi,y2,... ,yk) if there is at least one feature, /;, such

that yi -<i Xi, and for the remaining features, /i?4j, y5 <,• Xj. The strict relation is

stated formally in Equation 6.4, and ■< is defined formally in Equation 6.5.

{xux2,...,xk) -< (yi.ifei•••,!/*) iff

{3ie{l,...,k}xi -<t y{) and (Vj e {1,..., k} Xi ^ y{) (6.4)

{xi,x2,...,xk) l(yi,y2,...,yk) iff Vje{l,...,k} x{ ^ Vi (6.5)

123

6.2.2.2 Conjunctive Languages where RS-KII Subsumes CS-KII

RS-KII subsumes CS-KII for hypothesis spaces in which both the hypothesis space

and the generality ordering over the hypotheses can be expressed as regular gram-

mars. This is true of conjunctive feature languages in which the values for each

feature can be expressed as a regular grammar, and the generalization tree for each

feature can be expressed as a regular grammar. The grammar for the hypothesis

space is the concatenation of the grammars for each feature.

The grammar for the generality ordering, -<, over hypotheses is defined as fol-

lows. Let R<ti be the regular set {(x,y) e F{xFi \ x -<< y}, which represents -<*,

the generalization hierarchy for feature /,-. Let R±,i be the regular set {(x,y) €

FixFi | x <i y], where x ^ y if x -<* y or x = y. This is the union of R<ti and i?-,,

where i?-, = {ww | w € Fj. The generalization hierarchy, -<, over hypotheses is

constructed from the regular sets for the individual feature hierarchies as follows:

R* = \jR±,i-...-R±,i-i-R^i-R±,i+i-----R±,k (6.6) t=i

If both the hypothesis space and the generality orderings for each feature, R<ti,

and R-<ti, can be expressed as regular grammars, then every convex subset of the

hypothesis space is expressible as a regular set. When this is true, any knowledge

that can be expressed by CS-KII can also be expressed by RS-KII.

What generality orderings over features can be expressed as regular grammars?

A generality ordering, R^i or R-<,ti, is a subset of FjXFj. If Ft can be expressed

as a regular grammar, then so can FJXFJ, since regular grammars are closed under

Cartesian product. However, not every subset of Ft xF, can be expressed as a regular

grammar (see Section 4.1.2.4), so not all generality orderings over Fi are expressible

as regular grammars.

It is difficult to specify necessary and sufficient conditions for which a given

subset of FiXFi can be expressed as a regular grammar. However, it is possible to

identify some sufficient conditions. For example, any finite set can be expressed as

a regular grammar, so any generality ordering over Fi can be expressed as a regular

grammar if Fi is finite. Likewise, the equality relation, Ä-», can also be expressed

as a regular grammar if Fi is finite.

124

When Fi is infinite, some but not all orderings can be expressed. #-, can always

be expressed as a regular set under the shuffle mapping if Fi is a regular set (see

Section 4.1.2.4). However, only some generality orderings can be expressed as regular

grammars. Among the subsets of F.xF, that can be expressed as regular grammars

under the shuffle mapping are those of the form {(x,y) G F{xFi \ x < y} where

< is a lexicographic (dictionary) ordering. This provides a way to represent total

orderings over F.

In one common generality ordering over features with ordinal values (e.g., inte-

gers), each abstract value is a range of values, and one abstract value is more general

than another if its range properly contains the other. For example, a range might be

of the form [x..y], where x < y, and one range would be more general than another

if the first range contained the second (e.g., [5..10] -< [0..20]).

To represent the set of abstract values, F{, as a regular grammar, the range [x..y]

is represented as the string (re o y) where and o is a special symbol that indicates

the end of x and the beginning of y. F{ is the set of all possible ranges, {x o

y | x, y £ INT and < y} where INT is a regular expression for the set of integers,

(0 - 9)|1(0 - 9)+. Fi can be expressed as the regular set {shuffle(x,y) \ x,y € INT}.

The generality ordering over F{ is {([x1..yi],[x2,y2\) € FiXF{ \ x2 < xx and

y\ < 2/2}- This is the set of all pairs of ranges such that the first range is contained

in the second. The regular set for this ordering takes as input strings of the form

shuffle(w1,w2), where wx and w2 are strings encoding ranges in F{. A range in Fi

is encoded as a string of the form shuffle(x, y) where x and y are ground values for

the feature, and x < y. An input to the grammar is therefore an interleaving of

four strings, a, b, c and d, where the input is accepted if the range [a..b] is contained

within the range [c.d], and both of these are valid ranges (i.e, a < b and c<d).

The grammar for the generality ordering, ■<, over F{ is constructed from four

simpler grammars Create four copies of a regular grammar that recognizes the set

{shuffle(x,y) \ x,y G INT and x < y}. This is a simple modification of the DFA

described in Section 4.1.2.4 of Chapter 4. Call these grammars Mx through M4. For

each quartet of symbols in the input string, üibiCidi, pass a^ to Mu c{di to M2, CjOj

to M3, and b{di to M4. The machines Mi and M2 ensure that the ranges [a..b] and

[c.d] are valid, in that a < b and c < d. The machines M3 and M4 ensure that [a..b]

125

is contained within [c..d\—that is, c < a and b < d. If all four machines accept their

input, the input string is accepted by the composite regular grammar.

6.2.3 RS-KII Translators for IVSM Knowledge

RS-KII and CS-KII (IVSM) overlap in expressiveness. Some knowledge can be

utilized by both RS-KII and CS-KII, and some knowledge can be used by one but

not the other. For some hypothesis spaces, RS-KII subsumes CS-KII—that is, all

knowledge that can be utilized by CS-KII can also be utilized by RS-KII. This

is true of hypothesis spaces described by conjunctive feature languages, where the

generality ordering for each feature can be expressed as a regular grammar.

As empirical support for this fact, RS-KII translators will be demonstrated for

a number of knowledge sources used by IVSM. The hypothesis space for all of these

translators is assumed to consist of conjuncts of feature values, where the general-

ization ordering for each feature can be expressed as a regular grammar.

These translators generally take as input both the hypothesis space (H) and the

generality ordering (i?x)- R< is the union of R^ and R=, and therefore a regular

set whenever i?x and R= are regular. The ordering R-< is used in place of R^ in the

translators because it is easier to construct the necessary C sets from operations on

i?x than it is to construct them from set operations on R^. The P sets produced

by the translators are always empty, since the set representation that CS-KII uses

for P can only express the empty set.

A translator for noise-free examples is described in Section 6.2.3.1, a transla-

tor for noisy examples with bounded inconsistency [Hirsh, 1990] is described in

Section 6.2.3.2, and a translator for overgeneral domain theories is described in Sec-

tion 6.2.3.3.

6.2.3.1 Noise-free Examples

The target concept is consistent with all of the noise-free examples, by definition.

That is, the target concept covers all of the positive examples and none of the

negative examples. A hypothesis that is not consistent with the noise-free examples

cannot be the target concept.

126

A noise-free example is translated as a constraint that is satisfied by all of the

hypotheses consistent with the example. A positive example is translated as (C, 0),

where C is the set of hypotheses that cover the example. Similarly, a negative

example is translated as (C, 0), where C is the set of hypotheses that do not cover

the example.

For the conjunctive feature languages, an instance (x\, X2,..., Xk) is covered by a

hypothesis, (vi, V2,. ■., u*), if and only if for every feature fi from one to k, Xi -<i V{.

The set of hypotheses covering an instance can be computed as follows. Let F{

be the set of abstract values that feature i can take. Let i?-<,t be a regular grammar

encoding the generality ordering over F{-. FL^ = {(x,y) € HxH | x ^ y}. The set

of features values covering Xi is given by the following formula:

{v 6 Fi I Xi li v}=first(R^ir\{{xi}xFi)) (6.7)

This set is expressible as a regular grammar if X{, R<, and Fi are regular sets, since

regular sets are closed under union, intersection, and projection.

The set of hypotheses covering the instance is the Cartesian product of these

sets. The regular grammar for this set is the concatenation of the regular grammars

for the set of values covering the instance on each feature. A translator for noise-free

examples based on this construction is given in Figure 6.2.3.1. This translator takes

as input the hypothesis space, expressed as a list of value sets for each feature, F\

through Fk] the generality ordering R^, expressed as a list of generality orderings for

each feature, i?-<,i through .#-<,*; and an example of the form (t>i,... ,Vk,c), where

Vi is a value for feature fi and c is a classification (e.g., positive or negative).

6.2.3.2 Noisy Examples with Bounded Inconsistency

Bounded inconsistency [Hirsh, 1990] is a kind of noise in which each feature of the

example can be wrong by at most a fixed amount. For example, if the width value

for each instance is measured by an instrument with a maximum error of ±0.3mm,

then the width values for these instances have bounded inconsistency.

The idea for translating examples with bounded inconsistency is to use the error

margin to work backwards from the noisy example to compute the set of possible

noise-free examples. One of these examples is the correct noise-free version of the

127

TranExample(Fi,F2, ..-,Fk, R<,li R^,2, ■■ ■ ,R^,k ((rri,x2,...,a;fc>.class» -> (C,0)

where FixF2x ... xFk is the hypothesis space

{{xi, rr2, -. -, Xk),class) is an example

" 1 Vixl _ i r i-'V^x ... xVfc if class = negative — T':V2X...xVfc if class = positive

VJ = {u € Fi | v ^ Xi} = firstiR^niixijxFi))

Figure 6.2: RS-KII Translator for Noise-free Examples.

observed example, into which noise was introduced to produce the observed noisy

example. The target concept is strictly consistent with this noise-free example.

Let e be the noisy observed example, E be the set of noise-free examples from

which e could have been generated, and let e' be the correct noise-free example from

which e was in fact generated. Ideally, e is translated into (C, 0), where C is the set

of hypotheses consistent with e'. However, it is unknown which example in E is e'.

Because of this uncertainty, a noisy example is translated as (C, 0), where C is the

set of hypotheses that are strictly consistent with at least one of the examples in E.

Hypotheses that are consistent with none of the examples in E are also not consistent

with e', and therefore not the target concept. This is the approach used by Hirsh

[Hirsh, 1990] in IVSM to translate noisy examples with bounded inconsistency.

This suggests the following RS-KII translator for examples with bounded incon-

sistency. The set of possible noise-free examples, E, is computed from the noisy ex-

amples and the error margins for each feature. Each example, ej, in this set is trans-

lated using the RS-KII translator for noise-free examples, TranExample(H,R^,ei)

(Figure 6.2.3.1), which translates example e* into (Ci,Q). C{ is the set of hypotheses

that are strictly consistent with e(. The translator for the bounded inconsistent ex-

ample returns (C = u'=i Cj,$). C is the set of hypotheses consistent with at least

one of the examples in E.

128

TranExampleBI(H, i?-<, (61,82,..., 8k), {(xi,x2,..., a*) »class» -> (C, P)

where

E = [xi,±81]x[x2,±82]x...[xk,±8k]y. {class}

where [xi, ±8i] = {v \ Xi — 8{ < v < Xi + <5j}

C = U CiS.t. (Ci,0) = TranExample(H,R±,ei) ei€E

Figure 6.3: RS-KII Translator for Examples with Bounded Inconsistency.

The set E is computed from the observed example, (xi,x2,...,Xk, class), and the

error margins for each feature, ±81 through ±8k, as follows. If the observed value for

feature i is xi, and the error margin is ±8{, then the correct value for feature i is in

{v I Xi — 8i < v < Xi+8{}. Call this set [xi, ±8i] for short. Since examples are ordered

lists of feature values and a class, E is [xi,±8i]x[x2,±82]x ... x[a;fc,±<$fc]x{class}.

A translator for examples with bounded inconsistency based on this approach

is shown in Figure 6.2.3.2. It takes as input the hypothesis space (H), a regular

grammar encoding the generality ordering (R-<), the error margin for each feature

(±<$i through ±5fc), and an example. The hypothesis space input is represented

in factored form as an ordered list of regular sets, Fi through Fk, where F, is the

set of abstract values for feature /,. The input R-< is also represented in factored

form as an ordered list of regular sets, R^\ through Ä-<,fc, where R-<ti represents

the generality ordering for feature fi. The factored representation is used since the

translator makes a call to TranExample(H,R^,e), which expects H and R-< to be

represented in this fashion.

6.2.3.3 Domain Theory

A domain theory is a set of rules that explain why an instance is a member of the

target concept. For instance, a particular object is a cup because it is liftable, and

has a stable bottom and an open top. There are many ways to use a domain theory,

depending on assumptions about the completeness and correctness of the theory.

129

For example, if the theory is overspecial, then there are instances which cannot be

explained as members of the target concept. The target concept is a generalization

of the domain theory. The theory provides an initial guess at the target concept,

and additional knowledge indicates which generalization of the theory has the most

support for being the target concept. Other assumptions about the correctness and

completeness of the theory are also possible (see Section 6.2.3.3). For the sake of

brevity, only a translator for overspecial theories will be described.

If the hypothesis is overspecial, then the target concept is some generalization

of the domain theory. The theory could therefore be translated as a constraint

satisfied only by generalizations of the theory. Determining which hypotheses are

generalizations of the theory is somewhat difficult. One approach is to map the

theory onto an equivalent hypothesis, and then use the generalization ordering to

determine the generalizations. However, it is not clear how to perform this mapping.

A second approach is to translate both the theory and a positive example at

the same time. An explanation of the example by the theory corresponds to a

sufficient condition for the concept described by the theory. The target concept is a

generalization of the explanation, and an an explanation can be easily mapped onto

the hypothesis space. The explanation is a conjunction of operational predicates, or

features, and exactly corresponds to a member of the hypothesis space. Once the

mapping has been done, it is a simple matter to compute the set of hypotheses more

general than the explanation.

Only a positive example can be used for this translation, since the theory can

explain why an instance is a member of the target concept, and therefore a positive

example, but cannot explain why an instance is not a member of the target concept.

To utilize negative examples in this way, a theory is needed for the complement of

the target concept. Complementary theories for positive and negative examples are

also used in IVSM [Hirsh, 1990].

A translator for an overspecific domain theory is shown in Figure 6.2.3.3. In this

translator, H is the hypothesis space, R± is a regular grammar for the generalization

ordering, T is a domain theory, and e is a positive example. The explanation of an

example by the theory is essentially a generalized example. This example is passed

to the translator for noise-free examples described in Section 6.2.3.1.

130

TranOverSpecialDT(H,R±,T,e = (inst.positive)) -> (C,P) where

(C,P) = TranExample(H, R±, (explain(inst,T), positive))

Figure 6.4: RS-KII Translator for an Overspecial Domain Theory.

6.3 Complexity of Set Operations

The computational complexity of IVSM (CS-KII) and RS-KII are determined by the

complexity of integrating knowledge, and the complexity of enumerating hypotheses

from the solution set. The complexity of these operations is determined in turn by

the computational complexity of the set operations that define them.

When the P set is empty, the only set operation that determines the computa-

tional complexity of integration is intersection. Integration involves intersecting the

C sets and computing the union of the P sets, but since the P sets are empty, their

union is always empty. The union operation can be computed in constant time, and

is dominated by the complexity of intersecting the C sets. Since P is empty, the

deductive closure of (C, 0) is just C. The complexity of enumerating a hypothesis

from the deductive closure of (C, P) therefore depends on the size of the DFA for C.

In CS-KII, the P set is always empty since only the empty set is expressible in

the representation for P. Therefore, the costs of integration and enumeration in

CS-KII depend on the cost of intersecting convex sets and the cost of enumerating

hypotheses from an intersection of convex sets, respectively. In RS-KII, the P set can

be non-empty. However, when only the knowledge expressible in CS-KII is utilized,

the P set is always empty. The cost of integration and enumeration in CS-KII are

determined by the costs of intersecting regular sets and the cost of enumerating

hypotheses from an intersection of regular sets.

The computational complexity of intersection and enumeration for regular sets

is derived in Section 6.3.1 and Section 6.3.2, respectively. The computational com-

plexity of intersection and enumeration for convex sets is derived in Section 6.3.3

and Section 6.3.4. A comparison between the complexity equations for regular and

convex sets is made in Section 6.3.5.

131

6.3.1 Complexity of Regular Set Intersection

Recall that the intersection of two DFAs is implemented in RS-KII by constructing

an intentional DFA as shown in Figure 6.5. Constructing this DFA only requires

pointers to the two original DFAs, and definitions of the DFAs components in terms

of calls to similar components in the the two original DFAs. The intentional DFA

for the intersection of two DFAs can be constructed in constant time.

(sl,81,FuDeaduAccAlh,Y,i) n (s2,62,F2,Dead2,AccAll2,Y,2) = (s, 5, F, Dead, AcceptAll, E) where

s =

<J«0i>fe).<7)

F((qi,q2))

Dead{{quq2))

(si,s2)

{61{qi,a),ö2(q2,cr))

Fifa) A Fifa) true if

else unknown

true if

AcceptAll((qi,q2)) = < false if

else unknown

EinE2

Pointer to Ei

Dead\{qi) = true or Dead2(q2) = true

AccAll\{q\) = true and AccAU2(q2) = true AccAlli(qi) = false or AccAll2{q2) = false

if Ei ^ E2

if Ei = E2

Figure 6.5: Intersection Implementation.

6.3.2 Complexity of Regular Set Enumeration

Although constructing an implicit DFA for the intersection of two DFAs is a constant

time operation, enumerating hypotheses from this DFA is not. A hypothesis is

enumerated by finding a path from the start state to an accept state. This can take

both time and space proportional to the number of states in the DFA.

The implicit DFA for the intersection represents the explicit DFA shown in Fig-

ure 6.6. The implicit DFA generates states and edges in the explicit DFA as they

132

are needed by the search. Thus the space cost is bounded by the number of states

actually searched instead of by the total number of states in the explicit DFA. This

saves space on average, but the worst case complexities are still the same.

(Q,s,6,F,Z) = (Qi,s1,61,F1,T,)n(Q2,s2,52,F2,X}

where

Q = QixQ2

S = (81,82)

<5«9i,92>,ff) = (<*i(gi,ff),<&2(fc,<r))

F = FixF2

Figure 6.6: Explicit DFA for the Intersection of Two DFAs.

The cost of enumerating a hypothesis from this DFA is proportional to the num-

ber of states in the DFA, and the cost of computing the next-state function, S. The

explicit DFA has |Qi||Q2| states, and the cost of computing S is cost^) + cost(S2).

The time and space needed to enumerate a hypothesis from the intersection of two

DFAs is bounded by Equation 6.8.

0(\Qi\\Q2\[cost{Si) + cost{82)]) (6.8)

6.3.3 Complexity of Convex Set Intersection

The intersection of two convex sets is computed in two phases. In the first phase, a

convex set is computed that represents the intersection of the two sets, but whose

boundary sets contain extraneous hypotheses. In the second phase, these hypotheses

are eliminated from the boundary sets, yielding a minimal convex set. This analysis

follows [Hirsh, 1990].

Phase One. The non-minimal intersection of two convex sets, {H, -<,Si,Gi) and

(H,-<,S2,G2), is defined in Figure 6.7. In this definition, LUB(a,b,-<) returns the

least upper bounds of a and b in the partial order -<, and GLB(a, b, -<) returns the

133

(S, G, X, H) = (Si, Gu -<, #)n(S2, G2, H, -<> where

5 = {s | 3,l€51,.2€sa s.t. s€LUB(sl,«2, -<)}

G = {p | 3sieGl)ff26G2 s.t. 0 € GLB{gl,g2, -<)}

Figure 6.7: Intersection of Convex Sets, Phase One.

greatest lower bounds of a and b in -<.• That is, LUB(a,b, -<) returns the most spe-

cific common generalizations of a and b, and GLB(a,b, -<) returns the most general

common specializations of a and b.

Computing the non-minimal intersection involves finding LUB(a,b, -<) for every

pair of hypotheses (a,b) in SixS2, and GLB(x,y,~<) for every pair of hypotheses

(x,y) in G1xG2. There are IS1IIS2I + l^ill^l such pairs. The cost of computing

the non-minimal intersection is proportional to the number of pairs and the cost of

computing the GLBs and LUBs. Assuming that the cost of computing the LUB

of two hypotheses is tiub and the cost of computing the GLB is tgib, the cost of

computing the non-minimal intersection is given by the Equation 6.9.

|51||52|^ + |G1||G2|^6 (6.9)

Phase Two. In the second phase, the boundary sets are minimized by removing

from S hypotheses that are more general than other elements of S, or that are

not more specific than some element of G. The first test requires \S\2 comparisons.

Removing elements of S that are not more specific than or equivalent to some element

of G requires |5|(|Gi| + |G2|) comparisons. This makes use of the observation that

since G contains the GLBs of every pair of elements in G\ and G2, an element

s G S is more specific than some element of G if and only if s is more specific

than an element of Gi and an element of G2. The cost of minimizing S is therefore

[\S\2 + |5|(|Gi| + |G2|)]t<, where £< is the cost of comparing two hypotheses to

determine which is more general.

S contains the LUBs of all pairs of elements from Si and 52, so \S\ < |Si||S2|.

The cost of minimizing the S set is therefore bounded by [(|S'i||52|)2 + |5i||S'2|(|Gi| +

134

|G2|)]£<. A symmetric analysis holds for minimizing G. The total cost of minimizing

the intersection of two convex sets is given by Equation 6.10.

[{\Sx\\S2\f + |Si||S2|(|Gi| + |G2|) + (I^IIG,!)2 + |G1||G2|(|51| + |S2|)]i< (6.10)

Total Cost. The cost of intersecting two convex sets, is the sum of Equation 6.9

and Equation 6.10. The values of £<, t/u6, and tgib, depends on both the hypothesis

space and the generality ordering (■<). The time cost of computing the intersection

of two convex sets, (Si,Gi) and (S2,G2), is bounded by Equation 6.11:

|51||S'2|<,tt6+|Gi||C?2|<fftt+

[(|Si||S2|)a + |51||S2|(|G1| + \G2\) +

(IG^IG^ + ldllGaKlÄxl + ISal)]^ (6.11)

6.3.4 Complexity of Enumerating Convex Sets

Enumerating a hypothesis from an intersection of convex sets is inexpensive. If only

one or two hypotheses are needed, the first hypotheses in the S and G sets can be

returned. These are both constant time operations. This is sufficient to induce a

hypothesis, and to answer the emptiness and uniqueness queries.

If additional hypotheses are needed, they can be enumerated by starting with an

element of the S set and climbing the generalization tree. This is done by generating

the immediate parents of the hypothesis in the generalization tree, and enumerating

only those parents that are also covered by at least one G set element. The complex-

ity of generating an immediate parent depends on the representation of the generality

ordering. For conjunctive languages with tree-structured and lattice-structure fea-

tures, it is a constant time operation. Verifying that a hypothesis is covered by

at least one G set element takes time proportional to |G|£<. The time to generate

each element is therefore 0(|G|£<), at least for conjunctive languages with tree or

lattice-structured features.

135

6.3.5 Complexity Comparison

The intersection and enumeration costs for convex sets and regular sets are in terms

of entirely different quantities. The costs for convex sets are in terms of the S and

G set sizes, whereas the costs for regular sets are in terms of the number of states in

the DFA. In order to compare these costs we need to know the relationship between

these different quantities for equivalent sets.

It is difficult to find a general equation relating these quantities, but it is possible

to relate them when the sets are assumed to be subsets of a specific hypothesis space.

The idea is to devise an algorithm for translating an arbitrary convex subset of the

hypothesis space to an equivalent regular set, such that the size of the resulting

regular set can be expressed in terms of |5| and |G|. This provides the necessary

relation between the sizes of equivalent regular and convex subset of the hypothesis

space. This relation is only valid for the given hypothesis space, or class of hypothesis

spaces, and the approach requires that every convex subset of the hypothesis space

can be expressed as an equivalent regular set.

The computational complexities of CS-KII and RS-KII will be compared using

this approach for the class of hypothesis spaces described by a conjunctive language

with tree-structured features, where the generalization tree has a fixed depth bound.

A similar comparison will be made for conjunctive languages in which the features

are lattice structured, where the generalization tree for each feature has a fixed

depth bound. These hypothesis spaces are commonly used with IVSM and CEA,

and every convex subset of these hypothesis spaces can be expressed as regular set

(see Section 6.2.2.1).

6.3.5.1 Equating Regular and Convex sets

A hypothesis in a conjunctive language can be written as a vector of feature values,

(vi,V2,...,Vk), where Vi is one of the values in the generalization hierarchy for feature

fi. A hypothesis is generalized by replacing at least one feature value, Vi, with a

more general value. A hypothesis is specialized by replacing at least one feature

value with a more specific value. The generality ordering over the hypotheses has

a maximally general hypothesis, U, that classifies every instance as positive, and a

maximally specific element, 0, that classifies every instance as negative.

136

A convex set in this language can be written as shown in Equation 6.12, where

x <y\ix and y are hypotheses in the language and x is either more specific than

y, or x = y.

(S,G) = {h\3seS3geGs±h^g}

= \J{h\s^h^U}n \J{h\Q*hlg} (6.12) ses geG

In a conjunctive language, a hypothesis can be written as a vector of feature

values, (vi,v2, • • • >^fc)- Equation 6.12 can therefore be rewritten as shown in Equa-

tion 6.13. In this definition, fi(h) is the value of hypothesis h on feature ft.

(S,G) = \J{(v1,V2,...,vk)\f1(s)±v1lMU)A...Afk(s)lvklfk(U)}n ses \J{(vuV2,...,vk)\h(<b)^v1^h(g)A...Afk($)^vk^fk(g)}

geG

= U (vi I AM * v* 1 /i(C7)}x • • • XH I /*(*) 1 vk± fk(U)} n s€S

U fa I /iW l ui l /i(ff)>x • • • *H I AW ^ Ufc ^ /*^)>- (6-13)

geG

Let Ai(s) be the regular set {v{ | /<(«) r< v4 d /t(^)}, and let Bi(g) be the

regular set {vt | /i(0) ^ v{ ^ fi(g)}- The convex set described in Equation 6.13

can be expressed as the regular set shown in Equation 6.14. Regular grammars are

closed under concatenation, intersection, and finite union, so the resulting set is a

regular grammar.

(S, G) = U A^A^s)... Ak(s) n (J Bl(g)B2(g)... Bk(g) (6.14) ses geG

The number of states in the regular set described in Equation 6.14 depends on

the number of states in the Ai(s) and Bi(g) grammars for each s and g. The sizes

of these grammars depend on the depth of the generalization hierarchies (trees) for

each feature, and whether these hierarchies are tree structured or lattice structured.

137

6.3.5.2 Tree Structured Hierarchies

If the generalization hierarchy for each feature is tree structured, then there is at

most one hypothesis in the S set [Bundy et al, 1985]. Equation 6.14 reduces to

Equation 6.15.

(W,G> = [J(A1(s)nB1(g))(A2(s)nB2(g))...(Ak(s)nBk(g)) (6.15) g€G

For brevity, the set Ai(s)nBi(g) will also be referred to as Xi(s,g). The set

Xi{s,g) is equivalent to {v{ € F{ | /,-(«) * v,- ^ /<($)}» where F* is the set of values

for feature ft. It contains all hypotheses between fi(s) and ft(g) in the generalization

hierarchy for /;. In a tree structured generalization hierarchy with depth d, Xi(s,g)

contains at most d hypotheses. This is because s and g must be on the same branch

of the tree in order for s to be less general than g, and no branch of the tree has

more than d nodes. The DFA for Xi(s,g) therefore has at most 0(d) states.

The DFA for Xi(s,g)X2{s,g) ...Xk(s,g) has at most 0(kd) states. Call this

DFA X(s,g) for short. The DFA for \JgeGX(s,g) has at most 0(\G\kd) states.

For hypothesis spaces described by conjunctive languages in which the general-

ization hierarchies for each feature are tree structured and have a maximum depth

of d, every convex subset of the space can be expressed as a regular set as shown in

Equation 6.14. The DFA for this set has at most 0(\G\kd) states.

This relation between the sizes of equivalent convex and regular sets allows the

time complexity of intersecting two regular sets and enumerating a hypothesis from

their intersection (Equation 6.8) to be rewritten in terms of the boundary set sizes

of two equivalent convex sets, as shown in Equation 6.16. In this equation, it is

assumed that the regular set (Qi, Si, S1} Fu E) represents the same set of hypotheses

as the convex set (Si,C?i), and that (Qi,si,tfi,Fi,£) represents the same set of

hypotheses as (S2,G2). Equation 6.16 is derived by substituting \Gi\kd for Q\ and

\G2\kd for Q2 in Equation 6.8.

0(|Qi||Q2|[co5*(«Ji) + cost{82)]) = O(|Gx||G2|(fcd)2[c05i(51) + cost(52)}) (6.16)

For most grammars, cost(8i) and cost(S2) are 0(1).

138

The cost of enumerating a hypothesis from the intersection of two regular sets

has been expressed in terms of the sizes of S and G. This cost can now be compared

to the cost of enumerating a hypothesis from the intersection of two equivalent

convex sets, (Si, Gi) and (S2, G2). The cost of intersecting two convex sets is shown

in Equation 6.11. For conjunctive languages with tree structured features, the S

set never has more than one element, and the cost of determining the generality

relation between two hypotheses, t<, is 0(kd) [Hirsh, 1990]. The cost of finding the

least upper bound of two hypotheses (t[ub) is also 0(kd) when the generalization

hierarchies for each feature are tree structured. The cost of finding the greatest

lower bound of two hypotheses (tgib), is also 0{kd). Substituting these values into

Equation 6.11 yields Equation 6.17.

kd+ \dWG2\kd + (|Gi| + \G2\ + (|Gi||G2|)2 + 2|G1||G2|)*d. (6.17)

This cost is bounded by Equation 6.18:

0((|G1||G2|)2^) (6.18)

The cost of enumerating a hypothesis from the intersection of two regular sets

is 0(\Gi\\G2\(kd)2), and the cost of enumerating a hypothesis from two equivalent

convex sets 0((|Gi||G2|)2fcd). If kd < |Gi||G2|, then the cost for regular sets is

significantly less than the cost for convex sets. Since the size of the G sets can

grow exponentially in the number of examples, this is often a valid assumption.

When a convex set is represented as an equivalent regular set, and the generalization

hierarchies are tree structured, the regular set is effectively a factored representation

of the convex set. This allows the set to be represented more compactly, and improves

the computational complexity. Similar complexity improvements can be obtained for

convex sets by maintaining them in factored form [Subramanian and Feigenbaum,

1986]. However, there are still expressive differences between convex and regular

sets. In particular, regular sets can represent sets with "holes", whereas convex sets

cannot. Also, convex sets are not closed under union [Hirsh, 1990], but regular sets

are [Hopcroft and Ullman, 1979].

139

The computational complexity of enumerating a hypothesis from an intersection

of n convex sets in IVSM is n(\G*\4)kd, where \G*\ is the maximum size attained by

the G set. The size of G* can be as high as 2" [Haussler, 1988]. The computational

complexity of enumerating a hypothesis in RS-KII from the same n sets represented

as regular sets is is bounded by 0(\G\n(kd)n), where |G| is the size of the largest G

set among the convex sets being intersected. The worst case complexities of IVSM

and RS-KII are equivalent. Although RS-KII can be exponential, this is a very

loose upper bound. As will be seen in Section 6.4, the complexity of RS-KII can be

polynomial when the complexity of IVSM is exponential.

6.3.5.3 Lattice Structured Features

When the generalization hierarchies for the feature are lattice structured rather than

tree structured, the relation between the sizes of convex and regular sets changes

somewhat. The S set is not guaranteed to be singleton in this language [Bundy et al, 1985].

Convex subsets are therefore expressed as shown in Equation 6.19.

(S,G) = (J {wi | fi(s) ±Vl± h(U)}x ... x{vk | fk(s) <vk-< fk(U)} n ses U {vi | /i(0) < V! x /1(fl)}x ... x{vk I A(0) r< vk r< fk{g)} (6.19)

geG

Let Ai{s) be the regular set {v{ | /{(s) X v> ^ fi(U)}, and let Bi{g) be the regular

set {vi | /i(0) ^ V{ ^ fi(g)}- The regular set equivalent to the one in Equation 6.19

is shown in Equation 6.21, where Xi(s,g) = Ai(s)<lBi(g).

(S,G) = U^i004«(«)-..4k00n \jB1(g)B2(g)...Bk(g) (6.20) ses geG

U X1(s,g)X2(S,g)...Xk(s,g) (6.21) (s,g)esxG

Xi(s,g) is the set {vt € Ft \ fi(s) ^ vt ^ fi{g)}, where F, is the set of values in

the generalization hierarchy for feature /,-, and x < y means that either value x is

less general than value y in the generalization hierarchy for F», or that x = y. When

the hierarchy for Ft is lattice structured, and has a maximum depth of d, there can

140

be 0(wd) hypotheses between s and g, where w is the width (branching factor) of

the lattice. The DFA for Xi(s,g) therefore has 0(wd) states.

Let X{s, g) stand for Xi(s, g)X2{s, g)... Xk(s, g). The DFA for X(s, g) is a con-

catenation of the DFAs for each of the features, Xi(s,g), and has at most 0(kwd)

states. The DFA for ü/Sjg)eSXGX(s,g) has at most 0{\S\\G\kwd) states, by argu-

ments similar to the ones used for tree structured features.

This relation between the number of states in a regular set and the boundary

sets sizes of an equivalent convex set can be used to compare the complexities of

intersecting and enumerating hypotheses from both regular and convex sets. Let

(Qii «1,^1.^1. Si) be a regular set that represents the same set of hypotheses as

the convex set (SuGi), and let (Qi,si,£i,.Fi,X!i) be a regular set that represents

the same set of hypotheses as the convex set (S2,G2). The cost of intersecting the

two regular sets and enumerating a hypothesis from the intersection is bounded by

0(\Qi\\Q2\(cost(6i) + cost{62) (Equation 6.8), where cost(6i) and cost(S2) are the

costs of computing the next-state functions, Si and S2. These costs can generally be

assumed to be constant unless they are proportional to a relevant scale-up variable.

Substituting OflSi||Gi|kwd) for the size of Qu andO(|S2||G2|W) for the size of Q2,

yields the cost for intersecting and enumerating a hypothesis from the two regular

grammars in terms of the boundary set sizes of the equivalent convex sets. This cost

is shown in Equation 6.22.

0(|5i||52||Gi||G2|(fcWd)2) (6.22)

This cost of intersecting two regular sets and enumerating a hypothesis from their

intersection can now be compared to the cost of performing the same operation

on two convex sets. Let (Si.Gi) and (S2,G2) be two convex sets that represent,

respectively, the same sets of hypotheses as the two regular sets (Qi,si,<5i,Fi,Ei)-

and (Q2, s2, 62, F2, E2). Since the cost of enumerating a hypothesis from a convex set

is 0(1) (see Section 6.3.4), the cost of intersection and enumeration is just the cost of

intersecting two convex sets, as given in Equation 6.11. This equation depends on the

boundary set sizes of both convex sets, the cost of determining the generality relation

between two hypotheses (i<), and the costs of finding the least upper bound (tiub)

and greatest lower bound (tgib) of two hypotheses in the generalization hierarchy.

141

For lattice structured feature hierarchies, the cost of determining whether one

hypothesis is more general than another is 0{kwd) [Hirsh, 1990]. The costs of finding

the least upper bound or greatest lower bound of two hypotheses in the generalization

hierarchy are also bounded by 0(kwd). The intuition behind this bound is that all

three of these operations are essentially searches in the generalization hierarchy,

starting from the hypotheses in question. The hierarchy for each feature can be

searched independently. There are k features, each with a generalization hierarchy

of width w and depth d. This leads to k independent searches, each visiting at most

wd nodes in the hierarchy. Substituting 0(kwd) for ttub, tgib and t< in Equation 6.11

yields the cost bound shown in Equation 6.23.

|S1||S2|fcti/' + |G1||G'2|Ä;u/d+

\{\Sl\\S2\)2 + \Sl\\S2\{\Gl\ + \G2\) +

(\GX\\G2\)2 + IGiHGaKISil + \S2\)]kwd (6.23)

This is in turn bounded by Equation 6.24.

0([(|51||52|)2 + |51||52|(|G1| + |G2|) + (|G1||G2|)

2 + |G1||G2|(|51| + |52|)]A;U;d) (6.24)

When the feature generalization hierarchies are lattice structured, the cost of

intersecting two convex sets and enumerating a hypothesis from their intersection

(Equation 6.22) is roughly equivalent to the cost for performing the same operation

on convex sets (Equation 6.24), depending on the relative sizes of the boundary sets

and the feature hierarchies.

If the boundary sets are all about the same size, x, then the cost for regular

sets becomes 0(xA(kwd)2) and the cost for convex sets becomes 0(xAkwd). The

convex sets are a little more efficient, by a factor of kwd, the cost of comparing two

hypotheses. The regular sets are less efficient, since the comparison is encoded in the

states of the DFA, and these extra states combine multiplicatively when the DFAs

are intersected. If the size of the boundary sets is significantly larger than kwd,

then this additional factor does not have much of an impact, and the complexities of

regular and convex sets are about the same with respect to the combined intersection

and enumeration operation.

142

When the boundary set sizes differ, so that the S sets are much smaller or larger

than the G sets, regular sets are more efficient. Let the sizes of Si and S2 each be

s, and the sizes of G\ and G2 each be g. The complexity of the intersection and

enumeration operation for regular sets is 0(s2g2(kwd)2), and the complexity of the

same operation on convex sets is 0([s4 + gA + s2g + g2s]kwd). This difference is most

likely attributable to the fact that convex set intersection involves a minimization

step that requires 0(s4 + gA) comparisons among the hypotheses of the boundary

sets, whereas regular set intersection has no corresponding minimization step. The

complexity for convex sets is a squared factor greater than the complexity for regular

sets if {s + g)2 <£ kwd. If kwd dominates (s + g)2, then convex sets are more efficient

than regular sets.

6.4 Exponential Behavior in CS-KII and RS-KII

For both RS-KII and CS-KII, the worst case complexity for inducing a hypothesis

from n examples is 0(2"). This exponential behavior occurs in CS-KII and RS-KII

for different reasons. The computational cost of inducing a hypothesis in RS-KII

is proportional to the size of the DFA for the solution set. When P is empty,

the solution set is just C. The C set is the intersection of n DFAs, one for each

knowledge fragment. The intersection of two DFAs, A and B, results in a new DFA

with IQ^HQBI states, where \QA\ is number of states in A, and \QB\ is the number

of states for B. Intersecting n DFA's, each of size r, can result in a final DFA with

up to rn states.

For CS-KII, the cost of integrating an example is proportional to the size of the

boundary sets for the current version space. The worst case complexity occurs when

integrating an example causes the boundary sets to grow geometrically. When a new

negative example is processed, there may be more than one way to specialize each

hypothesis in the G set in order to exclude the example. If each hypothesis has two

specializations, and none of these hypotheses need to be pruned from G, then the size

of the G set doubles. Recall that a hypothesis is pruned from G if it is more specific

than some other element of G, or if it is not more general than any element of S. The

S set grows in the same fashion when there is more than one way to generalize the

hypotheses in S in order to cover a new positive example. This kind of growth can

143

also occur when non-example knowledge fragments are integrated. This geometric

increase in the size of a boundary set is known as fragmentation. After processing n

fragmenting examples or knowledge fragments, the affected boundary set contains 2n

hypotheses. In a geometric series, the cost of processing the last example dominates

the cost of processing all of the previous examples, so the computational complexity

is 0(2n).

Although both RS-KII and CS-KII can be exponential in the worst case, the com-

plexity of RS-KII is bounded by that of CS-KII for some hypothesis space languages,

such as the conjunctive languages described in Section 6.2.2.1. In one case, the com-

plexity of RS-KII is considerably less than that of CS-KII. This is the hypothesis

space and set of examples described by Haussler [Haussler, 1988], which induces ex-

ponential behavior in CEA (and in IVSM and CS-KII by extension). For this same

task, the complexity of RS-KII is only polynomial in the number of examples.

6.4.1 Haussiert Task

Haussler [Haussler, 1988] describes a hypothesis space and sequence of examples that

cause the boundary sets of CEA to grow geometrically. Both the space complexity

and time complexity of CEA are exponential in the number of examples.

Hypotheses in Haussler's task are conjuncts of feature values, with one value

for each of k features. The possible values for each feature are true, false, and

don't-care. These values are abbreviated as t, f, and d, respectively. For example,

a hypothesis for k = 4 might be (t ,f ,d,t). The don't-care value is more general

than the true and false values, and neither true nor false is more general than

the other.

The examples for this task consist of k/2 negative examples and a single positive

example. The positive example is (t,t,t, ... t). The negative examples have two

adjacent features with the value false, and the remaining features have the value

true. For the ith negative example, features 2i and 2i + l are false. See Figure 6.8.

The positive example is presented first, followed by the k/2 negative examples in

order.

144

( f, f, t, t, ..., t, t ) ( t, t, f, f, ..., t, t )

( t, t, t, t, ..., f, f >

Figure 6.8: Haussler's Negative Examples.

6.4.2 Performance of CS-KII and RS-KII

For each negative example in Haussler's task there are two ways to specialize each

element of the G set in order to exclude the negative example, one for for each of the

example's false values. None of the specializations can be pruned from G, so the

size of the G set doubles after processing each negative example. After processing all

k/2 examples, the G set contains 2fc/2 hypotheses. The cost of induction in CS-KII

is proportional to the sizes of the S and G sets, so the cost of inducing a hypothesis

from n = k/2 of these examples is 0(2n).

The cost of inducing a hypotheses from these same examples in RS-KII is only

0(n2). This dramatic change is due to the representational differences between

convex sets and DFAs. Processing one of Haussler's negative examples doubles the

number of most general hypotheses in the version space. This doubles the size of the

convex set representation of the version space, but does not double the size of the

equivalent DFA representation, at least not in this hypothesis space. The size of the

DFA may grow geometrically in other hypothesis spaces, or for other (non-example)

knowledge sources. In this hypothesis space, processing one of Haussler's negative

examples only adds a single state to the DFA. Thus, the size of the DFA grows only

linearly in the number of examples. The initial DFA has k states, so the cost of

processing k/2 negative examples is £*ii(fc + i) = 0(k2). Substituting n for k/2

yields 0(n2).

An example of RS-KII solving Haussler's task for six features (k = 6) is given

below, followed by a more detailed complexity analysis motivated by this exam-

ple. The detailed analysis includes some additional costs that do not appear in the

simplified analysis above, but the overall complexity is still 0(n2).

145

6.4.2.1 RS-KII's Performance on Haussler's Task

For this instantiation of Haussler's task there are six features. The examples there-

fore consist of one positive example, and three (k/2) negative examples:

• Po = (t,t,t,t,t,t)

• n1 = (f,f,t,t,t,t)

• n2 = (t,t,f,f,t,t)

• n3 = (t,t,t,t,f,f>

Each example is translated into a (C, P) pair, where C is the set of hypotheses

consistent with the example, and P is the empty set. C corresponds to the version

space of hypotheses consistent with the example. Translations of the four examples

are shown in Table 6.1. The corresponding DFAs for the C sets are shown in

Figure 6.9 for the positive example, and in Figure 6.10, Figure 6.11, and Figure 6.12

for the three negative examples. The unshaded nodes are dead states.

Example Po = (t,t,t,t,t,t)

m = (f,f,t,t,t,t)

n2 = (t,t,f,f,t,t)

n3 = (f,f,t,t,f,f)

C C0=(d|t)(d|t)(d|t)(d|t)(d|t)(d|t)

d=(d|f)(d|f)(d|t)(d|t)(d|t)(d|t)

C2=(d|t)(d|t)(d|f)(d|f)(d|t)(d|t)

C3 =(d|t)(d|t)(dlt)(d|t)(dlf)(dlfT

Table 6.1: Translations of Haussler's Examples.

The four examples are integrated into RS-KII one at a time. After integrating

each example, the solution set of the resulting (C, P) pair is tested for emptiness.

If the solution set to (C, P) is empty, then there are no hypotheses consistent with

the examples seen so far, much less with all of the examples, so no more examples

are processed. These actions emulates the behavior of CEA, which integrates each

example one at a time, and after integrating each example tests the resulting version

space for emptiness. If it is empty, then the version space has collapsed, and the

remaining examples are not integrated.

146

dltlf

Figure 6.9: DFA for C0, the Version Space Consistent with p0.

dltl^/gl^dltl^/ilxdltl^

dltlf

dltlf

Figure 6.10: DFA for C\, the Version Space Consistent with n\.

The test for emptiness is important for the polynomial behavior of RS-KII. The

empty test identifies and eliminates dead states from the DFA for C in the process

of ascertaining whether C is empty. If these dead states are not removed, then when

the C set for the next example is intersected, those dead states combine multiplica-

tively with the states in C, so that the number of dead states in the intersection is

proportional to the product of the states in C and the number of original dead states.

This geometric growth means that the DFA for C0r\CiD... Cfc/2 has 0(2fc/2) states,

dltl^/ilvdltll/glisdltl^

dltlf

dltlf

Figure 6.11: DFA for C2, the Version Space Consistent with ri2-

147

Iltlf

<^Kru<rixrxy!KyiUg dltlf

Figure 6.12: DFÄ for C3, the Version Space Consistent with n3.

most of which are dead states. Intersecting these DFAs takes time proportional to

0(k/2), since intersection is a constant time operation in RS-KII, but enumerating a

hypothesis from the intersection can take time proportional to the number of states

in the DFA, or 0(2fc/2). The complexity of CEA on this task is also 0(2k/2).

However, if the dead states are eliminated after each intersection, then the DFA

for C0nCin... nCi, has only a few more states than the DFA for C0nCin... nCi-i.

This linear growth leads to a polynomial time complexity for RS-KII on this task,

as will be seen below. The final DFA for the intersection of the C sets for all k/2

examples has only 0(k) states, and the time spent in eliminating dead states is only

0(k2).

The positive example, p0, is seen first. It is translated into (C0, P0) and integrated

into RS-KII. This is the first knowledge seen, so (C,P) = (C0,P0). The negative

examples are then processed one at a time, starting with n\. Example n\ is translated

into {CuPi} and integrated with (C,P), yielding (C0nCi,P0UPi).

Constructing the implicit DFAs for C0nCi and P0UPi are constant time oper-

ations. The solution set for (C0nCi,P0UPi) is then tested for emptiness. The P

sets are empty for all of the examples in this task, so P0UPi = 0, and the solution

set is just CbnCi. Testing whether C0nCi is empty takes time proportional to the

number of states in the DFA representing the intersection. The DFA for C0(~)Ci is

shown in Figure 6.13. The unshaded states are dead states.

The empty test tries to find a path from the start state of the DFA to an accept

state. Along the way, it eliminates any dead states it finds. If all of the states are

identified as dead states and eliminated, then the DFA is empty. Removing the dead

states from the DFA for CQr\Cx results in the DFA shown in Figure 6.14.

148

dlt^/gj§\ dl KTHÔÔ d It If

dltlf

Figure 6.13: DFA for C0nCi

'-cr'^yô^m

Figure 6.14: DFA for C0nCi After Empty Test.

The same process is repeated on the remaining negative examples. Example

n2 is seen next, and translated into (C2,P2). Integrating (C2,P2) with (ConCi,0)

yields (ConCinC2,0). The DFA for ConCxr\C2 is shown in Figure 6.15. This DFA

is tested for emptiness, and its dead states are removed as a side effect. This results

in the DFA of Figure 6.16.

xy^m dltlf

y y dim

Figure 6.15: DFA for C0ndnC2.

149

<y^o Figure 6.16: DFA for C0nC1nC2 After Empty Test.

Finally, n3 is seen. Translating and integrating n3 yields (C = C0nCinC2nC3, $)•

The DFA for C is shown in Figure 6.17. After the empty test removes the dead states,

the DFA for C is as shown in Figure 6.18.

O^O

dltlf

dltlf

Figure 6.17: DFA for CQnC1nC2r\C3.

C0nCinC2nC3 corresponds to the version space consistent with all four examples.

This is the same version space computed by IVSM from these examples, though

150

o^o Figure 6.18: DFA for CbnCinC2nC3 After Empty Test.

represented as a DFA instead of as a convex set. The equivalent convex set is shown

below.

S = {(t,t,t,t,t,t)}

G = {(t,d,t,d,t,d>,<t,d,t,d,d,t>,(t,d,d,t,t,d),(t,d,d,t,d,t>,

(d,t,t,d,t,d),(d,t,t,d,d,t>,(d,t,d,t,t,d>,(d,t,d,t,d,t)}

The 5 set contains one hypotheses, and the G set contains eight hypotheses, each

with six features. This consumes 54 units of space. The equivalent regular set has

eleven states, with three edges per state, for a total of 33 units of space. Edges that

lead directly to a dead state have been omitted from the figures for clarity.

6.4.3 Complexity Analysis

The costs involved in processing the examples are the cost of translating each exam-

ple, the cost of integrating each example, and the cost of applying the empty test to

the solution set of the resulting C set.

The cost of translating example t into (Ci,Pi) is the cost of constructing the

DFAs for d and Pt. The cost of constructing one of these DFA is proportional to

the number of states and edges it has. P{ is always empty, so the cost of constructing

151

Pi is 0(1). The DFA for the C set of the positive example has k + 2 states and

(fc + 2)|E| edges. The DFA for the C set of each negative example has (2fe + l) states

and (2k + 1)|S| edges. The cost of translating the examples is the sum of the sizes

for each DFA:

[(k + 2) + (k + 2)\^\] + (k/2)[(2k + l) + (2k + m\] = (k2 + Sk/2 + 2)(\i:\ + l) (6.25)

Each example is processed by integrating its translation, (Q, Pi), into the (C, P)

pair that represents all of the examples integrated so far, and then testing whether

the solution set of (CflCj, PuPj) is empty. Integration in RS-KII is a constant time

operation, so the cost of integrating all 1 + k/2 examples is 0(k).

The cost of Empty((CnCi,PuPi) is proportional to the number of states in the

DFA for the solution set of (CnC^PöPi). Since P and Pi are always empty in

this task, the solution set is just CnQ. The cost of the empty test is therefore

proportional to the number of states in the DFA for CfiCj.

The positive example is the first one integrated, so the empty test is applied to

(Co, P0), the translation of p0. Since PQ is empty, this is equivalent to testing whether

C0 is empty. The DFA for C0 is minimal, as are the DFAs for all of the examples.

Determining the emptiness of a minimal DFA is a constant time operation, so the

emptiness of C0 can be determined in constant time.

For a negative example, n», the empty test is applied to CnCj, where (Cj,0)

is the translation for nu and C is the set of hypotheses consistent with all of the

previously observed examples (p0 through nj_i). Determining the emptiness of CnC{

is proportional to the number of states in the DFA for CnCj. This number can be

expressed as a function of i and the number of features, k. After processing examples

Po through rii_i, the DFA for C is of the form shown in Figure 6.19.

The DFA for C consists of i - 1 "diamonds", each with three states, and a chain

oik- 2(t - 1) states. Intersecting the DFA for C with the DFA for Cu results in a

DFA of the form shown in Figure 6.20. This DFA is identical to the DFA for C up to

state i - 1. After that, it has one more "diamond", two chains of length k - 2i, and

a dead state. There are i diamonds, each with three states for a total of Zi states.

152

-k-2(i-1) >

"O

Figure 6.19: DFA for C0nCxn... nCi_i After Empty Test.

This does not include the state labeled i. There are two branches of k - 2i states,

and a dead state. The total number of states in Cr\C{ is therefore:

3« + 1 + 2{k - 2t) + 1 = 2k - i + 2 (6.26)

-k-2i-

... mQ

< k - 2i - dltlf

Figure 6.20: DFA for Cnd.

The lower branch in Figure 6.20 (unfilled circles) consists entirely of dead states,

and is pruned by the empty test. The dead branch has (k-2i + 1) states, so after

applying the empty test, the number of states in CnQ is given by the following

equation:

(2* - * + 2) - (k - 2% + 1) = k + i + 1 (6.27)

153

The cost of applying the empty test to CnC; for all k/2 negative examples is given

by Equation 6.28.

k/2 k/2

J^2k-i + 2 = k2 + k-J2i t=i t=i

= k2 + k-{k/2)(k/2 + l)/2

= k2+ k-(k2 +2k)/16

= ^+7-k (6.28)

The cost for the entire induction task is the translation cost (Equation 6.25) plus

the integration cost {0(k)) plus the cost of the empty test (Equation 6.28). The

sum of these costs is given by the following equation:

{k2 + Zk/2 + 2)(|E| + 1) + k + — k2 + -k lb o

= (i^+\n)k2+(2l+lm)k+2\n+2 = 0(k2\E\)

In this task, E is {t,f ,d} regardless of the number of features, so |E| can be treated

as a constant. Substituting n for k/2, we get a total cost of 0(n2).

6.4.4 Summary

The improvements of RS-KII over CS-KII on this task are due to representational

differences between convex and regular sets. A convex set stores the entire G set and

S set. In Haussler's task, the size of the G set doubles with each negative example,

so that it contains 0(2n) elements after processing n examples. Representing the

same version space as a DFA requires only a constant number of states. The DFA

effectively stores the version space in factored form. Factored convex-set representa-

tions have been shown to have similar complexity improvements [Subramanian and

Feigenbaum, 1986].

The regular set representation has similar complexity results for a range of hy-

pothesis spaces, including the conjunctive languages with tree-structured features,

and to a lesser degree for conjunctive languages with lattice-structured features. For

154

the latter hypothesis space, the cross-feature fragmentation is controlled for by the

DFA representation, but there is also within-feature fragmentation that increases the

complexity. However, the complexity is still less than it would be for a completely

unfactored representation.

6.5 Discussion

Regular and convex sets overlap in the knowledge they can represent, but neither

representation strictly subsumes the other. There are many knowledge fragments

that can be expressed as either a regular set or a convex set, but there are also

fragments that can be expressed in only one of the representations and not the

other. The relative computational complexities of RS-KII and CS-KII depend on the

knowledge being utilized, but for conjunctive languages with tree and lattice struc-

tured features, both RS-KII and CS-KII have the same worst-case complexity. RS-

KII effectively represents the version space in factored form [Subramanian and

Feigenbaum, 1986], which makes the cost of enumerating a hypothesis from the

intersection of regular sets cheaper than the cost of intersecting equivalent convex

sets by about a square factor. The factored representation also allows RS-KII to

learn from Haussler's examples [Haussler, 1988] in polynomial time, whereas IVSM

and CEA both take exponential time.

155

Chapter 7

Related Work

7.1 IVSM

Incremental Version Space Merging (IVSM) [Hirsh, 1990] was one of the first knowl-

edge integration systems for induction, and provided much of the motivation for KII.

IVSM integrates knowledge by translating each knowledge fragment into a version

space of hypotheses consistent with the knowledge, and then intersecting the version

spaces for the fragments to get a version space consistent with all of the knowledge.

This theoretically allows IVSM to utilize any knowledge that can be expressed as a

constraint on the hypothesis space. In practice, IVSM represents version spaces as

convex sets, and this limits the constraints that can be expressed, and therefore the

knowledge that can be used.

In later work, Hirsh suggests using representations other than convex sets for

version spaces, and identifies representations that guarantee polynomial time bounds

on the cost of induction [Hirsh, 1992]. KII expands on this work by extending the

space of set representations for the version space (i.e., C) from the few suggested by

Hirsh to the space of all possible set representations. KII also expands on IVSM by

allowing knowledge to be expressed in terms of preferences as well as constraints,

thereby increasing the kinds of knowledge that can be utilized. Finally, KII facilitates

formalization of the space of set representations by mapping them onto the space of

grammars, and using results from automata and formal language theory to establish

upper bounds on the expressiveness of set representations. KII strictly subsumes

IVSM, in that IVSM can be cast as an instantiation of KII with convex sets (CS-

KII).

156

7.2 Grendel

Grendel [Cohen, 1992] is another cognitive ancestor of KII. The motivation behind

Grendel is to better understand the effects of inductive biases on learning by express-

ing the biases explicitly in the form of a grammar. Grendel expresses the biases as

hard constraints on the hypothesis space, and as preferences on the order in which

the space is searched. The constraints are expressed as an antecedent description

grammar1, and the preferences are represented by marking productions in the gram-

mar as preferred or deferred. The grammar representing a given collection of biases

is constructed by a translator that takes as input all of the biases and outputs the

grammar. The language of the grammar is the biased hypothesis space. The gram-

mar is then searched for a hypothesis (string) consistent with the examples. The

search is guided by an information gain metric, and the preference markings on the

grammar productions.

Grendel can utilize a wide range of knowledge (biases), but it cannot integrate

knowledge. The integration work is done by the translator, not by Grendel itself. A

translator takes as input all of the biases, and outputs a single grammar. It is not

possible to translate the biases independently and integrate the grammars, as is done

in KII, because the grammar is essentially context free and not closed under inter-

section. In Grendel, each new combination of biases requires a new translator. This

is in contrast to KII, where knowledge fragments can be translated and integrated

independently, so that it is possible to have a single translator for each knowledge

fragment. This allows knowledge to be added or omitted much more flexibly than

in Grendel. This independence also means that the knowledge integration effort

in KII occurs primarily within KII's integration operator, as opposed to occurring

primarily within a single translator, as is the case with Grendel.

KII's greater flexibility in integrating knowledge comes from two sources. First,

constraints can be expressed in languages that are closed under intersection, which

allows constraints to be specified independently and composed via set intersection.

Second, preferences are represented as a separate grammar instead of as orderings on

the productions of the constraint grammar, as is the case in Grendel. This removes

*An antecedent description grammar is essentially a context free grammar.

157

a source of dependence between preferences and the biases encoded in the constraint

grammar, and provides a potentially more expressive language for the preferences.

KII also removes the somewhat arbitrary distinction between biases and exam-

ples. Grendel treats these as separate entities, but KII treats them both equally.

This uniformity facilitates certain analyses—such as determining an upper bound

on the expressiveness of set representations for which induction is computable—that

would be more difficult if examples were treated differently than other forms of

knowledge. Grendel treats examples in a fixed way, assuming that they are noise-

free and using an information gain metric to select among the strictly consistent

hypotheses. KII allows examples to be translated in more than one way, which per-

mits other assumptions about the examples, such as that they have noise conforming

to the bounded inconsistency model.

7.3 Bayesian Paradigms

Buntine [Buntine, 1991] describes a knowledge integration system in which knowl-

edge is represented as prior probability distributions over the hypotheses. The prior

probability for a hypothesis is its prior probability of being the target concept. The

hypothesis space is then searched for a hypothesis with the highest prior probability.

This is similar to KII, except that knowledge is expressed as probability distribu-

tions instead of as constrained optimization problems. Each of these representations

make different trade-offs between expressiveness, efficiency, and ease of constructing

translators. Whether the constraint paradigm or the Bayesian paradigm is more

appropriate depends on the available knowledge.

In the Bayesian paradigm, it may be difficult to find priors that adequately

express a piece of knowledge. Solving the probability equations to find the most

probable concept may also be difficult, depending on the distribution. KII is most

appropriate when the knowledge can be easily expressed as constraints and pref-

erences in set representations for which induction is computable, and preferably

polynomial.

158

7.4 Declarative Specification of Biases

Russell and Grosof [Rüssel and Grosof, 1987] describe a knowledge integration sys-

tem for induction in which knowledge, in the form of inductive biases, is translated

into determinations. The determinations and the examples deductively imply the

target concept. It is also possible to search for a desired set of determinations, such

as those that do not overconstrain the solution, and deduce the target concept from

these determinations. This yields the ability to dynamically shift the bias.

This system shares with KII the idea that induction is the process of identifying

a hypothesis that is deductively implied by the knowledge. The inductive leaps come

from unsupported assumptions made by the biases (knowledge), and from having to

select a hypothesis arbitrarily when more than one hypothesis is deductively implied

by the knowledge. A determination expresses a bias—that is, a preference for selecting certain hy-

potheses over others as the target concept. This can certainly be expressed in terms

of constraints and preferences over the hypothesis space. For example, a deter-

mination of the form nationality(?x, ?n) ::> language(?x,?l) can be trans-

lated along with an example, nationality (Fritz, German), language (Fritz,

German), into a preference for hypotheses that imply language(?x, German) is

true whenever nationality (?x, German) is true. The exact form of the preference

would depend on the hypothesis space and the preference set representation. Deter-

minations are no more expressive than (H, C, P) tuples, but it may be more natural

to express some knowledge in terms of determinations than in terms of constraints

and preferences, and vice versa. As with the Bayesian paradigm, the appropriate-

ness of a given framework depends on how naturally the knowledge at hand can be

expressed in that framework.

KII can have instantiations at various trade-offs points between expressiveness

and complexity. This is useful for investigating the effects of knowledge on induction,

and for generating induction algorithms that guarantee certain time complexities.

The framework of Russell and Grosof can not make such trade-offs directly. It may

be possible to find restricted determination languages with desirable complexity

properties, but this is not supported in any principled fashion by the framework.

It can, however, shift the bias by selecting which knowledge to utilize. KII has

159

no equivalent capability, although the choice of set representation does determine

which biases can be utilized, albeit at a much coarser grain than Russell and Grosof's

system.

7.5 PAC learning

The PAC learning literature (e.g., [Vapnik and Chervonenkis, 1971], [Valiant, 1984])

investigates, in part, the conditions under which a concept can be learned in poly-

nomial time from a polynomial number of examples. One of the main results from

this literature is that all concepts in a family of concepts are polynomially learnable

if any given concept in the family can be identified within a given error margin

and confidence level by a polynomial number of examples, and the time needed to

identify a hypothesis consistent with the examples is polynomial in the number of

examples [Valiant, 1984].

The KII framework is concerned with the hypothesis identification half of this

result. KII provides operations for identifying hypotheses consistent with a collection

of knowledge fragments, not just examples. The set representation determines the

complexity of identification, and determines the knowledge that can be expressed. If

the representations for C and P are such that a hypothesis can be enumerated from

the solution set in polynomial time, then the cost of identification is polynomially

bounded for any knowledge expressible in these representations. This complements

the PAC results, which deal with the cost of identifying a hypothesis from examples

only.

The accuracy of a hypothesis induced from examples depends on the Vapnik-

Chervonenkis dimension (VC-dimension) [Vapnik and Chervonenkis, 1971] of the

family of concepts to which the target concept belongs. The VC-dimension deter-

mines how many examples are needed in order to guarantee that the hypotheses

consistent with the examples will have a given level of accuracy. KII has nothing

formal to say as yet about the number of knowledge fragments needed to guarantee

a certain level of accuracy. This is an area for future research. However, accuracy

is generally correlated with the number of correct knowledge fragments, and this

depends in turn on the expressiveness of the set representation. More expressive set

160

representations can represent more of the available knowledge, and thus the accu-

racy of the learned hypothesis tends to increase with the expressiveness of the set

representation.

161

Chapter 8

Future Work

8.1 Long Term Vision

The ultimate vision for this work is to provide a general framework for integrating

knowledge into induction. I envision a library of translators for different knowledge

sources, hypothesis spaces, and set representations, and a library of set representa-

tions with which to instantiate KII. This would provide the user maximum flexibility

in deciding which hypothesis space to use, which of the available knowledge to use,

and in what set representation to express the knowledge.

The ability to choose a set representation is particularly important, since it allows

the user to make trade-offs between expressiveness and complexity. If it is impor-

tant to utilize all of the knowledge, a very expressive representation may be most

appropriate. If complexity is more of an issue, an inexpressive representation with

low complexity bounds may be more desirable. The representation also helps the

user evaluate the utility of available knowledge fragments. If a knowledge fragment

requires an expressive representation, but the remaining knowledge can be expressed

in a more restricted representation, then the benefits of using the more expressive

knowledge may not be worth the added complexity.

Instead of omitting overly expensive knowledge altogether, it may be possible

to use an approximation of the knowledge that is easier to express. For example, a

preference for the globally shortest hypothesis consistent with the knowledge may

be difficult to express, but a preference for the locally shortest hypothesis consistent

with the knowledge may be easier to express. The set representation provides a way

to evaluate the cost of various approximations.

162

8.2 Immediate Issues

The near term goals are to take steps in this direction by developing RS-KII trans-

lators for the knowledge used by additional induction algorithms, and to identify set

representations that guarantee polynomial-time identification.

Developing RS-KII translators for additional knowledge sources is needed to un-

derstand the full scope of RS-KII's expressiveness, and its practicality as an induc-

tion algorithm. The expressiveness of regular sets makes it seems likely that RS-KII

can subsume a number of existing algorithms, but this same expressiveness suggests

that RS-KII's complexity could be worse than that of the original, more specialized

algorithms. RS-KII's complexity with respect to AQ-11 and CEA is close to that

of the original algorithms, and in some cases much better. The expressiveness and

complexity of RS-KII, and thus its ultimate practicality as an induction algorithm,

is an area for future research.

Another key issue is identifying instantiations of KII that can integrate n knowl-

edge fragments and enumerate a hypothesis from the solution set in time polynomial

in n. This would provide a tractable induction algorithm that can potentially utilize

a range of knowledge other than examples. Additionally, the set representation for

the instantiation effectively defines a class of knowledge from which hypotheses can

be induced in polynomial time. This would complement the results in the PAC lit-

erature, which deal with polynomial-time learning from examples only (e.g.,[Vapnik

and Chervonenkis, 1971], [Valiant, 1984], [Blummer et al, 1989]).

Finally, the order in which an induction algorithm searches the hypothesis space

is an implicit bias of the algorithm. Hypotheses that occur earlier in the search are

preferred over those that come later, since an induction algorithm usually selects the

first hypothesis it finds that also satisfies the goal conditions. It is often difficult to

express search orderings in terms of binary preferences between hypotheses. In order

to determine whether one hypothesis comes before another in the search, it is often

necessary to emulate the search. In a few cases, such as best first or hill climbing

in certain hypothesis spaces, it is possible to extract the search order from the

hypotheses themselves. In a best first search, this is a simple matter of determining

which hypothesis has the better evaluation. In a hill climbing search, it is more

awkward, as evidenced by the LEF translator for AQ-11 described in Section 5.2.2.

163

The issue of expressing search order biases needs to be investigated in more detail.

One way to circumvent this problem is to replace the search-order bias with a bias

that can be more naturally expressed as an {H, C, P) tuple. Since the search order is

often an approximation of a more restrictive bias, an alternate approximation may

be well justified. In the case of AQ-11, the strict bias is to prefer hypotheses that

maximize the LEF. The cost of finding such a hypothesis is prohibitive, so AQ-11

uses a beam search to find a locally maximal hypothesis. It may be possible to find

some other approximation of the LEF that can be expressed more naturally as an

(H, C, P) tuple.

.164

Chapter 9

Conclusions

KII is a framework for integrating arbitrary knowledge into induction. This theo-

retically allows all of the available knowledge to be utilized by induction, thereby

increasing the accuracy of the induced hypothesis. It also allows hybrid induction

algorithms to be constructed by "mixing and matching" the knowledge and implicit

biases of various algorithms.

Knowledge is expressed uniformly in terms of constraints and preferences on

the hypothesis space, which are expressed as sets. Theoretically, just about any

knowledge can be expressed this way, but in practice the set representation deter-

mines what knowledge can be expressed, and the cost of inducing a hypothesis from

that knowledge. This reflects what seems to be an inherent trade-off between the

computational complexity of induction and the breadth of utilizable knowledge.

Instantiations of KII at various trade-off points between complexity and expres-

siveness can be generated by selecting an appropriate set representation. The space

of possible set representations can be mapped onto the space of grammars. This

provides a principled way to investigate the space of possible trade-offs, and to es-

tablish the trade-off limits. One such limit is that (CxC)CiP can be at most context

free, which effectively limits C to the regular languages, and P to the context free

languages. If the ability to integrate knowledge is sacrificed, C can be context free

and P can be regular. Otherwise, the solution set is not computable, so it is not

possible to induce a hypothesis from the knowledge.

This expressiveness bound also applies to search in general, and by extension

to other induction algorithms. The C set corresponds to the goal conditions, and

the P set to the relative "goodness" of goals. The solution set consists of the best

165

goals. If C and P are too expressive, then it is not possible to find the best goal.

The goal conditions must be relaxed, or the goodness ordering must be altered to

allow sub-optimal goals. To the extent that other induction algorithms use search

to identify the target concept, they are also limited by these bounds.

The vision motivating KII is the desire to integrate arbitrary knowledge into

induction. The reality is that complexity tends to increase with expressiveness, and

places an ultimate upper bound on expressiveness. RS-KII, an expressive instanti-

ation of KII, was developed to test the range of knowledge that can be practically

expressed and integrated within these bounds.

RS-KII can utilize the knowledge used by two disparate induction algorithms,

AQ-11 and (for some hypothesis spaces) CEA. RS-KII can also utilize noisy examples

with bounded inconsistency, and can utilize domain theories. This knowledge can

be integrated with the knowledge from AQ-11 and/or CEA, thereby forming hybrid

induction algorithms. It is likely that RS-KII can utilize the knowledge utilized

by many other induction algorithms as well. This would allow RS-KII not only

to subsume these algorithms, but also to form hybrid algorithms by "mixing-and-

matching" knowledge from different algorithms.

Since RS-KII is expressive, it can also be computationally expensive. However,

when RS-KII uses only the knowledge used by AQ-11, it's computational complexity

is only slightly worse than that of AQ-11. When using only the knowledge of CEA,

the complexity of RS-KII is comparable to that of CEA. For at least one collection

of knowledge (Haussler's examples), the complexity of CEA is exponential, but the

complexity of RS-KII is only polynomial. Similar results may obtain when using the

knowledge of other induction algorithms, but developing translators for additional

knowledge sources is an area for future work.

166

Reference List

[Aho et al, 1974] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, MA., 1974.

[Blummer et al., 1989] A. Blummer, A. Ehrenfect, D. Haussler, and M. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery, 36(4):929-965, 1989.

[Brieman et al, 1984] L. Brieman, Friedman J., R. Olshen, and C. Stone. Classifi- cation and Regression Trees. Wadsworth and Brooks, 1984.

[Bundy et al, 1985] A. Bundy, B. Silver, and D. Plummer. An analytical comparison of some rule-learning programs. Artificial Intelligence, 27, 1985.

[Buntine, 1991] W. Buntine. Classifiers: A theoretical and empirical study. In 12th International Joint Conference on Artificial Intelligence, pages 638-644, Sydney, 1991.

[Chomsky, 1959] N. Chomsky. On certain formal properties of grammars. Informa- tion and Control, 2, 1959.

[Clark and Niblett, 1989] P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3(?):261-283, 1989.

[Cohen, 1992] W. W. Cohen. Compiling prior knowledge into an explicit bias. In D. Sleeman and P. Edwards, editors, Machine Learning: Proceedings of the Ninth International Workshop, pages 102-110, Aberdeen, 1992.

[DeJong and Mooney, 1986] G. F. DeJong and R. Mooney. Explanation based learning: An alternative view. Machine Learning, 1(2):145-176, 1986.

[Dietterich et al., 1982] T. Dietterich, B. London, K. Clarkson, and G. Dromey. Learning and inductive inference. In P. Cohen and E. Feigenbaum, editors, The Handbook of Artificial Intelligence, Volume HI. William Kaufmann, Los Altos, CA, 1982.

[Fisher, 1936] R. Fisher. The use of multiple measurements in taxonomic problems. Annual Eugenics, 7:179-188, 1936.

167

[Flann and Dietterich, 1989] N. S. Flann and T. G. Dietterich. A study of explanation-based methods for inductive learning. Machine Learning, 4:187-226, 1989.

[Gordon and desJardins, 1995] D. Gordon and M. desJardins. Evaluation and selection of biases in machine learning. Machine Learning, 20(1/2):5-22, 1995.

[Haussler, 1988] D. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, 36:177-221, 1988.

[Hirsh, 1990] H. Hirsh. Incremental Version Space Merging: A General Framework for Concept Learning. Kluwer Academic Publishers, Boston, MA, 1990.

[Hirsh, 1992] H. Hirsh. Polynomial-time learning with version spaces. In AAAI-92: Proceedings, Tenth National Conference on Artificial Intelligence, pages 117-122, San Jose, CA, 1992.

[Hopcroft and Ullman, 1979] J. E. Hopcroft and J. D. Ullman. Introduction to Au- tomata Theory, Languages, and Computation. Addison-Wesley, Reading, MA, 1979.

[Kumar and Laveen, 1983] V. Kumar and K. Laveen. A geeneral branch and bound formulation for understanding and synthesizing and/or tree search procedures. Artificial Intelligence, 21:179-198, 1983.

[Kumar, 1992] V. Kumar. Search, branch and bound. In Encylopedia of Artificial Intelligence, pages 1000-1004. John Wiley & Sons, Inc., second edition, 1992.

[Michalski and Chilausky, 1980] R.S. Michalski and R.L. Chilausky. Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. Policy Analysis and Information Systems, 4(3):219- 244, September 1980.

[Michalski, 1974] R.S. Michalski. Variable-valued logic: System VLi. In Proeedings of the Fourth International Symposium on Multiple- Valued Logic, Morgantown, West Virginia, May 1974.

[Michalski, 1978] R.S. Michalski. Selection of most representative training examples and incremental generation of VLi hypotheses: The underlying methodology and the descriptions of programs ESEL and AQll. Technical Report 877, Department of Computer Science, University of Illinois, Urbana, Illinois, May 1978.

[Mitchell et al, 1986] T. Mitchell, R. Keller, and S. Kedar-Cabelli. Explanation- based generalization: A unifying view. Machine Learning, 1:47-80, 1986.

168

[Mitchell, 1980] T.M. Mitchell. The need for biases in learning generalizations. Technical Report CBM-TR-117, Rutgers University, 1980.

[Mitchell, 1982] T.M. Mitchell. Generalization as search. Artificial Intelligence, 18(2):203-226, March 1982.

[Pagallo and Haussler, 1990] G. Pagallo and D. Haussler. Boolean feature discovery in empirical learning. Machine Learning, 5:71-99, 1990.

[Pazzani and Kibler, 1992] M. J. Pazzani and D. Kibler. The utility of knowledge in inductive learning. Machine Learning, 9:57-94, 1992.

[Pazzani, 1988] M. Pazzani. Learning causal relationsips: An integration of empirical and explanation based learning methods. PhD thesis, University of California,

Los Angeles, CA, 1988.

[Quinlan, 1986] J.R. Quinlan. Induction of decision trees. Machine Learning, 1:81- 106, 1986.

[Quinlan, 1990] J. R. Quinlan. Learning logical definitions from relations. Machine

Learning, 5:239-266, 1990.

[Rosenbloom et al, 1993] P. S. Rosenbloom, H. Hirsh, W. W. Cohen, and B. D. Smith. Two frameworks for integrating knowledge in induction. In K. Krishen, editor, Seventh Annual Workshop on Space Operations, Applications, and Re- search (SOAR '93), pages 226-233, Houston, TX, 1993. Space Technology Inter- dependency Group. NASA Conference Publication 3240.

[Rüssel and Grosof, 1987] S. Rüssel and B. Grosof. A declarative approach to bias in concept learning. In Sixth national conference on artificial intelligence, pages 505-510, Seattle, WA, 1987. AAAI.

[Subramanian and Feigenbaum, 1986] D. Subramanian and J. Feigenbaum. Factor- ization in experiment generation. In Proceedings of the National Conference on Artificial Intelligence, pages 518-522, Philadelphia, PA, August 1986.

[Valiant, 1984] L.G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134-1142, 1984.

[Vapnik and Chervonenkis, 1971] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264-280, 1971.

[Warshall, 1962] S. Warshall. A theorem on Boolean matrices. Journal of the As- sociation for Computing Machinery, 9(1):11-12, 1962.

169

[Winston et al, 1983] P. Winston, T. Binford, B. Katz, and M. Lowry. Learning physical descriptions from functional definitions, examples, and precedents. In Proceedings of the National Conference on Artificial Intelligence, pages 433-439, Washington, D.C., August 1983. AAAI, Morgan-Kaufmann.

170

Induction as Knowledge Integration

Documents

Transcript of Induction as Knowledge Integration