Towards representation independence in PAC learningmanfred/pubs/C16.pdf · Definition 2.2 A...

Towards Representation Independence in PAC Learning

Manfred K. Warmuth*

Department of Computer and

Information Sciences

University of California

Santa Cruz, California 95064

Abstract

In the recent development ofwrious models of learning inspired by the PAC learning

model (introduced by Valiant) there has been a trend towards models which are as

representation independent as possible: We review this development and discuss the

advantages of representation independence. Motivated by the research in learning, we

propose a framework for studying the combinatorial properties of representations.

1 Introduction

In Complexity Theory one always tries to find definitions that are independent of the particular

representational details. We overview recent research in Computational Learning Theory in view

of the objective of obtaining representation independent notions of learnability. We restrict our

discussion to one family of learning models (called PAC learning models in [3]) whose original

version was introduced by Valiant [39] (see Section 2 for definitions). In the PAC learning models

the learning algorithm is given only a polynomial amount of resources (such as examples and

running time) to achieve the learning task.

Usually there are two representations required for specifying a learning problem. The first

one is the representation of the hidden concept that is to be learned and the second one the

hypothesis the learning algorithm produces. Restricting the hypothesis of the algorithm be of

a certain form frequently leads to infeasibility results that are very dependent on the particular restrictions used [32] (reviewed in Section 3). This has motivated the introduction of a definition of learning (ca~ed prediction) in which the hypothesis is not required to be of any particular form [20]

(Section 4). (See [4,5,6,8,30,34,35] for related models of prediction.) The definition of prediction from [20] naturally leads to a notion of reduction (reviewed in Section 5) between representation

classes which preserves predictability with a polynomial amount of resources [33]. Various complete

*Supported in part by ONR grant N00014-86-K-0454.

79

problem with respect to this definition of reduction have been shown to be hard to predict modulo

cryptographic assumptions [27,33] (Section 6). In Section 7 we discuss dropping restrictions on the

representation of the hidden concept to be learned.

In cases where no reduction has been found from a prediction problem to another, we would

like to be able to prove that no such reduction is possible. In Section 8 we introduce a methodology

that leads to such results and to a study of the combinatorial properties of representations in their

own right. We conclude with a discussion of the usefulness of this methodology and give a large

number of open problems (Section 9).

2 The definition of PAC learning

In this section we describe the PAC learning model which was introduced by Valiant [39]. By now

many variants of the original model have appeared in the literature ([20] gives a comparison of the

most common variants). We define a particular wriant below and motivate our choice.

Defini t ion 2.1 An alphabet E is any set. Elements of alphabets are called symbols. A word w

over an alphabet ~ is any finite length sequence of symbols of Y, Owl denotes its length). ~* denotes

all such words and ~ stands for the empty sequence (word) which has length zero. A concept or

Ianguage c over an alphabet ~ is any subset of ~*. A concept class C over an alphabet ~ is any

set of concepts, i.e. C C 2 ~°.

Throughout this paper we use Y, to denote the alphabet over which concepts are defined.

The goal of the PAC learning model is to characterize those concept classes that are "learnable"

with a polynomial amount of resources. (The acronym PAC abbreviates "probably approximately

correct" [3].) The purpose of this paper is to study representation independence in the PAC learning

model. Observe that concepts correspond to {0, 1}-valued functions. In some recent work [18]

Haussler generalizes the PAC model to learning real-valued functions. Many interesting issues arise

already in the case of learning concepts and we choose to restrict ourselves to that case.

Intuitively we say that a concept class C is "polynomially learnable" if there is a polynomial

time (possibly randomized) algorithm that, given polynomially many randomly generated elements

of Y.*, and told for each word whether or not the word is in some unknown target concept c E C,

produces a "hypothesis" h E C that approximates c accurately with high probability.

We will formalize the above intuitive definition. First, we would like the amount of resources

(such as the number of examples and running time) used by the learning algorithm to grow poly-

nomia~y in the "complexity" of the unknown target concept c E C. Concepts are possibly infinite

languages and thus a reasonable measure of complexity is the length (number of letters) of the

description (representation) of the concept in some given representation language. The question

80

of whether a concept class is polynomially learnable will depend on what representation language

for the concepts we have chosen. A concept class might be polynomially learnable with respect

to one representation but not with respect to another. For example, the question "Are regular

languages polynomially learnable?" is not well specified. A suitable question would be "Are DFAs

polynomialty learnable?". Observe that learning NFAs should he harder than learning DFAs since

there are regular languages whose NFA representation is exponentially more concise than its DFA

representation. This discussion motivates the following definition.

Definit ion 2.2 A representation class is a/our-tuple (R, F, e, Z) such that

F is an alphabet and R a language over F, called the representation language.

E is an alphabet for the concepts represented by R and e is a mapping from 1~ to concepts over

2*. For each representation r E R, the concept c(r) C 2" is the concept represented by r.

The concept class represented by (R, F, c, 2) is the set {c(r) : r E R}.

For r e R, the complexity of the representation r is max((1, [r[}).

Throughout this paper the letter F denotes the alphabet used for representations. Both the concept

alphabet Z and the representation alphabet F may be infinite. For example, if concepts are subsets

of the Euclidean n-dimensional reM space [R n, then 2 and F might contain the set of all reals IR.

However, note that in our definition of complexity of representations, each real in a word of ~* or

F* accounts for one unit. This corresponds to using the unit-cost model of computation [2]. For

syntactical reasons we assume that there is a special symbol "$" such that $ ¢ 2 U F always holds.

In most cases, given any set of representation class (R, F, c, 2), the alphabets F and 2 and the

concept mapping c will be implicitly understood, hence we use R as an abbreviation for the whole

class, represented by ( R, F, e, ~).

Associated with any representation class (R, F, c, 2) is an evaluation problem, which is that of

determining, given an arbitrary r E R and w E ~*, whether or not w E e(r). This is defined

formally as a language:

Definit ion 2.3 The evaluation problem for a representation class (R, r , c, E) is the language

{r$w : w E c(r)}, where $ is the special symbol that is not in E U F. We denote the defined

language as E(R, r , c, E) and use E(R) for shorthand.

Throughout this paper we assume that for all representation classes considered the evaluation

problem is in P (polynomiaJ time).

In the following definitions of particular representation classes, we use the word "encodes" to

abbreviate "encodes with respect to some fixed, standard encoding scheme." As usual, if integers

8]

are used to define the concepts, then we must explicitly mention whether they axe to be encoded

in unary or binary.

Conventions: For all Boolean representation classes given below, the concept alphabet is {0,1},

and for simplicity we assume that each Boolean concept b is a subset of {0,1} '~ for some positive

power n. Each representation r such that c(r) = b is assumed to contain n in binary. The number n

indicates the number of variables of the concept. These variables have indices in the range from one

to n which are encoded in binary and a bit pat tern {0,1} n encodes an assignment to the variables.

For representation classes where each concept is a language over some finite alphabet (such

as DFAs, NFAs, CFGs) the concept alphabet ~]~ is the infinite "universal set of symbols" which

is assumed to contain all finite alphabets (except for the symbol "$ ' ) . The representation of a

language specifies the finite subalphabet of ~ used for the particular language.

Below we define the representation classes used in this paper.

• R B F = {r : r encodes a Boolean formula}. For a formula r of n variables the concept c(r)

consists of all assignments {0,1} ~ that satisfy the formula.

• R D N F = {r : r encodes a Boolean formula in disjunctive normal form}.

• R V l V F = { r : r encodes a Boolean formula in conjunctive normal form}.

$ R k - t e r m D N F = { r : r encodes a DNF formula with at most k terms}.

o R k - d a u s e O g F = { r : r encodes a CNF formula with at most k clauses}.

i R k - D N F = {r : r encodes a DNF formula where each term has at most k literals}.

• R k _ C N F = { r : r encodes a CNF formula where each clause has at most k literals}.

• Rmonomlal 8 -~ R l - t e r m D N F = R 1 - C N F and Rclause8 -= R l - c l a u s e C N F -'~ R 1 - D N F .

• R x o R s u m = {r : r encodes the exclusive OR of a set of literais}.

• RDT = {r : r encodes a Boolean decision tree}. A Boolean decision tree is a binary tree,

where each internal node is labeled with a variable and each leaf with either 1 or 0. The two

branches at each internal node represent the two possible settings of the variable at that note.

For any assignment of the variables there is a unique path to a leaf in the tree. The concept

c(r) consists of all assignments leading to leafs labeled with 1.

• R e l l ~ g = {r : r encodes an acyclic Boolean circuit}, where if r has n inputs, then c(r) is the

set of assignment {0, 1} n accepted by the circuit encoded by r.

• R D F A = {r : r encodes a DFA} is a set of representations (for the concept class of regular

languages) with the implicit mapping c such that for any r, c ( r ) is the concept (language)

accepted by the DFA encoded by r.

82

, RNFA = {r : r encodes an NFA}.

• RCFa = {r : r encodes a CFG in Chomsky normal form}.

• RSINaLE = {r : r encodes one bit pattern}. Let u be the bit pattern represented by r. Then

c(r) = { 0 , I } ;'~I - { u } .

• RpAIRS : {r : r encodes two distinct bit patterns of the same length}. Let u and v be the

bit patterns represented by r. Then e(r) = {0, 1} [ul - {u, v}.

We conclude this list of representation classes with a class for which the concept alphabet and

the representation alphabet consist of the set of reals IR. The defined concepts are iso-thetic

boxes in 1R n, i.e. each facet of such a box is parallel to a hyper-plane spanned by a subset of

n - 1 of the n coordinate axes.

• R ~ = { r : r is a sequence (word) of an even number of reals }. If r = x l x ~ , . . . , xn, then the

concept c(r) is the following cross product of closed intervals of IR : [xi, x2] x [x3, x4] x . o . x

[x,_l,x,~]. Note that e(r) C_ ~n/2.

A learning algorithm for a representation class is given words labeled according to an unknown

target representation in the class, which are called examples:

Definit ion 2.4 Let (R, F, e, E) be a representation class. For any representation r E R, and for

any word w e E*, let tabelc(r)(w ) = "%" if w e c(r) and " - " if w ~ c(r). An example of the

concept c(r) (respectively the representation r) is a pair {w, labelc(r)(w)}. A sequence of examples

is consistent with c(r) (respectively r) if all examples in the sequence are examples of c(r). An

unlabeled example is just a word w.

We have already mentioned that the resources of the learning algorithm is allowed to grow with

the complexity of the representation from which it receives examples. Formally this is done by

giving the learning algorithm a parameter ~, which is an upper bound on the length of the target

representation r according to which the examples are labeled. The number of examples is allowed

to grow polynomially in this parameter s.

The example words given to the learning algorithm are drawn at random. However, when

learning a representation class whose concepts may be infinite sets of words of unbounded length,

then it is also reasonable to a~ow the number of examples required by the learning algorithm to

grow polynomially with the length of the longest example word seen (here let the empty word have

length one). Note that if the concept alphabet contains the set of reals IR, then a point in I1~ '~ may

be encoded by a word of n symbols of IR and thus not every infinite concept contains words of

unbounded length. However, the regular language a'b* over the alphabet {a, b} is a set of words of

unbounded length.

83

Since the example words are drawn at random it is difficult to specify the dependence~on the

length of the longest example. One might let the number of examples grow polynomially in the

expectation and the variance of the longest length, where these two parameter are explicit inputs to

the learning algorithm. We choose a simpler convention which is motivated in more detail in [33].

We assume that the learning algorithm only receives example words up to a given maximum length

n (the probability of all longer words is zero), and the algorithm is given n as a parameter.

Definit ion 2.5 For any language L and n >> 1, let L In] = {w E L : Iwl < n}.

The examples words are drawn according to an arbitrary but fixed distribution on ~[n] and

the number of examples of the learning algorithm is allowed to grow polynomially in the input

parameters n and s (the upper bound on the complexity of the representation described above). In

the high level definition of a learning algorithm given at the beginning of this section we required

that the hypothesis produced by the learning algorithm must be "accurate with high probability".

This is made precise using the following definitions.

Definit ion 2.6 Let (R, F, c, E) be a representation class. For two representation r, r ~ E R, r ~ r ~

denotes the set (c(r) - e(r')) U (e(r') - c(r))), i.e. the symmetric difference of the concepts defined

by r and r J. Let n be a maximum word length and P be a probability distribution on E[ hI. Then the

error of a hypothesis representation r ~ with respect to a target representation r is defined to be the

probability P ( r /k rl).

Intuitively, the error of the hypothesis is the probability that it labels a random word differently

than the target representation. The learning algorithm is given two additional parameters e and 6,

both in the interval (0, 1), and is required to produce a hypothesis whose error is at most e with

probability at least 1 - ~. The number of examples is allowed to grow polynomially in 1 and ~.

Below we give a formal summary of the model discussed.

Definit ion 2.7 A learning algorithm A is an algorithm that takes as input four parameters s, n E

~N, E,3 E (0, 1) and a collection of examples of E[n] x { + , - } , for some concept alphabet ~1. A

outputs a representation of a hypothesis and it is a polynomial time learning algorithm i f there

exists a polynomial t such that the run time of A is at most t(s, n, 1, ½, l), where l is the total length

of the input of A.

Definit ion 2.8 The representation class (R1, r l , Cl, El) is polynomially learnable in terms of the

representation class (R2, r2, e2, ~2) iff there exists a polynomial time learning algorithm A and a

polynomial p such that for all input parameters s , n E ~I and e, lf E (0,1), for all rl e R~ ], and for

all probability distributions 1 on ~n], i rA is given at least p( s, n, I , ~ ) randomly generated examples

1The allowed distributions must fulfill some relatively benign measure-theoretic assumptions discussed in the Appendix of [9].

84~

of an unknown target concept) cl(ra) (for rx ERa), then A produces a hypothesis in R2 which with probability at least 1 - $ has error at most e. (The class (RI~ F1, cl, ~1) is called the target class

and ( R2, F2, c2, E2) the hypothesis class.)

In the original definition of polynomial learnability the target class and the hypothesis class of

the algorithms were required to be identical [39]. A number of polynomial learning algorithms have

been found which show that a class R is learnable in terms of itself [1,9,12,23,26,37,39]. However,

for many useful, known learning algorithms, the target class and the hypothesis class are not the

same and the learning algorithm uses that fact [9,19,30,32]. Recall that in this paper we only

consider representation classes whose evaluation problem is in 7 ~.

3 Hardness results for learning with particular hypotheses classes

In this section we review the background leading to infeasibility results of the following type: A

representation class R is not polynomially learnable in terms of the hypothesis class R1 modulo

the very likely complexity theoretic assumption that 2 T~P ¢ )q'P, but at the same time R is

polynomially learnable in terms of a different hypothesis class R2. Thus learnability very much

depends on how the hypotheses are represented.

Definit ion 3.1 [20] A random polynomial time hypothesis finder (r-poly hy-fi) for Ri in terms

of R2 is a randomized polynomial time algorithm A that for all input parameters s, n E ~I, all

representations rl e R[1S ], and all sets of examples of el(r)f) ~ ] produces with probability at least

7 (for some fixed 7) a hypothesis representation in R2 that is consistent with the examples.

T h e o r e m 3.2 [20,32] I f R1 is polynomially learnable in terms of R2, then there exists an r-poly

hi-fi for Rx in terms of R2.

P r o o f sketch: Let A be a polynomial learning algorithm for R in terms of R t. We construct an

r-poly hi-fi B from the learning algorithm A. B recqives the input parameters s, n, and an example

sequence U. B poses a particular learning problem to A which uses the same parameters s and

n. All examples not in U have probability zero and the examples in U all have equal probability.

The parameter ~ (which corresponds to 7) is set to 1 and e is set ot q-~, where q is the number of

distinct examples in U. Note that e is smaller than the probability of any example in U. It is easy

to see that when A is applied to this learning problem then it must produce a consistent hypothesis

(since e is small) with probability at least ½. []

2~p ¢oasists of all l~ngu~es ~.xepted by probabilis~ic polynomial time Turing machine~ with one-sided error

bounded away from 0, as defined in [15],

85

An example (from [32]) of how this lemma can be applied is the following: for each constant

k > 2, it is NP-complete to decide for a given sequence of examples whether there is a consistent k-

term DNF formula. Thus if the class Rk-~ermDNF Was learnable by the same class of representations

then by the above lemma there would exists a randomized polynomial algorithm for an NP-complete

problem and thus 7 ~ = .N'P. However~ Rk-~ermDNF can be learned in terms of Rk-GNF [32].

Similarly, it can be shown that the Boolean threshold functions 3 cannot be learned by the same

class unless ?~P = AlP [32]. Again it is easy to learn Boolean threshold functions in terms of

halfspaces using linear programming [9]. (The algorithm Winnow of [30] can aJso be used to learn

Boolean threshold functions in terms of halfspaces.)

Other infeasibitity results for learning with particulax hypothesis classes which are based on the

assumption that T~T ~ ~ AlP axe given in [1,7,19,29].

4 Learning independent of the representation of the hypothesis

Motivated by the infeasibility results discussed in the previous section, a notion of leaxnability

(called predictability) is defined in [20] that is independent of the representation of the hypothesis.

Defini t ion 4.1 A representation class 1~1 is polynomia~y predictable ff there exists a represen-

tation class R2 (whose evaluation problem is in P) and R~ is polynomially learnable in terms of

R2.

From now on we use the following definition of polynomially predictable (which is shown to

be equivalent in [20]). In this definition~ the prediction algorithm is not required to output a

hypothesis.

Defini t ion 4.2 A prediction algorithm A is an algorithm that takes as input three parameters

s,n e ~q, epsilon E (0, I), a collection of elements of ~[ n] × { + , - } (for some concept alphabet Z),

and an element w E E[~]. The output of A is either % " or "--', indicating its prediction for w. A

is a polynomial time prediction algorithm if there exists a polynomial t such that the run time of

A is at most t(s, n, 1 l), where I is the total length of the input of A. 7,

Defini t ion 4.3 The representation class (R, F, c, E) is polynomially predictable iff there exists a

polynomial time prediction algorithm A and polynomial p such that for all input parameters s, n, and

e > O, for all r E R Is], and for all probability distributions on E[n], if A is given at least p(s, n, ~)

randomly generated examples of (the unknown target concept) e(r), and a randomly generated

unlabeled word w E ~[n] then the probability that A incorrectly predicts labele(r)(w ) is at most e.

ZSuch a function is given by a clause and a ~hreshold k. The function is true for a~ assignments that make at least k litera]s in the clause true.

86

Note that in all definitions of learnability and predictability given in this paper the parameters

8, n, {, and in the case of learnability the parameter ~ as well, are explicitly given as inputs to the

algorithms. However, the notions of polynomial learnability and predictability are not weakened

if these parameters are not inputs of the algorithm but the number of examples and the running

time is still allowed to depend polynomially on the parameters. (See [20,33] for a more formal

treatment.)

The above notion of prediction has various advantages over learnability. First there is a natural

notion of reductions among representation problems that preserve polynomial predictability [33].

(See next section.) The reductions lead to complete problems for sets of representation classes and

for some of these complete problems infeasibility results have been shown [27,33]. (These results

are reviewed in Section 6.) Recall that if a representation class is not potynomially predictable

then this implies that it is not polynomially learnable in terms of any representation class whose

evaluation problem is in P. Thus infeasibility results for polynomial predictability are very strong

and useful.

5 Prediction preserving reductions

Motivated by the reductions given in [30], Pitt and Warmuth [33] introduced a general notion of

reduction among representation classes that preserves polynomial predictability. Thus if there is a

reduction from R1 to R2, then the existence of a polynomial prediction algorithm A2 for t/2 implies

the existence of a polynomial prediction algorithm A1 for R1. The algorithm A1 is constructible

from A2 and from the reduction R1 to R2.

We would like to motivate the definition of reduction by appealing to our intuition of how one

can predict R1 given a prediction algorithm for R2. When A1 is given examples of R1 then it

should be able to transform these examples to examples of R2. Thus our definition of a reduction

provides a function f (called the word transformation) which is employed as follows: when A1

receives the example (x, label) then it forwards (f(x), label) to A2. After receiving a number of

labeled examples, A1 is given an unlabeled word w. Again A1 passes the corresponding word f(w) to A2. Hopefully A1 will be able to simply use the prediction of A 2 on f(w). For this to work there

must exist a second transformation g (called the representation transformation) which maps any

representation rl E R t to a related representation g(rl) E R2 with the property that for all words

wl E Z1

where ei is the implicit concept mapping of Ri. We outlined how a prediction algorithm A2

for R2 might be used to obtain a prediction algorithm A1 for R1. However, the constructed

algorithm A1 is a polynomial prediction algorithm only if certain requirements are met for the

87

two transformations f and g. The word transformation clearly must be computable in polynomial

time. For the representation transformation much weaker conditions apply. We only need that g

is length preserving within a polynomial. The representation transformation g does not have to be

computable at all.

Below we give the formal definition of reduction. Since the running time of a prediction al-

gorithm is allowed to grow polynomiaUy in s and n, we allow f to grow polynomially in these

parameters as well. There are many reductions [33] where the word transformation makes use of

the parameters s and n. Similarly we let g also depend on the additional parameter n. (It is

not necessary to provide g with the parameter s, which is a bound on Irl], since rl is already a

parameter to g.) Note that we did not let f and g grow polynomially in 1 The reason for this is

that so far we have found no reductions that would make use of such a dependence.

Definition 5.1 A function h all of whose inputs and outputs are either words or natural numbers

(encoded in unary notation) is called polynomially length preserving if there is a fixed polynomial

q such that all inputs to the function h of length at most I produce outputs of length at most q(l).

Defini t ion 5.2 Let (R1, r l , cl, El) and (R2, r2, e2, ~2) be a pair of prediction classes. Then with

respect to this pair, a word transformation is a function f : ~ × ~q x gq --* ~ and a representation

transformation a function g : R1 x ~q --+ R2.

Defini t ion 5.3 For two representation classes (R1, r l , el, El) and (R2, r2, c2, E2) the first one re-

duces to the second one (denoted by (R1, r l , el, ~1) _<1 (R2, r2, c2, ~ ) ) iff there is a polynomially length preserving word transformation f and a polynomially length preserving representation trans-

formation g such that for all s, n e ~q, rl e R~ "], and wl e E[ 'q,

(1) Wl e cl(r l) i f f f (w l , s , n ) e e2(g(rl,n))7"

(2) f is computable in time t(lw11, s, n), for some fixed polynomial t.

In [33] it was shown that the above notion of reducibility fulfills the basic requirements (below

two lemmas) that make it useful.

L e m m a 5.4 [33] For all representation classes R1 and R2, if R1 ~ R2 and R2 is predictable, then R1 is predictable.

L e m m a 5.5 [33] The relation ~_ is transitive, i.e., if R1 ~_ R2 ~_ R3, then R1 ~_ R3.

Defini t ion 5.6 Let R1:9 R2 denote the fact that R1 ~_ R2 does not hold.

88

5.1 E x a m p l e r e d u c t i o n s

Every k-term DNF formula over n variables can be rewritten as a k-CNF formula over the same n

variables of size O(nk). For example, xy ÷ rs = (x T r)(x ÷ s)(y -~ r)(y ~- s) (here k = 2). Thus

for each k, Rk-~ermDNf ~-- Rk-CNF [32]: the word transformation f is the projection function of

its first argument (i.e., / is the identity on wl), and the representation transformation g maps a

k-term DNF expression to the equivalent k-CNF expression.

We next give a reduction from [25,30]: for any fixed k, Rk-cgf <] Rmonomlals. (Thus by

transitivity, for any fixed k, Rk-DNF ~_ Rmonomlals.) Observe that in a k-CNF formula there are at

most u = O(n k) clauses, since each clause has at most k Uterals. Let f map each n-bit assignment

into a u-bit assignment, the ith bit of which is 1 iff the ith k-literal clause is 1. Given this

transformation of the assignment, the mapping g simply expresses each k-CNF with v clauses as a

monomial of size v over the enlarged variable set of size u. Since k is a constant, f is computable

in time polynomial in n, and the size of the image of g is bounded by a polynomial in n.

One of the first representation classes shown to be polynomially predictable is Rmo~o,nlal~ [39]

The above reductions can be used to show that Rk-termDgF and Rk-CNF are polynomiaily pre-

dictable as well [25,30,32,39]. As discussed in Section 3, finding a consistent 2-term DNF expression

for a given set of examples (each consisting of a truth assignment and a label) is NP-hard [32]. It is

easy to find a consistent 2-CNF or a consistent monomial for a given set of examples~ or to determine

that there is no consistent one. So since R2-termDNF ~ R2--CNF ~ Rmonomials, can ~t we use the

theses reductions to solve an NP-complete problem? When given a set of examples that is consis-

tent with a 2-term-DNF, apply the word transformation f of the reduction R2-~errnDNF ~-- R2-CNF to the assignment and leave the labels unchanged. Now Requirement (1) implies that there is a

R~-clvF consistent with the new set of examples. Not surprisingly, the opposite direction is false.

If there is a R2-CNF consistent with the new set of examples then this does not imply the there is

a consistent R2--~ermDNF for the originai set of examples.

One of the two main open problems in computational learning theory are the questions whether

arbitrary DNF (RDNF) and arbitrary decision trees (RDT) are polynomially predictable. (In [12]

decision trees of fixed rank are shown to be learnable by the same class.)

It is easy to construct a logically equivalent DNF formula for a given decision tree: take the

conjunction of all terms corresponding to the leaves labeled with 1. Thus ROT ~_ RDNF ( f iS again

the identity on wl).

An obvious question is whether RDNF ~_ ROT. It can be shown [21] that the smallest logically

equivalent decision tree for the formula xlxa~-i-bx2xa+2 ÷ . . . ÷ x ~ x , , for n even, must have at least 2 ~ 2

2 n negative leaves. This shows that if f is restricted to be the identity on wl, then RDNF ~ RDT. However, if the word transformation is allowed to be an arbitrary polynomial function, then there

still might be a reduction from RDNF to RDT. In Section 8 we give techniques for proving results

89

of the form R1 :~ R2.

As another example, observe that the language consisting of satisfying as,:ignments (encoded

as bit patterns of length n) of any DNF formula over n variables with s terms is accepted by an

NFA of size O(sn). (The NFA guesses which of s terms is satisfied, and branches to a chain of

O(n) states to verify that the n bit input satisfies the appropriate literals.) Thus predicting DNF

trivially reduces to predicting NFAs, where again, f is the identity on wl, and g maps the DNF

expression to the corresponding NFA.

We now give a case where there is no reduction between between two representation classes

when f is restricted to be the identity on w, but there is one when this restriction is dropped.

Observe that the satisfying assignments of the DNF formula zlz~+l + x2z~+2 + . . . + z~xn, for n

even, is a language whose stoniest DFA has size at least 2~. Thus RDNF ~ RDFA does not hold

if f is the identity on w. This is contrasted by the following reduction from RDNF to RDFA. We

include the proof here since it is instructive.

T h e o r e m 5.7 [33] RDNF <1 RDFA.

P r o o f : Let rl E RDNF encode a DNF expression of n variables. Then each example assignment

is a word of length n. The parameter s is a bound on the length of the target DNF expression

rl. In particular, s is an upper bound on the number of terms of rl . For all assignments w,

f (w, s, n) = (wl) ~, i.e. f simply replicates wl exactly s times. For a given DNF expression rl with

at most s terms it is easy to design a DFA A with O(sn) states such that rl is true on wl iff A

accepts (wl) s. In particular, for each of at most s terms, the DFA uses a chain of O(n) states (and

input symbols) to check if the term is satisfied. If not, then it moves on to the next copy of wl in

the input, and the next set of states to test whether the next term is satisfied. Thus g simply needs

to map rl and n into a representation rz of such an automaton A. For any reasonable encoding for

DFAs we have that It21 is polynomial in sn. D

P~eductions of the form (R, F, c, E) ~ (R', r , c, Z) where R ~ _C R are particularly useful for

determining the "hard core" of the representation language R. Below we give two of such reductions.

The goal is to make R ~ as small and restricted as possible.

Def in i t ion 5.8 [33] RBFtree = {r : r encodes a permutation 7r of n = 2 k elements, for some k).

For any k, let T (k) be the complete ordered Boolean tree of height k, where the gates at even height

are V-gates and the gates at odd height are A-gates. Then e(r) consists of exactly those bit patterns w of length 2 k, such that if the leaves ofT(k) are labeled (in order) with the inputs r(w), then T (k}

evaluates to 1.

Note that the representations of R B F t ~ are very restricted Boolean formula and thus in some

sense RBFtree is a subset of RBF.

90

T h e o r e m 5.9 [33] RBF <] RBFtree.

By fixing the labels of the gates other than alternating level by level as in R B F ~ one can get

a normal form for RDNF.

Definition 5.10 RDNFtree = {r : r encodes a permutation 7r of n = 2 k elements, for some k}.

For any k, let T(k) be the complete ordered Boolean tree of depth k, where the gates at depth less

than k/2 are labeled AND, and the gates at depth at least k/2 are labeled OR. Then c(r) consists

of exactly those strings w of length 2 k, such that if the leaves o fT {k) are labeled (in order) with the

inputs ~(w), then T (~) evaluates to 1.

The o rem 5.11 RDN F ~ RDNFtree

Proof: Let D be an arbitrary DNF formula with n variables and at most k terms. We first

show that there exists a complete ordered binary tree T that computes D and has the property

that all gates of depth less that half of the whole depth of the tree compute V and the remaining

gates compute A. Let z = 1 ÷ ~log(2n + k)l and y = 2 z. Our basic building blocks will be complete

binary trees of depth z (with y leaves). Call an V-tree (A-tree) a tree with the above specifications

that has only V-gates (A-gates) at the internal nodes. The final tree T consists of an V-tree where

each of the 9 leaves of the tree are merged with the root of a different A-tree. Let T^,i be the

A-tree of the i-th leaf (1 ~ ~ < y). Let k t < k he number of terms in D. We label the leaves with

inputs such that the first k ~ A-trees compute the k ~ terms of D and the remaining trees compute

the constant zero. Note that this would show that T computes D.

Let Dj be the j-th (1 ~ j _< k ~) term of D. Label the first IDjl leaves of T^ d with the literals

of Dj and the remaining ones with one. For the remaining trees TA,t such that k I + 1 < l < y, label

one of its leaves with zero and the other leaves arbitrary. Observe that each T^,l computes zero.

This completes the description of T which computes D. Note that the number of nodes in T is

polynomial in k and n.

We now use the above construction to reduce RDNF to RBF~ree. For simplicity we assume that

a~ inputs are of length m = n. Let k = s and define y and z as above. The word transformation

f maps an n-bit assignment wl into the assignment (wlzzl)al'~)0 u2-2sn-au. The transformation g

maps the representation rl (Irl] _< s) of a DNF formula D into a representation r2 of a complete

binary B tree of depth 2z that computes D and whose leaves are labeled with a permutation of the

following input sequence: I = ((Zl, ..., zn,-~Xl, ...,-~xn))'(<l))'~(<0)) u2-2"n-'y. We use the above

construction to show that such a tree exists.

Observe that since Irll < s the formula D represented by rl has at most s terms. So we

can apply the above construction with k = s to produce the required tree B from D. The re-

quirement that the leaves of B must be labeled with a permutation of I can be easily fulfilled

91

because of the following. There are s occurrences of each literal in I and thus there are enough

literals for the leaves of the A-trees that compute a term. Similarly the sy ones in I suffice to fill

the remaining leaves of these A-trees with ones. Also if needed there is at least one zero for each

of the y A-trees since the number of zeros in I is at least as large as y, the total number of A-trees. D

6 Complete problems and infeasibility results for prediction

Definit ion 6.1 If R is a representation class, and T~ is a set of representation classes, then R is

prediction-hard for ~ iff for all representation classes R ~ E T~, R r ~ R. I f R E T~ also, then R is

prediction-complete for 7~.

Thus if a representation class R is prediction-hard for a class T~, then the polynomial pre-

dictability of R implies the polynomial predictability of every representation class in T~.

In [33] Pitt and Warmuth propose to classify representation classes according to the complexity

of their evaluation problem. Intuitively, the higher the complexity of the language E(R) (Defini-

tion 2.3), the harder it is to predict a hidden target representation of R.

Definit ion 6.2 For a complexity class £, let Ti£ = {R : E(R) E £}.

Besides 7>,ALP and 7~:P the complexity classes used in this section are:

f_L~, the class of languages accepted by a deterministic log-space bounded Turing machine;

Aff_L~, defined as above except deterministic is replaced by non-deterministic;

ArC 1, the class of languages accepted by log-depth circuits of standard fan-in two Boolean gates.

Note that for each circuit in A/'C 1, there is an equivalent polynomially sized boolean formula [27]

and for each boolean formula there is an equivalent polynomially sized circuit in Arc 1. Thus RBF

is prediction-complete for Arc 1.

Other sets of representation classes for which prediction-complete problems have been found [33]

include T ~ f ~ ~ ¢ : ~ and T~p.

T h e o r e m 6.3 [33] I~DFA, I~NFA, and RCtRC are prediction-complete for T~f~, T ~ and 7~7~,

respectively, and RDFA is prediction-hard for T~Xcl.

The following infeasibility results were shown with respect to a weaker model of polynomial pre-

dictability which only requires that the algorithm predicts correctly for the unlabeled example with

92

probability slightly better than ½. Clearly, predictability implies weak predictability. Surprisingly,

Schapire shows that the converse also holds [38].

Definition 6.4 The representation class (R,F,c, E) is weakly polynomially predictable iff there

exists a polynomial time prediction algorithm A and polynomials p and q such that for all input

parameters s,n E ~q, for all r E R Is], and for all probability distributions on E[ n], i rA is given at

least p(s, n) randomly generated examples of (the unknown target concept) c(r), and a randomly

generated unlabeled word w E E['q, then the probability that A incorrectly predicts labelc(r)( w) is at 1 m o s t ½ -

Theorem 6.5 [38] If a representation class is weakly polynomially predictable then it also is poly-

nomially predictable.

It can be shown that Lemma 5.5 still holds if predictable is replaced by weakly predictable and

thus the above definitions of completeness are the same for both variants of polynomial predictabil-

ity.

Theorem 6.6 [33] If there exist a one-way functions that is hard on its iterates [17,28], then no

prediction.complete problem for Tip is weakly polynomiatIy predictable.

Thus in particular RcIRc and all the other complete problems for Tiv given in [33] are not

weakly predictable. The proof of the above theorem relies on the fact that the existence of a

one-way functions that is one-way on its iterates [17,28] is equivalent to the existence of one-way permutations [41]. From one-way permutations [41] a cryptographically secure pseudorandom bit

generator can be constructed and using a construction of Goldreich, Goldwasser, and Micah [16]

it can be shown that RcrRc and thus any prediction-complete problem for T~p is not weakly

polynomiaUy predictable [33].

At this point a method for proving unpredictability results based on cryptographic assumptions

becomes evident. If cryptographic functions that are hard to invert can be shown to have easy

evaluation problems in some "lower complexity class" £, then any representation class that is

prediction-hard ~z: is not predictable.

Kearns and Valiant [27] have recently taken this approach, and have shown that~ based certain cryptographic assumptions (the intractability of inverting the RSA cryptosystem, factoring Blum

integers, or deciding quadratic residuosity), there are some presentation classes in 7~.cl that are

not weakly predictable. Consequently, any representation class that is prediction-hard 7~Hcl (see Theorem 6.3) is not weakly predictable, based on the same cryptographic assumptions.

Theorem 6.7 RBF or RDFA are weakly polynomially predictable then there exists a randomized

polynomial algorithm for inverting the RSA cryptosystem, for factoring Blum integers, and for

deciding quadratic residuosity.

93

7 Independence of the target representation

In Section 4 is was shown that there are advantages to letting the representation class of the

hypothesis be any arbitrary class in ~ p (i.e. any representation class whose evaluation problem

is in :P). Why not choose the same approach for the target representation class? A canonical

representation class would be ~~tiRe (Boolean circuits) (or any other representation class derived

from a "universal" computational model). However, the corresponding prediction problem RoIRC is already prediction-complete for 7~, (Theorem 6.3) and thus modulo cryptographic assumptions

unpredictable (Theorem 6.6).

In practice, one would like to take advantage of the fact that the target representation is required

to be of a particular form. This limits the space of possible target concepts of words in I][ n] that

have a representation in R Is] and simplifies the "search" of the prediction algorithm for the target

concept. For a fixed concept c C ~[n], restricting the representation language amounts to increasing

the smallest size s such that there is a representation for the concept c in R [8]. Note that 8 is a

parameter of the prediction algorithm and the number of examples and the running time is allowed

to grow polynomially with 8. Thus increasing the size of the smallest representation 8 amounts to

giving the prediction algorithm more examples and time for predicting the concept e. It seems that

only by restricting the representation language of the targets are we able to construct polynomial

prediction algorithms.

A very useful way of producing restricted representation classes is the approach taken in [33].

Representation classes are defined by the complexity of their evaluation problem. Now all the well

studied standard complexity classes contained in P lead to corresponding classes of representations

(see Definition 6.2).

8 Techniques for proving non-reductions

In Section 5.1 we have seen several cases of pairs of representation classes t/1 and R2, for which

the following property holds: for any rl E R1, the size of the smallest r2 E //2 which represents

the same concept as rl (i.e. cl(rl) = c2(r2)) grows in the worst case exponentially in the size

of rl. Expressed differently, R1 :~ R2, provided that the word transformation is the identity on

wl. We give a number of such non-reducible pairs of representation classes (see [14] for additional examples):

T h e o r e m 8.1 If the word transformation is restricted to be the identity on wl, then RDNF ~ RDT,

RDNF ~ RDFA, RNFA ~ RDFA, RxoR~um ~ RDNF, RXORsum ~ RCNF, RDNF ~ RCNF, and RCNF ~ RDNF.

94

P r o o f sketch: Assume for the rest of this proof that f must be the identity on wl. The fact

that RDNF :~ ROT was shown in [21]. (In Section 5.1 we gave a DNF expression for which there

only exists exponentially sized equivalent decision trees.) In Section 5.1 we already showed that

RDNF ~. RDFA. The remaining non-reductions are trivial for the case that f is identity on Wl. VI

Note that for a modified version of the notion of reduction than the one given in this paper

(Definition 5.3), it is easy to show [30,33] that RDNF ~_ RCNF and RCNF <1 RDNF even if f is the

identity on wl. Simply replace Requirement (1) by

(]- ' ) Wl e e l ( r l ) it~ f (wl , s ,n ) • e2(ff( r l ,n)) .

In Requirement (1'), the word transformation maps examples to examples with the opposite labels,

whereas in the case of Requirement (1), the labels of the examples must be preserved by the word

transformation. For the sake of simplicity of the presentation we only used Requirement (1) for the

definition of reduction (Definition 5.3) and for the rest of this section.

Even though RDNF ~ RDFA holds if f is restricted to be the identity on wl, one can still

show that RDNF ~ RDFA (see Theorem 5.7) by chosing the polynomial time computable word

transformation f judiciously. However, in other cases such as the pair "RNFA, RDFA" this seems

to be hard to show. 4

Our goal is to obtain proofs for the fact that R1 ~ R2, that hold even if f is ~owed to be an

arbitrary polynomiaUy length preserving, polynomially computable word transformation (these are

the conditions imposed on f in the definition of ~ ). We achieve this by relaxing the notion of

reduction and proving non-reducibility with respect to the less restricted notion.

Definit ion 8.2 Let R1 and Rz be representation classes. Then R1 is freely reducible to R2 (denoted

by RI U R2) iff for all polynomially length preserving word transformations f and polynom{ally length preserving representation transformations g, there exists s, n E ~ , rl 6 R~ s], and wl 6 ~[n], such that Requirement (1) does hold. Let R1 ~. R2 denote the fact that R~ E R2 does not hold.

Note that R1 ___ R2 iff R1 <1 R2 but Requirement (2) is dropped (see Definition 5.3), i.e. the

word transformation f does not have to be computable in polyrtomial time.

Clearly, R1 <1 R2 implies/~1 U R2 and R1 ~ R2 implies R1 ~ R2.

The reason that we dropped the requirement that f is polynomially computable in the definition

of ~ is that we can develop a purely combinatorial characterization that is necessary and sufficient

for the fact that RI ~ R2. Restricting f to be polynomial time computable would give additional

leverage for showing that R1 :~ R2. However, it seems hard to make use of that leverage.

4Note that if.h~Z~ = ~ then by Theorem 6.3, RNFA ~ I~DFA. It is unclear whether the opposite implication

holds.

95

We now introduce the tools that lead to a combinatorial characterization of ~ and then give a

few examples of how to apply the characterization. Further research needs to be done to investigate

the most common representation classes used in Artificial Intelligence and Computational Learning

Theory with respect to this characterization.

We first introduce an equivalence relation that has the property that all words in the same

equivalence class always receive the same label.

Def in i t ion 8.3 For all representation classes (R,F,c ,E) , all s ,n E ~I and all v ,w E E In], the

word v is equivalent to w (denoted by u =-R,,,n v), if for all r e R Is], labelc(r)(v) = labelc(r)(w). Let

P(-R,s,n) be any subset of E In] which contains one element (representative) from each equivalence

class of the relation -R,s,n. Let N(=-R,s,,~) denote the number of equivalence classes of the relation

All words of ~['q that lie in the same equivalence class of the relation -R,,,,~ are labeled identi-

cally for each representation in R[']. Thus f( . , s, n) could simply map all words of E[n] that lie in

the same class to the same image word.

Def in i t ion 8.4 For any set Q, (Q) denotes some fixed sequence containing exactly the elements of

Q. Let ~Q) = (zx , . . . , xq) and let h be a function on the elements of Q, then h(IQ) ) denotes the

sequence ( h ( z l ) , . . . , h(xq)).

For a given sequence D of words in ~* and a representation r, we are interested in the sequence

of plus labels and minus labels induced by r on D.

Def in i t ion 8.5 Let R be a representation class and D be a finite sequence of words of ~*. For

r E R, the sequence labelc(r)(D ) is called the dichotomy induced by r on D. Similarly, we define the

set of all dichotomies induced by R on the sequence D (denoted by IIR(D)) as the set {labelc(r)(D ) :

r E R } .

The set of dichotomies IIR(D) induced by a representation class R on a sequence D has been

used in pattern recognition [10,40] and in Computational Learning Theory [9] to measure the

"richness" of the concept class associated with a representation class. (A more commonly used

definition is IIR(D) = {D n c(r) : r E R}, where D is a subset of the domain Z* [9,10,22,40].)

Intuitively, the larger the set of dichotomies, the harder is the learning task. (See [9] of how this

intuition is formalized.) It is easy to obtain bounds on the number of dichotomies in IIR(D) using

the following parameter of a representation class called the Vapnik.Chervonenkis (VC) dimension s.

5Usually the VC dimension is defined for concept classes rather than representation classes. The same parameter is called capacityof a concept class C in [40] (named after a similar notion in [10]) and is denoted by S(C) in [11].

96

Definition 8.6 A set Q c ~* is shattered by R /f IHn((Q))I = 2IQ[. The VC dimension of a representation class R is the cardinality of the largest set that is shattered by R.

L e m m a 8.'/" [22,40] Let R be a representation class with VC dimension d and let D be any sequence of m words in E*. Then IIIR(D)I < m d + 1.

If the concept and the representation alphabets contain the set of reals IR. and if the concepts are

subsets of some higher dimensional Euclidean domain, then the VC dimension (and more precisely

IIIR(D)I is especially useful for estimating the number of examples necessary and sufficient for

learning [9,13]. The methodology of the VC dimension has also been generalized to case of learning

real-valued function classes [18,31,36,40].

Note that the set of dichotomies IIRt,l((P(--R,s,n)) ) is independent of the choice of the repre-

sentative set P ( -R .... ). We now give a purely combinatorial characterization of [~.

Theorem 8.8 Let R1 and R2 be representation classes. Then RI ~_ R2 iff for all polynomials p and q, there exist s, n E IN, such that for all sequences D.,,, of N(~Rt,.,=) words of ~(.. ,0], we

have that IIR~,l( ( P(=-n~,s,n ) ) ) ~ IIR~,,,,)I(Ds,,~ ).

Proof: We prove the following equivalent restatement of the theorem: R1 E R2 iff there exist

polynomials p and q, such that for all s, n E IN, there exists a sequence Ds,n of N(=-R~,s,n) words

of ~(.,n)], for which IIR?l((P(=/h,.,n)) ) C IIR[q(.,.)l(Ds,n) holds.

For the forward direction assume that R1 E Rz. Then there is a polynomially length pre-

serving word transformation f and polynomially length preserving representation transforma-

tion g, such that for all s,n E IN, ra E R[ '] and w E ~ ] , Requirement (1) holds. Since g

is polynomially length preserving there exists a polynomial q, such that [g(rl,n)l < q(Irll,n)-

Requirement (1) implies that the dichotomy induced by rl on the sequence (P(=n~,.,.)) is the

same as the dichotomy induced by g(ra, n) E R [q(''")] on the sequence f ((P(-R, , , , . ) ) , s, n). Let D.,,~ = f((P(=Rx,.,.)), s, n), Since f is polynomially length preserving, there exists a polynomial p

such that for all wl e Z~ hI, lf(wx,s,n)[ < p(s,n). We follow that for all s ,n E IN and the sequence

Ds,n of ~[p(s,n)]~ , IIR?]((P(=R~,s,n)) ) -C II~q(,,.)l(Ds,.)° This completes the proof of the forward

direction of restatement of the theorem.

For the opposite direction, let p and q be two polynomials, such that for all s, n E IN, there exists

a sequence Ds,n of N(=Rl,s,n) words of :E~ (s'n)], for which IIn~l((P(=nl,s,n)) ) C_ IIp~q(,,.)I(D.,,~). From p, q and Ds,n, we will construct a polynomially length preserving word transformation f and

a polynomiaily length preserving representation transformation g witnessing the fact that R1 C R2

holds.

First define f(., s, n) as a mapping from the set of representatives P(-=Rl,s,n) to the elements

of the sequence Ds,~: the ith element of sequence (P(=-Rl,s,n)) is mapped to ith element of D,,n.

97

Then extend f( . , s, n) to the domain E~] by mapping all words in each equivalence class of the

relation =/h,s,n to the saane word of D,,,~ as the representative of the class. Since Da,n consists of

words of E~(s,n)], the defined word transformation f is polynomially length preserving.

Let g(rl, n) consist of a representation r2 E//[q(s,n)] such that the dichotomy induced by r2 on the

sequence Ds,,~ is the same as the dichotomy induced by rl E R[ a] on the sequence (P('=R~,s,n)). Note

that by the above assumption r2 exists and observe that the defined representation transformation

g is also polynomially length preserving (with the polynomial q). From the definition of f and g it

follows that they fulfill Requirement (I) and this completes the proof of the opposite direction of

the restatement of the theorem given at the beginning of the proof. []

The following corollaries give sufficient conditions for R1 [~ R~.

The first one follows from the above theorem and the observation that ]IIR[,1((P(---=R,,,,~)))I =

Corol lary 8.9 Let R1 and R2 be representation classes. Then//1 ~- R:, if for all polynomials

p and q, there exist s, n E IN such that for all sequences D of N(=nl,s,,~) words of 2~(~'")], the

inequality > D )t holds.

Corol lary 8.10 Let R1 and R2 be representation classes. Then R1 ~[ R2, if for all polynomials p and q, there exist s ,n E IN and Qs,,~ c ~['q, such that for all sequences D of [Qa,n[ words of

E~ (s''~)], we have that IIn~.l((Qs,n) ) g IIp~q(.,.)l(D ).

Corol lary 8.11 Let//1 and R2 be representation classes. Then R1 [7. R2, if the VC dimension of R1 is larger than the VC dimension of R2.

Proofi We will apply the previous corollary to show this. Let Q be any maximum cardinality

subset of ~ that is shattered by R1. Choose s and n such that Q c ~[n] and IIIR~,j((Q)) [ = 2[QI.

Since the VC dimension of R1 is larger than the VC dimension of t/2, we have that for all sequences

D of IQ[ words of ~ , IIn~,]((Q)) g IIR~(D). The proof now follows from the previous corollary:

for all polynomials p and q, choose s and n as above and Qs,n = Q. 1"3

It is easy to see that the VC dimension of//PAIRS is two and the VC dimension of//SINGLE is one. From the previous corollary we follow that

T h e o r e m 8.12 RPAIRS ~. RSINGLE.

The above theorem implies that RpAIRS ~ //SINGLE. It is easy to show that//SINGLE <j RPAIRS.

The converse of the previous corollary clearly does not hold. We will show that RSINGLE ~[//D,

even though the VC dimension of RSINGLE iS one and the VC dimension of RD is infinite (More

precisely, the VC dimension of RD is 2n when restricted to boxes in lit n [9,23].)

98

Theorem 8.13 RStNGLE ~ Ro and Ro ~ RSINGLE.

Proof." Using the previous corollary, the second half of the theorem follows from the fact that

the VC dimension of Ro is larger than the VC dimension of RSINGLE. The proof of the fact that

RSXNGLE [/. Ro is more involved. Let c be a constant such that all representations of I~SINGLE that

represent concepts of the form {0,1} n - w, where w 6 {0,1}, have length at most ca. Clearly such

a constant exists. We will use Corollary 8.10 to proof this theorem. Arbitrarily choose polynomials

p and q. Choose n such that p(cn, n) < 2 n-1. Furthermore, set s to cn and Os,,~ to the set of

all 2 n bit patterns of length n. From the choice of s we have that IIRsxNGLEE,I((Qs,n)) contains all

dichotomies of length 2 ~ with exactly one minus. Denote the set IIRsxnGzsi,l((Os,n)) as T.

The concept alphabet of Ro is the set of all reals JR. Let D be any sequence of 2 n points of

[pjp(cn.n)]. To apply Corollary 8.10 we still have to show that T ~ 1]no~q(,.,)~(D ).

Let d = p(cn, n) and let [xl,z2] × [xs, x4] × --" × [X2d-l,Z2d]. be the smallest box in [R d that

contains all points of D. Assume there is point y E D that is in the interior of the box, i.e. in

(xl,x2) × (za, Xa) × . . . x (x2d-l,x2d). Clearly, there is no concept (box) in Ro that labels y as

minus and the remaining points of D as plus. Thus if there is a point y in the interior, then

T ~ IIna/q(,.,)l(D ).

We are left with the case that all points of D lie on the faces of the smallest box. Each xi (1 < i < 2d) corresponds to one of the 2d faces of the box. We say a point of D owns a face if it

is the only point of D that lies on that face. Since 2d = 2p(cn, n) < 2 n and D has length 2 n, there

must exist a point y E D that does not own any of the 2d faces of the smallest box. Similarly to

the above case, there is no box in Ro that labels y as minus and the remaining points of D as plus

and thus T ~ IIRotq(,,,)l(D ) holds. []

Once we have shown for two representation classes that R1 ~ R2, then more results of this type

may be obtained by applying the following lemma.

L e m m a 8.14 If RI[~ R2, R1 c Ra and Rb C II2, then Ra~_Rb.

C o r o l l a r y 8.15 For all classes R1 {RPAIR$, R2--clauseCNF, RDT, RDFA} and

P r o o f ske tch : It is easy to show the following reductions: RPAIRS ~_ R2-olauseONF, RpAIR5 4- RDT, RPA1RS 4-- RDFA and Rmonoraials 4_ Ro. Now the corollary follows from the pre-

vious lemma and theorem. []

By the above corollary, RSINGLE ~_ Rmonomials, Reca~ that in our definition of Rmonomiats we

assumed (as done for all Boolean function in this paper) that assigments for n variables are encoded

99

as bit patters of length n. For a somewhat non-standard way of encoding assignments (which we

only use in the next theorem) we can show that RSINGLE <1 Rmonomials.

T h e o r e m 8.16 If assignments of n variables are encoded by a list that contains the numbers (all in binary) n and all indices whose values are false, then RSINGLS 4_ Rmo'*om~at,.

Proof~ For any wl E {0,1}*, let N(wl) denote the number encoded by the binary pattern

w. Let f(wl, s, n) be a list containing 2 n in binary and the bit pattern w~. Note that f(wl, s, n) encodes an assignment of 2"* variables, all of which are set to true except the one with index N(wl).

For wl E {0,1}'*, let rw~ e RSINGLE be a representation of the concept {0,1} 2= - B(wl), where

B(wl) consists of the bit pattern of length 2 '~ with exactly one 0 in position N(Wl) and a~ other

bits 1. Let g(rwl, n) be the encoding of the monomial over 2"* variables which consists of only one

unnegated literal, the one with index N(Wl). Since all indices and the number of variables 2"* are

encoded in binary, f and g are polynomially length preserving. It is easy to see that Requirement

(1) holds, which completes the reduction. []

9 Conclusion

Our notion of non-reducibility between representation classes is based on the definition of reduc-

tion (the relation <1 of Definition 5.3) between representation classes introduced in [33]. This

definition of reduction follows the spirit of the standard many-one reduclbilities used in Complexity

Theory [24]. One may define a version of a Turing reduction [24] between representation classes

that preserves polynomial predictability. Useful notions of reduction should have the following

property: If R1 reduces to R~ and R~ is polynomially predictable, then RI is polynomially pre-

dictable as well: It would be interesting to develop notions of non-reducibility which are based on

a Turing reduction between representation classes. Is there also a combinatorial characterization

of the fact that RI is not reducible to R2 (as in Theorem 8.8)?

Observe that there are polynomially predictable representation classes such that both R1 [~ R2

and R2 ~ R1 holds (See Theorem 8.13). However, we think even independent of the objectives of

Computational Learning Theory the definition of ~ and its combinatorial characterization is a

useful combinatorial tool for studying representation classes.

There are any number of open problems of the type is R1 [~ R2? We first give a few conjectures that involve representation classes used in this paper.

Conjectures RDNF ~[ RDT, RNFA ~[ RDFA, RXOR~rn ~- RDNF, RXORsum ~[ RCNF, RDNF ~- RCNF, and RCNF ~[ RDNF.

Boolean terms and clauses over n variables can be interpreted as special halfspaces in lit'*.

Furthermore, DNF expressions correspond to unions and CNF expressions to intersections of such

100

halfspaces. Do the above conjectures where I~DN F a n d .RCNF appear on the right-hand-side of still hold if RDNF and R&NF are replaced by their geometric generalizations: unions and

intersection of arbitrary halfspaces, respectively. (Let [K be the concept and representation alphabet

and represent a halfspace by the list of the coefficients of the bordering hyperplane.)

We conclude with a simple conjecture involving halfspaces.

Conjecture Let R1 be the representation class defining halfspaces of iR~ (where n depends on the

concept) and let R2 be the representation class defining intersections of two such halfspaces. Then

R2~ R1. More strongly, we conjecture that R2Bool~. R1, where R2BooZ is the same as R2, but now the the intersecting halfspaces are restricted to halfspaces of Boolean hypercubes (0, 1} '~ (where n

depends on the concept).

The representation class R1 is learnable in terms of itself using linear programming [9]. However,

we know of no polynomial prediction algorithm for R2 or for R2Boot. 6

The above conjecture is particularly interesting in view of the fact that R~ ~ R1, where the

concepts of R~ are symmetric differences of halfspaces instead of intersections of halfspaces [39] as

for R2.

Acknowledgements

The notion of reduction used extensively in this paper was developed together with Lenny Pitt [33].

The basic definitions given here and many example reductions are from or are very similar to the

ones given in our joint paper [33]. Thanks to him for giving me feedback on early drafts of this

paper and for lively discussions of the material. Thanks also to David Iiaussler, David tIelmbold

and Phil Long for stimulating conversations.

References

[1] N. Abe. Polynomial learnability as a formal model of natural language acquisition. 1989.

Ph.D. Thesis, in preparation, Department of Computer and Information Science, University

of Pennsylvania.

[2] A. Aho, J. Hopcroft, and J. Ullman. The Design and Analysis of Computer Algorithms.

Addison-Wesley, London, 1974.

[3] D. Angluin. Queries and concept learning. Machine Learning, 2:319-342, 1987.

6In [7] it is shown that R~ is not learnable in terms of R~ unless ~ T ~ ~ A f t . (See Section 3 for a discussion of

results of this type.)

101

[4] D. Angluin and C. H. Smith. Inductive inference: theory and methods. Computing Surveys,

15(3):237-269, 1983.

[5] J. M. Barzdin. Prognostication of automata and functions, pages 81-84. Elsevier North- Holland, New York, 1972.

[6] J. M. Barzdin and It. V. Freivald. On the prediction of general recursive functions. Soviet

Mathematics Doklady, 13:1224-1228, 1972.

[7] A. Blum and It. Rivest. Training a 3-node neural network is AlP-complete. In Proceedings of

the 1988 Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, August 1988.

[8] L. Blum and M. Blum. Toward a mathematical theory of inductive inference. Inform. Contr., 28:125-155, 1975.

[9] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-

Chervonenkis Dimension. Technical Report UCSC-CItL-87-20, Department of Computer and Information Sciences, University of California, Santa Cruz, November 1987. To appear, J. ACM.

[10] T. Cover. Geometrical and statistical properties of systems of linear inequalities with applications to pattern recognition. IEEE Trans. Elect. Comp., 14:4326-334, 1965.

[11] L. P. Dudley. A Course on Empirical Processes. Springer Verlag, New York, 1984. Lecture Notes in Mathematics No. 1097.

[12] A. Ehrenfeucht and D. Haussler. Learning Decision Trees from Random Examples. Technical Iteport UCSC-CItL-87-15, Department of Computer and Information Sciences, University of California, Santa Cruz, October 1987.

[13] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. G. Valiant. A general lower bound on the number of examples needed for learning. In Proceedings of the 1988 Workshop on Computational

Learning Theory, pages 139-154, Morgan Kaufmann, San Mateo, CA, August 1988.

[14] S. I. Ga~ant. Representability theory. 1988. Unpublished manuscript.

[15] J. Gill. Probabilistic Turing machines. SIAM J. Computing, 6(4):675-695, 1977.

[16] O. Goldreich, S. Goldwasser, and S. Micah. How to construct random functions. J. ACM, 33(4):792-807, 1986.

[17] O. Goldreich, It. Krawczyk, and M. Luby. On the existence of pseudorandom generators. In Proceedings of the 29th Annual IEEE Symposium on Foundations of Computer Science~

pages 12-24, IEEE Computer Society Press, Washington, D.C., October 1988.

102

[18] D. Haussler. Generalizing the pac model: sample size bounds from metric dimension-based uniform convergence results. In Proceedings o/ the 30th Annual IEEE Symposium on Founda- tions of Computer Science, IEEE Computer Society Press, Washington, D.C., October 1989.

[19] D. Haussler. Learning Conjunctive Concepts in Structural Domains. Technical Report UCSC-

CRL-87-01, Department of Computer and Information Sciences, University of California, Santa

Cruz, February 1987. To appear in Machine Learning.

[20] D. Haussler, M. Kearns, N. Littlestone, and M. K. Warmuth. Equivalence of models for

polynomial leaxnability. In Proceedings of the 1988 Workshop on Computational Learning Theory, pages 42-55, Morgan Kaufmann, San Mateo, CA, August 1988.

[21] D. Haussler and G. Pagallo. Feature discovery in empirical learning. Technical Report UCSC-

CRL-88-08, Department of Computer and Information Sciences, University of California, Santa

Cruz, October 1987. Revised May 1, 1989.

[22] D. ttaussler and E. Welzl. Epsilon-nets and simplex range queries. Discrete Computational

Geometry, 2:127-151, 1987.

[23] D. ttelmbold, l~. Sloan, and M. K. Warmuth. Learning nested differences of intersection closed

concept classes. In Proceedings of the 1989 Workshop on Computational Learning Theory,

Morgan Kaufmann, San Mateo, CAt August 1989.

[24] J. Hopcroft and J. Ullman. Introduction to Automata Theory, Languages, and Computation.

Addison-Wesley, London, 1979.

[25] M. Kearns, M. Li, L. Pitt, and L. G. Valiant. On the learnability of boolean formulae. In

Proceedings of the 19th Annual ACM Symposium on Theory of Computing, Assoc. Comp.

Mach., New York, May 1987.

[26] M. Kearns and L. Pitt. A polynomial-time algorithm for learning k-variable pattern lan-

guages from examples. In Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., October 1989.

[27] M. Kearns and L. G. Valiant. Cryptographic limitations on learning Boolean formulae and

finite automata. In Proceedings of the 21st A~jnual ACM Symposium on Theory of Computing, pages 433-444~ Assoc. Comp. Mach., New York, May 1989.

[28] L. A. Levin. One-way functions and pseudorandom generators. Combinatorica, 7(4):357-363,

1987.

[29] J. Lin and J. Vitter. Complexity issues in learning neural nets. In Proceedings of the 1989 Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, July 1989.

[30] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold

algorithm. Machine Learning, 2:285-318, 1987.

103

[31] B. K. Natarajan. Some Results on Learning. Technical Report CMU-RLTR-89-6, Robotics Institute, Carnegle-Mellon University, 1989.

[32] L. Pitt and L. G. Valiaxtt. Computational limitations on learning from examples. J. ACM, 35(4):965-984, 1988.

[33] L. Pitt and M. K. Warmuth. Reductions among prediction problems: On the difficulty of

predicting automata. In Proceedings of the 3rcl Annual IEEE Conference on Structure in

Complexity Theory, pages 60-69, IEEB Computer Society Press, Washington, D.C., June 1988.

[34] K. M. Podnieks. Comparing various concepts of function prediction, part 1. Latv. Gosudarst.

Univ Uch. Zapiski, 210:68-81, 1974. (in Russian).

[35] K. M. Podnieks. Comparing various concepts of function prediction, part 2. Latv. Gosudarst. Univ Uch. Zapiski, 233:33-44, 1975. (in Russian).

[36] D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, New York, 1974.

[37] R. L. Pdvest. Learning decision lists. Machine Learning, 2:229-246, 1987.

[38] R. Schapire. The strength of weak learnability. In Proceedings of the 1989 Workshop on

Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, August 1989.

[39] L. Valiant and M. K. Warmuth. Predicting symmetric differences of two halfspaces reduces to predicting halfspaces. 1989. Unpublished manuscript.

[40] L. G. Valiant. A theory of the learnable. Comm. Assoc. Comp. Mach., 27(11):1134-1142, 1984.

[41] V. N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer Verlag, New York, 1982.

[42] A. C. Yao. Theory and applications of trapdoor functions. In Proceedings of the 23rd Annual

IEEE Symposium on Foundations of Computer Science, pages 80-91, IEEE Computer Society Press, Washington, D.C., November 1982.

Towards representation independence in PAC learningmanfred/pubs/C16.pdf · Definition 2.2 A...

Documents

Transcript of Towards representation independence in PAC learningmanfred/pubs/C16.pdf · Definition 2.2 A...