LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January,...

21
LEARNIN LEARNIN HE HE UNIFORM UNIFORM UNDER UNDER DISTRIBUTION DISTRIBUTION Toward DNF Toward DNF Ryan O’Donnell Microsoft Research January, 2006

Transcript of LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January,...

Page 1: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

LEARNINLEARNIN HEHE UNIFORMUNIFORM UNDER DISTRIBUTIONUNDER DISTRIBUTION

– – Toward DNFToward DNF – –

Ryan O’Donnell

Microsoft Research

January, 2006

Page 2: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Re: How to make $1000!

A Grand of George W.’s:

A Hundred Hamiltons:

A Cool Cleveland:

Page 3: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

The “junta” learning problem

f : {−1,+1}n ! {−1,+1} is an unknown Boolean function.

f depends on only k ¿ n bits.

May generate “examples”, h x, f(x) i,

where x is generated uniformly at random.

Task: Identify the k relevant variables.

, Identify f exactly.

, Identify one relevant variable.

DNA

Page 4: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Run time efficiency

Information theoretically:

Algorithmically:

Naive algorithm: Time nk.

Best known algorithm: Time = n .704 k

[Mossel-O-Servedio ’04]

Need only ¼ 2k log n examples.

Seem to need n(k) time steps.

Page 5: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

How to get the money

Learning log n-juntas in poly(n) time gets you $1000.

Learning log log n-juntas in poly(n) time gets you $1000.

Learning n(1)-juntas in poly(n) time gets you $200.

The case k = log n is a subproblem of the problem of

“Learning polynomial-size DNF under the uniform distribution.”

http://www.thesmokinggun.com/archive/bushbill1.html

Page 6: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Time: n

Algorithmic attempts

• For each xi, measure empirical ‘correlation’ with f(x): E[ f(x) xi ].

Different from 0 ) xi must be relevant.

Converse false: xi can be influential but uncorrelated.

(e.g., k = 4, f = “exactly 2 out of 4 bits are +1”)

• Try measuring f ’s correlation with pairs of variables: E[ f(x) xi xj ].

Different from 0 ) both xi and xj must be relevant.

Still might not work. (e.g., k ¸ 3, f = “parity on k bits”)

• So try measuring correlation with all triples of variables…Time: n2Time: n3

Page 7: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

A result

In time nd, you can check correlation with all d-bit functions.

What kind of Boolean functions on k bits could be uncorrelated with all

functions on d or fewer bits??

[Mossel-O-Servedio ’04]:

• Proves structure theorem about such functions.

(They must be expressible as parities of ANDs of small size.)

• Can apply a parity-learning algorithm in that case.

• End result: An algorithm running in time

(Well, parities on > d bits, e.g.…)

Uniform-distribution learning results often

implied by structural results about Boolean functions.

Æ Æ Æ Æ

Page 8: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

PAC Learning

PAC Learning:

There is an unknown f : {−1,+1}n ! {−1,+1}.

Algorithm gets i.i.d. “examples”,

h x, f(x) i

Task: “Learn.” Given , find a “hypothesis”

function h which is (w.h.p.) -close to f.

Goal: Running-time efficiency.

CIRCUITS OF THE MINDunknown dist. Uniform Distribution

Page 9: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Running-time efficiency

The more “complex” f is, the more time it’s fair to allow.

Fix some measure of “complexity” or “size”, s = s( f ).

Goal: run in time poly(n, 1/, s).

Often focus on fixing s = poly(n), learning in poly(n) time.

e.g., size of smallest DNF

formula

Page 10: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

The “junta” problem

Fits into the formulation (slightly strangely):

• is fixed to 0. (Equivalently, 2−k.)

• Measure of “size” is 2(# of relevant variables). s = 2k.

[Mossel-O-Servedio ’04] had running time essentially

Even under this extremely conservative notion of “size”, we don’t

know how to learn in poly(n) time for s = poly(n).

Page 11: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

complexity measure s fastest known algorithm

DNF size nO(log s) [V ’90]

2(# of relevant variables) n.704 log2s [MOS ’04]

depth d circuit size nO(logd-1 s) [LMN ’93, H ’02]

Assuming factoring is hard, nlog(d) s time is necessary.

Even with “queries”.

[K ’93]

Decision Tree size nO(log s) [EH ’89]

Any algorithm that works in the “Statistical Query” model requires time nk.

[BF ’02]

Page 12: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

What to do?

1. Give Learner extra help:

• “Queries”: Learner can ask for f(x) for any x.

) Can learn DNF in time poly(n, s). [Jackson ’94]

• “More structured data”:

• Examples are not i.i.d., are generated by a standard random walk.

• Examples come in pairs, hx, f(x)i, hx', f(x')i, where x, x' share a > ½ fraction of coordinates.

) Can learn DNF in time poly(n, s). [Bshouty-Mossel-O-Servedio ’05]

Page 13: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

What to do? (rest of the talk)

2. Give up on trying to learn all functions.

• Rest of the talk: Focus on just learn monotone functions.

• f is monotone , changing a −1 to a +1 in the input can only make f go from −1 to +1, not the reverse

• Long history in PAC learning [HM’91, KLV’94, KMP ’94, B’95, BT’96, BCL’98, V’98, SM’00, S’04, JS’05...]

• f has DNF size s and is monotone ) f has a size s monotone DNF:

Page 14: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Why does monotonicity help?

1. More structured.

2. You can identify relevant variables.

Fact: If f is monotone, then f depends on xi

iff it has correlation with xi; i.e., E[ f(x) xi] 0.

Proof: If f is monotone, its variables have only nonnegative

correlations.

Page 15: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

complexity measure s fastest known algorithm

DNF size poly(n, slog s) [Servedio ’04]

2(# of relevant variables) poly(n, 2k) = poly(n, s)

depth d circuit size

Decision Tree size poly(n, s) [O-Servedio ’06]

Monotone case

any function

Page 16: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Learning Decision Trees

Non-monotone (general) case:

Structural result:

Every size s decision tree (# of leaves = s)

is -close to a decision tree with depth

d := log2(s/).

Proof: Truncate to depth d.

Probability any input would use a longer path is · 2−d = /s.

There are at most s such paths.

Use the union bound.

x3

x5 x1

x1 x5 x4

−1 +1 1 +1 −1

1

x2

+1 −1

Page 17: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Learning Decision Trees

Structural result:

Any depth d decision tree can be expressed as a degree d

(multilinear) polynomial over R.

Proof: Given a path in the tree, e.g.,

“x1 = +1, x3 = −1, x6 = +1, output +1”,

there is a degree d expression in the variables which is:

0 if the path is not followed, path-output if the path is followed.

Now just add these.

Page 18: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Learning Decision Trees

Cor: Every size s decision tree is -close to a degree log(s/)

multilinear polynomial.

Least-squares polynomial regression (“Low Degree Algorithm”)

• Draw a bunch of data.

• Try to fit it to degree d multilinear polynomial over R.

• Minimizing L2 error is a linear least-squares problem over nd many

variables (the unknown coefficients).

) learn size s DTs in time poly(nd) = poly(nlog s).

Page 19: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Learning monotone Decision Trees

[O-Servedio ’0?]:

1. Structural theorem on DTs: For any size s decision tree (not nec.

monotone), the sum of the n degree 1 correlations is at most

2. Easy fact we’ve seen: For monotone functions,

variable correlations = variable “influence”.

3. Theorem of [Friedgut ’96]: If the “total influence” of f is at most t,

then f essentially has at most 2O(t) relevant variables.

4. Folklore “Fourier analysis” fact: If the total influence of f is at most

t, then f is close to a degree-O(t) polynomial.

Page 20: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Learning monotone Decision Trees

Conclusion: If f is monotone and has a size s decision tree, then it has

essentially only relevant variable and essentially only

degree

Algorithm:

• Identify the essentially relevant variables (by correlation

estimation).

• Run the Polynomial Regression algorithm up to degree ,

but only using those relevant variables.

Total time:

Page 21: LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Open problem

Learn monotone DNF under uniform in polynomial time!

A source of help: There is a poly-time algorithm for learning almost all

randomly chosen monotone DNF of size up to n3.

[Servedio-Jackson ’05]

Structured monotone DNF – monotone DTs – are efficiently learnable.

“Typical-looking” monotone DNF are efficiently learnable (at least

up to size n3). So… all monotone DTs are efficiently learnable?

I think this problem is great because it is:

a) Possibly tractable. b) Possibly true.

c) Interesting to complexity theory people.

d) Would close the book on learning monotone fcns under uniform!