The Power of Selective Memory

Power of Selective Memory. Slide 1

The Power of Selective Memory

Shai Shalev-Shwartz

Joint work with

Ofer Dekel, Yoram Singer

Hebrew University, Jerusalem


Outline• Online learning, loss bounds etc.• Hypotheses space – PST• Margin of prediction and hinge-loss• An online learning algorithm• Trading margin for depth of the PST• Automatic calibration• A self-bounded online algorithm for learning

PSTs


Online Learning

• For

• Get an instance

• Predict a target based on

• Get true update and suffer loss

• Update prediction mechanism


Analysis of Online Algorithm• Relative loss bounds (external regret):

For any fixed hypothesis h :


Prediction Suffix Tree (PST)Each hypothesis is parameterized by a triplet:

context function


PST Example

0

-3

1

-1

4

-2 7


Margin of Prediction

• Margin of prediction

• Hinge loss

-3 -2 -1 0 1 2 30

0.5

1

1.5

2

2.5

3

3.5

4

0-1 losshinge loss


Complexity of hypothesis• Define the complexity of hypothesis as

• We can also extend g s.t.

and get


Algorithm I :Learning Unbounded-Depth PST

• Init:• For t=1,2,…

• Get and predict• Get and suffer loss• Set• Update weight vector• Update tree


Example

y = 0

y = ?


Example

y = +0

y = ?


Example

y = +0

y = ? ?


Example

y = + -0

y = ? ?

-.23

+


Example

y = + -0

y = ? ? ?

-.23

+


Example

y = + - +0

y = ? ? ?

-.23

+

.23

.16

+

-


Example

y = + - +0

y = ? ? ? -

-.23

+

.23

.16

+

-


Example

y = + - + -0

y = ? ? ? -

-.42

+

.23

.16

+

-

-.14

-.09

+

-


Example

y = + - + -0

y = ? ? ? - +

-.42

+

.23

.16

+

-

-.14

-.09

+

-


Example

y = + - + - +0

y = ? ? ? - +

-.42

+

.41

.29

+

-

-.14

-.09

+

-

.09

.06

+

-


Analysis• Let be a sequence of

examples and assume that • Let be an arbitrary hypothesis• Let be the loss of on the

sequence of examples. Then,


Proof Sketch• Define

• Upper bound

• Lower bound

• Upper + lower bounds give the bound in the theorem


Proof Sketch (Cont.)Where does the lower bound come from?• For simplicity, assume that and • Define a Hilbert space:• The context function gt+1

is the projection of gt onto the half-space

where f is the function


Example revisited

• The following hypothesis has cumulative loss of 2 and complexity of 2. Therefore, the number of mistakes is bounded above by 12.

y = + - + - + - + -


Example revisited

• The following hypothesis has cumulative loss of 1 and complexity of 4. Therefore, the number of mistakes is bounded above by 18.But, this tree is very shallow

0

1.41 -1.41

+-

y = + - + - + - + -

Problem: The tree we learned is much more deeper !


Geometric Intuition


Geometric Intuition (Cont.)Lets force gt+1 to be sparse by “canceling” the new coordinate


Geometric Intuition (Cont.)Now we can show that:


Trading margin for sparsity• We got that

• If is much smaller than we can get a loss bound !

• Problem: What happens if is very small and therefore ?Solution: Tolerate small margin errors !

• Conclusion: If we tolerate small margin errors, we can get a sparser tree


Automatic Calibration• Problem: The value of is unknown • Solution: Use the data itself to estimate it !More specifically:• Denote

• If we keep then we get a mistake bound


Algorithm II :Learning Self Bounded-Depth PST

• Init:• For t=1,2,…

• Get and predict• Get and suffer loss• If do nothing! Otherwise:

• Set• Set • Set

• Update w and the tree as in Algo. I, up to depth dt


Analysis – Loss Bound• Let be a sequence of

examples and assume that • Let be an arbitrary hypothesis• Let be the loss of on the

sequence of examples. Then,


Analysis – Bounded depth• Under the previous conditions, the depth of

all the trees learned by the algorithm is bounded above by


Example revisitedPerformance of Algo. II• y = + - + - + - + - …• Only 3 mistakes• The last PST is of

depth 5• The margin is 0.61

(after normalization)• The margin of the max

margin tree (of infinite depth) is 0.7071

0

-.55

+.55

.39+

-

-. 22

-.07

+ -

.07

.05

-

.03

-.05

-+

-


Conclusions• Discriminative online learning of PSTs• Loss bound• Trade margin and sparsity• Automatic calibration

Future work• Experiments• Features selection and extraction• Support vectors selection

The Power of Selective Memory

Documents

Transcript of The Power of Selective Memory