Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R....

Multi‐Task Averaging:Theory and Practice

Maya R. Gupta, Google Research, Univ. Washington

1

Sergey FeldmanUniv. Washington

Bela FrigyikUniv. Pecs

Aristotle

2

The idea of a mean is old :

\By the mean of a thing I denote a pointequally distant from either extreme..."-Aristotle

v = ymin+ymax

2

v ¡ ymin = ymax ¡ v

ymin ymaxv

Tycho Brahe (16th century)

3

Averaged to reduce measurement error.

¹y = 1N

PNi=1 yi

Legendre (1805)

4

Legendre noted the mean minimizes squared error:

¹y = arg min¹

NXi=1

(yi ¡ ¹)2

Legendre (1805)

5

Frigyik et al. 2008:the mean minimizes anyfunctional Bregman divergence.

Banerjee et al. 2005:the mean minimizes any Bregman divergence:

¹y = arg min¹

NXi=1

Ã(yi; ¹)

)

Legendre noted the mean minimizes squared error:

¹y = arg min¹

NXi=1

(yi ¡ ¹)2

(

Gauss (1809)

6

The average was central to Gauss's construction ofthe normal distribution. His goals:

- a smooth distribution- whose likelihood peak was at the sample mean.

Fisher 1922

7

\...no other statistic which can becalculated from the same sampleprovides any additional information asto the value of the parameter..."-R. A. Fisher

Stein’s Paradox 1956

8

Total squared error can be reduced by estimating each of themeans of T Gaussian random variables using datasampled from all of them, even if the random variablesare independent and have di®erent means.

t = 1

t = 2

t = T

...

¹1

¹2

¹3

Stein Estimation: One Sample CaseProblem: estimate means f¹tg of T Gaussian random variables

Given: random sample Yt » N (¹t; ¾2) for t = 1; : : : ; T

9

¹1

¹2

¹3



10

¹1

¹2

¹3

Maximum Likelihood Estimate

¹t = Yt



11

James-Stein Estimate

¹JSt =

Ã1¡ (T ¡ 2)¾2PT

r=1 Yr2

!Yt

Maximum Likelihood Estimate

¹t = Yt ¹1

¹2

¹3

James‐Stein Estimator Derivation:Efron and Morris 1972 Empirical Bayes Argument

12

Key assumptions:

² ¹t » N (0; ¿ 2)

² ¿ 2 unknown

² Yt » N (¹t; ¾2)

² ¾2 known


¹JSt =

Ã1¡ (T ¡ 2)¾2PT

r=1 Yr2

!Yt

13

E[¹tjYt] =

μ1¡ ¾2

¿2 + ¾2

¶Yt;

Key assumptions:

² ¹t » N (0; ¿ 2)

² ¿ 2 unknown

² Yt » N (¹t; ¾2)

² ¾2 known


¹JSt =

Ã1¡ (T ¡ 2)¾2PT

r=1 Yr2

!Yt


E

"(T ¡ 2)¾2PT

r=1 Yr2

#=

¾2

¿ 2 + ¾2

Key assumptions:

² ¹t » N (0; ¿ 2)

² ¿ 2 unknown

² Yt » N (¹t; ¾2)

² ¾2 known


¹JSt =

Ã1¡ (T ¡ 2)¾2PT

r=1 Yr2

!Yt

E[¹tjYt] =

μ1¡ ¾2

¿2 + ¾2

¶Yt;


² ¹t » N (»; ¿ 2)

² ¿ 2 and » unknown

² Yti » N (¹t; ¾2t )

i = 1; : : : ; Nt

² ¾2t unknown

² » = 1T

PTr=1

¹Yr

² Positive-part: (x)+ = max(x; 0)

² Diagonal § with §tt = ¾2t

Nt

² ¹Y is a T length vector with tth entry ¹Yt

General James-Stein Estimate

A More General JSE (Bock, 1972)

15

¹JSt = » +

³1¡ T¡3

(¹Y¡»)T§¡1(¹Y¡»)

´+ ³¹Yt ¡ »

´

James‐Stein Dominates

16

James's and Stein's theorem (1961):

For T > 3 the general JSE dominates the sample average (MLE):

E[jj¹¡ ¹JSjj22] · E[jj¹¡ ¹Y jj22]

for every choice of ¹. This is also written as:

R(¹JS) · R( ¹Y )

17

James-Stein Estimation Ã! Empirical Bayes

Multi-task Averaging (MTA) Ã! Empirical Loss Minimizationwith Regularization

(Empirical Vapnik)

(Tikhonov Regularization)

Multi‐Task Averaging Feldman et al. 2012

Problem: estimate means f¹tg of T random variables.

Given: Nt IID samples fytigNti=1 from each random variable.

Data Model: Yti drawn IID from ºt with ¯nite mean ¹t.

18

t = 1

t = 2

t = T

...

¹1

¹2

¹3

Building the MTA Objective\Single-task" averaging:

19

¹yt = argmin¹t

NtXi=1

(yti ¡ ¹t)2

Building the MTA Objective

20

f¹ytgTt=1 = arg min

f¹tgTt=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

add across tasks

\Single-task" averaging:

Building the MTA Objective

21

f¹ytgTt=1 = arg min

f¹tgTt=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

Mahalanobis distance

\Single-task" averaging:

The MTA Objective\Multi-task" averaging:

22

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

similarity between task r and sMahalanobis distanceto samples from task t


23

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

Task 1:Estimateaverage movieticket price

Task 2:Estimatemean ageof kids atsummer camp


24

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2


Task 2:Estimateprice ofteain China?


25

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹y1

¹y2


26

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹MTA1¹y1

¹y2¹MTA

2


27

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹y1

¹y2


28

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹y1

¹y2

¹MTA1

¹MTA2


29

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹y1

¹y2


30

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹y1

¹y2

¹MTA1

¹MTA2


31

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

empirical losslowers bias

regularizer:lowers estimationvariance

MTA Closed Form Solution

For non-negative A:

vectorof Tsampleaverages

graphLaplacianof A + AT

diagonalmatrix ofsample mean

variances¾2

t

Nt 32

¹MTA =³I +

°

T§L

´¡1

¹y

vectorof TMTAsolution

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

\Multi-task" averaging:


33

Lemma: this inverse always exists ifArs ¸ 0, ° ¸ 0, and Nt ¸ 1.

¹MTA =³I +

°

T§L

´¡1

¹y

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2



34

MTA estimatesare a linearcombo ofsampleaverages

T £ Tmatrix W

T sampleaverages

¹MTA =³I +

°

T§L

´¡1

¹y

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2



35

right-stochasticmatrix W

Theorem:convexcombo ofsampleaverages

T sampleaverages

¹MTA =³I +

°

T§L

´¡1

¹y

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2


When is MTA Better than the sample means?

36

Y1i = ¹1 + ²1

N1 samples, ¾21

Y2i = ¹2 + ²2

N2 samples, ¾22


Task 2:Estimatemean ageof kids atsummer camp

37

Y1i = ¹1 + ²1

N1 samples, ¾21

Y2i = ¹2 + ²2

N2 samples, ¾22

MTA estimate¡I + °

T§L

¢¡1 ¹Y :

¹MTA1 =

ÃT +

¾22

N2A12

T +¾21

N1A12 +

¾22

N2A12

!¹Y1 +

Ã¾21

N1A12

T +¾21

N1A12 +

¾22

N2A12

!¹Y2:


38

N1 samples, ¾21 N2 samples, ¾2

2

Biased, but smaller error variance than sample averages.

Y1i = ¹1 + ²1 Y2i = ¹2 + ²2

Risk[¹MTA1 ] <Risk[¹Y1] if (¹1¡¹2)

2 < 4A12

+¾21

N1+

¾22

N2

MTA estimate¡I + °

T§L

¢¡1 ¹Y :

¹MTA1 =

ÃT +

¾22

N2A12

T +¾21

N1A12 +

¾22

N2A12

!¹Y1 +

Ã¾21

N1A12

T +¾21

N1A12 +

¾22

N2A12

!¹Y2:


39

Risk[¹MTA] <Risk[¹Y ] if (¹1¡¹2)2¡ ¾2

1

N1¡ ¾2

2

N2< 4

A12

¹2

Optimal A for T = 2

4040

Example:

Answer: the optimal task similarity in terms of MSE:

A¤12 = 2

(¹1¡¹2)2

Y1 Y2

¹1¹2

4141

Example:

Answer: the optimal task similarity in terms of MSE:

A¤12 = 2

(¹1¡¹2)2

Y1 Y2

¹1¹2

A¤12 = 2

(¹y1¡¹y2)2

Estimated Optimal A for T = 2

estimated

42

Optimal sim:A¤

12 = 2(¹1¡¹2)2

A12

¾21 = 1 ¾2

2 = 1

¹1 = 0; ¹2 = 1

Optimal A for T > 2

43

² Bad news: no simple analytical minimization of risk for T > 2.

Optimal A for T > 2

44


² One solution: use pairwise estimate to populate A:

A¤rs = 2

(¹yr¡¹ys)2

A¤12 A¤

13

A¤21 A¤

23

A¤32A¤

31

Optimal A for T > 2² Bad news: no simple analytical minimization of risk for T > 2.

² One solution: use pairwise estimate to populate A

² Better solution: constrain A = a11T and optimize over a.

aa

a

a

a

Optimal A for T > 2

46

aa

a

a

a




² Analyzable: optimal similarity a is

a¤ = 21

T (T¡1)Tr=1

Ts=1(¹r¡¹s)2

average squared distancebetween the T means

Optimal A for T > 2

47

aa

a

a

a




² Analyzable: optimal similarity a is

a¤ = 21

T (T¡1)Tr=1

Ts=1(¹r¡¹s)2

a¤ = 21

T (T¡1)Tr=1

Ts=1(¹yr¡¹ys)2

² In practice, we estimate

How to Set Similarity Matrix A?

48

Choose A to minimize expected total squared error.

Choose A to minimize worst-case total squared error.

Minimax A for T = 2To ¯nd minimax A, we need:

1. A constraint set for f¹tg: ¹t 2 [bl; bu].

(bl; bu)

(bl; bl)

(bu; bu)

(bu; bl)



(bl; bl)

(bu; bu)(bl; bu)

(bu; bl)



2. A least favorable prior (LFP):

p(¹1; ¹2) =

8><>:12; if (¹1; ¹2) = (bl; bu)

12; if (¹1; ¹2) = (bu; bl)

0; otherwise.

(bl; bu)

(bl; bl)

(bu; bu)

(bu; bl)



2. A least favorable prior (LFP).

3. A Bayes-optimal estimator withconstant risk w.r.t. the LFP:

MTA with sim Amm12 = 2

(bu¡bl)2.

(bl; bu)

(bl; bl)

(bu; bu)

(bu; bl)





MTA with sim Amm12 = 2

(bu¡bl)2.

bl = mint ¹yt bu = maxt ¹yt

¹y1¹y2

(bl; bu)

(bl; bl)

(bu; bu)

(bu; bl)

Minimax A for T > 2To ¯nd minimax A, we need:




MTA with constant simamm = 2

(bu¡bl)2.

(bl; bu)

(bl; bl)

(bu; bu)

(bu; bl)

bl = mint ¹yt bu = maxt ¹yt

¹y1 ¹y3 ¹y4 ¹y5¹y2

Estimator Summary

55

SimulationsGaussian Simulations Uniform Simulations¹t » N (0; ¾2

¹) ¹t » U(¡p

3¾2¹;

p3¾2

¹)¾2

t » Gamma(0:9; 1:0) + 0:1 ¾2t » U(0:1; 2:0)

Nt » Uf2; : : : ; 100g Nt » Uf2; : : : ; 100gyti » N (¹t; ¾

2t ) yti » U [¹t ¡

p3¾2

t ; ¹t +p

3¾2t ]

56

5‐fold randomized Cross‐Validation

57

Gaussian Simulation, T = 5

58

(Lower is better.)


59


60

Uniform Simulation, T = 5

61


62


63

Scales O(T)

64

Constant and minimax MTA weight matrices can be rewritten:

Sherman Morrisonformula

Z is diagonal so Z¡1, Z¡1z, and 1TZ¡1

can all be computed in O(T )W¹Y is O(T)

W = (I + §L(a11T ))¡1

= (I + §(aTI ¡ a11T ))¡1

= (I + aT§¡ a§11T )¡1

= (Z ¡ z1T )¡1

= Z¡1 +Z¡1z1TZ¡1

1 + 1TZ¡1z

Application: Class Grades

65

Problem: estimate ¯nal grades f¹tg of T students.

Given: N homework grades fytigNi=1 from each student.

Application: Class Grades

66

Problem: estimate ¯nal grades f¹tg of T students.

Given: N homework grades fytigNi=1 from each student.

² 16 classrooms ! 16 datasets.

² Uncurved grades normalized to be between 0 and 100.

² Pooled variance used for all tasks in a dataset.

² Final class grades include homeworks, projects, labs,quizzes, midterms, and the ¯nal exam.

67

Percent change in risk vs. single-task.

68


69


70


Application: Product Sales

71

Exp 1: How much will tth customer spend on their next order?

Given: $ amounts fytigNti=1 that tth customer spent on Nt orders.

² T = 477

² yti ranged from $15 to $480.

² Nt ranged from 2 to 17.


72

Given: $ amounts fytigNti=1 that each of Nt customers spent

after buying the tth puzzle.

Exp 2: If you bought the tth puzzle, how muchwill you spend on your next order?

² T = 77

² yti ranged from $0 to $480.

² Nt ranged from 8 to 348.


73

No ground truth ! use sample means from all data as ¹t

and use random half of data to get ¹yt.

Customer 1:

Customer 2:

Customer T :


74



Customer 1:

Customer 2:

Customer T :

¹1

¹2

¹T


75



Customer 1:

Customer 2:

Customer T :

¹y1

¹y2

¹yT


76

Percent change in risk vs. single-taskaveraged over 1000 random splits.

(Lower is better.)

111


77

Percent change in risk vs. single-taskaveraged over 1000 random splits.

(Lower is better.)

Model Mismatch: 2008 Election

78

Problem: What percent of tth state's vote will go toObama and McCain on election day?

Given: Nt pre-election polls fytigNti=1 from each state.

Model Mismatch: 2008 Election

79

Problem: What percent of tth state's vote will go toObama and McCain on election day?

Given: Nt pre-election polls fytigNti=1 from each state.

Percent change in average risk vs. single-task.(Lower is better.)

MTA Applied to Kernel Density Estimation

80

KDE: Given that events fxig happened,estimate the probability of event z as

p(z) = 1N

PNi=1 K(xi; z)

x3

x1

x2

x4

x5

z


81


x3

x1

x2

x4

x5

p(z) = 1N

PNi=1 K(xi; z)

z


82


Equivalently,

argminy(z)

NXi=1

(K(xi; z)¡ y(z))2

p(z) = 1N

PNi=1 K(xi; z)


83


Equivalently,

argminy(z)

NXi=1

(K(xi; z)¡ y(z))2

Use MTA to form a Multi-task KDE:

arg minfyt(zt)gT

t=1

TXt=1

NtXi=1

(Kt(xti; zt)¡ yt(zt))2 +°

TXr=1

TXs=1

Ars (yr(zr)¡ ys(zs))2

p(z) = 1N

PNi=1 K(xi; z)

MT‐KDE for Terrorism Risk AssessmentProblem: Estimate the probability of terrorist eventsat 40,000 locations in Jerusalem,each location z; xi 2 R74

T = 7 terrorist groups

84

Task similarity matrix A from terrorism expert Mohammed Hafez:

MT‐KDE for Terrorism Risk Assessment

85

Suicides (T = 17) Bombings (T = 11)Single task .145 .1096James-Stein .145 .1096MTA constant .1897 .1096MTA minimax .1897 .1096Expert sim .1292 .0089

Mean Reciprocal Rank of a Left-Out Event

MTA is an intuitive, simple, accurate approach to estimating multiple means jointly.

When can you estimate multiple means at once?

Can you estimate the task similarities better?

Learn more: see our 2012 NIPS paper or email me for the journal paper ([email protected])

86

Last Slide

¹ = W ¹y

right stochastic W

W =¡I + °

T§L

¢¡1

diagonal § with §tt ¸ 0Ars ¸ 0, ° ¸ 0

¹t = ¸¹yt + (1¡ ¸)PT

r=1 ®r¹yr

¹ = W ¹y

MTA:¹ = W ¹y

0 < ¸ · 1PTr=1 ®r = 1

®r ¸ 0, 8r(J-S, more)

Bayesian Analysis: IGMRFs

88

Recall that:

12

PNr=1

PTs=1 Ars(yr ¡ ys)

2 = 12yTLS y

The above regularizer can be thoughtof as coming from an intrinsic (improper)GMRF prior (Rue and Held, '05):

p(y) = (2¼)¡T2 jLSj 12 exp

¡¡1

2yTLS y

¢Usually used for graphical models when LS is sparse.

where L = D ¡ A

Bayesian Analysis

Assuming di®erences are independent:

p(y) /TY

r=1

TYs=1

e¡°Ars(yr¡ys)2

Y1 ¡ Y2 » N (0; 1=2°A12)

Y2 ¡ Y3 » N (0; 1=2°A23)

Y1 ¡ Y3 » N (0; 1=2°A13)

Y1 ¡ Y3 = (Y1 ¡ Y2) + (Y2 ¡ Y3) !1

A13=

1

A12+

1

A23

Y1 ¡ Y2 = (Y1 ¡ Y3) + (Y3 ¡ Y2) !1

A12=

1

A13+

1

A32

Y2 ¡ Y3 = (Y2 ¡ Y1) + (Y1 ¡ Y3) !1

A23=

1

A21+

1

A13

Impossible to satisfy all RHS with any ¯nite A!

for T = 3

89

Related Multi‐Task Regularizers

90

PTr=1 k¯r ¡ 1

T

PTs=1 ¯sk2

2 Distance to mean(Evgeniou and Pontil, 2004)

jj¯jj¤ Trace norm (Abernethy et al., 2009)

tr(¯TD¡1¯) Learned, shared feature covariance matrix(Argyriou et al., 2008)

tr(¯§¡1¯T ) Learned task covariance matrix(Jacob et al., 2008 and Zhang and Yeung, 2010)PT

r=1

PTs=1 Arsk¯r ¡ ¯sk2

2 Pairwise distance regularizer (Sheldon, 2008)or constraint (Kato et al., 2007)


91

(Lower is better.)(Lower is better.)


92(Lower is better.)

Pairwise T=5 Results


Oracle T=5 Results


Stein’s Unbiased Risk Estimate

95

² True A¤12 is depends on unknown ¹t. We plugged in ¹yt to get:

A¤12 = 2

(¹y1¡¹y2)2

² Another approach:minimize Stein's unbiased risk estimate (SURE),which is an empirical proxy Q such that E[Q] = risk.

² Result:

ASURE12 =

Ã2

(¹y1¡¹y2)2¡¾21

N1¡ ¾2

2N2

!+

SURE T=2 Experiments


Alternative Formulation

97

MTA is:¡I + °

T§L

¢¡1 ¹Y

MTA Variant is:

§1=2¡I + °

TL¢¡1

§¡1=2 ¹Y ;

with optimal sim:

A¤12 = 2

(¹1¡¹2)2:

with optimal sim:

A¤12 = 2

¹1¾1

¡¹2¾2

2 :

Di®erent notions of distance!

What if ¹1 = 2, ¾1 = 1, ¹2 = 4, ¾2 = 2?

98

Alternative Formulation T=2 Results

(Lower is better.)


99

more general form ofregularized Laplacian kernel(Smola and Kondor, 2003)

¹MTA =³I +

°

T§L

´¡1

¹y

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

+°

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2


Stein’s Unbiased Risk Estimate

100

² True A¤12 is depends on unknown ¹t. We plugged in ¹yt to get:

A¤12 = 2

(¹y1¡¹y2)2

² Another approach:minimize Stein's unbiased risk estimate (SURE),which is an empirical proxy Q such that E[Q] = risk.

² Result:

ASURE12 =

Ã2

(¹y1¡¹y2)2¡¾21

N1¡ ¾2

2N2

!+

SURE T=2 Experiments


Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R....

Documents

Transcript of Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R....