Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R....
Transcript of Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R....
Multi‐Task Averaging:Theory and Practice
Maya R. Gupta, Google Research, Univ. Washington
1
Sergey FeldmanUniv. Washington
Bela FrigyikUniv. Pecs
Aristotle
2
The idea of a mean is old :
\By the mean of a thing I denote a pointequally distant from either extreme..."-Aristotle
v = ymin+ymax
2
v ¡ ymin = ymax ¡ v
ymin ymaxv
Tycho Brahe (16th century)
3
Averaged to reduce measurement error.
¹y = 1N
PNi=1 yi
Legendre (1805)
4
Legendre noted the mean minimizes squared error:
¹y = arg min¹
NXi=1
(yi ¡ ¹)2
Legendre (1805)
5
Frigyik et al. 2008:the mean minimizes anyfunctional Bregman divergence.
Banerjee et al. 2005:the mean minimizes any Bregman divergence:
¹y = arg min¹
NXi=1
Ã(yi; ¹)
)
Legendre noted the mean minimizes squared error:
¹y = arg min¹
NXi=1
(yi ¡ ¹)2
(
Gauss (1809)
6
The average was central to Gauss's construction ofthe normal distribution. His goals:
- a smooth distribution- whose likelihood peak was at the sample mean.
Fisher 1922
7
\...no other statistic which can becalculated from the same sampleprovides any additional information asto the value of the parameter..."-R. A. Fisher
Stein’s Paradox 1956
8
Total squared error can be reduced by estimating each of themeans of T Gaussian random variables using datasampled from all of them, even if the random variablesare independent and have di®erent means.
t = 1
t = 2
t = T
...
¹1
¹2
¹3
Stein Estimation: One Sample CaseProblem: estimate means f¹tg of T Gaussian random variables
Given: random sample Yt » N (¹t; ¾2) for t = 1; : : : ; T
9
¹1
¹2
¹3
Stein Estimation: One Sample CaseProblem: estimate means f¹tg of T Gaussian random variables
Given: random sample Yt » N (¹t; ¾2) for t = 1; : : : ; T
10
¹1
¹2
¹3
Maximum Likelihood Estimate
¹t = Yt
Stein Estimation: One Sample CaseProblem: estimate means f¹tg of T Gaussian random variables
Given: random sample Yt » N (¹t; ¾2) for t = 1; : : : ; T
11
James-Stein Estimate
¹JSt =
Ã1¡ (T ¡ 2)¾2PT
r=1 Yr2
!Yt
Maximum Likelihood Estimate
¹t = Yt ¹1
¹2
¹3
James‐Stein Estimator Derivation:Efron and Morris 1972 Empirical Bayes Argument
12
Key assumptions:
² ¹t » N (0; ¿ 2)
² ¿ 2 unknown
² Yt » N (¹t; ¾2)
² ¾2 known
James-Stein Estimate
¹JSt =
Ã1¡ (T ¡ 2)¾2PT
r=1 Yr2
!Yt
13
E[¹tjYt] =
μ1¡ ¾2
¿2 + ¾2
¶Yt;
Key assumptions:
² ¹t » N (0; ¿ 2)
² ¿ 2 unknown
² Yt » N (¹t; ¾2)
² ¾2 known
James-Stein Estimate
¹JSt =
Ã1¡ (T ¡ 2)¾2PT
r=1 Yr2
!Yt
James‐Stein Estimator Derivation:Efron and Morris 1972 Empirical Bayes Argument
E
"(T ¡ 2)¾2PT
r=1 Yr2
#=
¾2
¿ 2 + ¾2
Key assumptions:
² ¹t » N (0; ¿ 2)
² ¿ 2 unknown
² Yt » N (¹t; ¾2)
² ¾2 known
James-Stein Estimate
¹JSt =
Ã1¡ (T ¡ 2)¾2PT
r=1 Yr2
!Yt
E[¹tjYt] =
μ1¡ ¾2
¿2 + ¾2
¶Yt;
James‐Stein Estimator Derivation:Efron and Morris 1972 Empirical Bayes Argument
² ¹t » N (»; ¿ 2)
² ¿ 2 and » unknown
² Yti » N (¹t; ¾2t )
i = 1; : : : ; Nt
² ¾2t unknown
² » = 1T
PTr=1
¹Yr
² Positive-part: (x)+ = max(x; 0)
² Diagonal § with §tt = ¾2t
Nt
² ¹Y is a T length vector with tth entry ¹Yt
General James-Stein Estimate
A More General JSE (Bock, 1972)
15
¹JSt = » +
³1¡ T¡3
(¹Y¡»)T§¡1(¹Y¡»)
´+ ³¹Yt ¡ »
´
James‐Stein Dominates
16
James's and Stein's theorem (1961):
For T > 3 the general JSE dominates the sample average (MLE):
E[jj¹¡ ¹JSjj22] · E[jj¹¡ ¹Y jj22]
for every choice of ¹. This is also written as:
R(¹JS) · R( ¹Y )
17
James-Stein Estimation Ã! Empirical Bayes
Multi-task Averaging (MTA) Ã! Empirical Loss Minimizationwith Regularization
(Empirical Vapnik)
(Tikhonov Regularization)
Multi‐Task Averaging Feldman et al. 2012
Problem: estimate means f¹tg of T random variables.
Given: Nt IID samples fytigNti=1 from each random variable.
Data Model: Yti drawn IID from ºt with ¯nite mean ¹t.
18
t = 1
t = 2
t = T
...
¹1
¹2
¹3
Building the MTA Objective\Single-task" averaging:
19
¹yt = argmin¹t
NtXi=1
(yti ¡ ¹t)2
Building the MTA Objective
20
f¹ytgTt=1 = arg min
f¹tgTt=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
add across tasks
\Single-task" averaging:
Building the MTA Objective
21
f¹ytgTt=1 = arg min
f¹tgTt=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
Mahalanobis distance
\Single-task" averaging:
The MTA Objective\Multi-task" averaging:
22
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
similarity between task r and sMahalanobis distanceto samples from task t
The MTA Objective\Multi-task" averaging:
23
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
Task 1:Estimateaverage movieticket price
Task 2:Estimatemean ageof kids atsummer camp
The MTA Objective\Multi-task" averaging:
24
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
Task 1:Estimateaverage movieticket price
Task 2:Estimateprice ofteain China?
The MTA Objective\Multi-task" averaging:
25
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹y1
¹y2
The MTA Objective\Multi-task" averaging:
26
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹MTA1¹y1
¹y2¹MTA
2
The MTA Objective\Multi-task" averaging:
27
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹y1
¹y2
The MTA Objective\Multi-task" averaging:
28
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹y1
¹y2
¹MTA1
¹MTA2
The MTA Objective\Multi-task" averaging:
29
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹y1
¹y2
The MTA Objective\Multi-task" averaging:
30
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹y1
¹y2
¹MTA1
¹MTA2
The MTA Objective\Multi-task" averaging:
31
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
empirical losslowers bias
regularizer:lowers estimationvariance
MTA Closed Form Solution
For non-negative A:
vectorof Tsampleaverages
graphLaplacianof A + AT
diagonalmatrix ofsample mean
variances¾2
t
Nt 32
¹MTA =³I +
°
T§L
´¡1
¹y
vectorof TMTAsolution
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
\Multi-task" averaging:
MTA Closed Form Solution
33
Lemma: this inverse always exists ifArs ¸ 0, ° ¸ 0, and Nt ¸ 1.
¹MTA =³I +
°
T§L
´¡1
¹y
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
\Multi-task" averaging:
MTA Closed Form Solution
34
MTA estimatesare a linearcombo ofsampleaverages
T £ Tmatrix W
T sampleaverages
¹MTA =³I +
°
T§L
´¡1
¹y
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
\Multi-task" averaging:
MTA Closed Form Solution
35
right-stochasticmatrix W
Theorem:convexcombo ofsampleaverages
T sampleaverages
¹MTA =³I +
°
T§L
´¡1
¹y
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
\Multi-task" averaging:
When is MTA Better than the sample means?
36
Y1i = ¹1 + ²1
N1 samples, ¾21
Y2i = ¹2 + ²2
N2 samples, ¾22
Task 1:Estimateaverage movieticket price
Task 2:Estimatemean ageof kids atsummer camp
37
Y1i = ¹1 + ²1
N1 samples, ¾21
Y2i = ¹2 + ²2
N2 samples, ¾22
MTA estimate¡I + °
T§L
¢¡1 ¹Y :
¹MTA1 =
ÃT +
¾22
N2A12
T +¾21
N1A12 +
¾22
N2A12
!¹Y1 +
þ21
N1A12
T +¾21
N1A12 +
¾22
N2A12
!¹Y2:
When is MTA Better than the sample means?
38
N1 samples, ¾21 N2 samples, ¾2
2
Biased, but smaller error variance than sample averages.
Y1i = ¹1 + ²1 Y2i = ¹2 + ²2
Risk[¹MTA1 ] <Risk[¹Y1] if (¹1¡¹2)
2 < 4A12
+¾21
N1+
¾22
N2
MTA estimate¡I + °
T§L
¢¡1 ¹Y :
¹MTA1 =
ÃT +
¾22
N2A12
T +¾21
N1A12 +
¾22
N2A12
!¹Y1 +
þ21
N1A12
T +¾21
N1A12 +
¾22
N2A12
!¹Y2:
When is MTA Better than the sample means?
39
Risk[¹MTA] <Risk[¹Y ] if (¹1¡¹2)2¡ ¾2
1
N1¡ ¾2
2
N2< 4
A12
¹2
Optimal A for T = 2
4040
Example:
Answer: the optimal task similarity in terms of MSE:
A¤12 = 2
(¹1¡¹2)2
Y1 Y2
¹1¹2
4141
Example:
Answer: the optimal task similarity in terms of MSE:
A¤12 = 2
(¹1¡¹2)2
Y1 Y2
¹1¹2
A¤12 = 2
(¹y1¡¹y2)2
Estimated Optimal A for T = 2
estimated
42
Optimal sim:A¤
12 = 2(¹1¡¹2)2
A12
¾21 = 1 ¾2
2 = 1
¹1 = 0; ¹2 = 1
Optimal A for T > 2
43
² Bad news: no simple analytical minimization of risk for T > 2.
Optimal A for T > 2
44
² Bad news: no simple analytical minimization of risk for T > 2.
² One solution: use pairwise estimate to populate A:
A¤rs = 2
(¹yr¡¹ys)2
A¤12 A¤
13
A¤21 A¤
23
A¤32A¤
31
Optimal A for T > 2² Bad news: no simple analytical minimization of risk for T > 2.
² One solution: use pairwise estimate to populate A
² Better solution: constrain A = a11T and optimize over a.
aa
a
a
a
Optimal A for T > 2
46
aa
a
a
a
² Bad news: no simple analytical minimization of risk for T > 2.
² One solution: use pairwise estimate to populate A
² Better solution: constrain A = a11T and optimize over a.
² Analyzable: optimal similarity a is
a¤ = 21
T (T¡1)Tr=1
Ts=1(¹r¡¹s)2
average squared distancebetween the T means
Optimal A for T > 2
47
aa
a
a
a
² Bad news: no simple analytical minimization of risk for T > 2.
² One solution: use pairwise estimate to populate A
² Better solution: constrain A = a11T and optimize over a.
² Analyzable: optimal similarity a is
a¤ = 21
T (T¡1)Tr=1
Ts=1(¹r¡¹s)2
a¤ = 21
T (T¡1)Tr=1
Ts=1(¹yr¡¹ys)2
² In practice, we estimate
How to Set Similarity Matrix A?
48
Choose A to minimize expected total squared error.
Choose A to minimize worst-case total squared error.
Minimax A for T = 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
(bl; bu)
(bl; bl)
(bu; bu)
(bu; bl)
Minimax A for T = 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
(bl; bl)
(bu; bu)(bl; bu)
(bu; bl)
Minimax A for T = 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
2. A least favorable prior (LFP):
p(¹1; ¹2) =
8><>:12; if (¹1; ¹2) = (bl; bu)
12; if (¹1; ¹2) = (bu; bl)
0; otherwise.
(bl; bu)
(bl; bl)
(bu; bu)
(bu; bl)
Minimax A for T = 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
2. A least favorable prior (LFP).
3. A Bayes-optimal estimator withconstant risk w.r.t. the LFP:
MTA with sim Amm12 = 2
(bu¡bl)2.
(bl; bu)
(bl; bl)
(bu; bu)
(bu; bl)
Minimax A for T = 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
2. A least favorable prior (LFP).
3. A Bayes-optimal estimator withconstant risk w.r.t. the LFP:
MTA with sim Amm12 = 2
(bu¡bl)2.
bl = mint ¹yt bu = maxt ¹yt
¹y1¹y2
(bl; bu)
(bl; bl)
(bu; bu)
(bu; bl)
Minimax A for T > 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
2. A least favorable prior (LFP).
3. A Bayes-optimal estimator withconstant risk w.r.t. the LFP:
MTA with constant simamm = 2
(bu¡bl)2.
(bl; bu)
(bl; bl)
(bu; bu)
(bu; bl)
bl = mint ¹yt bu = maxt ¹yt
¹y1 ¹y3 ¹y4 ¹y5¹y2
Estimator Summary
55
SimulationsGaussian Simulations Uniform Simulations¹t » N (0; ¾2
¹) ¹t » U(¡p
3¾2¹;
p3¾2
¹)¾2
t » Gamma(0:9; 1:0) + 0:1 ¾2t » U(0:1; 2:0)
Nt » Uf2; : : : ; 100g Nt » Uf2; : : : ; 100gyti » N (¹t; ¾
2t ) yti » U [¹t ¡
p3¾2
t ; ¹t +p
3¾2t ]
56
5‐fold randomized Cross‐Validation
57
Gaussian Simulation, T = 5
58
(Lower is better.)
Gaussian Simulation, T = 25
59
Gaussian Simulation, T = 500
60
Uniform Simulation, T = 5
61
Uniform Simulation, T = 25
62
Uniform Simulation, T = 500
63
Scales O(T)
64
Constant and minimax MTA weight matrices can be rewritten:
Sherman Morrisonformula
Z is diagonal so Z¡1, Z¡1z, and 1TZ¡1
can all be computed in O(T )W¹Y is O(T)
W = (I + §L(a11T ))¡1
= (I + §(aTI ¡ a11T ))¡1
= (I + aT§¡ a§11T )¡1
= (Z ¡ z1T )¡1
= Z¡1 +Z¡1z1TZ¡1
1 + 1TZ¡1z
Application: Class Grades
65
Problem: estimate ¯nal grades f¹tg of T students.
Given: N homework grades fytigNi=1 from each student.
Application: Class Grades
66
Problem: estimate ¯nal grades f¹tg of T students.
Given: N homework grades fytigNi=1 from each student.
² 16 classrooms ! 16 datasets.
² Uncurved grades normalized to be between 0 and 100.
² Pooled variance used for all tasks in a dataset.
² Final class grades include homeworks, projects, labs,quizzes, midterms, and the ¯nal exam.
67
Percent change in risk vs. single-task.
68
Percent change in risk vs. single-task.
69
Percent change in risk vs. single-task.
70
Percent change in risk vs. single-task.
Application: Product Sales
71
Exp 1: How much will tth customer spend on their next order?
Given: $ amounts fytigNti=1 that tth customer spent on Nt orders.
² T = 477
² yti ranged from $15 to $480.
² Nt ranged from 2 to 17.
Application: Product Sales
72
Given: $ amounts fytigNti=1 that each of Nt customers spent
after buying the tth puzzle.
Exp 2: If you bought the tth puzzle, how muchwill you spend on your next order?
² T = 77
² yti ranged from $0 to $480.
² Nt ranged from 8 to 348.
Application: Product Sales
73
No ground truth ! use sample means from all data as ¹t
and use random half of data to get ¹yt.
Customer 1:
Customer 2:
Customer T :
Application: Product Sales
74
No ground truth ! use sample means from all data as ¹t
and use random half of data to get ¹yt.
Customer 1:
Customer 2:
Customer T :
¹1
¹2
¹T
Application: Product Sales
75
No ground truth ! use sample means from all data as ¹t
and use random half of data to get ¹yt.
Customer 1:
Customer 2:
Customer T :
¹y1
¹y2
¹yT
Application: Product Sales
76
Percent change in risk vs. single-taskaveraged over 1000 random splits.
(Lower is better.)
111
Application: Product Sales
77
Percent change in risk vs. single-taskaveraged over 1000 random splits.
(Lower is better.)
Model Mismatch: 2008 Election
78
Problem: What percent of tth state's vote will go toObama and McCain on election day?
Given: Nt pre-election polls fytigNti=1 from each state.
Model Mismatch: 2008 Election
79
Problem: What percent of tth state's vote will go toObama and McCain on election day?
Given: Nt pre-election polls fytigNti=1 from each state.
Percent change in average risk vs. single-task.(Lower is better.)
MTA Applied to Kernel Density Estimation
80
KDE: Given that events fxig happened,estimate the probability of event z as
p(z) = 1N
PNi=1 K(xi; z)
x3
x1
x2
x4
x5
z
MTA Applied to Kernel Density Estimation
81
KDE: Given that events fxig happened,estimate the probability of event z as
x3
x1
x2
x4
x5
p(z) = 1N
PNi=1 K(xi; z)
z
MTA Applied to Kernel Density Estimation
82
KDE: Given that events fxig happened,estimate the probability of event z as
Equivalently,
argminy(z)
NXi=1
(K(xi; z)¡ y(z))2
p(z) = 1N
PNi=1 K(xi; z)
MTA Applied to Kernel Density Estimation
83
KDE: Given that events fxig happened,estimate the probability of event z as
Equivalently,
argminy(z)
NXi=1
(K(xi; z)¡ y(z))2
Use MTA to form a Multi-task KDE:
arg minfyt(zt)gT
t=1
TXt=1
NtXi=1
(Kt(xti; zt)¡ yt(zt))2 +°
TXr=1
TXs=1
Ars (yr(zr)¡ ys(zs))2
p(z) = 1N
PNi=1 K(xi; z)
MT‐KDE for Terrorism Risk AssessmentProblem: Estimate the probability of terrorist eventsat 40,000 locations in Jerusalem,each location z; xi 2 R74
T = 7 terrorist groups
84
Task similarity matrix A from terrorism expert Mohammed Hafez:
MT‐KDE for Terrorism Risk Assessment
85
Suicides (T = 17) Bombings (T = 11)Single task .145 .1096James-Stein .145 .1096MTA constant .1897 .1096MTA minimax .1897 .1096Expert sim .1292 .0089
Mean Reciprocal Rank of a Left-Out Event
MTA is an intuitive, simple, accurate approach to estimating multiple means jointly.
When can you estimate multiple means at once?
Can you estimate the task similarities better?
Learn more: see our 2012 NIPS paper or email me for the journal paper ([email protected])
86
Last Slide
¹ = W ¹y
right stochastic W
W =¡I + °
T§L
¢¡1
diagonal § with §tt ¸ 0Ars ¸ 0, ° ¸ 0
¹t = ¸¹yt + (1¡ ¸)PT
r=1 ®r¹yr
¹ = W ¹y
MTA:¹ = W ¹y
0 < ¸ · 1PTr=1 ®r = 1
®r ¸ 0, 8r(J-S, more)
Bayesian Analysis: IGMRFs
88
Recall that:
12
PNr=1
PTs=1 Ars(yr ¡ ys)
2 = 12yTLS y
The above regularizer can be thoughtof as coming from an intrinsic (improper)GMRF prior (Rue and Held, '05):
p(y) = (2¼)¡T2 jLSj 12 exp
¡¡1
2yTLS y
¢Usually used for graphical models when LS is sparse.
where L = D ¡ A
Bayesian Analysis
Assuming di®erences are independent:
p(y) /TY
r=1
TYs=1
e¡°Ars(yr¡ys)2
Y1 ¡ Y2 » N (0; 1=2°A12)
Y2 ¡ Y3 » N (0; 1=2°A23)
Y1 ¡ Y3 » N (0; 1=2°A13)
Y1 ¡ Y3 = (Y1 ¡ Y2) + (Y2 ¡ Y3) !1
A13=
1
A12+
1
A23
Y1 ¡ Y2 = (Y1 ¡ Y3) + (Y3 ¡ Y2) !1
A12=
1
A13+
1
A32
Y2 ¡ Y3 = (Y2 ¡ Y1) + (Y1 ¡ Y3) !1
A23=
1
A21+
1
A13
Impossible to satisfy all RHS with any ¯nite A!
for T = 3
89
Related Multi‐Task Regularizers
90
PTr=1 k¯r ¡ 1
T
PTs=1 ¯sk2
2 Distance to mean(Evgeniou and Pontil, 2004)
jj¯jj¤ Trace norm (Abernethy et al., 2009)
tr(¯TD¡1¯) Learned, shared feature covariance matrix(Argyriou et al., 2008)
tr(¯§¡1¯T ) Learned task covariance matrix(Jacob et al., 2008 and Zhang and Yeung, 2010)PT
r=1
PTs=1 Arsk¯r ¡ ¯sk2
2 Pairwise distance regularizer (Sheldon, 2008)or constraint (Kato et al., 2007)
Gaussian Simulation, T = 2
91
(Lower is better.)(Lower is better.)
Uniform Simulation, T = 2
92(Lower is better.)
Pairwise T=5 Results
93(Lower is better.)
Oracle T=5 Results
94(Lower is better.)
Stein’s Unbiased Risk Estimate
95
² True A¤12 is depends on unknown ¹t. We plugged in ¹yt to get:
A¤12 = 2
(¹y1¡¹y2)2
² Another approach:minimize Stein's unbiased risk estimate (SURE),which is an empirical proxy Q such that E[Q] = risk.
² Result:
ASURE12 =
Ã2
(¹y1¡¹y2)2¡¾21
N1¡ ¾2
2N2
!+
SURE T=2 Experiments
96(Lower is better.)
Alternative Formulation
97
MTA is:¡I + °
T§L
¢¡1 ¹Y
MTA Variant is:
§1=2¡I + °
TL¢¡1
§¡1=2 ¹Y ;
with optimal sim:
A¤12 = 2
(¹1¡¹2)2:
with optimal sim:
A¤12 = 2
¹1¾1
¡¹2¾2
2 :
Di®erent notions of distance!
What if ¹1 = 2, ¾1 = 1, ¹2 = 4, ¾2 = 2?
98
Alternative Formulation T=2 Results
(Lower is better.)
MTA Closed Form Solution
99
more general form ofregularized Laplacian kernel(Smola and Kondor, 2003)
¹MTA =³I +
°
T§L
´¡1
¹y
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
\Multi-task" averaging:
Stein’s Unbiased Risk Estimate
100
² True A¤12 is depends on unknown ¹t. We plugged in ¹yt to get:
A¤12 = 2
(¹y1¡¹y2)2
² Another approach:minimize Stein's unbiased risk estimate (SURE),which is an empirical proxy Q such that E[Q] = risk.
² Result:
ASURE12 =
Ã2
(¹y1¡¹y2)2¡¾21
N1¡ ¾2
2N2
!+
SURE T=2 Experiments
101(Lower is better.)