Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar,...
Transcript of Large Scale Machine Learning - Sanjiv K · 2010-10-20 · Large-Scale Optimization Sanjiv Kumar,...
Large-Scale Optimization
Sanjiv Kumar, Google Research, NY
EECS-6898, Columbia University - Fall, 2010
Sanjiv
Kumar 10/19/2010 EECS6898 – Large Scale Machine Learning 1
EECS6898 – Large Scale Machine Learning 2
Learning and OptimizationIn most cases, learning from data reduces to optimizing a function with
respect to model parameters
Given training data we want to learn a model
Sanjiv
Kumar 10/19/2010
);( wxfy =prediction }1,1{or,.,. −∈ℜ∈ℜ∈ yyxge d
niii yxD ,...,1},{ ==
EECS6898 – Large Scale Machine Learning 3
Learning and OptimizationIn most cases, learning from data reduces to optimizing a function with
respect to model parameters
Given training data we want to learn a model
Sanjiv
Kumar 10/19/2010
);( wxfy =prediction
niii yxD ,...,1},{ ==
)(minargˆ wJww
=
)](),([minarg wRλwDLw
+=
regularizerloss
}1,1{or,.,. −∈ℜ∈ℜ∈ yyxge d
EECS6898 – Large Scale Machine Learning 4
Learning and OptimizationIn most cases, learning from data reduces to optimizing a function with
respect to model parameters
Given training data we want to learn a model
Sanjiv
Kumar 10/19/2010
);( wxfy =prediction
niii yxD ,...,1},{ ==
)(minargˆ wJww
=
)](),([minarg wRλwDLw
+=
regularizerloss
• Convex vs non-convex
• Constrained vs unconstrained
• Smooth vs non-differentiable
w
)(wJ )(wJ
w
)(wJ
w
)(wJ
w
convex non-convex
non-differentiableconstrained
local minimumglobal minimum
twice differentiable
}1,1{or,.,. −∈ℜ∈ℜ∈ yyxge d
EECS6898 – Large Scale Machine Learning 5
Examples
Sanjiv
Kumar 10/19/2010
0wxwy T +=Linear regression
convex, smooth, unconstrainedwwλyxwwJ T
iii
T +−= ∑ 2)()(
Absorb by adding a dummy variable in x:
ww in 01 0 =x
x
y
ℜ∈ℜ∈ yx d ,
EECS6898 – Large Scale Machine Learning 6
Examples
Sanjiv
Kumar 10/19/2010
0wxwy T +=Linear regression
convex, smooth, unconstrainedwwλyxwwJ T
iii
T +−= ∑ 2)()(
Linear SVM )sgn( 0wxwy T +=
convex, non-differentiable, unconstrained
wwλxwywJ T
ii
Ti +−= ∑ )1,0max()(
Absorb by adding a dummy variable in x:
ww in 01 0 =x
x
y
2x
1x
ℜ∈ℜ∈ yx d ,
}1,1{, −∈ℜ∈ yx d
EECS6898 – Large Scale Machine Learning 7
Examples
Sanjiv
Kumar 10/19/2010
0wxwy T +=Linear regression
convex, smooth, unconstrainedwwλyxwwJ T
iii
T +−= ∑ 2)()(
Linear SVM )sgn( 0wxwy T +=
convex, non-differentiable, unconstrained
wwλxwywJ T
ii
Ti +−= ∑ )1,0max()(
convex, smooth, unconstrained
Logistic Regression
wwλxwyσwJ T
ii
Ti +−= ∑ )(log()(
)()|1( 0wxwσxyp T +==
Absorb by adding a dummy variable in x:
ww in 01 0 =x
x
y
probabilistic)1/(1)( xexσ −+=
1x
2x
2x
1x
ℜ∈ℜ∈ yx d ,
}1,1{, −∈ℜ∈ yx d
}1,1{, −∈ℜ∈ yx d
EECS6898 – Large Scale Machine Learning 8
Examples
Sanjiv
Kumar 10/19/2010
0wxwy T +=Linear regression ℜ∈ℜ∈ yx d ,
convex, smooth, unconstrainedwwλyxwwJ T
iii
T +−= ∑ 2)()(
Linear SVM )sgn( 0wxwy T += }1,1{, −∈ℜ∈ yx d
convex, non-differentiable, unconstrained
wwλxwywJ T
ii
Ti +−= ∑ )1,0max()(
Nonlinear (kernelized) versions: replace with
• 2-norm regularizer changes to where K
is the kernel matrix
• Same properties of the functions as their linear versions
xwT ∑i ii xxkw ),(
KwwT
Logistic Regression
}1,1{, −∈ℜ∈ yx d
wwλxwyσwJ T
ii
Ti +−= ∑ )(log()(
)()|1( 0wxwσxyp T +== probabilistic
Absorb by adding a dummy variable in x:
ww in 01 0 =x
x
y
)1/(1)( xexσ −+=
convex, smooth, unconstrained
1x
2x
2x
1x
EECS6898 – Large Scale Machine Learning 9
Popular Optimization MethodsFirst Order Methods
– Use gradient of the function to iteratively find the local minimum– Easy to compute and run but slow convergence– Gradient (steepest) descent– Coordinate descent– Conjugate Gradient– Stochastic Gradient Descent
Second Order Methods– Use gradients and Hessian (curvature) iteratively– Computationally more demanding but fast convergence– Newton’s method– Quasi-Newton methods (BFGS and variants)– Stochastic Quasi-Newton methods
Other than line-search based methods: Trust-region
Sanjiv
Kumar 10/19/2010
EECS6898 – Large Scale Machine Learning 10
Conditions for Local Minimum• If is a local-minimizer then,
– First-order necessary condition
Sanjiv
Kumar 10/19/2010
)(wJ
w
local minimum
local maximum saddle point
w
stationary pointGradient 0)ˆ()(
=∇=∂
∂ wJwwJ
EECS6898 – Large Scale Machine Learning 11
Conditions for Local Minimum• If is a local-minimizer then,
– First-order necessary condition
– Second-order necessary condition
Sanjiv
Kumar 10/19/2010
)(wJ
w
local minimum
local maximum saddle point
w
stationary pointGradient
Hessianpositive semi-definite
PSD is )ˆ()( 22
wJwwwJT ∇=
∂∂
∂
0)ˆ()(=∇=
∂∂ wJwwJ
EECS6898 – Large Scale Machine Learning 12
Conditions for Local Minimum• If is a local-minimizer then,
– First-order necessary condition
– Second-order necessary condition
– Sufficient conditions
Sanjiv
Kumar 10/19/2010
)(wJ
w
local minimum
local maximum saddle point
w
stationary pointGradient
Hessian PSD is )ˆ()( 22
wJwwwJT ∇=
∂∂
∂
positive semi-definite
PD is )ˆ(2 wJ∇ positive definite (strictly PD)0)ˆ( =∇ wJ and
For smooth convex function any stationary point is a global minimizer
0)ˆ()(=∇=
∂∂ wJwwJ
EECS6898 – Large Scale Machine Learning 13
Gradient DescentIteratively estimates new parameter values
Sanjiv
Kumar 10/19/2010
)(wJ
wtw 1+tw
)(1 tttt wJηww ∇−=+
EECS6898 – Large Scale Machine Learning 14
Gradient DescentIteratively estimates new parameter values
Sanjiv
Kumar 10/19/2010
)(wJ
wtw 1+tw
)(1 tttt wJηww ∇−=+
)()( 1 tt wJwJ ≤+We want
But this is not sufficient to reach local minima !One has to make sure
Step-length must satisfy the Wolfe conditions of sufficient decrease
ttttttt pwJηcwJpηwJ )()()( 1 ∇+≤+ 10 1 << c
Gradient descent converges to local minimum if step size is sufficiently small !
search direction
0)( →∇ twJ
EECS6898 – Large Scale Machine Learning 15
Gradient DescentIteratively estimates new parameter values
Sanjiv
Kumar 10/19/2010
)(wJ
wtw 1+tw
)(1 tttt wJηww ∇−=+
)()( 1 tt wJwJ ≤+We want
But this is not sufficient to reach local minima !0)( →∇ twJOne has to make sure
Step-length must satisfy the Wolfe conditions of sufficient decrease
ttttttt pwJηcwJpηwJ )()()( 1 ∇+≤+ 10 1 << c
Gradient descent converges to local minimum if step size is sufficiently small !
Rate of Convergence: Linearr
wwww
t
t ≤−−+
ˆˆ1
Painfully slow in practice, many heuristics are used (e.g., momentum)
search direction
EECS6898 – Large Scale Machine Learning 16
Coordinate DescentUpdates one coordinate at a time keeping others fixed
Sanjiv
Kumar 10/19/2010
)(1itt
it
it wJηww ∇−=+
• Cycles through all the coordinates sequentially
0w
0w
1w
1w2w
EECS6898 – Large Scale Machine Learning 17
Coordinate DescentUpdates one coordinate at a time keeping others fixed
Sanjiv
Kumar 10/19/2010
)(1itt
it
it wJηww ∇−=+
Properties• Very simple and easy to implement – reasonable when variables are loosely coupled
• Can be inefficient in practice (may take long time to converge)
• Convergence not guaranteed, in general
• Cycles through all the coordinates sequentially
0w
0w
1w
1w2w
Block Coordinate Descent• Update a small set of coordinates simultaneously
• Has been used successfully for training large SVMs
EECS6898 – Large Scale Machine Learning 18
Conjugate GradientParameters are searched along conjugate directions
Sanjiv
Kumar 10/19/2010
021 =HppTConjugacy: Two vectors p1 and p2 are conjugate wrt H
if
Starting with )( 00 wJp −∇= tttt pαww +=+1
green: GD red: CG
EECS6898 – Large Scale Machine Learning 19
Conjugate GradientParameters are searched along conjugate directions
Sanjiv
Kumar 10/19/2010
021 =HppTConjugacy: Two vectors p1 and p2 are conjugate wrt H
if
Starting with tttt pαww +=+1
ttTt
tT
tt
pHppwJα )(∇
=))()((
))()(()(
11
1
−−
−
∇−∇
∇−∇∇=
ttTt
ttT
tt
wJwJpwJwJwJβ
Successive conjugate directions as linear combination of gradient direction and previous conjugate direction
1)( −+−∇= tttt pβwJp
green: GD red: CG
,....2,1=t
Hestenes- Stiefel form
Exact
But usually picked something that satisfies Wolfe’s conditions !
)( 00 wJp −∇=
EECS6898 – Large Scale Machine Learning 20
Conjugate GradientParameters are searched along conjugate directions
Sanjiv
Kumar 10/19/2010
021 =HppT
Properties• For quadratic functions, guaranteed to return the optimal in at most d iterations
• Use preconditioning to increase the convergence rate: make condition number of Hessian as small as possible
Conjugacy: Two vectors p1 and p2 are conjugate wrt H
if
Starting with tttt pαww +=+1
ttTt
tT
tt
pHppwJα )(∇
=))()((
))()(()(
11
1
−−
−
∇−∇
∇−∇∇=
ttTt
ttT
tt
wJwJpwJwJwJβ
Successive conjugate directions as linear combination of gradient direction and previous conjugate direction
1)( −+−∇= tttt pβwJp
green: GD red: CG
,....2,1=t
Hestenes- Stiefel form
Exact
But usually picked something that satisfies Wolfe’s conditions !
)( 00 wJp −∇=
EECS6898 – Large Scale Machine Learning 21
Newton’s MethodApproximates a function using second order Taylor expansion
Sanjiv
Kumar 10/19/2010
)(wJ
wtw 1+tw
pwJpwJpwJpwJ tT
tT
tt )(21)()()( 2∇+∇+≈+
quadratic
EECS6898 – Large Scale Machine Learning 22
Newton’s MethodApproximates a function using second order Taylor expansion
Sanjiv
Kumar 10/19/2010
)(wJ
wtw 1+tw)(11 tttt wJHww ∇−= −+
minimizing w.r.t. p
pwJpwJpwJpwJ tT
tT
tt )(21)()()( 2∇+∇+≈+
( ) )()()( 112tttt wJHwJwJp ∇=∇∇= −−
Hessian
– No explicit step-length parameter– If Hessian is not positive definite, Newton step may be undefined or may even diverge– For quadratic functions, converges in a single step !
quadratic
EECS6898 – Large Scale Machine Learning 23
Newton’s MethodApproximates a function using second order Taylor expansion
Sanjiv
Kumar 10/19/2010
)(wJ
wtw 1+tw)(11 tttt wJHww ∇−= −+
minimizing w.r.t. p
Rate of Convergence: Quadratick
ww
ww
t
t ≤−
−+2
1
ˆ
ˆ
Fast convergence but each iteration slow: O(d2) space and O(d3) time
pwJpwJpwJpwJ tT
tT
tt )(21)()()( 2∇+∇+≈+
( ) )()()( 112tttt wJHwJwJp ∇=∇∇= −−
Hessian
– No explicit step-length parameter– If Hessian is not positive definite, Newton step may be undefined or may even diverge– For quadratic functions, converges in a single step !
quadratic
EECS6898 – Large Scale Machine Learning 24
Quasi-Newton MethodsDo not require computation of Hessian but still super-linear convergence
Key Idea: Approximate Hessian locally using change in gradients
Secant equation
Sanjiv
Kumar 10/19/2010
ttt ysB =+1
,1 ttt wws −= + )()( 1 ttt wJwJy ∇−∇= +
Approx Hessian
Wolfe’s conditions0>t
Tt ys
0ftB
EECS6898 – Large Scale Machine Learning 25
Quasi-Newton MethodsDo not require computation of Hessian but still super-linear convergence
Key Idea: Approximate Hessian locally using change in gradients
Secant equation
Sanjiv
Kumar 10/19/2010
ttt ysB =+1
,1 ttt wws −= + )()( 1 ttt wJwJy ∇−∇= +
Approx Hessian
– Additional conditions imposed on B, e.g., symmetry or diagonal– Difference between successive B’s is low-rank – BFGS (Broyden-Fletcher-Goldfarb-Shanno): uses rank-2 (inverse) Hessian updates
Wolfe’s conditions0>t
Tt ys
0ftB
FtBt BBB ~~minarg~
~1 −=+ subject to tttTtt syBBB == +++ 111
~,~~if 1~ −= tt BB
EECS6898 – Large Scale Machine Learning 26
Quasi-Newton MethodsDo not require computation of Hessian but still super-linear convergence
Key Idea: Approximate Hessian locally using change in gradients
Secant equation
Sanjiv
Kumar 10/19/2010
ttt ysB =+1
,1 ttt wws −= + )()( 1 ttt wJwJy ∇−∇= +
Approx Hessian
– Additional conditions imposed on B, e.g., symmetry or diagonal– Difference between successive B’s is low-rank – BFGS (Broyden-Fletcher-Goldfarb-Shanno): uses rank-2 (inverse) Hessian updates
• Inverse of approx Hessian updated directly O(d2) space and time
Tttt
Ttttt
Ttttt ssρsyρIBysρIB +−−=+ )(~)(~
1 ,/1 tTtt ysρ = IB =0
~
Wolfe’s conditions0>t
Tt ys
0ftB
subject to tttTtt syBBB == +++ 111
~,~~
O(d2) cost per iteration, superlinear rate of convergence, self-correcting properties
FtBt BBB ~~minarg~
~1 −=+if 1~ −= tt BB
EECS6898 – Large Scale Machine Learning 27
Limited-Memory BFGS (L-BFGS)Quasi-Newton methods produce dense Hessian approximations even when
true Hessian is sparse high storage cost for large-scale problems
Key Idea: Store approx Hessian using a few d-dim vectors from most recent iterations Store at most m most recent pairs
Sanjiv
Kumar 10/19/2010
),( tt ys
EECS6898 – Large Scale Machine Learning 28
Limited-Memory BFGS (L-BFGS)Quasi-Newton methods produce dense Hessian approximations even when
true Hessian is sparse high storage cost for large-scale problems
Key Idea: Store approx Hessian using a few d-dim vectors from most recent iterations Store at most m most recent pairs
Sanjiv
Kumar 10/19/2010
),( tt ys
Similar to Conjugate-Gradient, rate of convergence linear instead of superlinear as in BFGS
)(~1 ttttt wJBαww ∇−=+
– Iteratively estimate the product using most recent
– Can be achieved efficiently in two loops in O(md)
– Different initialization for each inner loop possible, e.g.,
)(~tt wJB ∇ ),( tt ys
,~0 IγB tt =11
11
−−
−−=t
Tt
tTt
tyyysγ
How about sparse, sampling-based or diagonal approximation of Hessian ?
EECS6898 – Large Scale Machine Learning 29
Comparison: Logistic Regression
Sanjiv
Kumar 10/19/2010
n=1500, d=100
independent features correlated features Hessian-based methods do better
Minka [5] Trade-off between number of iterations and cost per iteration !
EECS6898 – Large Scale Machine Learning 30
Batch vs
OnlineMost optimization functions use i.i.d. training data as,
Sanjiv
Kumar 10/19/2010
])[()( 2 wwλyxwwJ T
iii
T +−= ∑
])1,0[max()( wwλxwywJ T
ii
Ti +−= ∑
])([log()( wwλxwyσwJ T
ii
Ti +−= ∑
SVM
Linear Regression
Logistic Regression
Batch methods– compute gradients using all the training data– each iteration needs to use full batch of data slow for large datasets
Online (Stochastic) Methods– use gradients from a small subset of data, usually just a single point– do not decrease the function value monotonically– typically have worse convergence properties than batch methods– quite appropriate for large-scale applications where data is usually redundant– can converge even before seeing all the data once very fast– popular conjecture: may also avoid bad local minima
EECS6898 – Large Scale Machine Learning 31
Stochastic Gradient Descent (SGD)At each iteration, use a small training subset to estimate the gradient
Sanjiv
Kumar 10/19/2010
),(1 ttttt xwJηww ∇−=+single point or a small subset of training data
EECS6898 – Large Scale Machine Learning 32
Stochastic Gradient Descent (SGD)At each iteration, use a small training subset to estimate the gradient
Converges to optimal value if,
Step-length
Sanjiv
Kumar 10/19/2010
),(1 ttttt xwJηww ∇−=+single point or a small subset of training data
w∞=∑t tη∞<∑t tη
2
0ηtττηt +
= 0, 0 >ητ are tuning parameters
Properties– O(d) space and time per iteration instead of O(nd) of gradient descent– Outperforms batch gradient methods on large datasets– Slow convergence on ill-conditioned problems– Stochastic Meta-Descent: adjust a different gain for each parameter
step size decreases fast enough
parameters may move arbitrary distances
EECS6898 – Large Scale Machine Learning 33
Experiment -
SGD
Sanjiv
Kumar 10/19/2010
Bottou & Bousquet [8]
SVM classification taskd = 47K (~80 nonzero), n = ~800K
EECS6898 – Large Scale Machine Learning 34
SGD for Sparse DataSuppose training set vectors are sparse
– Only a small fraction (s) of d
elements are non-zero– Common for many applications: text, images, biological data,…
Sanjiv
Kumar 10/19/2010
nid
ix ,...,1}{ =ℜ∈1<<s
EECS6898 – Large Scale Machine Learning 35
SGD for Sparse DataSuppose training set vectors are sparse
– Only a small fraction (s) of d
elements are non-zero– Common for many applications: text, images, biological data,…
Objective function
Sanjiv
Kumar 10/19/2010
sparse
How to make sparse updates, i.e., O(sd) per iteration?
nid
ix ,...,1}{ =ℜ∈
])()([)( 1∑ = += ni i
Ti wRλxwyLwJ
)()(),( ttttTtttt wRλxyxwyLxwJ ∇+∇=∇
dense
dense vector update
)(dO per iteration
1<<s
EECS6898 – Large Scale Machine Learning 36
SGD for Sparse DataSuppose training set vectors are sparse
– Only a small fraction (s) of d
elements are non-zero– Common for many applications: text, images, biological data,…
Objective function
Sanjiv
Kumar 10/19/2010
sparse
How to make sparse updates, i.e., O(sd) per iteration?Break updates into two parts
nid
ix ,...,1}{ =ℜ∈
])()([)( 1∑ = += ni i
Ti wRλxwyLwJ
)()(),( ttttTtttt wRλxyxwyLxwJ ∇+∇=∇
dense
dense vector update
)(dO per iteration
1<<s
1. Update using gradients of L(.) alone ignoring the regularizer),(1 ttttt xwLηww ∇−=+
2. Every kth iteration, adjust w using gradients of regularizer)( 111 +++ ∇−← tttt wRηkww
EECS6898 – Large Scale Machine Learning 37
PerceptronOriginally proposed to learn a linear binary classifier
Sanjiv
Kumar 10/19/2010
)sgn()( xwxf T=
⎩⎨⎧ +
=+t
ttt w
xyww 1Update Rule
)(xfy ≠if x
is misclassified, i.e.,
otherwise
}1,1{−=y
EECS6898 – Large Scale Machine Learning 38
PerceptronOriginally proposed to learn a linear binary classifier
Sanjiv
Kumar 10/19/2010
)sgn()( xwxf T=
Update Rule )(xfy ≠
2x
1x
if x is misclassified, i.e.,
otherwise
Properties– Stochastic gradient descent on a non-differentiable
loss function– Guaranteed to converge if data is linearly separable– What happens for inseparable data ?
– Use of heuristics such as voting with various parameter vectors
will not converge but oscillate !
⎩⎨⎧ +
=+t
ttt w
xyww 1
}1,1{−=y
EECS6898 – Large Scale Machine Learning 39
Natural Gradient DescentIncorporate Riemannian Metric into stochastic descent
Sanjiv
Kumar 10/19/2010
]),(),([ Tttttxt xwJxwJEG ∇∇=
0ηtττηt +
=),(~ 11 tttttt xwJGηww ∇−= −+
EECS6898 – Large Scale Machine Learning 40
Natural Gradient DescentIncorporate Riemannian Metric into stochastic descent
Metric update
Sanjiv
Kumar 10/19/2010
]),(),([ Tttttxt xwJxwJEG ∇∇=
Properties– Update matrix inverse directly O(d2) space and time per iteration– More stable – has good theoretical properties– Still expensive for large-scale problems
),(~ 11 tttttt xwJGηww ∇−= −+
Ttttttt xwJxwJtGttG ),(),()/1(~/)1(~
1 ∇∇+−=+assuming Hessian to be constant
0ηtττηt +
=
EECS6898 – Large Scale Machine Learning 41
Online-BFGSChange of gradients estimated using a subset of training data
Sanjiv
Kumar 10/19/2010
,1 ttt wws −= +
Wolfe’s conditions0>t
Tt ys
0ftB
ttt syB =+1~
),(~1 tttttt xwJBηww ∇−=+
stochastic approximation of inverse Hessian)()( 1 ttt wJwJy ∇−∇= +
EECS6898 – Large Scale Machine Learning 42
Online-BFGSChange of gradients estimated using a subset of training data
Sanjiv
Kumar 10/19/2010
,1 ttt wws −= +
Tttt
Ttttt
Ttttt ssρcysρIBysρIB +−−=+ )(~)(~
1tTtt ysρ /1=
Wolfe’s conditions0>t
Tt ys
0ftB
1. To maintain positive curvature
ttt syB =+1~
),(~1 tttttt xwJBηww ∇−=+
stochastic approximation of inverse Hessian
),(),( 1 ttttt xwJxwJy ∇−∇= +Modify tsλ+ 0≥λ
)()( 1 ttt wJwJy ∇−∇= +
2. Initialize using a small coefficient to restrict initial parameter updateB~
IεB =0~ 1010−≈ε
3. Modify the update of 1~+tB
10 ≤< c cηη tt /=
Compute on the same sample !
EECS6898 – Large Scale Machine Learning 43
Online-BFGSChange of gradients estimated using a subset of training data
Sanjiv
Kumar 10/19/2010
,1 ttt wws −= +
Tttt
Ttttt
Ttttt ssρcysρIBysρIB +−−=+ )(~)(~
1tTtt ysρ /1=
Wolfe’s conditions0>t
Tt ys
0ftB
1. To maintain positive curvature
ttt syB =+1~
),(~1 tttttt xwJBηww ∇−=+
stochastic approximation of inverse Hessian
),(),( 1 ttttt xwJxwJy ∇−∇= +Modify tsλ+ 0≥λ
)()( 1 ttt wJwJy ∇−∇= +
2. Initialize using a small coefficient to restrict initial parameter updateB~
IεB =0~ 1010−≈ε
3. Modify the update of 1~+tB
10 ≤< c cηη tt /=
Compute on the same sample !
– Test of convergence: If last k stochastic gradients have been below a threshold– Online version of L-BFGS possible O(md) cost instead of O(d2)
EECS6898 – Large Scale Machine Learning 44
Experiments
Sanjiv
Kumar 10/19/2010
Synthetic quadratic function
# parameters d = 5, n = ~1M
Schraudolph et al. [7]
EECS6898 – Large Scale Machine Learning 45
Experiments
Sanjiv
Kumar 10/19/2010
On Conditional Random Field (CRF) applied to text chunking
# parameters d = 100K, n = ~9K
Schraudolph et al. [7]
EECS6898 – Large Scale Machine Learning 46
Stochastic Gradient Descent (Quasi-Newton)Approximate inverse Hessian by a diagonal matrix
Sanjiv
Kumar 10/19/2010
Inspired by online-BFGS
,1 ttt wws −= +
Wolfe’s conditions
0>tTt ys
0ftBttt syB =+1
~)),(),((~
111 ttttttt xwJxwJBww ∇−∇≈− +++
diagonal matrix)()( 1 ttt wJwJy ∇−∇= +
EECS6898 – Large Scale Machine Learning 47
Stochastic Gradient Descent (Quasi-Newton)Approximate inverse Hessian by a diagonal matrix
Sanjiv
Kumar 10/19/2010
Inspired by online-BFGS
,1 ttt wws −= +
Wolfe’s conditions
0>tTt ys
0ftBttt syB =+1
~)),(),((~
111 ttttttt xwJxwJBww ∇−∇≈− +++
diagonal matrix)()( 1 ttt wJwJy ∇−∇= +
ittttiiitt xwJxwJBww )],(),([~][ 11 ∇−∇≈− ++
EECS6898 – Large Scale Machine Learning 48
Stochastic Gradient Descent (Quasi-Newton)Approximate inverse Hessian by a diagonal matrix
Sanjiv
Kumar 10/19/2010
Inspired by online-BFGS
,1 ttt wws −= +
Wolfe’s conditions
0>tTt ys
0ftBttt syB =+1
~)),(),((~
111 ttttttt xwJxwJBww ∇−∇≈− +++
diagonal matrix)()( 1 ttt wJwJy ∇−∇= +
ittttiiitt xwJxwJBww )],(),([~][ 11 ∇−∇≈− ++
In Practice,– Update matrix entries only every k iterations– Do a leaky average of B to get stable updates
– Has a flavor of partially ‘fixed-Hessian’
1]~1[~~ −+← iiiiiii rBkBBittitkttti wwxwJxwJr ]/[)],(),([ 11 −∇−∇= +−+ }},100min{,max{ ii rλλr ←
EECS6898 – Large Scale Machine Learning 49Sanjiv
Kumar 10/19/2010
Stochastic Gradient Descent (Quasi-Newton)
Bordes et al. [9]
Alpha dataset of Pascal Challenge (2008) # parameters d = 500 (dense), n = ~100K
sparse SGDRCV1: d = 47K, s = 0.0016
1. H. E. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, 1951.
2. J. Nocedal and S. J. Wright, Numerical Optimization. Springer Series in Operations Research, 1999.3. N. Schraudolph, “Local gain adaptation in stochastic gradient descent,” Int. conf in Artifical Neural
Networks, 1999.4. S. Amari, H. Park, K. Fukumizu, “Adaptive method of realizing natural gradient learning for multi-layer
perceptrons,” Neural Computation, 2000.5. Tom Minka, “A comparison of numerical optimizers for logistic regression”, 2003.
http://research.microsoft.com/en-us/um/people/minka/papers/logreg/6. L. Bottou and Y. Lecun, “Online learning for very large datasets,” Applied Stochastic Models in
Business and Industry, 2005.7. N. Schraudolph, J. Yu and S. Gunter, “A stochastic quasi-Newton method for online convex
optimization,” AISTATS, 2007.8. L. Bottou and O. Bousquet: Learning Using Large Datasets, Mining Massive DataSets for Security,
NATO ASI Workshop Series, IOS Press, Amsterdam, 2008.9. A. Bordes, L. Bottou and P. Gallinari, “SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent,
JMLR, 2009.10. A. Bordes, L. Bottou, P. Gallinari, J. Chang, S. A. Smith, “Erratum: SGD-QN is less careful than
expected,” 2010. http://jmlr.csail.mit.edu/papers/v11/bordes10a.html11. J. Yu, S. Vishwanathan, S. Günter, and N. N. Schraudolph. A Quasi-Newton Approach to Nonsmooth
Convex Optimization Problems in Machine Learning. Journal of Machine Learning Research, 11:1145–1200, 2010.
References
EECS6898 – Large Scale Machine Learning 50Sanjiv
Kumar 10/19/2010
Given: A labeled training set,
Goal: Learn a predictive function
EECS6898 – Large Scale Machine Learning 51
Logistic Regressionniii yx ...1},{ = }1,1{, −∈ℜ∈ i
di yx
)()|1( xwσxyp T==
wwλxwywL T
ii
Ti +−+= ∑ ))exp(1log()(
XAzIλXAXw Tt
11 )( −+ +=
log-likelihood log-priorxwT
)|1( xyp =
1x
2x
)()|( iT
iii xwyσxyp =
(negative) log-posterior
Convex problem, Newton method:
)100(~),100(~ KOdMOn)( 2ndO
)( 3dO
multiplicationinversion
)(ndO First-order methods
Sanjiv
Kumar 10/19/2010
Absorb ww in 0