CS60021: Scalable Data Mining Large Scale Machine...
Transcript of CS60021: Scalable Data Mining Large Scale Machine...
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 1
CS60021: Scalable Data Mining
Large Scale Machine Learning
Sourangshu Bhattacharya
¡ Example:Spamfiltering
¡ InstancespacexÎ X (|X|=n datapoints)§ Binaryorreal-valuedfeaturevectorx ofwordoccurrences
§ d features(words+otherthings,d~100,000)¡ ClassyÎ Y§ y:Spam(+1),Ham(-1)
¡ Goal:Estimateafunctionf(x) sothat y=f(x)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 2
¡ Wouldliketodoprediction:estimateafunctionf(x) sothat y=f(x)
¡ Wherey canbe:§ Realnumber: Regression§ Categorical: Classification§ Complexobject:
§ Rankingofitems,Parsetree,etc.
¡ Dataislabeled:§ Havemanypairs{(x,y)}
§ x …vectorofbinary,categorical,realvaluedfeatures§ y …class({+1,-1},orarealnumber)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 3
¡ Task: Givendata(X,Y) buildamodelf() topredictY’ basedonX’
¡ Strategy: Estimate𝒚 = 𝒇 𝒙on(𝑿, 𝒀).Hopethatthesame𝒇(𝒙) alsoworkstopredictunknown𝒀’§ The“hope” iscalledgeneralization
§ Overfitting: Iff(x)predictswellYbutisunabletopredictY’§ Wewanttobuildamodelthatgeneralizeswelltounseendata§ ButJure,howcanwewellondatawehaveneverseenbefore?!?
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 4
X Y
X’ Y’Test data
Trainingdata
¡ Idea: Pretendwedonotknowthedata/labelsweactuallydoknow§ Buildthemodelf(x) onthetrainingdataSeehowwellf(x) doesonthetestdata§ Ifitdoeswell,thenapplyitalsotoX’
¡ Refinement:Crossvalidation§ Splittingintotraining/validationsetisbrutal§ Let’ssplitourdata(X,Y) into10-folds(buckets)§ Takeout1-foldforvalidation,trainonremaining9§ Repeatthis10times,reportaverageperformance
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 5
X Y
X’
Validation set
Trainingset
Test set
¡ Binaryclassification:
¡ Input: Vectorsxj andlabelsyj§ Vectorsxj arerealvaluedwhere 𝒙 𝟐 = 𝟏
¡ Goal: Findvectorw = (w(1), w(2) ,... , w(d) )§ Eachw(i) isarealnumber
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 6
f (x) = +1 if w(1) x(1) + w(2) x(2) +. . .w(d) x(d) ³ q-1 otherwise{
w × x = 0 - --- --
-- -
- -
- --
-
q-®
"®
,
1,
ww
xxxNote:
Decision boundary is linear
w
¡ Wanttoestimate𝒘 and𝒃!§ Standardway: Useasolver!
§ Solver: softwareforfindingsolutionsto“common”optimizationproblems
¡ Useaquadraticsolver:§ Minimizequadraticfunction§ Subjecttolinearconstraints
¡ Problem: Solversareinefficientforbigdata!J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 7
iii
n
iibw
bwxyits
Cww
x
x
-³+××"
×+× å=
1)(,..
min1
21
,
¡ Wanttoestimatew,b!¡ Alternativeapproach:§ Wanttominimize f(w,b):
¡ Sidenote:§ Howtominimizeconvexfunctions𝒈(𝒛)?§ Usegradientdescent:minz g(z)§ Iterate:zt+1¬ zt – h Ñg(zt)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 8
å å= = þ
ýü
îíì
+-×+×=n
i
d
j
ji
ji bxwyCwwbwf
1 1
)()(21 )(1,0max),(
iii
n
iibw
bwxyits
Cww
x
x
-³+××"
+× å=
1)(,..
min1
21
,
g(z)
z
Findtheminimumormaximumofanobjectivefunctiongivenasetofconstraints:
argminx
f0(x)
s.t.fi
(x) 0, i = {1, . . . , k}h
j
(x) = 0, j = {1, . . . l}
9
LinearClassificationMaximumLikelihood
K-Means
argmax
✓
nX
i=1
log p✓(xi)
arg minµ1,µ2,. . . ,µk
J(µ) =kX
j=1
X
i2Cj
||xi � µj ||2
argminw
nX
i=1
||w||2 + C
nX
i=1
⇠i
s.t. 1� yixTi w ⇠i
⇠i � 0
10
Local(nonglobal)minimaandmaxima:
What is Optimization
But generally speaking...
We’re screwed.! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous functions):
h(x) = sin(2πx) = 0
−3−2
−10
12
3
−3−2
−10
12
3−50
0
50
100
150
200
250
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
11
Convex Sets
DefinitionA set C ⊆ Rn is convex if for x, y ∈ C and any α ∈ [0, 1],
αx + (1 − α)y ∈ C.
x
y
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 9 / 53
Convex Functions
DefinitionA function f : Rn → R is convex if for x, y ∈ dom f and any α ∈ [0, 1],
f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y).
f(x)
f(y)αf(x) + (1 - α)f(y)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 15 / 53
A function f : Rn ! R is convex if for x, y 2 domfand any a 2 [0, 1],
f(ax+ (1� a)y) af(x) + (1� a)f(y)
A set C ✓ Rnis convex if for x, y 2 C and any a 2 [0, 1],
ax+ (1� a)y 2 C
12
Convex Functions Examples
Important examples in Machine Learning
! SVM loss:
f(w) =!
1 − yixTi w"
+
! Binary logistic loss:
f(w) = log#
1 + exp(−yixTi w)
$
−2 30
3
[1 - x]+
log(1 + ex)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 22 / 53
Convex Functions Examples
Important examples in Machine Learning
! SVM loss:
f(w) =!
1 − yixTi w"
+
! Binary logistic loss:
f(w) = log#
1 + exp(−yixTi w)
$
−2 30
3
[1 - x]+
log(1 + ex)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 22 / 5313
Convex Optimization Problems
Convex Optimization Problems
DefinitionAn optimization problem is convex if its objective is a convex function, theinequality constraints fj are convex, and the equality constraints hj areaffine
minimizex
f0(x) (Convex function)
s.t. fi(x) ≤ 0 (Convex sets)
hj(x) = 0 (Affine)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 23 / 53
14
Optimization Algorithms Gradient Methods
Gradient Descent
The simplest algorithm in the world (almost). Goal:
minimizex
f(x)
Just iteratext+1 = xt − ηt∇f(xt)
where ηt is stepsize.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 39 / 53
16
Optimization Algorithms Gradient Methods
Single Step Illustration
f(xt) - η∇f(xt)T (x - xt)
f(x)
(xt, f(xt))
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 40 / 5317
Optimization Algorithms Gradient Methods
Newton’s method
Idea: use a second-order approximation to function.
f(x + ∆x) ≈ f(x) + ∇f(x)T ∆x +1
2∆xT∇2f(x)∆x
Choose ∆x to minimize above:
∆x = −!
∇2f(x)"−1 ∇f(x)
This is descent direction:
∇f(x)T ∆x = −∇f(x)T!
∇2f(x)"−1 ∇f(x) < 0.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 44 / 53
Inverse Hessian Gradient
19
Optimization Algorithms Gradient Methods
Newton step picture
(x, f(x))
(x + ∆x, f(x + ∆x))
f
f̂
f̂ is 2nd-order approximation, f is true function.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 45 / 53
20
Convex Functions
Actually, more general than that
DefinitionThe subgradient set, or subdifferential set, ∂f(x) of f at x is
∂f(x) =!
g : f(y) ≥ f(x) + gT (y − x) for all y"
.
Theoremf : Rn → R is convex if and
only if it has non-empty
subdifferential set everywhere.
f(y)
f(x) + gT (y - x)
(x, f(x))
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 17 / 53
Optimization Algorithms Subgradient Methods
Why subgradient descent?
! Lots of non-differentiable convex functions used in machine learning:
f(x) =!
1 − aT x"
+, f(x) = ∥x∥1 , f(X) =
k#
r=1
σr(X)
where σr is the rth singular value of X.
! Easy to analyze
! Do not even need true sub-gradient: just have Egt ∈ ∂f(xt).
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 52 / 53
Convex Functions
Actually, more general than that
DefinitionThe subgradient set, or subdifferential set, ∂f(x) of f at x is
∂f(x) =!
g : f(y) ≥ f(x) + gT (y − x) for all y"
.
Theoremf : Rn → R is convex if and
only if it has non-empty
subdifferential set everywhere.
f(y)
f(x) + gT (y - x)
(x, f(x))
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 17 / 53
21
Optimization Algorithms Subgradient Methods
Subgradient Descent
Really, the simplest algorithm in the world. Goal:
minimizex
f(x)
Just iteratext+1 = xt − ηtgt
where ηt is a stepsize, gt ∈ ∂f(xt).
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 51 / 53
22
¡ Goalofmachinelearning:§ Minimizeexpectedloss
givensamples
¡ ThisisStochasticOptimization§ Assumelossfunctionisconvex
23
¡ Processallexamplestogetherineachstep
¡ Entiretrainingsetexaminedateachstep¡ Veryslowwhennisverylarge
24
¡ “Optimize”oneexampleatatime¡ Chooseexamplesrandomly(orreorderandchooseinorder)§ Learningrepresentativeofexampledistribution
25
¡ Equivalenttoonlinelearning(theweightvectorwchangeswitheveryexample)
¡ Convergenceguaranteedforconvexfunctions(tolocalminimum)
26
Iterate until convergence:• For j = 1 … d
• Evaluate:• Update:
w(j) ¬ w(j) - hÑf(j)
¡ Gradientdescent:
¡ Problem:§ ComputingÑf(j) takesO(n)time!
§ n …sizeofthetrainingdataset
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 27
å= ¶¶
+=¶¶
=Ñn
ijiij
jj
wyxLCw
wbwff
1)(
)()(
)( ),(),(
h…learning rate parameter C… regularization parameter
¡ StochasticGradientDescent§ Insteadofevaluatinggradientoverallexamplesevaluateitforeachindividual trainingexample
¡ Stochasticgradientdescent:
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 28
)()()( ),()( j
iiji
j
wyxLCwxf
¶¶×+=Ñ
å= ¶¶
+=Ñn
ijiijj
wyxLCwf
1)(
)()( ),(We just had:
Iterate until convergence:• For i = 1 … n
• For j = 1 … d• Compute: Ñf(j)(xi)• Update: w(j) ¬ w(j) - h Ñf(j)(xi)
Notice: no summationover i anymore
¡ Convergenceverysensitivetolearningrate()(oscillationsnearsolutionduetoprobabilisticnatureofsampling)§ Mightneedtodecreasewithtimetoensurethealgorithmconvergeseventually
¡ Basically– SGDgoodformachinelearningwithlargedatasets!
29
¡ Stochastic– 1exampleperiteration¡ Batch– Alltheexamples!¡ SampleAverageApproximation(SAA):§ SamplemexamplesateachstepandperformSGDonthem
¡ Allowsforparallelization,butchoiceofmbasedonheuristics
30
¡ ExamplebyLeonBottou:§ ReutersRCV1documentcorpus
§ Predictacategoryofadocument§ Onevs. therestclassification
§ n =781,000 trainingexamples(documents)§ 23,000testexamples§ d =50,000 features
§ Onefeatureperword§ Removestop-words§ Removelowfrequencywords
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 31
¡ Questions:§ (1) IsSGD successfulatminimizingf(w,b)?§ (2) HowquicklydoesSGD findtheminoff(w,b)?§ (3)Whatistheerroronatestset?
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 32
Training time Value of f(w,b) Test error Standard SVM“Fast SVM”SGD SVM
(1) SGD-SVM is successful at minimizing the value of f(w,b)(2) SGD-SVM is super fast(3) SGD-SVM test set error is comparable
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 33
Optimization quality: | f(w,b) – f (wopt,bopt) |
ConventionalSVM
SGD SVM
For optimizing f(w,b) within reasonable qualitySGD-SVM is super fast
¡ SGD onfulldatasetvs.ConjugateGradient onasampleofn trainingexamples
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 34
Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times
Theory says: Gradient descent converges in
linear time 𝒌. Conjugate gradient converges in 𝒌� .
𝒌… condition number
¡ Needtochooselearningrateh andt0
¡ Leonsuggests:§ Chooset0 sothattheexpectedinitialupdatesarecomparablewiththeexpectedsizeoftheweights
§ Chooseh:§ Selectasmallsubsample§ Tryvariousratesh (e.g.,10,1,0.1,0.01,…)§ Picktheonethatmostreducesthecost§ Useh fornext100kiterationsonthefulldataset
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 35
÷øö
çèæ
¶¶
++
-¬+ wyxLCw
ttww ii
tt
tt),(
01
h
¡ SparseLinearSVM:§ Featurevectorxi issparse(containsmanyzeros)
§ Donotdo:xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…]§ Butrepresentxi asasparsevector xi=[(4,1), (9,5), …]
§ CanwedotheSGDupdatemoreefficiently?
§ Approximatedin2steps:cheap:xi issparseandsofewcoordinatesj ofw willbeupdatedexpensive:w isnotsparse,allcoordinatesneedtobeupdated
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 36
÷øö
çèæ
¶¶
+-¬wyxLCwww ii ),(h
wyxLCww ii
¶¶
-¬),(h
)1( h-¬ ww
¡ Solution1:𝒘 = 𝒔 ⋅ 𝒗§ Representvectorw astheproductofscalars andvectorv
§ Thentheupdateprocedureis:
§ (1)𝒗 = 𝒗 − 𝜼𝑪 𝝏𝑳 𝒙𝒊,𝒚𝒊𝝏𝒘
§ (2)𝒔 = 𝒔(𝟏 − 𝜼)¡ Solution2:§ Performonlystep(1) foreachtrainingexample§ Performstep(2) withlowerfrequencyandhigherh
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 37
wyxLCww ii
¶¶
-¬),(h
)1( h-¬ ww
Two step update procedure:
(1)
(2)
¡ Stoppingcriteria:HowmanyiterationsofSGD?§ Earlystoppingwithcrossvalidation
§ Createavalidationset§ Monitorcostfunctiononthevalidationset§ Stopwhenlossstopsdecreasing
§ Earlystopping§ ExtracttwodisjointsubsamplesA andB oftrainingdata§ TrainonA,stopbyvalidatingonB§ Numberofepochsisanestimateofk§ Trainfork epochsonthefulldataset
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 38
¡ Reference:http://alex.smola.org/teaching/10-701-15/math.html
¡ Givendataset𝐷 = { 𝑥@, 𝑦@ , … , 𝑥C, 𝑦C }¡ Lossfunction:𝐿 𝜃, 𝐷 = @
G∑ 𝑙(𝜃; 𝑥K, 𝑦K)GKL@
¡ Forlinearmodels:𝑙 𝜃; 𝑥K, 𝑦K = 𝑙(𝑦K, 𝜃M𝜙 𝑥K )¡ Assumption𝐷 isdrawnIIDfromsomedistribution𝒫.
¡ Problem:minS𝐿(𝜃, 𝐷)
¡ Input:𝐷¡ Output: �̅�
Algorithm:¡ Initialize𝜃V¡ For𝑡 = 1,… , 𝑇
𝜃Z[@ = 𝜃Z − 𝜂Z𝛻S𝑙(𝑦Z, 𝜃M𝜙 𝑥Z )
¡ �̅� = ∑ ^_S_`_ab∑ ^_`_ab
.
¡ Expectedloss:𝑠 𝜃 = 𝐸𝒫[𝑙(𝑦, 𝜃M𝜙 𝑥 ]¡ OptimalExpectedloss:𝑠∗ = 𝑠 𝜃∗ =minS𝑠(𝜃)
¡ Convergence:
𝐸Sh 𝑠 �̅� − 𝑠∗ ≤𝑅k + 𝐿k ∑ 𝜂ZkM
ZL@
2∑ 𝜂ZMZL@
¡ Where:𝑅 = 𝜃V − 𝜃∗¡ 𝐿 = max𝛻𝑙(𝑦, 𝜃M𝜙 𝑥 )
¡ Define𝑟Z = 𝜃Z − 𝜃∗ and𝑔Z =𝛻S𝑙(𝑦Z, 𝜃M𝜙 𝑥Z )
¡ 𝑟Z[@k = 𝑟Zk + 𝜂Zk 𝑔Z k − 2𝜂Z 𝜃Z − 𝜃∗ M𝑔Z¡ Takingexpectationw.r.t𝒫, �̅� andusing𝑠∗ −𝑠 𝜃Z ≥ 𝑔ZM(𝜃∗ − 𝜃Z),weget:𝐸Sh 𝑟Z[@k − 𝑟Zk ≤ 𝜂Zk𝐿k + 2𝜂Z(𝑠∗ − 𝐸Sh[𝑠 𝜃Z ])
¡ Takingsumover𝑡 = 1,… , 𝑇 andusing𝐸Sh 𝑟Z[@k − 𝑟Vk
≤ 𝐿ks 𝜂ZkMt@
ZLV
+ 2s𝜂Z(𝑠∗ − 𝐸Sh[𝑠 𝜃Z ])Mt@
ZLV