CS60021: Scalable Data Mining Large Scale Machine...

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 1

CS60021: Scalable Data Mining

Large Scale Machine Learning

Sourangshu Bhattacharya

¡ Example:Spamfiltering

¡ InstancespacexÎ X (|X|=n datapoints)§ Binaryorreal-valuedfeaturevectorx ofwordoccurrences

§ d features(words+otherthings,d~100,000)¡ ClassyÎ Y§ y:Spam(+1),Ham(-1)

¡ Goal:Estimateafunctionf(x) sothat y=f(x)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 2

¡ Wouldliketodoprediction:estimateafunctionf(x) sothat y=f(x)

¡ Wherey canbe:§ Realnumber: Regression§ Categorical: Classification§ Complexobject:

§ Rankingofitems,Parsetree,etc.

¡ Dataislabeled:§ Havemanypairs{(x,y)}

§ x …vectorofbinary,categorical,realvaluedfeatures§ y …class({+1,-1},orarealnumber)


¡ Task: Givendata(X,Y) buildamodelf() topredictY’ basedonX’

¡ Strategy: Estimate𝒚 = 𝒇 𝒙on(𝑿, 𝒀).Hopethatthesame𝒇(𝒙) alsoworkstopredictunknown𝒀’§ The“hope” iscalledgeneralization

§ Overfitting: Iff(x)predictswellYbutisunabletopredictY’§ Wewanttobuildamodelthatgeneralizeswelltounseendata§ ButJure,howcanwewellondatawehaveneverseenbefore?!?


X Y

X’ Y’Test data

Trainingdata

¡ Idea: Pretendwedonotknowthedata/labelsweactuallydoknow§ Buildthemodelf(x) onthetrainingdataSeehowwellf(x) doesonthetestdata§ Ifitdoeswell,thenapplyitalsotoX’

¡ Refinement:Crossvalidation§ Splittingintotraining/validationsetisbrutal§ Let’ssplitourdata(X,Y) into10-folds(buckets)§ Takeout1-foldforvalidation,trainonremaining9§ Repeatthis10times,reportaverageperformance


X Y

X’

Validation set

Trainingset

Test set

¡ Binaryclassification:

¡ Input: Vectorsxj andlabelsyj§ Vectorsxj arerealvaluedwhere 𝒙 𝟐 = 𝟏

¡ Goal: Findvectorw = (w(1), w(2) ,... , w(d) )§ Eachw(i) isarealnumber


f (x) = +1 if w(1) x(1) + w(2) x(2) +. . .w(d) x(d) ³ q-1 otherwise{

w × x = 0 - --- --

-- -

- -

- --

-

q-®

"®

,

1,

ww

xxxNote:

Decision boundary is linear

w

¡ Wanttoestimate𝒘 and𝒃!§ Standardway: Useasolver!

§ Solver: softwareforfindingsolutionsto“common”optimizationproblems

¡ Useaquadraticsolver:§ Minimizequadraticfunction§ Subjecttolinearconstraints

¡ Problem: Solversareinefficientforbigdata!J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 7

iii

n

iibw

bwxyits

Cww

x

x

-³+××"

×+× å=

1)(,..

min1

21

,

¡ Wanttoestimatew,b!¡ Alternativeapproach:§ Wanttominimize f(w,b):

¡ Sidenote:§ Howtominimizeconvexfunctions𝒈(𝒛)?§ Usegradientdescent:minz g(z)§ Iterate:zt+1¬ zt – h Ñg(zt)


å å= = þ

ýü

îíì

+-×+×=n

i

d

j

ji

ji bxwyCwwbwf

1 1

)()(21 )(1,0max),(

iii

n

iibw

bwxyits

Cww

x

x

-³+××"

+× å=

1)(,..

min1

21

,

g(z)

z

Findtheminimumormaximumofanobjectivefunctiongivenasetofconstraints:

argminx

f0(x)

s.t.fi

(x) 0, i = {1, . . . , k}h

j

(x) = 0, j = {1, . . . l}

9

LinearClassificationMaximumLikelihood

K-Means

argmax

✓

nX

i=1

log p✓(xi)

arg minµ1,µ2,. . . ,µk

J(µ) =kX

j=1

X

i2Cj

||xi � µj ||2

argminw

nX

i=1

||w||2 + C

nX

i=1

⇠i

s.t. 1� yixTi w ⇠i

⇠i � 0

10

Local(nonglobal)minimaandmaxima:

What is Optimization

But generally speaking...

We’re screwed.! Local (non global) minima of f0

! All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3−2

−10

12

3

−3−2

−10

12

3−50

0

50

100

150

200

250

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53

11

Convex Sets

DefinitionA set C ⊆ Rn is convex if for x, y ∈ C and any α ∈ [0, 1],

αx + (1 − α)y ∈ C.

x

y


Convex Functions

DefinitionA function f : Rn → R is convex if for x, y ∈ dom f and any α ∈ [0, 1],

f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y).

f(x)

f(y)αf(x) + (1 - α)f(y)


A function f : Rn ! R is convex if for x, y 2 domfand any a 2 [0, 1],

f(ax+ (1� a)y) af(x) + (1� a)f(y)

A set C ✓ Rnis convex if for x, y 2 C and any a 2 [0, 1],

ax+ (1� a)y 2 C

12

Convex Functions Examples

Important examples in Machine Learning

! SVM loss:

f(w) =!

1 − yixTi w"

+

! Binary logistic loss:

f(w) = log#

1 + exp(−yixTi w)

$

−2 30

3

[1 - x]+

log(1 + ex)


Convex Functions Examples

Important examples in Machine Learning

! SVM loss:

f(w) =!

1 − yixTi w"

+

! Binary logistic loss:

f(w) = log#

1 + exp(−yixTi w)

$

−2 30

3

[1 - x]+

log(1 + ex)


Convex Optimization Problems

Convex Optimization Problems

DefinitionAn optimization problem is convex if its objective is a convex function, theinequality constraints fj are convex, and the equality constraints hj areaffine

minimizex

f0(x) (Convex function)

s.t. fi(x) ≤ 0 (Convex sets)

hj(x) = 0 (Affine)


14

Optimization Algorithms Gradient Methods

Gradient Descent

The simplest algorithm in the world (almost). Goal:

minimizex

f(x)

Just iteratext+1 = xt − ηt∇f(xt)

where ηt is stepsize.


16


Single Step Illustration

f(xt) - η∇f(xt)T (x - xt)

f(x)

(xt, f(xt))



Newton’s method

Idea: use a second-order approximation to function.

f(x + ∆x) ≈ f(x) + ∇f(x)T ∆x +1

2∆xT∇2f(x)∆x

Choose ∆x to minimize above:

∆x = −!

∇2f(x)"−1 ∇f(x)

This is descent direction:

∇f(x)T ∆x = −∇f(x)T!

∇2f(x)"−1 ∇f(x) < 0.


Inverse Hessian Gradient

19


Newton step picture

(x, f(x))

(x + ∆x, f(x + ∆x))

f

f̂

f̂ is 2nd-order approximation, f is true function.


20

Convex Functions

Actually, more general than that

DefinitionThe subgradient set, or subdifferential set, ∂f(x) of f at x is

∂f(x) =!

g : f(y) ≥ f(x) + gT (y − x) for all y"

.

Theoremf : Rn → R is convex if and

only if it has non-empty

subdifferential set everywhere.

f(y)

f(x) + gT (y - x)

(x, f(x))


Optimization Algorithms Subgradient Methods

Why subgradient descent?

! Lots of non-differentiable convex functions used in machine learning:

f(x) =!

1 − aT x"

+, f(x) = ∥x∥1 , f(X) =

k#

r=1

σr(X)

where σr is the rth singular value of X.

! Easy to analyze

! Do not even need true sub-gradient: just have Egt ∈ ∂f(xt).


Convex Functions

Actually, more general than that

DefinitionThe subgradient set, or subdifferential set, ∂f(x) of f at x is

∂f(x) =!

g : f(y) ≥ f(x) + gT (y − x) for all y"

.

Theoremf : Rn → R is convex if and

only if it has non-empty

subdifferential set everywhere.

f(y)

f(x) + gT (y - x)

(x, f(x))


21

Optimization Algorithms Subgradient Methods

Subgradient Descent

Really, the simplest algorithm in the world. Goal:

minimizex

f(x)

Just iteratext+1 = xt − ηtgt

where ηt is a stepsize, gt ∈ ∂f(xt).


22

¡ Goalofmachinelearning:§ Minimizeexpectedloss

givensamples

¡ ThisisStochasticOptimization§ Assumelossfunctionisconvex

23

¡ Processallexamplestogetherineachstep

¡ Entiretrainingsetexaminedateachstep¡ Veryslowwhennisverylarge

24

¡ “Optimize”oneexampleatatime¡ Chooseexamplesrandomly(orreorderandchooseinorder)§ Learningrepresentativeofexampledistribution

25

¡ Equivalenttoonlinelearning(theweightvectorwchangeswitheveryexample)

¡ Convergenceguaranteedforconvexfunctions(tolocalminimum)

26

Iterate until convergence:• For j = 1 … d

• Evaluate:• Update:

w(j) ¬ w(j) - hÑf(j)

¡ Gradientdescent:

¡ Problem:§ ComputingÑf(j) takesO(n)time!

§ n …sizeofthetrainingdataset


å= ¶¶

+=¶¶

=Ñn

ijiij

jj

wyxLCw

wbwff

1)(

)()(

)( ),(),(

h…learning rate parameter C… regularization parameter

¡ StochasticGradientDescent§ Insteadofevaluatinggradientoverallexamplesevaluateitforeachindividual trainingexample

¡ Stochasticgradientdescent:


)()()( ),()( j

iiji

j

wyxLCwxf

¶¶×+=Ñ

å= ¶¶

+=Ñn

ijiijj

wyxLCwf

1)(

)()( ),(We just had:

Iterate until convergence:• For i = 1 … n

• For j = 1 … d• Compute: Ñf(j)(xi)• Update: w(j) ¬ w(j) - h Ñf(j)(xi)

Notice: no summationover i anymore

¡ Convergenceverysensitivetolearningrate()(oscillationsnearsolutionduetoprobabilisticnatureofsampling)§ Mightneedtodecreasewithtimetoensurethealgorithmconvergeseventually

¡ Basically– SGDgoodformachinelearningwithlargedatasets!

29

¡ Stochastic– 1exampleperiteration¡ Batch– Alltheexamples!¡ SampleAverageApproximation(SAA):§ SamplemexamplesateachstepandperformSGDonthem

¡ Allowsforparallelization,butchoiceofmbasedonheuristics

30

¡ ExamplebyLeonBottou:§ ReutersRCV1documentcorpus

§ Predictacategoryofadocument§ Onevs. therestclassification

§ n =781,000 trainingexamples(documents)§ 23,000testexamples§ d =50,000 features

§ Onefeatureperword§ Removestop-words§ Removelowfrequencywords


¡ Questions:§ (1) IsSGD successfulatminimizingf(w,b)?§ (2) HowquicklydoesSGD findtheminoff(w,b)?§ (3)Whatistheerroronatestset?


Training time Value of f(w,b) Test error Standard SVM“Fast SVM”SGD SVM

(1) SGD-SVM is successful at minimizing the value of f(w,b)(2) SGD-SVM is super fast(3) SGD-SVM test set error is comparable


Optimization quality: | f(w,b) – f (wopt,bopt) |

ConventionalSVM

SGD SVM

For optimizing f(w,b) within reasonable qualitySGD-SVM is super fast

¡ SGD onfulldatasetvs.ConjugateGradient onasampleofn trainingexamples


Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times

Theory says: Gradient descent converges in

linear time 𝒌. Conjugate gradient converges in 𝒌� .

𝒌… condition number

¡ Needtochooselearningrateh andt0

¡ Leonsuggests:§ Chooset0 sothattheexpectedinitialupdatesarecomparablewiththeexpectedsizeoftheweights

§ Chooseh:§ Selectasmallsubsample§ Tryvariousratesh (e.g.,10,1,0.1,0.01,…)§ Picktheonethatmostreducesthecost§ Useh fornext100kiterationsonthefulldataset


÷øö

çèæ

¶¶

++

-¬+ wyxLCw

ttww ii

tt

tt),(

01

h

¡ SparseLinearSVM:§ Featurevectorxi issparse(containsmanyzeros)

§ Donotdo:xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…]§ Butrepresentxi asasparsevector xi=[(4,1), (9,5), …]

§ CanwedotheSGDupdatemoreefficiently?

§ Approximatedin2steps:cheap:xi issparseandsofewcoordinatesj ofw willbeupdatedexpensive:w isnotsparse,allcoordinatesneedtobeupdated


÷øö

çèæ

¶¶

+-¬wyxLCwww ii ),(h

wyxLCww ii

¶¶

-¬),(h

)1( h-¬ ww

¡ Solution1:𝒘 = 𝒔 ⋅ 𝒗§ Representvectorw astheproductofscalars andvectorv

§ Thentheupdateprocedureis:

§ (1)𝒗 = 𝒗 − 𝜼𝑪 𝝏𝑳 𝒙𝒊,𝒚𝒊𝝏𝒘

§ (2)𝒔 = 𝒔(𝟏 − 𝜼)¡ Solution2:§ Performonlystep(1) foreachtrainingexample§ Performstep(2) withlowerfrequencyandhigherh


wyxLCww ii

¶¶

-¬),(h

)1( h-¬ ww

Two step update procedure:

(1)

(2)

¡ Stoppingcriteria:HowmanyiterationsofSGD?§ Earlystoppingwithcrossvalidation

§ Createavalidationset§ Monitorcostfunctiononthevalidationset§ Stopwhenlossstopsdecreasing

§ Earlystopping§ ExtracttwodisjointsubsamplesA andB oftrainingdata§ TrainonA,stopbyvalidatingonB§ Numberofepochsisanestimateofk§ Trainfork epochsonthefulldataset


¡ Reference:http://alex.smola.org/teaching/10-701-15/math.html

¡ Givendataset𝐷 = { 𝑥@, 𝑦@ , … , 𝑥C, 𝑦C }¡ Lossfunction:𝐿 𝜃, 𝐷 = @

G∑ 𝑙(𝜃; 𝑥K, 𝑦K)GKL@

¡ Forlinearmodels:𝑙 𝜃; 𝑥K, 𝑦K = 𝑙(𝑦K, 𝜃M𝜙 𝑥K )¡ Assumption𝐷 isdrawnIIDfromsomedistribution𝒫.

¡ Problem:minS𝐿(𝜃, 𝐷)

¡ Input:𝐷¡ Output: �̅�

Algorithm:¡ Initialize𝜃V¡ For𝑡 = 1,… , 𝑇

𝜃Z[@ = 𝜃Z − 𝜂Z𝛻S𝑙(𝑦Z, 𝜃M𝜙 𝑥Z )

¡ �̅� = ∑ ^_S_`_ab∑ ^_`_ab

.

¡ Expectedloss:𝑠 𝜃 = 𝐸𝒫[𝑙(𝑦, 𝜃M𝜙 𝑥 ]¡ OptimalExpectedloss:𝑠∗ = 𝑠 𝜃∗ =minS𝑠(𝜃)

¡ Convergence:

𝐸Sh 𝑠 �̅� − 𝑠∗ ≤𝑅k + 𝐿k ∑ 𝜂ZkM

ZL@

2∑ 𝜂ZMZL@

¡ Where:𝑅 = 𝜃V − 𝜃∗¡ 𝐿 = max𝛻𝑙(𝑦, 𝜃M𝜙 𝑥 )

¡ Define𝑟Z = 𝜃Z − 𝜃∗ and𝑔Z =𝛻S𝑙(𝑦Z, 𝜃M𝜙 𝑥Z )

¡ 𝑟Z[@k = 𝑟Zk + 𝜂Zk 𝑔Z k − 2𝜂Z 𝜃Z − 𝜃∗ M𝑔Z¡ Takingexpectationw.r.t𝒫, �̅� andusing𝑠∗ −𝑠 𝜃Z ≥ 𝑔ZM(𝜃∗ − 𝜃Z),weget:𝐸Sh 𝑟Z[@k − 𝑟Zk ≤ 𝜂Zk𝐿k + 2𝜂Z(𝑠∗ − 𝐸Sh[𝑠 𝜃Z ])

¡ Takingsumover𝑡 = 1,… , 𝑇 andusing𝐸Sh 𝑟Z[@k − 𝑟Vk

≤ 𝐿ks 𝜂ZkMt@

ZLV

+ 2s𝜂Z(𝑠∗ − 𝐸Sh[𝑠 𝜃Z ])Mt@

ZLV

¡ Usingconvexityof𝑠:

s𝜂Z

Mt@

ZLV

𝐸Sh 𝑠 �̅� ≤ 𝐸Sh[s 𝜂Z𝑠 𝜃ZMt@

ZLV

]

¡ Substitutingintheexpressionfrompreviousslide:𝐸Sh 𝑟Z[@k − 𝑟Vk

≤ 𝐿ks 𝜂ZkMt@

ZLV

+ 2s𝜂Z(𝑠∗ − 𝐸Sh[𝑠 �̅� ])Mt@

ZLV¡ Rearrangingthetermsprovestheresult.

CS60021: Scalable Data Mining Large Scale Machine...

Documents

Transcript of CS60021: Scalable Data Mining Large Scale Machine...