Expected Tensor Decomposition with Stochastic Gradient...

19
Expected Tensor Decomposition with Stochastic Gradient Descent Takanori Maehara (Shizuoka Univ) Kohei Hayashi (NII -> AIST) Ken-ichi Kawarabayashi (NII) Presented at AAAI 2016

Transcript of Expected Tensor Decomposition with Stochastic Gradient...

Page 1: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Expected Tensor Decomposition with Stochastic Gradient Descent

Takanori Maehara (Shizuoka Univ) Kohei Hayashi (NII -> AIST) Ken-ichi Kawarabayashi (NII)

Presented at AAAI 2016

Page 2: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Information Overload

• Too many texts to catch up

• How can we know what’s going on? --- Machine learning!

http://blogs.transparent.com/

Page 3: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

How can we analyze this?

http://tweetfake.com/

Page 4: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Word Co-occurance w

ord

word matrix decomposition

Page 5: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Higher-order Information

• Idea is generalized as tensor

wor

d

word

tensor

hashtag

URL

user

tensor decomposition

Page 6: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Scalability Issue

• To compute decomposition, we need beforehand.

• Sometimes is too huge to put on memory.

tensor

tensor

Page 7: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Questions

• Can we get decomposition without ?

• If possible, when and how?

YES

• When: is given by sum • How: stochastic optimization

tensor

tensor

Page 8: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Revisit: Word Co-occurance

• : “Heavy” tensor (e.g. dense) • : “Light” tensor (e.g. sparse)

tweet 1 tweet 2 tweet T

𝑋

𝑋𝑡

we want to decompose

this

Page 9: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

CP Decomposition

min𝑈,𝑉,𝑊

𝐿   ;𝑈,𝑉,𝑊

• 𝐿 ;𝑈,𝑉,𝑊 = || − ∑ 𝑢𝑟 ∘ 𝑣𝑟 ∘ 𝑤𝑟𝑅𝑟=1 ||2

Page 10: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Batch Optimization

• CP decomposition is non-convex • Use alternating least squares

Repeat until convergence: • 𝑈 ← 𝑈 − 𝜂𝛻𝑈𝐿 • 𝑉 ← 𝑉 − 𝜂𝛻𝑉𝐿 • 𝑊 ← 𝑊 − 𝜂𝛻𝑊𝐿

ALS

𝑋

𝑋

𝑋

step size

Page 11: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Key Fact: Loss is Decomposable

L( ) =L( + + …. + ) =L( ) + L( ) + … + L( ) + const

𝑋

𝑋1 𝑋2 𝑋𝑇

𝑋1 𝑋2 𝑋𝑇

𝐿 𝑋 = ||𝑋 −�𝑢𝑟 ∘ 𝑣𝑟 ∘ 𝑤𝑟

𝑅

𝑟=1

||2

E𝑥 − 𝑎 2 = E𝑥 2 − 2𝑎E𝑥 + 𝑎2 = E 𝑥2 − 2𝑎E𝑥 + 𝑎2 + E𝑥 2 − E 𝑥2 = E[𝑥 − 𝑎]2+Var[𝑥]

Intuition: for stochastic variable 𝑥 and non-stochastic variable 𝑎,

Page 12: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Stochastic Optimization

• If 𝐿 = 𝐿1 + 𝐿2 + ⋯+ 𝐿𝑇, we can use 𝛻𝐿𝑡 instead of 𝛻𝐿

• Stochastic ALS

for 𝑡 = 1, … ,𝑇: • 𝑈 ← 𝑈 − 𝜂𝑡𝛻𝑈𝐿 • 𝑉 ← 𝑉 − 𝜂𝑡𝛻𝑉𝐿 • 𝑊 ← 𝑊 − 𝜂𝑡𝛻𝑊𝐿

SALS

Repeat until convergence: • 𝑈 ← 𝑈 − 𝜂𝛻𝑈𝐿 • 𝑉 ← 𝑉 − 𝜂𝛻𝑉𝐿 • 𝑊 ← 𝑊 − 𝜂𝛻𝑊𝐿

ALS

𝑋

𝑋

𝑋 𝑋𝑡

𝑋𝑡

𝑋𝑡

Page 13: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Observations

• ALS and SALS converge to the same solutions – (𝐿ALS−𝐿SALS) = const

• SALS does not require to precompute – Batch gradient: ∇L( ) – Stochastic gradient: ∇L( )

• One step of SALS is lighter – is sparse, whereas is not – ∇L( ) is also sparse

𝑋

𝑋

𝑋𝑡

𝑋 𝑋𝑡

𝑋𝑡

Page 14: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Related Work

• Element-wise SGD A special case of SALS

Page 15: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Experiment

• Amazon Review datasets (word triplets) • 10M reviews, 280K vocabs, 500M non-zeros

SGD almost converges before batch starts

Page 16: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Experiment (Cont’d)

• 34M reviews, 520K vocabs, 1G non-zeros

ALS failed (exhausting 256 GB

memory)

Page 17: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

More on Paper

• 2nd-order SALS – Using block diagonal of Hessian – Faster convergence, low sensitivity of 𝜂

• With l2 norm regularizer:

– min L(X; U, V, W) + R(U, V, W), 𝑅 𝑈,𝑉,𝑊 = 𝜆(||𝑈||2 + ||𝑉||2 + ||𝑊||2)

• Theoretical analysis of convergence

Page 18: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Further Applications

• Online Learning

News, stocks, market

Train ,line, delay,遅

News, stocks, market

AKB, artists, image AKB,

artists, Image

News, stocks, market

Earthquake,M5.3

Earthquake,M5.3

Drama, CSI, TV KDD2015

Hayashi et al.

Page 19: Expected Tensor Decomposition with Stochastic Gradient Descentbigdata.nii.ac.jp/eratokansyasai3/wp-content/uploads/2016/09/0810_… · Expected Tensor Decomposition with Stochastic

Take Home Message

If is given by sum …

• Don’t use

• Use

tensor

batch gradient

stochastic gradient!