Parallel asynchronous inference of word senses with Microsoft Azure

46
Parallel asynchronous inference of word senses with Azure Sergey Bartunov, MSU

Transcript of Parallel asynchronous inference of word senses with Microsoft Azure

Parallel asynchronous inference of word senses with Azure

Sergey Bartunov, MSU

Learning as optimization

F (✓) = r(✓) +NX

i=1

fi(xi; ✓) ! min✓

Learning as optimization

loss

F (✓) = r(✓) +NX

i=1

fi(xi; ✓) ! min✓

regularizer

objectparameters

Learning as optimization

• can be huge

• regularizer and loss can be complex

• parameters’ dimensionality can be very large

N

loss

F (✓) = r(✓) +NX

i=1

fi(xi; ✓) ! min✓

regularizer

objectparameters

Learning as optimization

• can be huge

• regularizer and loss can be complex

• parameters’ dimensionality can be very large

N

loss

F (✓) = r(✓) +NX

i=1

fi(xi; ✓) ! min✓

regularizer

objectparameters

Commodity PC is not enough!

Learning word embeddings

For each word find its embedding such that similar words have close embeddings

Java

Platform

.NET

Mono

Railways

Ticket

Train

Politics

Party

Socialism

Learning word embeddings

…compiled for a specific hardware platform, since different central processor…

Learning word embeddings

…compiled for a specific hardware platform, since different central processor…

object: word and its context

Learning word embeddings

…compiled for a specific hardware platform, since different central processor…

object: word and its context

loss: � log p(v|w) p(v|w) = exp(ATwBv)PV

v0=1 exp(ATwBv0

)

Learning word embeddings

…compiled for a specific hardware platform, since different central processor…

object: word and its context

loss:

parameters: word embeddings Aw, Bw 2 RD, w 2 1, . . . , V

� log p(v|w) p(v|w) = exp(ATwBv)PV

v0=1 exp(ATwBv0

)

Skip-gram (Mikolov et al, 2013)

Gradient optimization

F (✓) = r(✓) +NX

i=1

fi(xi; ✓) ! min✓

gradient descent✓t+1 = ✓t � �trF (✓t)

Stochastic optimization

F (✓) = r(✓) +NX

i=1

fi(xi; ✓) ! min✓

stochastic gradient descent✓t+1 = ✓t � �tG(✓t)

Stochastic optimization

F (✓) = r(✓) +NX

i=1

fi(xi; ✓) ! min✓

stochastic gradient descent✓t+1 = ✓t � �tG(✓t)

EG(✓) = rF (✓)

Stochastic optimization

F (✓) = r(✓) +NX

i=1

fi(xi; ✓) ! min✓

stochastic gradient descent✓t+1 = ✓t � �tG(✓t)

EG(✓) = rF (✓)

for example: G(✓) = r [r(✓) +Nfj(xj ; ✓)] , j ⇠ Uniform(1, N)

Learning word embeddings

• 400k word vocabulary, 300-dimensional embeddings • 240 million parameters to train! • 1 GB memory snapshot

Stochastic parallel optimization

core 1 core 2 core K…

shared parameters

Stochastic parallel optimization

core 1 core 2 core K…

shared parameters

data flow

Stochastic parallel optimization

core 1 core 2 core K…

shared parameters

data flow

Stochastic parallel optimization

core 1 core 2 core K…

shared parameters

data flow

no synchronization!! (see e.g. Hogwild paper)

Stochastic parallel optimization

My laptop: 2 cores, 8 GB RAM

Stochastic parallel optimization

My laptop: 2 cores, 8 GB RAM

Stochastic parallel optimization

My laptop: 2 cores, 8 GB RAM

Stochastic parallel optimization

My laptop: 2 cores, 8 GB RAM 22 hours

2 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)

Learning polysemic word embeddings

Java

Platform (1)

.NET

Mono

Railways

Ticket

Platform (2)

Train

Platform (3)

Politics

Party

Socialism

Learning polysemic word embeddings

…compiled for a specific hardware platform, since different central processor…(computer meaning)

Learning polysemic word embeddings

…compiled for a specific hardware platform, since different central processor…

…as the safe distance from the platform edge increases with the speed…(railway meaning)

(computer meaning)

Learning polysemic word embeddings

…compiled for a specific hardware platform, since different central processor…

…as the safe distance from the platform edge increases with the speed…(railway meaning)

(computer meaning)

… Socialist Party; the Socialist Workers Platform and the Committee for a…(political meaning)

Learning polysemic word embeddings

…compiled for a specific hardware platform, since different central processor…

…as the safe distance from the platform edge increases with the speed…(railway meaning)

(computer meaning)

… Socialist Party; the Socialist Workers Platform and the Committee for a…(political meaning)

loss:

loss:

loss:

� log p(v|w, z = 1)

� log p(v|w, z = 2)

� log p(v|w, z = 3)

Learning polysemic word embeddings

…compiled for a specific hardware platform, since different central processor…

…as the safe distance from the platform edge increases with the speed…(railway meaning)

(computer meaning)

… Socialist Party; the Socialist Workers Platform and the Committee for a…(political meaning)

loss:

loss:

loss:

p(v|w, z = k) =exp(AT

wkBv)PVv0=1 exp(A

TwkBv0

)

� log p(v|w, z = 1)

� log p(v|w, z = 2)

� log p(v|w, z = 3)

word meanings are unobserved

Learning polysemic word embeddings

log p(W,V |A,B,↵) = log

Zp(z|↵)

Y

i

Y

j

p(vij |wi, zi, A,B)dz ! max

A,B

word meanings are unobserved, hence EM algorithm must be employed

Learning polysemic word embeddings

log p(W,V |A,B,↵) = log

Zp(z|↵)

Y

i

Y

j

p(vij |wi, zi, A,B)dz ! max

A,B

word meanings are unobserved, hence EM algorithm must be employed

• How to choose prior such that it allows to automatically increase number of word meanings if necessary?

• How to put the EM procedure into stochastic optimization framework?

Learning polysemic word embeddings

log p(W,V |A,B,↵) = log

Zp(z|↵)

Y

i

Y

j

p(vij |wi, zi, A,B)dz ! max

A,B

word meanings are unobserved, hence EM algorithm must be employed

• How to choose prior such that it allows to automatically increase number of word meanings if necessary?

• How to put the EM procedure into stochastic optimization framework?

Bayesian nonparametrics (Orbanz, 2014)

Learning polysemic word embeddings

log p(W,V |A,B,↵) = log

Zp(z|↵)

Y

i

Y

j

p(vij |wi, zi, A,B)dz ! max

A,B

word meanings are unobserved, hence EM algorithm must be employed

• How to choose prior such that it allows to automatically increase number of word meanings if necessary?

• How to put the EM procedure into stochastic optimization framework?

Stochastic variational inference (Blei et al, 2012)

Bayesian nonparametrics (Orbanz, 2014)

EM algorithm

… Socialist Party; the Socialist Workers Platform and the Committee for a…

EM algorithm

E-step: disambiguate the word given its context

… Socialist Party; the Socialist Workers Platform and the Committee for a…

p(z = politics) = 0.96p(z = transport) = 0.01

p(z = computer) = 0.03

EM algorithm

E-step: disambiguate the word given its context

… Socialist Party; the Socialist Workers Platform and the Committee for a…

p(z = politics) = 0.96p(z = transport) = 0.01

p(z = computer) = 0.03

M-step: update word embeddings by weighted gradient

✓t+1= ✓t + �tr

"X

k

p(zi = k) log p(vij |wi, zi = k, ✓t)

#

Learning polysemic word embeddings

• 400k word vocabulary, 300-dimensional embeddings, max. 30 meanings per word • 7.2 billion parameters to train! • 18 GB memory snapshot

Learning polysemic word embeddings

My laptop: 2 cores, 8 GB RAM 6 days!

16 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)

Resultsjulia> expected_pi(vm, dict.word2id["cloud"])30-element Array{Float64,1}: 0.404964 0.134444 0.0987207 0.361865 5.70338e-6 5.18419e-7 4.7129e-8 4.28446e-9 3.89496e-10 3.54087e-11 ⋮

Resultsjulia> nearest_neighbors(vm, dict, "cloud", 1)10-element Array{(Any,Any,Any),1}: ("clouds",1,0.791538f0) ("haze",2,0.6702103f0) ("nimbostratus",1,0.653774f0) ("altostratus",1,0.6300289f0) ("noctilucent",1,0.6294991f0) ("cumulonimbus",1,0.6289225f0) ("stratocumulus",1,0.6274564f0) ("cumulus",2,0.6273055f0) ("clouds",2,0.6201524f0) ("cirrostratus",1,0.6146165f0)

Resultsjulia> nearest_neighbors(vm, dict, "cloud", 2)10-element Array{(Any,Any,Any),1}: ("louis",5,0.5705162f0) ("vrain",1,0.55054826f0) ("lucie",1,0.52579653f0) ("clair",1,0.52284604f0) ("johns",2,0.5215208f0) ("marys",1,0.5036709f0) ("nazianz",1,0.4979607f0) ("lawrence",2,0.49513188f0) ("missouri",3,0.49284995f0) ("joseph",2,0.4928328f0)

Resultsjulia> nearest_neighbors(vm, dict, "cloud", 3)10-element Array{(Any,Any,Any),1}: ("computing",1,0.7052178f0) ("middleware",1,0.68975633f0) ("cloud-based",1,0.6546666f0) ("context-aware",1,0.6417114f0) ("enterprise",1,0.63958025f0) ("virtualization",1,0.6359488f0) ("soa",1,0.6349716f0) ("distributed",1,0.6310058f0) ("unicore",1,0.62737936f0) ("client-server",1,0.6239226f0)

Resultsjulia> nearest_neighbors(vm, dict, "cloud", 4)10-element Array{(Any,Any,Any),1}: ("mist",1,0.56100917f0) ("clouds",3,0.54695433f0) ("fire",5,0.53125167f0) ("flame",3,0.52561617f0) ("dragon",1,0.5224602f0) ("sorceror",1,0.5199405f0) ("shining",2,0.5165066f0) ("shadow",1,0.516233f0) ("mysterious",2,0.5153119f0) ("smoke",3,0.51471066f0)

Results

julia> disambiguate(vm, dict, "cloud", split("weather forecast cold rainy"))30-element Array{Float64,1}: 0.999278 9.49993e-7 1.52921e-8 0.000720983 0.0 0.0 0.0 0.0 0.0 0.0 ⋮

Results

julia> disambiguate(vm, dict, "cloud", split("multi-core virtual machine"))30-element Array{Float64,1}: 0.000243637 6.16926e-5 0.998918 0.000776869 0.0 0.0 0.0 0.0 0.0 0.0 ⋮

and thanks to Microsoft Research and Microsoft Azure team!

Dmitry Kondrashkin Anton Osokin Dmitry P. Vetrov

project page: bayesgroup.ru/adagram sources: github.com/sbos/AdaGram.jl