Parallel asynchronous inference of word senses with Microsoft Azure

Parallel asynchronous inference of word senses with Azure

Sergey Bartunov, MSU

Learning as optimization

F (✓) = r(✓) +NX

i=1

fi(xi; ✓) ! min✓


loss

F (✓) = r(✓) +NX

i=1


regularizer

objectparameters


• can be huge

• regularizer and loss can be complex

• parameters’ dimensionality can be very large

N

loss

F (✓) = r(✓) +NX

i=1


regularizer

objectparameters


• can be huge

• regularizer and loss can be complex

• parameters’ dimensionality can be very large

N

loss

F (✓) = r(✓) +NX

i=1


regularizer

objectparameters

Commodity PC is not enough!

Learning word embeddings

For each word find its embedding such that similar words have close embeddings

Java

Platform

.NET

Mono

Railways

Ticket

Train

Politics

Party

Socialism


…compiled for a specific hardware platform, since different central processor…



object: word and its context




loss: � log p(v|w) p(v|w) = exp(ATwBv)PV

v0=1 exp(ATwBv0

)




loss:

parameters: word embeddings Aw, Bw 2 RD, w 2 1, . . . , V

� log p(v|w) p(v|w) = exp(ATwBv)PV

v0=1 exp(ATwBv0

)

Skip-gram (Mikolov et al, 2013)

Gradient optimization

F (✓) = r(✓) +NX

i=1


gradient descent✓t+1 = ✓t � �trF (✓t)

Stochastic optimization

F (✓) = r(✓) +NX

i=1


stochastic gradient descent✓t+1 = ✓t � �tG(✓t)


F (✓) = r(✓) +NX

i=1



EG(✓) = rF (✓)


F (✓) = r(✓) +NX

i=1



EG(✓) = rF (✓)

for example: G(✓) = r [r(✓) +Nfj(xj ; ✓)] , j ⇠ Uniform(1, N)


• 400k word vocabulary, 300-dimensional embeddings • 240 million parameters to train! • 1 GB memory snapshot

Stochastic parallel optimization

core 1 core 2 core K…

shared parameters



shared parameters

data flow



shared parameters

data flow

no synchronization!! (see e.g. Hogwild paper)


My laptop: 2 cores, 8 GB RAM


My laptop: 2 cores, 8 GB RAM 22 hours

2 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)

Learning polysemic word embeddings

Java

Platform (1)

.NET

Mono

Railways

Ticket

Platform (2)

Train

Platform (3)

Politics

Party

Socialism


…compiled for a specific hardware platform, since different central processor…(computer meaning)



…as the safe distance from the platform edge increases with the speed…(railway meaning)

(computer meaning)




(computer meaning)

… Socialist Party; the Socialist Workers Platform and the Committee for a…(political meaning)




(computer meaning)


loss:

loss:

loss:

� log p(v|w, z = 1)

� log p(v|w, z = 2)

� log p(v|w, z = 3)




(computer meaning)


loss:

loss:

loss:

p(v|w, z = k) =exp(AT

wkBv)PVv0=1 exp(A

TwkBv0

)

� log p(v|w, z = 1)

� log p(v|w, z = 2)

� log p(v|w, z = 3)

word meanings are unobserved


log p(W,V |A,B,↵) = log

Zp(z|↵)

Y

i

Y

j

p(vij |wi, zi, A,B)dz ! max

A,B

word meanings are unobserved, hence EM algorithm must be employed



Zp(z|↵)

Y

i

Y

j


A,B


• How to choose prior such that it allows to automatically increase number of word meanings if necessary?

• How to put the EM procedure into stochastic optimization framework?



Zp(z|↵)

Y

i

Y

j


A,B




Bayesian nonparametrics (Orbanz, 2014)



Zp(z|↵)

Y

i

Y

j


A,B




Stochastic variational inference (Blei et al, 2012)

Bayesian nonparametrics (Orbanz, 2014)

EM algorithm

… Socialist Party; the Socialist Workers Platform and the Committee for a…

EM algorithm

E-step: disambiguate the word given its context


p(z = politics) = 0.96p(z = transport) = 0.01

p(z = computer) = 0.03

EM algorithm

E-step: disambiguate the word given its context


p(z = politics) = 0.96p(z = transport) = 0.01

p(z = computer) = 0.03

M-step: update word embeddings by weighted gradient

✓t+1= ✓t + �tr

"X

k

p(zi = k) log p(vij |wi, zi = k, ✓t)

#


• 400k word vocabulary, 300-dimensional embeddings, max. 30 meanings per word • 7.2 billion parameters to train! • 18 GB memory snapshot


My laptop: 2 cores, 8 GB RAM 6 days!

16 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)

Resultsjulia> expected_pi(vm, dict.word2id["cloud"])30-element Array{Float64,1}: 0.404964 0.134444 0.0987207 0.361865 5.70338e-6 5.18419e-7 4.7129e-8 4.28446e-9 3.89496e-10 3.54087e-11 ⋮

Resultsjulia> nearest_neighbors(vm, dict, "cloud", 1)10-element Array{(Any,Any,Any),1}: ("clouds",1,0.791538f0) ("haze",2,0.6702103f0) ("nimbostratus",1,0.653774f0) ("altostratus",1,0.6300289f0) ("noctilucent",1,0.6294991f0) ("cumulonimbus",1,0.6289225f0) ("stratocumulus",1,0.6274564f0) ("cumulus",2,0.6273055f0) ("clouds",2,0.6201524f0) ("cirrostratus",1,0.6146165f0)

Resultsjulia> nearest_neighbors(vm, dict, "cloud", 2)10-element Array{(Any,Any,Any),1}: ("louis",5,0.5705162f0) ("vrain",1,0.55054826f0) ("lucie",1,0.52579653f0) ("clair",1,0.52284604f0) ("johns",2,0.5215208f0) ("marys",1,0.5036709f0) ("nazianz",1,0.4979607f0) ("lawrence",2,0.49513188f0) ("missouri",3,0.49284995f0) ("joseph",2,0.4928328f0)

Resultsjulia> nearest_neighbors(vm, dict, "cloud", 3)10-element Array{(Any,Any,Any),1}: ("computing",1,0.7052178f0) ("middleware",1,0.68975633f0) ("cloud-based",1,0.6546666f0) ("context-aware",1,0.6417114f0) ("enterprise",1,0.63958025f0) ("virtualization",1,0.6359488f0) ("soa",1,0.6349716f0) ("distributed",1,0.6310058f0) ("unicore",1,0.62737936f0) ("client-server",1,0.6239226f0)

Resultsjulia> nearest_neighbors(vm, dict, "cloud", 4)10-element Array{(Any,Any,Any),1}: ("mist",1,0.56100917f0) ("clouds",3,0.54695433f0) ("fire",5,0.53125167f0) ("flame",3,0.52561617f0) ("dragon",1,0.5224602f0) ("sorceror",1,0.5199405f0) ("shining",2,0.5165066f0) ("shadow",1,0.516233f0) ("mysterious",2,0.5153119f0) ("smoke",3,0.51471066f0)

Results

julia> disambiguate(vm, dict, "cloud", split("weather forecast cold rainy"))30-element Array{Float64,1}: 0.999278 9.49993e-7 1.52921e-8 0.000720983 0.0 0.0 0.0 0.0 0.0 0.0 ⋮

Results

julia> disambiguate(vm, dict, "cloud", split("multi-core virtual machine"))30-element Array{Float64,1}: 0.000243637 6.16926e-5 0.998918 0.000776869 0.0 0.0 0.0 0.0 0.0 0.0 ⋮

and thanks to Microsoft Research and Microsoft Azure team!

Dmitry Kondrashkin Anton Osokin Dmitry P. Vetrov

project page: bayesgroup.ru/adagram sources: github.com/sbos/AdaGram.jl

http://bayesgroup.ru/adagram

http://github.com/sbos/AdaGram.jl

Parallel asynchronous inference of word senses with Microsoft Azure

Technology

Transcript of Parallel asynchronous inference of word senses with Microsoft Azure