Parallel asynchronous inference of word senses with Microsoft Azure
-
Upload
microsoft-azure-for-research -
Category
Technology
-
view
370 -
download
0
Transcript of Parallel asynchronous inference of word senses with Microsoft Azure
Learning as optimization
• can be huge
• regularizer and loss can be complex
• parameters’ dimensionality can be very large
N
loss
F (✓) = r(✓) +NX
i=1
fi(xi; ✓) ! min✓
regularizer
objectparameters
Learning as optimization
• can be huge
• regularizer and loss can be complex
• parameters’ dimensionality can be very large
N
loss
F (✓) = r(✓) +NX
i=1
fi(xi; ✓) ! min✓
regularizer
objectparameters
Commodity PC is not enough!
Learning word embeddings
For each word find its embedding such that similar words have close embeddings
Java
Platform
.NET
Mono
Railways
Ticket
Train
Politics
Party
Socialism
Learning word embeddings
…compiled for a specific hardware platform, since different central processor…
Learning word embeddings
…compiled for a specific hardware platform, since different central processor…
object: word and its context
Learning word embeddings
…compiled for a specific hardware platform, since different central processor…
object: word and its context
loss: � log p(v|w) p(v|w) = exp(ATwBv)PV
v0=1 exp(ATwBv0
)
Learning word embeddings
…compiled for a specific hardware platform, since different central processor…
object: word and its context
loss:
parameters: word embeddings Aw, Bw 2 RD, w 2 1, . . . , V
� log p(v|w) p(v|w) = exp(ATwBv)PV
v0=1 exp(ATwBv0
)
Skip-gram (Mikolov et al, 2013)
Stochastic optimization
F (✓) = r(✓) +NX
i=1
fi(xi; ✓) ! min✓
stochastic gradient descent✓t+1 = ✓t � �tG(✓t)
Stochastic optimization
F (✓) = r(✓) +NX
i=1
fi(xi; ✓) ! min✓
stochastic gradient descent✓t+1 = ✓t � �tG(✓t)
EG(✓) = rF (✓)
Stochastic optimization
F (✓) = r(✓) +NX
i=1
fi(xi; ✓) ! min✓
stochastic gradient descent✓t+1 = ✓t � �tG(✓t)
EG(✓) = rF (✓)
for example: G(✓) = r [r(✓) +Nfj(xj ; ✓)] , j ⇠ Uniform(1, N)
Learning word embeddings
• 400k word vocabulary, 300-dimensional embeddings • 240 million parameters to train! • 1 GB memory snapshot
Stochastic parallel optimization
core 1 core 2 core K…
shared parameters
data flow
no synchronization!! (see e.g. Hogwild paper)
Stochastic parallel optimization
My laptop: 2 cores, 8 GB RAM 22 hours
2 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)
Learning polysemic word embeddings
Java
Platform (1)
.NET
Mono
Railways
Ticket
Platform (2)
Train
Platform (3)
Politics
Party
Socialism
Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…(computer meaning)
Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…(railway meaning)
(computer meaning)
Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…(railway meaning)
(computer meaning)
… Socialist Party; the Socialist Workers Platform and the Committee for a…(political meaning)
Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…(railway meaning)
(computer meaning)
… Socialist Party; the Socialist Workers Platform and the Committee for a…(political meaning)
loss:
loss:
loss:
� log p(v|w, z = 1)
� log p(v|w, z = 2)
� log p(v|w, z = 3)
Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…(railway meaning)
(computer meaning)
… Socialist Party; the Socialist Workers Platform and the Committee for a…(political meaning)
loss:
loss:
loss:
p(v|w, z = k) =exp(AT
wkBv)PVv0=1 exp(A
TwkBv0
)
� log p(v|w, z = 1)
� log p(v|w, z = 2)
� log p(v|w, z = 3)
word meanings are unobserved
Learning polysemic word embeddings
log p(W,V |A,B,↵) = log
Zp(z|↵)
Y
i
Y
j
p(vij |wi, zi, A,B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
Learning polysemic word embeddings
log p(W,V |A,B,↵) = log
Zp(z|↵)
Y
i
Y
j
p(vij |wi, zi, A,B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
• How to choose prior such that it allows to automatically increase number of word meanings if necessary?
• How to put the EM procedure into stochastic optimization framework?
Learning polysemic word embeddings
log p(W,V |A,B,↵) = log
Zp(z|↵)
Y
i
Y
j
p(vij |wi, zi, A,B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
• How to choose prior such that it allows to automatically increase number of word meanings if necessary?
• How to put the EM procedure into stochastic optimization framework?
Bayesian nonparametrics (Orbanz, 2014)
Learning polysemic word embeddings
log p(W,V |A,B,↵) = log
Zp(z|↵)
Y
i
Y
j
p(vij |wi, zi, A,B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
• How to choose prior such that it allows to automatically increase number of word meanings if necessary?
• How to put the EM procedure into stochastic optimization framework?
Stochastic variational inference (Blei et al, 2012)
Bayesian nonparametrics (Orbanz, 2014)
EM algorithm
E-step: disambiguate the word given its context
… Socialist Party; the Socialist Workers Platform and the Committee for a…
p(z = politics) = 0.96p(z = transport) = 0.01
p(z = computer) = 0.03
EM algorithm
E-step: disambiguate the word given its context
… Socialist Party; the Socialist Workers Platform and the Committee for a…
p(z = politics) = 0.96p(z = transport) = 0.01
p(z = computer) = 0.03
M-step: update word embeddings by weighted gradient
✓t+1= ✓t + �tr
"X
k
p(zi = k) log p(vij |wi, zi = k, ✓t)
#
Learning polysemic word embeddings
• 400k word vocabulary, 300-dimensional embeddings, max. 30 meanings per word • 7.2 billion parameters to train! • 18 GB memory snapshot
Learning polysemic word embeddings
My laptop: 2 cores, 8 GB RAM 6 days!
16 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)
Resultsjulia> expected_pi(vm, dict.word2id["cloud"])30-element Array{Float64,1}: 0.404964 0.134444 0.0987207 0.361865 5.70338e-6 5.18419e-7 4.7129e-8 4.28446e-9 3.89496e-10 3.54087e-11 ⋮
Resultsjulia> nearest_neighbors(vm, dict, "cloud", 1)10-element Array{(Any,Any,Any),1}: ("clouds",1,0.791538f0) ("haze",2,0.6702103f0) ("nimbostratus",1,0.653774f0) ("altostratus",1,0.6300289f0) ("noctilucent",1,0.6294991f0) ("cumulonimbus",1,0.6289225f0) ("stratocumulus",1,0.6274564f0) ("cumulus",2,0.6273055f0) ("clouds",2,0.6201524f0) ("cirrostratus",1,0.6146165f0)
Resultsjulia> nearest_neighbors(vm, dict, "cloud", 2)10-element Array{(Any,Any,Any),1}: ("louis",5,0.5705162f0) ("vrain",1,0.55054826f0) ("lucie",1,0.52579653f0) ("clair",1,0.52284604f0) ("johns",2,0.5215208f0) ("marys",1,0.5036709f0) ("nazianz",1,0.4979607f0) ("lawrence",2,0.49513188f0) ("missouri",3,0.49284995f0) ("joseph",2,0.4928328f0)
Resultsjulia> nearest_neighbors(vm, dict, "cloud", 3)10-element Array{(Any,Any,Any),1}: ("computing",1,0.7052178f0) ("middleware",1,0.68975633f0) ("cloud-based",1,0.6546666f0) ("context-aware",1,0.6417114f0) ("enterprise",1,0.63958025f0) ("virtualization",1,0.6359488f0) ("soa",1,0.6349716f0) ("distributed",1,0.6310058f0) ("unicore",1,0.62737936f0) ("client-server",1,0.6239226f0)
Resultsjulia> nearest_neighbors(vm, dict, "cloud", 4)10-element Array{(Any,Any,Any),1}: ("mist",1,0.56100917f0) ("clouds",3,0.54695433f0) ("fire",5,0.53125167f0) ("flame",3,0.52561617f0) ("dragon",1,0.5224602f0) ("sorceror",1,0.5199405f0) ("shining",2,0.5165066f0) ("shadow",1,0.516233f0) ("mysterious",2,0.5153119f0) ("smoke",3,0.51471066f0)
Results
julia> disambiguate(vm, dict, "cloud", split("weather forecast cold rainy"))30-element Array{Float64,1}: 0.999278 9.49993e-7 1.52921e-8 0.000720983 0.0 0.0 0.0 0.0 0.0 0.0 ⋮
Results
julia> disambiguate(vm, dict, "cloud", split("multi-core virtual machine"))30-element Array{Float64,1}: 0.000243637 6.16926e-5 0.998918 0.000776869 0.0 0.0 0.0 0.0 0.0 0.0 ⋮
and thanks to Microsoft Research and Microsoft Azure team!
Dmitry Kondrashkin Anton Osokin Dmitry P. Vetrov
project page: bayesgroup.ru/adagram sources: github.com/sbos/AdaGram.jl