Query Expansion with Locally-Trained Word Embeddings (ACL 2016)

Post on 15-Apr-2017

606 views 0 download

Transcript of Query Expansion with Locally-Trained Word Embeddings (ACL 2016)

Query Expansion with Locally-Trained Word Embeddings

Fernando Bhaskar Mitra Nick CraswellMicrosoft

p(d)

d

p(d)

d

q

p(d|q)

cutglobal local*cutting taxsqueeze deficitreduce voteslash budget

reduction reductionspend houselower billhalve plansoften spendfreeze billion

global: trained using full corpus

local: trained using topically-

*gas

global local

t-SNE projection: top words by p̃(d|q) (blue: query; red: top words by p(d|q))

• local term clustering [Lesk 1968, Attar and Fraenkel 1977]

• local latent semantic analysis [Hull 1995, Hull, 1994; Schutze et al., 1995; Singhal et al., 1997]

• local document clustering [Tombros and van Rijsbergen, 2001; Tombros et al., 2002; Willett, 1985]

• one sense per discourse [Gale et al., 1992]

targetcorpus

query

results

q = [gas:1.0 tax:1.0 petroleum:0.0 tariff:0.0 …]

query = gas tax

q = [gas:1.0 tax:1.0 petroleum:0.0 tariff:0.0 …]

query = gas tax

d = [gas:0.0 tax:0.0 petroleum:0.7 tariff:0.5 …]

q = [gas:1.0 tax:1.0 petroleum:0.0 tariff:0.0 …]

query = gas tax

… gas petroleum:0.9 indigestion:0.6 … tax tariff:0.7 strain:0.4 … …[ ]W=

q = [gas:1.0 tax:1.0 petroleum:0.8 tariff:0.6 …]

query = gas tax

d = [gas:0.0 tax:0.0 petroleum:0.7 tariff:0.5 …]

W = UUT

U m⇥ k embedding matrix

p(d)

d

q

p(d|q)

p(d)

d

q

p̃(d|q)

targetcorpus

query

results

externalcorpus

query

results

U =

8>>><

>>>:

uniform p(d) on the target corpus

uniform p(d) on an external corpus

p(d|q) on the target corpus

p(d|q) on an external corpus

docs words queries

trec12 469,949 438,338 150

robust 528,155 665,128 250

web 50,220,423 90,411,624 200

global local

target target

wikipedia+gigaword* gigaword†

google news* wikipedia†

*publicly available embedding; †publicly available external corpus

targetcorpus

query

results

externalcorpus

query

results

targetcorpus

query

results

targetcorpus

query

results

externalcorpus

query

results

trec12 robust web

local vs global

NDCG@10

0.0

0.1

0.2

0.3

0.4

0.5

expansion

nonegloballocal

trec12 robust web

local embedding

NDCG@10

0.0

0.1

0.2

0.3

0.4

0.5

corpus

targetgigawordwikipedia

• local embedding provides a stronger representation than global embedding

• potential impact for other topic-specific natural language processing tasks

• future work

• effectiveness improvements

• efficiency improvements