Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University...

Hierarchical Dirichlet Trees for Information Retrieval

Gholamreza HaffariSimon Fraser University

Yee Whye TehUniversity College London

NAACL talk, Boulder, June 2009

2

Outline

• Information Retrieval (IR) using Language Models– Hierarchical Dirichlet Document (HDD) model

• Hierarchical Dirichlet Trees – The model, Inference– Learning the Tree and Parameters

• Experiments– Experimental Results, Analysis

• Conclusion

3

Outline




• Conclusion

44

ad-hoc Information Retrieval

• In ad-hoc information retrieval, we are given – a collection of documents d1,..,dn

– a query q

• The task is to return a (sorted) list of documents based on their relevance to q

55

A Language Modeling approach to IR

• Build a language model (LM) for each document di

• Sort documents based on the probability of generating the query q using their language model P(q|LM(di) )

• For simplification, we can make the bag-of-words assumption; i.e. the terms are generated independently and identically

– So our LM is a multinomial distribution whose dimension is the size of the dictionary

6

The Collection Model

• Because of the sparsity, training a LM using a single document gives poor performance

• We should smooth the document LMs using a collection-wide model

– A Dirichlet distribution can be used to summarize collection-wide information

– This Dirichlet distribution is used as a prior for documents’ language models

7

The Hierarchical Dirichlet Distribution

Uniform distribution

(Figure is taken from Phil Cowans, 2004)

0

1 2 3 4 5

8

The Hierarchical Dirichlet Distribution

• The HDD model is intuitively appealing

– It is reasonable to assume that the LM of individual documents vary (to some extent) about a common model

• By making the common mean a random variable, instead of fixing it beforehand, information is shared across documents

– It leads to an inverse document frequency effect (Cowans, 2004)

• But the model has a deficiency

– How can we tell it if a pair of words must have positive/negative correlation in the learned LM ? (an effect similar to query expansion)

9

Outline


• Hierarchical Dirichlet Trees – The model, Inference, – Learning the Tree and Parameters


• Conclusion

1010

Injecting Prior Knowledge

• Represent the words inter-dependencies with a binary tree

– Correlated words are placed nearby in the tree and at the leaf level

– Can use WordNet or a word clustering algorithm

• The tree can represent a multinomial distribution over words (we will see it shortly) which we call multinomial-tree

– In the model, replace the flat multinomial distributions with these multinomial-tree distributions

• Instead of the Dirichlet distribution, we use a prior which is called Dirichlet-Tree distribution

11

Multinomial-Tree Distribution

• Given a tree and the probability of choosing one of a node’s children

– The prob of reaching a particular leaf is the product of probabilities on the unique path from the root to that leaf

– We call the multinomial at the leaf level as multinomial-tree distribution

.2 .8

.7 .3

.14 .06

.8

(.14 , .06 , .8)

12

Dirichlet-Tree Distribution

p1 p2

p3

• Put a Dirichlet distribution over each node’s probability distribution on selecting its children

Dirichlet(.2,.8)

Dirichlet(.3,.7)

• The resulting prior over the multinomial distribution at the leaf level is called Dirichlet-Tree distribution

13

Hierarchical Dirichlet Tree Model

The parent-child nodes (k,l) on the path from the root of the tree to a word at the leaf level

For each node k in the tree

1414

Inference in the HDT Model • We do approximate inference by making the minimum oracle

assumption– Each time you see a word in a document, increment the count of the nodes

on its path to the root (in the local tree) by one

– The nodes on the path from root to a word (in the global tree) are asked only the first time that the term is seen in each document, and never asked subsequently

root

w

The Global Tree

root

w

The Local Tree

… …

15

The Document Score

• Our document score can be thought as fixing 0k to a good value beforehand and then integrating out dk

• Hence the score for the document d, i.e. the relevance score, is

nl

nk

nl

nk

16

Learning the Tree Structure

• We used three agglomerative clustering algorithms to build the tree structure over the vocabulary words– Brown clustering (BCluster)

– Distributional clustering (DCluster)

– Probabilistic hierarchical clustering (PCluster)

• Since the inference involves visiting the nodes on the path from a word to the root, we would like to have low average depth for the leaf nodes– We introduced a tree simplification operator which tend to change the

structure of the tree but not loosing so much information

– Contract a subset of nodes in the tree which have a particular property

17

Simplification of the Tree

• Let be the length of the path from a node k to the closest leaf node – denotes nodes just above a leaf node

– denotes nodes further up the tree

• If we have long branches in the tree, we can keep the root and make the leaves the immediate children of this node– It can be achieved algorithmically by contracting the nodes with

• Suppose the structure near to the surface of the tree is important, but the other internal nodes are less important– Nodes with can be contracted

18

Learning the Parameters

• We constrain the HDT model to be centered over the flat HDD model– We just allow learning the hyper-parameters – A Gamma prior is put over , and the MAP is found by optimizing the

objective function using LBFGS

• We set 0,k values so that the tree induces the same distribution on the document LMs as the flat HDD model

03

0201

01 + 02 The Global Tree

19

Outline




• Conclusion

20

Experiments

• We present results on two datasets

• The baseline methods: (1) The flat HDD model (Cowans 2004), (2) Okapi BM25, (3) Okapi BM25 with query expansion

• The comparison criteria– Top-10 precision – Average precision

Dataset # docs # queries dictionary size

Medline 1400 225 4227

Cranfield 1033 30 8800

2121

Results

2222

Precision-Recall graph for Cranfield

recall

precision

23

Analysis

• k > 0k parent(k) means positive correlation in selecting the children of the node parent(k)

– If a child has been selected, it’s likely to select more children

• This coincides with intuition – BCluster and DCluster produce trees which put similar-meaning words nearby– PCluster tends to put words with high co-occurrence nearby

0.9044 0.7977 0.3344

24

Examples of the Learned Trees

25

Conclusion

• We presented a hierarchical Dirichlet tree model for information retrieval– It can inject (semantical or syntactical) word relationships as the domain

knowledge into a probabilistic model

• The model uses a tree which captures the relationship among the words– We investigated the effect of different tree building algorithms and their

simplification

• Future research includes – Scaling up the method for larger datasets

– Using WordNet or Wikipedia to build the tree

26

Merci

Thank You

27

Inference in HDD Model

• Let the common mean m be– The oracle is asked the first time that each term is seen in each

document, and never asked subsequently

• After integrating out document j LM parameters, the score is

– The blue part gives an effect similar to term-frequency inverse-document-frequency (TF-IDF).

– The red part normalizes for the document length.

28

Learning the Parameters

• We put a Gamma(bk+1,b) prior over the values of the hyper-parameters k , where k is the mode

– Setting k as follows reduces the model to HDD in the mode of the Gamma distribution

• We used LBFGS to find the MAP values where the derivative of the objective function wrt k is

Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University...

Documents

Transcript of Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University...