1 Natural Language Processing Gholamreza Ghassem-Sani Fall 1383.
Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University...
-
Upload
keenan-cape -
Category
Documents
-
view
219 -
download
1
Transcript of Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University...
Hierarchical Dirichlet Trees for Information Retrieval
Gholamreza HaffariSimon Fraser University
Yee Whye TehUniversity College London
NAACL talk, Boulder, June 2009
2
Outline
• Information Retrieval (IR) using Language Models– Hierarchical Dirichlet Document (HDD) model
• Hierarchical Dirichlet Trees – The model, Inference– Learning the Tree and Parameters
• Experiments– Experimental Results, Analysis
• Conclusion
3
Outline
• Information Retrieval (IR) using Language Models– Hierarchical Dirichlet Document (HDD) model
• Hierarchical Dirichlet Trees – The model, Inference– Learning the Tree and Parameters
• Experiments– Experimental Results, Analysis
• Conclusion
44
ad-hoc Information Retrieval
• In ad-hoc information retrieval, we are given – a collection of documents d1,..,dn
– a query q
• The task is to return a (sorted) list of documents based on their relevance to q
55
A Language Modeling approach to IR
• Build a language model (LM) for each document di
• Sort documents based on the probability of generating the query q using their language model P(q|LM(di) )
• For simplification, we can make the bag-of-words assumption; i.e. the terms are generated independently and identically
– So our LM is a multinomial distribution whose dimension is the size of the dictionary
6
The Collection Model
• Because of the sparsity, training a LM using a single document gives poor performance
• We should smooth the document LMs using a collection-wide model
– A Dirichlet distribution can be used to summarize collection-wide information
– This Dirichlet distribution is used as a prior for documents’ language models
7
The Hierarchical Dirichlet Distribution
Uniform distribution
(Figure is taken from Phil Cowans, 2004)
0
1 2 3 4 5
8
The Hierarchical Dirichlet Distribution
• The HDD model is intuitively appealing
– It is reasonable to assume that the LM of individual documents vary (to some extent) about a common model
• By making the common mean a random variable, instead of fixing it beforehand, information is shared across documents
– It leads to an inverse document frequency effect (Cowans, 2004)
• But the model has a deficiency
– How can we tell it if a pair of words must have positive/negative correlation in the learned LM ? (an effect similar to query expansion)
9
Outline
• Information Retrieval (IR) using Language Models– Hierarchical Dirichlet Document (HDD) model
• Hierarchical Dirichlet Trees – The model, Inference, – Learning the Tree and Parameters
• Experiments– Experimental Results, Analysis
• Conclusion
1010
Injecting Prior Knowledge
• Represent the words inter-dependencies with a binary tree
– Correlated words are placed nearby in the tree and at the leaf level
– Can use WordNet or a word clustering algorithm
• The tree can represent a multinomial distribution over words (we will see it shortly) which we call multinomial-tree
– In the model, replace the flat multinomial distributions with these multinomial-tree distributions
• Instead of the Dirichlet distribution, we use a prior which is called Dirichlet-Tree distribution
11
Multinomial-Tree Distribution
• Given a tree and the probability of choosing one of a node’s children
– The prob of reaching a particular leaf is the product of probabilities on the unique path from the root to that leaf
– We call the multinomial at the leaf level as multinomial-tree distribution
.2 .8
.7 .3
.14 .06
.8
(.14 , .06 , .8)
12
Dirichlet-Tree Distribution
p1 p2
p3
• Put a Dirichlet distribution over each node’s probability distribution on selecting its children
Dirichlet(.2,.8)
Dirichlet(.3,.7)
• The resulting prior over the multinomial distribution at the leaf level is called Dirichlet-Tree distribution
13
Hierarchical Dirichlet Tree Model
The parent-child nodes (k,l) on the path from the root of the tree to a word at the leaf level
For each node k in the tree
1414
Inference in the HDT Model • We do approximate inference by making the minimum oracle
assumption– Each time you see a word in a document, increment the count of the nodes
on its path to the root (in the local tree) by one
– The nodes on the path from root to a word (in the global tree) are asked only the first time that the term is seen in each document, and never asked subsequently
root
w
The Global Tree
root
w
The Local Tree
… …
15
The Document Score
• Our document score can be thought as fixing 0k to a good value beforehand and then integrating out dk
• Hence the score for the document d, i.e. the relevance score, is
nl
nk
nl
nk
16
Learning the Tree Structure
• We used three agglomerative clustering algorithms to build the tree structure over the vocabulary words– Brown clustering (BCluster)
– Distributional clustering (DCluster)
– Probabilistic hierarchical clustering (PCluster)
• Since the inference involves visiting the nodes on the path from a word to the root, we would like to have low average depth for the leaf nodes– We introduced a tree simplification operator which tend to change the
structure of the tree but not loosing so much information
– Contract a subset of nodes in the tree which have a particular property
17
Simplification of the Tree
• Let be the length of the path from a node k to the closest leaf node – denotes nodes just above a leaf node
– denotes nodes further up the tree
• If we have long branches in the tree, we can keep the root and make the leaves the immediate children of this node– It can be achieved algorithmically by contracting the nodes with
• Suppose the structure near to the surface of the tree is important, but the other internal nodes are less important– Nodes with can be contracted
18
Learning the Parameters
• We constrain the HDT model to be centered over the flat HDD model– We just allow learning the hyper-parameters – A Gamma prior is put over , and the MAP is found by optimizing the
objective function using LBFGS
• We set 0,k values so that the tree induces the same distribution on the document LMs as the flat HDD model
03
0201
01 + 02 The Global Tree
19
Outline
• Information Retrieval (IR) using Language Models– Hierarchical Dirichlet Document (HDD) model
• Hierarchical Dirichlet Trees – The model, Inference– Learning the Tree and Parameters
• Experiments– Experimental Results, Analysis
• Conclusion
20
Experiments
• We present results on two datasets
• The baseline methods: (1) The flat HDD model (Cowans 2004), (2) Okapi BM25, (3) Okapi BM25 with query expansion
• The comparison criteria– Top-10 precision – Average precision
Dataset # docs # queries dictionary size
Medline 1400 225 4227
Cranfield 1033 30 8800
2121
Results
2222
Precision-Recall graph for Cranfield
recall
precision
23
Analysis
• k > 0k parent(k) means positive correlation in selecting the children of the node parent(k)
– If a child has been selected, it’s likely to select more children
• This coincides with intuition – BCluster and DCluster produce trees which put similar-meaning words nearby– PCluster tends to put words with high co-occurrence nearby
0.9044 0.7977 0.3344
24
Examples of the Learned Trees
25
Conclusion
• We presented a hierarchical Dirichlet tree model for information retrieval– It can inject (semantical or syntactical) word relationships as the domain
knowledge into a probabilistic model
• The model uses a tree which captures the relationship among the words– We investigated the effect of different tree building algorithms and their
simplification
• Future research includes – Scaling up the method for larger datasets
– Using WordNet or Wikipedia to build the tree
26
Merci
Thank You
27
Inference in HDD Model
• Let the common mean m be– The oracle is asked the first time that each term is seen in each
document, and never asked subsequently
• After integrating out document j LM parameters, the score is
– The blue part gives an effect similar to term-frequency inverse-document-frequency (TF-IDF).
– The red part normalizes for the document length.
28
Learning the Parameters
• We put a Gamma(bk+1,b) prior over the values of the hyper-parameters k , where k is the mode
– Setting k as follows reduces the model to HDD in the mode of the Gamma distribution
• We used LBFGS to find the MAP values where the derivative of the objective function wrt k is