ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large...
-
date post
20-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large...
iSchool, Cloud Computing Class Talk, Oct 6th 2008 1
Computing Pairwise Document Computing Pairwise Document Similarity in Large Collections:Similarity in Large Collections:
A MapReduce PerspectiveA MapReduce Perspective
Tamer Elsayed, Jimmy Lin, and Douglas W. OardTamer Elsayed, Jimmy Lin, and Douglas W. Oard
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 2
Overview
Abstract Problem Trivial Solution MapReduce Solution Efficiency Tricks Identity Resolution in Email
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 3
Abstract Problem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
Applications: Clustering Coreference resolution “more-like-that” queries
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 4
Similarity of Documents
Simple inner product Cosine similarity Term weights
Standard problem in IR tf-idf, BM25, etc.
di
dj
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 5
Trivial Solution
load each vector o(N) times load each term o(dft
2) times
scalable and efficient solutionfor large collections
Goal
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 6
Better Solution
Load weights for each term once Each term contributes o(dft
2) partial scores Allows efficiency tricks
Each term contributes only if appears in
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 7
Decomposition MapReduce
Load weights for each term once Each term contributes o(dft
2) partial scores
Each term contributes only if appears in
mapmap
indexindex
reducereduce
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 8
MapReduce Framework
mapmap
mapmap
mapmap
mapmap
reducereduce
reducereduce
reducereduce
input
input
input
input
output
output
output
ShufflingShuffling
group values group values by: by: [[keyskeys]]
(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce
handles low-level details transparentlytransparently
(k2, [v2])(k1, v1)
[(k3, v3)][k2, v2]
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 9
Standard Indexing
tokenizetokenize
tokenizetokenize
tokenizetokenize
tokenizetokenize
combinecombine
combinecombine
combinecombine
doc
doc
doc
doc
posting list
posting list
posting list
ShufflingShuffling
group values group values by: by: termsterms
(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 10
Indexing (3-doc toy collection)
Clinton
Barack
Cheney
Obama
Indexing
2
1
1
1
1
ClintonObamaClinton 1
1
ClintonCheney
ClintonBarackObama
ClintonObamaClinton
ClintonCheney
ClintonBarackObama
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 11
Pairwise Similarity(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs
Clinton
Barack
Cheney
Obama
2
1
1
1
1
1
1
22
22
11
1111
22
22 22
22
11
1133
11
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 12
Pairwise Similarity (abstract)(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs
multiplymultiply
multiplymultiply
multiplymultiply
multiplymultiply
sumsum
sumsum
sumsum
term postings
term postings
term postings
term postings
similarity
similarity
similarity
ShufflingShuffling
group values group values by: by: pairspairs
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 13
Experimental Setup
0.16.0 Open source MapReduce implementation
Cluster of 19 machines Each w/ two processors (single core)
Aquaint-2 collection 906K documents
Okapi BM25 Subsets of collection
Elsayed, Lin, and Oard, ACL 2008Elsayed, Lin, and Oard, ACL 2008
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 14
Efficiency (disk space)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
0 10 20 30 40 50 60 70 80 90 100
Corpus Size (%)
Inte
rme
dia
te P
air
s (
bill
ion
s)
8 trillion intermediate pairs
Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk
Aquaint-2 Collection, ~ 906k docs
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 15
Terms: Zipfian Distribution
term rank
do
c fr
eq (
df)
each term t contributes o(dft2) partial results
very few terms dominate the computations
most frequent term (“said”) 3%
most frequent 10 terms 15%
most frequent 100 terms 57%
most frequent 1000 terms 95%
~0.1% of total terms(99.9% df-cut)
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 16
Efficiency (disk space)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
0 10 20 30 40 50 60 70 80 90 100
Corpus Size (%)
Inte
rmed
iate
Pai
rs (
bil
lio
ns)
no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%
8 trillionintermediate pairs
0.5 trillion intermediate pairs
Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk
Aquaint-2 Collection, ~ 906k doc
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 17
Effectiveness (recent work) Effect of df-cut on effectiveness
Medline04 - 909k abstracts- Ad-hoc retrieval
50
55
60
65
70
75
80
85
90
95
100
99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)
Re
lati
ve
P5
(%
)
Drop 0.1% of terms“Near-Linear” Growth
Fit on diskCost 2% in Effectiveness
Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 18
Implementation Issues
BM25s Similarity Model
TF, IDF Document length
DF-Cut Build a histogram Pick the absolute df for the % df-cut
5.0
5.0log*
5.15.0*
5.15.0 11
1
11
1
df
dfN
dlavgdl
tf
tf
dlavgdl
tf
tf
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 19
Other Approximation Techniques ?
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 20
Other Approximation Techniques
(2) Absolute df
Consider only terms that appear in at least n (or %) documents An absolute lower bound on df, instead of just
removing the % most-frequent terms
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 21
Other Approximation Techniques
(3) tf-Cut
Consider only documents (in posting list) with tf > T ; T=1 or 2
OR: Consider only the top N documents based on tf for each term
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 22
Other Approximation Techniques
(4) Similarity Threshold
Consider only partial scores > SimT
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 23
Other Approximation Techniques:
(5) Ranked List
Keep only the most similar N documents In the reduce phase
Good for ad-hoc retrieval and “more-like this” queries
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 24
Space-Saving Tricks
(1) Stripes
Stripes instead of pairs Group by doc-id not pairs
11
222211
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 25
Space-Saving Tricks
(2) Blocking
No need to generate the whole matrix at once Generate different blocks of the matrix at
different steps limit the max space required for intermediate results
Similarity Matrix
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 26
Identity Resolution in Email Topical Similarity Social Similarity Joint Resolution of Mentions
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 27
Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be
rescheduled
Did Sheila want Scott to participate? Looks like the
call will be too late for him.
Sheila
Basic Problem
WHO?WHO?WHO?WHO?
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 28
Enron Collection
-----Original Message-----From: [email protected]@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected]: [email protected]: Shhhh.... it's a SURPRISE !
Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: [email protected]: [email protected]: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: '[email protected]@ENRON'
Hope all is well.Count me in for the group present.See ya next week if not earlier
Please call me (713) 207-5233
Liza
Elizabeth Sager713-853-6349
Hi Shari
Thanks!
Shari
55 Sheila’s !!55 Sheila’s !!weisman
pardoglover
richjones
breedenhuckaby
tweedmcintyrechadwick
birminghamkahanekforakertasmanfisherpetitt
DomboRobbinschang
jarnotkirby
knudsenboehringer
lutzgloverwollamjortnerneylon
whangernagel
gravesmclaughlin
venvillerappazzo
millerswatekhollis
maynesnacey
ferrarinidey
macleodhowarddarlingwatsonperlickadvanihesterkennerlewis
waltonwhitmanberggrenosowski
kelly
Rank Rank CandidatesCandidates
Rank Rank CandidatesCandidates
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 29
Generative Model
1. Choose “personperson” c to mention
p(c)
2. Choose appropriate “contextcontext” X to mention c
p(X | c)
3. Choose a “mentionmention” l
p(l | X, c) ““sheila”sheila”
GEGEconferenceconference
callcall
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 30
3-Step Solution(1) Identity(1) Identity ModelingModeling
Posterior DistributionPosterior Distribution
(3) Mention Resolution(3) Mention Resolution
(2) Context Reconstruction(2) Context Reconstruction
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 31
Contextual Space
LocalLocalContextContext
Conversational Conversational ContextContext
Topical ContextTopical Context
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 32
Topical Context
Date: Fri Dec 15 05:33:00 EST 2000From: [email protected]: vince j kaminski <[email protected]>Cc: sheila walton <[email protected]>Subject: Re: Grant Masson
Great news. Lets get this moving along. Sheila, can you work out GE letter?
Vince, I am in London Monday/Tuesday, back Weds late. I'll ask Sheila to fix this for you and if you need me call me on my cell phone.
Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled
Did Sheila want Scott to participate? Looks like the call will be too late for
him.
Sheila call
Sheila
call
GE
GE
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 33
Contextual Space
Social ContextSocial Context
LocalLocalContextContext
Conversational Conversational ContextContext
Topical ContextTopical Context
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 34
Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled
Did Sheila want Scott to participate? Looks like the call will be too late for
him.
Social Context
Date: Tue, 19 Dec 2000 07:07:00 -0800 (PST)From: [email protected]: [email protected] Subject: ESA Option Execution
KayCan you initial the ESA assignment and assumption agreement or should I ask Sheila Tweed to do it? I believe she is currently en route from Portland.Thanks, Rebecca
Sheila Tweed
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 35
Contextual Space (mentions)
“Sheila”
social
conversational
social
topical
social
topical
topical
“Sheila Tweed”
“sheila”
“sg”
“Sheila Walton”
“Sheila”
Joint Resolution of MentionsJoint Resolution of Mentions
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 36
Topical Expansion
Each email is a document Index all (bodies of) emails
remove all signature and salutation lines Use temporal constraints
Need an email-to-date/time mapping Check for each pair of documents
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 37
Social Expansion Can we use the same technique?
For each email: list of participating email addresses comprises the document
MessageID: 3563Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled
Did Sheila want Scott to participate? Looks like the call will be too late for
him.
2563 [email protected] [email protected]
Index the new “social documents” and apply same topical expansion process
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 38
Social Similarity Models
Intersection size Jaccard Coefficent Boolean
All given temporal constraints
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 39
Joint Resolution
“Sheila”
social
conversational
social
topical
social
topical
topical
“Sheila Tweed”
“sheila”
“sg”
“Sheila Walton”
“Sheila”
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 40
Joint Resolution
SpreadSpreadCurrent ResolutionCurrent Resolution
CombineCombineContext InfoContext Info
UpdateUpdateResolutionResolution
MentionGraph
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 41
Joint Resolution
mapmap shuffleshuffle reducereduce
MentionGraph
MapReduce!MapReduce!
Work in Progress!Work in Progress!
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 42
System DesignEmailsThreads Identity Models
Social Expansion
Conv. Expansion
Topical Expansion
Local Expansion
LocalContext
Social Context
TopicalContext
Conv.Context
Mention Recognition
Context-Free Resolution
Mentions
Merging Contexts
Context-Free Resolution
Joint Resolution
Posterior Resolution
Prior Resolution
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 43
Iterative Joint Resolution Input: Context Graph + Prior Resolution Mapper
Consider one mention Takes:
1. out-edges and context info2. prior resolution
Spread context info and prior resolution to all mentions in context
Reducer Consider one mention Takes:
1. in-edges and context info2. prior resolution
Compute posterior resolution Multiple Iterations
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 44
Conclusion
Simple and efficient MapReduce solution applied to both topical and social expansion in
“Identity Resolution in Email” different tricks for approximation
Shuffling is critical df-cut controls efficiency vs. effectiveness tradeoff 99.9% df-cut achieves 98% relative accuracy
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 45
Thank You!
Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 46
Algorithm
Matrix must fit in memory Works for small collections
Otherwise: disk access optimization