Effective Phrase Prediction
description
Transcript of Effective Phrase Prediction
![Page 1: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/1.jpg)
Effective Phrase PredictionEffective Phrase Prediction
VLDB 2007 : Text Databases
Presented By Arnab Nandi, H. V. Jagadish
University of Michigan
2008-03-07
Summerized By Jaeseok Myung
![Page 2: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/2.jpg)
Copyright 2008 by CEBT
MotivationMotivation
Pervasiveness of Autocompletion
Typical autocompletion is still at word level
Phrase Prediction
Words provide much more information to exploit for prediction
– Context, Phrase Structures
Most text is predictable and repetitive in many applications
– Email Composition
Prob(“Thank you very much” | “Thank”) ~= 1
Center for E-Business Technology IDS Lab. Seminar – 2/13
![Page 3: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/3.jpg)
Copyright 2008 by CEBT
ChallengesChallenges
Number of phrases is large
n(vocabulary) >> n(alphabet)
n(phrases) = O(vocabulary phrase length)
=> FussyTree structure
Length of phrase is unknown
“word” has a well-defined boundary
=> Significance
How to evaluate a suggestion mechanism?
=> Total Profit Metric (TPM)
Center for E-Business Technology IDS Lab. Seminar – 3/13
![Page 4: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/4.jpg)
Copyright 2008 by CEBT
Problem DefinitionProblem Definition
R = query(p)
Need data structure that can
Store completions efficiently
Support fast querying
Center for E-Business Technology
![Page 5: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/5.jpg)
Copyright 2008 by CEBT
An n-gram Data ModelAn n-gram Data Model
R = query(p) : r ∈ R, prob (p, r) is maximized
mth order Markov model
m: # of previous states that we are using to predict the next state
n-gram model is equivalent to an (n-1)th order Markov model
Center for E-Business Technology
18.0
6.03.00.1
)|()|()(),,( 23121
iXpXPtXiXPtXPpitP
w7,1
w7,2
w7,3
1020
30
frequency for rank
Prefix length p = 5
![Page 6: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/6.jpg)
Copyright 2008 by CEBT
Fundamental Data StructuresFundamental Data Structures
Basic data structure to “completion” problems
TRIE or Suffix Tree
Phrase Version
Every node = word
Center for E-Business Technology
<TRIE> <Suffix Tree>
![Page 7: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/7.jpg)
Copyright 2008 by CEBT
Pruned Count Suffix Tree(PCST)Pruned Count Suffix Tree(PCST)
Construct a frequency based phrase Tree
Prune all nodes with frequency < threshold τ
Problems
PCST including infrequent phrases is constructed as an intermediate result => does not perform well for large sized data
Center for E-Business Technology
[16] Estimating alphanumeric selectivity in the presence of wildcards
![Page 8: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/8.jpg)
Copyright 2008 by CEBT
FussyTree ConstructionFussyTree Construction
Filter out infrequent phrases even before adding to the tree
Center for E-Business Technology
N = 2, τ = 2
training sentence size
threshold
Ignoredphrases
Tokenizingwindow size = 4
the size of the largest frequent phrase
(please, call, me, asap)
(call, me, asap, -end-)
![Page 9: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/9.jpg)
Copyright 2008 by CEBT
SignificanceSignificance
A node in the FussyTree is “significant” if it marks a phrase boundary
Center for E-Business Technology
Example : “please call”
Z and Y are considered tuning parametersAssume, z=2, y=3
“please call”(3) > “please”(0) * “call”(1)
“please call”(3) > ½ * “please”(0)
“please call”(3) > 3 * “please call me”(1)…
![Page 10: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/10.jpg)
Copyright 2008 by CEBT
Significance – cont.Significance – cont.
Center for E-Business Technology
END
All leaves are significant
due to END node (frequency = 0)
Some internal nodes are significant too
Intuitively, suggestions ending on significant nodes will be better
No need to store counts
![Page 11: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/11.jpg)
Copyright 2008 by CEBT
Online Significance MarkingOnline Significance Marking
(Offline) Significance requires an additional pass
Compare against tree generated by FussyTree with offline significance
Center for E-Business Technology
A
B
C
D
E
Add “ABCXY”
A
B
C
D
E
X
Y
The branch point is considered for promotion
The immediate descendant significant nodes are considered for demotion
![Page 12: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/12.jpg)
Copyright 2008 by CEBT
Evaluation MetricsEvaluation Metrics
Precision & Recall
Refer to the quality of the suggestions themselves
For ranked results :
Center for E-Business Technology
![Page 13: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/13.jpg)
Copyright 2008 by CEBT
Total Profit Metric (TPM)Total Profit Metric (TPM)
Total Profit Metric
TPM measures the effectiveness of suggestion mechanism
Counting number of keystrokes saved by suggestions
d is the distraction parameter
TPM(0) corresponds to a user who does not mind distraction at all
TPM(1) is an extreme case where we consider every suggestion(right or wrong) to be a blocking factor that costs us one keystroke
The distraction value would be closer to 0 than 1
Center for E-Business Technology
![Page 14: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/14.jpg)
Copyright 2008 by CEBT
Total Profit Metric – An ExampleTotal Profit Metric – An Example
Center for E-Business Technology
![Page 15: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/15.jpg)
Copyright 2008 by CEBT
ExperimentsExperiments
Multiple Corpora
Enron Small : 1 user’s “sent” (366 emails, 250KB)
Enron Large : multiple users (20,842 emails, 16MB)
Wikipedia (40,000 documents, 53MB)
Data Structures
(1) PCST, (2) FussyTree with Count, (3) FussyTree with Significance
Parameters
Significance : z(comparability) = 2, y(uniqueness) = 2
Training Sentence Size N = 8
Prefix Size P = 2
Center for E-Business Technology
![Page 16: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/16.jpg)
Copyright 2008 by CEBT
Prediction QualityPrediction Quality
Center for E-Business Technology
![Page 17: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/17.jpg)
Copyright 2008 by CEBT
Tuning Parameters (1)Tuning Parameters (1)
Center for E-Business Technology
![Page 18: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/18.jpg)
Copyright 2008 by CEBT
Tuning Parameters (2)Tuning Parameters (2)
Center for E-Business Technology
![Page 19: Effective Phrase Prediction](https://reader035.fdocuments.in/reader035/viewer/2022062518/568146a2550346895db3be78/html5/thumbnails/19.jpg)
Copyright 2008 by CEBT
ConclusionConclusion
Phrase level autocompletion is challenging, but can provide much greater savings beyond word-level autocompletion
A technique to accomplish this based on “significance”
New evaluation metrics for ranked autocompletion
Possible Extensions
Part of Speech Reranking
Semantic Reranking
– Using Wordnet
Query Completion for structured data
– XML, ..
Center for E-Business Technology