Towards a Query Optimizer for Text-Centric Tasks
description
Transcript of Towards a Query Optimizer for Text-Centric Tasks
Towards a Query Optimizer for Text-Centric TasksPanagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis GravanoPresenter: Avinandan Sengupta
2
Session Outline
Text Centric Tasks
Methods Employed
A More Disciplined Approach
Proposed Algorithm
Experimental Setup
Results Conclusion
3
Scenario I
Construction of a table of disease outbreaks from a newspaper archive
sample tuples
Task 1
Information Extraction
4
Scenario II
Word FrequencySamsung 2900
Nokia 2500Blackberry 2000
Apple 1900... ...
Tabulating the number of times an organization’s name appears on a particular web site
Task 2
Content Summary Construction
5
Scenario III
Task 3
Discovering pages on Botany on the Internet
Focused Resource Discovery
6
Text-centric tasks
Types
Information Extraction
Content Summary
Construction
Focused Resource Discovery
7
Performing Text-Centric Tasks
8
Recall – In Text Centric Tasks
Set of tokens that the document processor P extracts from the corpus
Strategy
Corpus
Documents Processed
9
General flow
Retrieve documents from
corpus
Relevant?
Process document
Recall ≥ Target Recall
Done
Y
Y
Document Classifier
Document Processor
Document Retrieval Token Extraction Check
Start
optional
10
What are the available method for retrieval?
Scan
Filtered Scan AQG
ISE
Crawl Query
Iterative Set
Expansion
Automatic Query
Generation
Execution Strategies
11
Execution Time – Generic Model
Strategy
Corpus
12
Execution Time – Simplified
13
Scan (SC)
Time(SC,D) = |Dretr| . (tR + tP)
=
14
Filtered Scan (FS)
Time(FS,D) = |Dretr| . (tR + tF + Cσ . tP )
: selectivity of Cfraction of database documents that C judges useful
one time, offline
15
Iterative Set Expansion (ISE)=
Time(ISE,D) = |Qsent| . tQ + |Dretr| . (tR + tP)
16
Automatic Query Generation (AQG)
Time(AQG,D) = |Qsent| . tQ + |Dretr| . (tR + tP)
=
17
Which strategy to use?
CrawlingQuerying
Text centric tasks
Select a strategy based on
heuristics/intuition
18
A More Disciplined Approach
19
Can we do better?
Define execution models
Estimate cost s
Select appropriate
technique based on cost
Revisit technique selection
Scan
Filtered Scan
AQG
ISE
20
Formalizing the problemGiven a target recall value τ , the goal is to identify an execution strategy S among S1, . . . , Sn such that:
Recall(S, D) ≥ τ
Time(S, D) ≤ Time(Sj , D) if Recall(Sj , D) ≥ τ
21
Degrees
g(d)degree of a document
g(t) # of distinct documents in D from which P can extract t
g(q) # of documents from D retrieved by query q
Duseful
Duseless
# of distinct tokens extracted from d using P
degree of a token
degree of a query
22
Cost of Scan - 1Time(SC,D) = |Dretr| . (tR + tP)
SC retrieves documents in no particular order and does not retrieve the same document twice.
SC is doing multiple token sampling from a finite population in parallel over D
Probability of observing a token t k times in a sample of size S follow hypergeometric distribution
23
Cost of Scan - 2
Probability that token t does not appear in the
sample
# of documents in which the token does not appear
# of ways to select S documents from |D| docs
# of ways to select S documents from |D| - g(t) docs
Probability that token t appears in at least one
document
Expected number of tokens retrieved after
processing S documents
24
Cost of Scan - 3
We do not know the exact g(t) for each tokenBut, we know the form of the token degree distribution [power law distribution]Thus by using estimates for the probabilities Pr{g(t) = i}
|Tokens| * {Pr{g(t) = 1}*[1 – (|D| - 1)!(|D| - S)!/(|D| - 1 – S)!|D|!] + Pr{g(t) = 2}*[1 – (|D| - 2)!(|D| - S)!/(|D| - 2 – S)!|D|!] + ... + Pr{g(t) = ∞}*[1 – (|D| - ∞)!(|D| - S)!/(|D| - ∞ – S)!|D|!]}
Estimated # of documents retrieved to achieve a target recall
25
Cost of Filtered ScanClassifier selectivity
Classifier recall
Cr : the fraction of useful documents in D that are also classified as useful by the classifier. A uniform recall is assumed across tokens
Cr * g(t) : # times each token appears (on average)
26
Cost of Filtered Scan
Estimated # of documents retrieved to achieve a target recall
When Cσ is high, almost all documents in D are processed by P, and the behavior tends towards that of Scan
27
Cost of ISE - Random Graph ModelA random graph is a collection of points, or vertices, with lines, or edges, connecting pairs of them at random
The presence or absence of an edge between two vertices is independent of the presence or absence of any other edge, so that each edge may be considered to be present with independent probability p.
28
Cost of ISE – Querying GraphQuerying Graph: A bipartite graph with (V,E)V = {Tokens, t} U {Documents, d}
E1= {edge: d->t, such that tokens t can be extracted from d}
E2= {edge: t->d, such that a query with t retrieves document d}
E= E1 U E2
29
Cost of ISE – With Generating Functions
Degree distribution of a randomly chosen document
Degree distribution of a randomly chosen token
pdk is the probability that a randomly chosen document d contains k tokens
ptk is the probability that a randomly chosen token t retrieves k documents
30
Cost of ISE – With Generating Functions
degree distribution for a document chosen by following a random edge
degree distribution for a token chosen by following a random edge
31
Cost of ISE – Properties of Generating Functions
32
Cost of ISE - EvaluationConsider: ISE has sent a set Q of tokens as queries
By the Power property, the distribution of the total number of retrieved documents (which are pointed to by these tokens)
The degree distribution of these tokens is: Gt1(x)These tokens were discovered by following random edges on the graph
Gd2(x) = [Gt1(x)]|Q|
Time(ISE,D) = |Qsent| . tQ + |Dretr| . (tR + tP)
Implies - |Dretr| is a random variable whose distribution is given by Gd2(x)
Hence, the degree distribution of these documents is described by Gd1(x)
Documents are retrieved by following random edges on the graph
33
Cost of ISE - EvaluationBy Composition property, the distribution of the total number of tokens retrieved |Tokensretr| by the Dretr documents:
Using Moments property, the expected values for|Dretr| and |Tokensretr|, after ISE sends Q queries
the number of queries |Qsent| sent by Iterative Set Expansion to reach the target recall τ
34
Cost of AQG
35
Algorithms
36
Global Optimization
37
Local Optimization
38
Probablity, Distributions, Parameter Estimations
39
Scan - Parameter EstimationThis relies on the characteristics of the token and document degree distributions. After retrieving and processing a few documents, we can estimate the distribution parameters based on the frequency of the initially extracted tokens and documents. Specifically, we can use a maximum likelihood fit to estimate the parameters of the document degree distribution. For example, the document degrees for Task 1 tend to follow a power-law distribution, with a probability mass function:
Goal: Estimate the most likely value of β, for a given sample of document degrees g(d1), . . . , g(ds)
ζ (β) is the Riemann zeta function (serves as a normalizing factor)
Use MLE to identify the value of β that maximizes the likelihood function:
40
Scan - Parameter Estimation
Find the maxima:
we can estimate the value of β using numeric approximation
41
Scan – Token Distribution Estimation
To maximize the above, we take log, (eliminate factorials by Stirling’s approximation, and equate the derivative to zero to find the maxima
42
Filtered Scan – Parameter Estimation
43
ISE – Parameter Estimation
44
AQG – Parameter Estimation
45
Experimental Setting and Results
46
• Tuple extraction from New York Times archives• Categorized word frequency computation for
Usenet newgroups• Document retrieval on Botany from the
Internet
Details of the Experiments
47
Task 1a, 1b – Information Extraction
Document Processor: Snowball
1a: Extracting a Disease-Outbreaks relation, tuple (DiseaseName, Country)
1b: extracting a Headquarters relation, tuple (Organization,Location)
Token: a single tuple of the target relation
Document: a news article from The New York Times archive
Corpus: Newspaper articles from The NewYork Times, published in 1995 (NYT95) and 1996 (NYT96)
NYT95 documents for training
NYT96 documents for evaluation of the alternative execution strategies
NYT96 Features182,531 documents, 16,921 tokens (Task 1a)605 tokens (Task 1b)
Document Classifier: RIPPER
g(d):power-law distributiong(t): power-law distribution
48
Task 1a, 1b – Information Extraction
FS: Rule Based Classifier (RIPPER)
RIPPER trained with a set of 500 useful documents and 1500 not useful documents from the NYT95 data set
AQG: 2000 documents from the NYT95 data set as a training set to create the queries required by Automatic Query Generation
ISE: construct queries using the AND operator of the attributes of each tuple (tuple typhus, Belize -> [typhus AND Belize])
ISE/AQG: maximum # of returned documents - 100
49
Task 2 - Content Summary Construction
•Separate documents into topics based on high-level name of the newsgroup (comp, sci)•Train a rule-based classifier using RIPPER; creates rules to assign documents into categories •Final queries contain the antecedents of the rules, across all categories
Document Processor: Simple Tokenizer
Token: words and its frequency
Document: A Usenet message
Corpus: 20 Newgroups data set from the UCI KDD Archive. Contains 20,000 messages
g(d):lognormal distributiong(t): power-law distribution
Extracting words and their frequency from newsgroup
FS: not applicable (all documents useful)
ISE: queries are constructed using words that appear in previously retrieved documents ISE/AQG: maximum # of
returned documents - 100
AQG Modus operandi
50
Task 3 – Focused Resource Discovery
•Separate documents into topics based on high-level name of the newsgroup (comp, sci)•Train a rule-based classifier using RIPPER; creates rules to assign documents into categories •Final queries contain the antecedents of the rules, across all categories
Document Processor: Multinomial Naïve Bayes Classifier
Token: URL of page on Botany
Document: Web page
Corpus: 800,000 pages with 12,000 relevant to Botany
g(d):lognormal distributiong(t): power-law distribution
Retrieving document on Botany from the Internet
ISE/AQG: maximum # of returned documents - 100
AQG Modus operandi
51
Task 3 – Database Building• Retrieve 8000 pages listed in Open Directory under: Top -
> Science -> Biology -> Botany• Select 1000 documents as training documents• Create a multinomial Naive Bayes classifier that decides
whether a Web page is about Botany• foreach of the downloaded Botany pages– extract backlinks with Google– classify retrieved pages – foreach page classified as “Botany”
• repeat backlinks extraction • until none of the backlinks was classified under Botany.
52
Task 3 – Database Attributes
• Around 12,000 pages on Botany, pointed to by approximately 32,000 useless documents
• Augment useless documents:– picked 10 more random topics from the third level of the
Open Directory hierarchy– downloaded all the Web pages listed under these topics,
for a total of approximately 100,000 pages.• Final data set – Total: around 800,000 pages– 12,000 are relevant to Botany.
53
Task 3 – Modus Operandi
• SC with a classifier deciding whether each of the retrieved pages belongs to the category of choice.
• For FS– a focused crawler starts from a few Botany Web pages, – visits a Web page only when at least one of the
documents that points to it is useful• For AQG– train a RIPPER classifier using the training set– create a set of rules that assign documents into the
Botany category.
Evaluation – Model Accuracy
54
Task 1a
Task 1b
Task 2
Task 3
55
Evaluation – Global vs. ActualTask 1a
Task 1b
Task 2
Task 3
56
Evaluation – Global vs. LocalTask 1a
Task 1b
Task 2
Task 3
57
• Introduced a rigorous cost model for several query- and crawl-based execution strategies that underlie the implementation of many text-centric tasks
• Develop principled cost estimation approaches for the model introduced• Analyzed the models to predict the execution time and output
completeness of important query- and crawl-based algorithms and accordingly select a strategy– until now these were only empirically evaluated, with limited theoretical
justification• Demonstrated that the suggested modeling can be successfully used to
create optimizers for text-centric tasks• Showed that the optimizers help build efficient execution plans to achieve a
target recall, resulting in executions that can be orders of magnitude faster than alternate choices
Conclusion
58
References• Generator functions• Sampling from a finite population: http://
blog.data-miners.com/2008/05/agent-problem-sampling-from-finite.html
• Random graphs with arbitrary degree distribution and their applications
• Probability Distributions– http://en.wikipedia.org/wiki/Hypergeometric_distribution– http://en.wikipedia.org/wiki/Pareto_distribution– http://en.wikipedia.org/wiki/Power_law– http://en.wikipedia.org/wiki/Zipf's_law
• MLE: http://statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html
59
Thanks!
60
Backup Slides
61
Probability, Distributions and Estimations
62
Distributions, Models, and Likelihood
63
MLE
Maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters.
64
Zipf’s LawGiven some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
65
Power Law
An example power-law graph, being used to demonstrate ranking of popularity. To the right is the long tail, and to the left are the few that dominate (also known as the 80-20 rule).
66
Hypergeometric Distribution
Discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement
67
Binomial Distribution
Describes the number of successes in a sequence of n draws with replacement.