Towards a Query Optimizer for Text-Centric Tasks

67
Towards a Query Optimizer for Text-Centric Tasks Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis Gravano Presenter: Avinandan Sengupta

description

Towards a Query Optimizer for Text-Centric Tasks. Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain, Luis Gravano. Presenter: Avinandan Sengupta. Session Outline. Text Centric Tasks. Methods Employed. A More Disciplined Approach. Experimental Setup. Proposed Algorithm. - PowerPoint PPT Presentation

Transcript of Towards a Query Optimizer for Text-Centric Tasks

Page 1: Towards a Query Optimizer for Text-Centric Tasks

Towards a Query Optimizer for Text-Centric TasksPanagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis GravanoPresenter: Avinandan Sengupta

Page 2: Towards a Query Optimizer for Text-Centric Tasks

2

Session Outline

Text Centric Tasks

Methods Employed

A More Disciplined Approach

Proposed Algorithm

Experimental Setup

Results Conclusion

Page 3: Towards a Query Optimizer for Text-Centric Tasks

3

Scenario I

Construction of a table of disease outbreaks from a newspaper archive

sample tuples

Task 1

Information Extraction

Page 4: Towards a Query Optimizer for Text-Centric Tasks

4

Scenario II

Word FrequencySamsung 2900

Nokia 2500Blackberry 2000

Apple 1900... ...

Tabulating the number of times an organization’s name appears on a particular web site

Task 2

Content Summary Construction

Page 5: Towards a Query Optimizer for Text-Centric Tasks

5

Scenario III

Task 3

Discovering pages on Botany on the Internet

Focused Resource Discovery

Page 6: Towards a Query Optimizer for Text-Centric Tasks

6

Text-centric tasks

Types

Information Extraction

Content Summary

Construction

Focused Resource Discovery

Page 7: Towards a Query Optimizer for Text-Centric Tasks

7

Performing Text-Centric Tasks

Page 8: Towards a Query Optimizer for Text-Centric Tasks

8

Recall – In Text Centric Tasks

Set of tokens that the document processor P extracts from the corpus

Strategy

Corpus

Documents Processed

Page 9: Towards a Query Optimizer for Text-Centric Tasks

9

General flow

Retrieve documents from

corpus

Relevant?

Process document

Recall ≥ Target Recall

Done

Y

Y

Document Classifier

Document Processor

Document Retrieval Token Extraction Check

Start

optional

Page 10: Towards a Query Optimizer for Text-Centric Tasks

10

What are the available method for retrieval?

Scan

Filtered Scan AQG

ISE

Crawl Query

Iterative Set

Expansion

Automatic Query

Generation

Execution Strategies

Page 11: Towards a Query Optimizer for Text-Centric Tasks

11

Execution Time – Generic Model

Strategy

Corpus

Page 12: Towards a Query Optimizer for Text-Centric Tasks

12

Execution Time – Simplified

Page 13: Towards a Query Optimizer for Text-Centric Tasks

13

Scan (SC)

Time(SC,D) = |Dretr| . (tR + tP)

=

Page 14: Towards a Query Optimizer for Text-Centric Tasks

14

Filtered Scan (FS)

Time(FS,D) = |Dretr| . (tR + tF + Cσ . tP )

: selectivity of Cfraction of database documents that C judges useful

one time, offline

Page 15: Towards a Query Optimizer for Text-Centric Tasks

15

Iterative Set Expansion (ISE)=

Time(ISE,D) = |Qsent| . tQ + |Dretr| . (tR + tP)

Page 16: Towards a Query Optimizer for Text-Centric Tasks

16

Automatic Query Generation (AQG)

Time(AQG,D) = |Qsent| . tQ + |Dretr| . (tR + tP)

=

Page 17: Towards a Query Optimizer for Text-Centric Tasks

17

Which strategy to use?

CrawlingQuerying

Text centric tasks

Select a strategy based on

heuristics/intuition

Page 18: Towards a Query Optimizer for Text-Centric Tasks

18

A More Disciplined Approach

Page 19: Towards a Query Optimizer for Text-Centric Tasks

19

Can we do better?

Define execution models

Estimate cost s

Select appropriate

technique based on cost

Revisit technique selection

Scan

Filtered Scan

AQG

ISE

Page 20: Towards a Query Optimizer for Text-Centric Tasks

20

Formalizing the problemGiven a target recall value τ , the goal is to identify an execution strategy S among S1, . . . , Sn such that:

Recall(S, D) ≥ τ

Time(S, D) ≤ Time(Sj , D) if Recall(Sj , D) ≥ τ

Page 21: Towards a Query Optimizer for Text-Centric Tasks

21

Degrees

g(d)degree of a document

g(t) # of distinct documents in D from which P can extract t

g(q) # of documents from D retrieved by query q

Duseful

Duseless

# of distinct tokens extracted from d using P

degree of a token

degree of a query

Page 22: Towards a Query Optimizer for Text-Centric Tasks

22

Cost of Scan - 1Time(SC,D) = |Dretr| . (tR + tP)

SC retrieves documents in no particular order and does not retrieve the same document twice.

SC is doing multiple token sampling from a finite population in parallel over D

Probability of observing a token t k times in a sample of size S follow hypergeometric distribution

Page 23: Towards a Query Optimizer for Text-Centric Tasks

23

Cost of Scan - 2

Probability that token t does not appear in the

sample

# of documents in which the token does not appear

# of ways to select S documents from |D| docs

# of ways to select S documents from |D| - g(t) docs

Probability that token t appears in at least one

document

Expected number of tokens retrieved after

processing S documents

Page 24: Towards a Query Optimizer for Text-Centric Tasks

24

Cost of Scan - 3

We do not know the exact g(t) for each tokenBut, we know the form of the token degree distribution [power law distribution]Thus by using estimates for the probabilities Pr{g(t) = i}

|Tokens| * {Pr{g(t) = 1}*[1 – (|D| - 1)!(|D| - S)!/(|D| - 1 – S)!|D|!] + Pr{g(t) = 2}*[1 – (|D| - 2)!(|D| - S)!/(|D| - 2 – S)!|D|!] + ... + Pr{g(t) = ∞}*[1 – (|D| - ∞)!(|D| - S)!/(|D| - ∞ – S)!|D|!]}

Estimated # of documents retrieved to achieve a target recall

Page 25: Towards a Query Optimizer for Text-Centric Tasks

25

Cost of Filtered ScanClassifier selectivity

Classifier recall

Cr : the fraction of useful documents in D that are also classified as useful by the classifier. A uniform recall is assumed across tokens

Cr * g(t) : # times each token appears (on average)

Page 26: Towards a Query Optimizer for Text-Centric Tasks

26

Cost of Filtered Scan

Estimated # of documents retrieved to achieve a target recall

When Cσ is high, almost all documents in D are processed by P, and the behavior tends towards that of Scan

Page 27: Towards a Query Optimizer for Text-Centric Tasks

27

Cost of ISE - Random Graph ModelA random graph is a collection of points, or vertices, with lines, or edges, connecting pairs of them at random

The presence or absence of an edge between two vertices is independent of the presence or absence of any other edge, so that each edge may be considered to be present with independent probability p.

Page 28: Towards a Query Optimizer for Text-Centric Tasks

28

Cost of ISE – Querying GraphQuerying Graph: A bipartite graph with (V,E)V = {Tokens, t} U {Documents, d}

E1= {edge: d->t, such that tokens t can be extracted from d}

E2= {edge: t->d, such that a query with t retrieves document d}

E= E1 U E2

Page 29: Towards a Query Optimizer for Text-Centric Tasks

29

Cost of ISE – With Generating Functions

Degree distribution of a randomly chosen document

Degree distribution of a randomly chosen token

pdk is the probability that a randomly chosen document d contains k tokens

ptk is the probability that a randomly chosen token t retrieves k documents

Page 30: Towards a Query Optimizer for Text-Centric Tasks

30

Cost of ISE – With Generating Functions

degree distribution for a document chosen by following a random edge

degree distribution for a token chosen by following a random edge

Page 31: Towards a Query Optimizer for Text-Centric Tasks

31

Cost of ISE – Properties of Generating Functions

Page 32: Towards a Query Optimizer for Text-Centric Tasks

32

Cost of ISE - EvaluationConsider: ISE has sent a set Q of tokens as queries

By the Power property, the distribution of the total number of retrieved documents (which are pointed to by these tokens)

The degree distribution of these tokens is: Gt1(x)These tokens were discovered by following random edges on the graph

Gd2(x) = [Gt1(x)]|Q|

Time(ISE,D) = |Qsent| . tQ + |Dretr| . (tR + tP)

Implies - |Dretr| is a random variable whose distribution is given by Gd2(x)

Hence, the degree distribution of these documents is described by Gd1(x)

Documents are retrieved by following random edges on the graph

Page 33: Towards a Query Optimizer for Text-Centric Tasks

33

Cost of ISE - EvaluationBy Composition property, the distribution of the total number of tokens retrieved |Tokensretr| by the Dretr documents:

Using Moments property, the expected values for|Dretr| and |Tokensretr|, after ISE sends Q queries

the number of queries |Qsent| sent by Iterative Set Expansion to reach the target recall τ

Page 34: Towards a Query Optimizer for Text-Centric Tasks

34

Cost of AQG

Page 35: Towards a Query Optimizer for Text-Centric Tasks

35

Algorithms

Page 36: Towards a Query Optimizer for Text-Centric Tasks

36

Global Optimization

Page 37: Towards a Query Optimizer for Text-Centric Tasks

37

Local Optimization

Page 38: Towards a Query Optimizer for Text-Centric Tasks

38

Probablity, Distributions, Parameter Estimations

Page 39: Towards a Query Optimizer for Text-Centric Tasks

39

Scan - Parameter EstimationThis relies on the characteristics of the token and document degree distributions. After retrieving and processing a few documents, we can estimate the distribution parameters based on the frequency of the initially extracted tokens and documents. Specifically, we can use a maximum likelihood fit to estimate the parameters of the document degree distribution. For example, the document degrees for Task 1 tend to follow a power-law distribution, with a probability mass function:

Goal: Estimate the most likely value of β, for a given sample of document degrees g(d1), . . . , g(ds)

ζ (β) is the Riemann zeta function (serves as a normalizing factor)

Use MLE to identify the value of β that maximizes the likelihood function:

Page 40: Towards a Query Optimizer for Text-Centric Tasks

40

Scan - Parameter Estimation

Find the maxima:

we can estimate the value of β using numeric approximation

Page 41: Towards a Query Optimizer for Text-Centric Tasks

41

Scan – Token Distribution Estimation

To maximize the above, we take log, (eliminate factorials by Stirling’s approximation, and equate the derivative to zero to find the maxima

Page 42: Towards a Query Optimizer for Text-Centric Tasks

42

Filtered Scan – Parameter Estimation

Page 43: Towards a Query Optimizer for Text-Centric Tasks

43

ISE – Parameter Estimation

Page 44: Towards a Query Optimizer for Text-Centric Tasks

44

AQG – Parameter Estimation

Page 45: Towards a Query Optimizer for Text-Centric Tasks

45

Experimental Setting and Results

Page 46: Towards a Query Optimizer for Text-Centric Tasks

46

• Tuple extraction from New York Times archives• Categorized word frequency computation for

Usenet newgroups• Document retrieval on Botany from the

Internet

Details of the Experiments

Page 47: Towards a Query Optimizer for Text-Centric Tasks

47

Task 1a, 1b – Information Extraction

Document Processor: Snowball

1a: Extracting a Disease-Outbreaks relation, tuple (DiseaseName, Country)

1b: extracting a Headquarters relation, tuple (Organization,Location)

Token: a single tuple of the target relation

Document: a news article from The New York Times archive

Corpus: Newspaper articles from The NewYork Times, published in 1995 (NYT95) and 1996 (NYT96)

NYT95 documents for training

NYT96 documents for evaluation of the alternative execution strategies

NYT96 Features182,531 documents, 16,921 tokens (Task 1a)605 tokens (Task 1b)

Document Classifier: RIPPER

g(d):power-law distributiong(t): power-law distribution

Page 48: Towards a Query Optimizer for Text-Centric Tasks

48

Task 1a, 1b – Information Extraction

FS: Rule Based Classifier (RIPPER)

RIPPER trained with a set of 500 useful documents and 1500 not useful documents from the NYT95 data set

AQG: 2000 documents from the NYT95 data set as a training set to create the queries required by Automatic Query Generation

ISE: construct queries using the AND operator of the attributes of each tuple (tuple typhus, Belize -> [typhus AND Belize])

ISE/AQG: maximum # of returned documents - 100

Page 49: Towards a Query Optimizer for Text-Centric Tasks

49

Task 2 - Content Summary Construction

•Separate documents into topics based on high-level name of the newsgroup (comp, sci)•Train a rule-based classifier using RIPPER; creates rules to assign documents into categories •Final queries contain the antecedents of the rules, across all categories

Document Processor: Simple Tokenizer

Token: words and its frequency

Document: A Usenet message

Corpus: 20 Newgroups data set from the UCI KDD Archive. Contains 20,000 messages

g(d):lognormal distributiong(t): power-law distribution

Extracting words and their frequency from newsgroup

FS: not applicable (all documents useful)

ISE: queries are constructed using words that appear in previously retrieved documents ISE/AQG: maximum # of

returned documents - 100

AQG Modus operandi

Page 50: Towards a Query Optimizer for Text-Centric Tasks

50

Task 3 – Focused Resource Discovery

•Separate documents into topics based on high-level name of the newsgroup (comp, sci)•Train a rule-based classifier using RIPPER; creates rules to assign documents into categories •Final queries contain the antecedents of the rules, across all categories

Document Processor: Multinomial Naïve Bayes Classifier

Token: URL of page on Botany

Document: Web page

Corpus: 800,000 pages with 12,000 relevant to Botany

g(d):lognormal distributiong(t): power-law distribution

Retrieving document on Botany from the Internet

ISE/AQG: maximum # of returned documents - 100

AQG Modus operandi

Page 51: Towards a Query Optimizer for Text-Centric Tasks

51

Task 3 – Database Building• Retrieve 8000 pages listed in Open Directory under: Top -

> Science -> Biology -> Botany• Select 1000 documents as training documents• Create a multinomial Naive Bayes classifier that decides

whether a Web page is about Botany• foreach of the downloaded Botany pages– extract backlinks with Google– classify retrieved pages – foreach page classified as “Botany”

• repeat backlinks extraction • until none of the backlinks was classified under Botany.

Page 52: Towards a Query Optimizer for Text-Centric Tasks

52

Task 3 – Database Attributes

• Around 12,000 pages on Botany, pointed to by approximately 32,000 useless documents

• Augment useless documents:– picked 10 more random topics from the third level of the

Open Directory hierarchy– downloaded all the Web pages listed under these topics,

for a total of approximately 100,000 pages.• Final data set – Total: around 800,000 pages– 12,000 are relevant to Botany.

Page 53: Towards a Query Optimizer for Text-Centric Tasks

53

Task 3 – Modus Operandi

• SC with a classifier deciding whether each of the retrieved pages belongs to the category of choice.

• For FS– a focused crawler starts from a few Botany Web pages, – visits a Web page only when at least one of the

documents that points to it is useful• For AQG– train a RIPPER classifier using the training set– create a set of rules that assign documents into the

Botany category.

Page 54: Towards a Query Optimizer for Text-Centric Tasks

Evaluation – Model Accuracy

54

Task 1a

Task 1b

Task 2

Task 3

Page 55: Towards a Query Optimizer for Text-Centric Tasks

55

Evaluation – Global vs. ActualTask 1a

Task 1b

Task 2

Task 3

Page 56: Towards a Query Optimizer for Text-Centric Tasks

56

Evaluation – Global vs. LocalTask 1a

Task 1b

Task 2

Task 3

Page 57: Towards a Query Optimizer for Text-Centric Tasks

57

• Introduced a rigorous cost model for several query- and crawl-based execution strategies that underlie the implementation of many text-centric tasks

• Develop principled cost estimation approaches for the model introduced• Analyzed the models to predict the execution time and output

completeness of important query- and crawl-based algorithms and accordingly select a strategy– until now these were only empirically evaluated, with limited theoretical

justification• Demonstrated that the suggested modeling can be successfully used to

create optimizers for text-centric tasks• Showed that the optimizers help build efficient execution plans to achieve a

target recall, resulting in executions that can be orders of magnitude faster than alternate choices

Conclusion

Page 58: Towards a Query Optimizer for Text-Centric Tasks

58

References• Generator functions• Sampling from a finite population: http://

blog.data-miners.com/2008/05/agent-problem-sampling-from-finite.html

• Random graphs with arbitrary degree distribution and their applications

• Probability Distributions– http://en.wikipedia.org/wiki/Hypergeometric_distribution– http://en.wikipedia.org/wiki/Pareto_distribution– http://en.wikipedia.org/wiki/Power_law– http://en.wikipedia.org/wiki/Zipf's_law

• MLE: http://statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html

Page 59: Towards a Query Optimizer for Text-Centric Tasks

59

Thanks!

Page 60: Towards a Query Optimizer for Text-Centric Tasks

60

Backup Slides

Page 61: Towards a Query Optimizer for Text-Centric Tasks

61

Probability, Distributions and Estimations

Page 62: Towards a Query Optimizer for Text-Centric Tasks

62

Distributions, Models, and Likelihood

Page 63: Towards a Query Optimizer for Text-Centric Tasks

63

MLE

Maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters.

Page 64: Towards a Query Optimizer for Text-Centric Tasks

64

Zipf’s LawGiven some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Page 65: Towards a Query Optimizer for Text-Centric Tasks

65

Power Law

An example power-law graph, being used to demonstrate ranking of popularity. To the right is the long tail, and to the left are the few that dominate (also known as the 80-20 rule).

Page 66: Towards a Query Optimizer for Text-Centric Tasks

66

Hypergeometric Distribution

Discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement

Page 67: Towards a Query Optimizer for Text-Centric Tasks

67

Binomial Distribution

Describes the number of successes in a sequence of n draws with replacement.