Towards a Query Optimizer for Text-Centric Tasks

Towards a Query Optimizer for Text-Centric TasksPanagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis GravanoPresenter: Avinandan Sengupta

2

Session Outline

Text Centric Tasks

Methods Employed

A More Disciplined Approach

Proposed Algorithm

Experimental Setup

Results Conclusion

3

Scenario I

Construction of a table of disease outbreaks from a newspaper archive

sample tuples

Task 1

Information Extraction

4

Scenario II

Word FrequencySamsung 2900

Nokia 2500Blackberry 2000

Apple 1900... ...

Tabulating the number of times an organization’s name appears on a particular web site

Task 2

Content Summary Construction

5

Scenario III

Task 3

Discovering pages on Botany on the Internet

Focused Resource Discovery

6

Text-centric tasks

Types

Information Extraction

Content Summary

Construction

Focused Resource Discovery

7

Performing Text-Centric Tasks

8

Recall – In Text Centric Tasks

Set of tokens that the document processor P extracts from the corpus

Strategy

Corpus

Documents Processed

9

General flow

Retrieve documents from

corpus

Relevant?

Process document

Recall ≥ Target Recall

Done

Y

Y

Document Classifier

Document Processor

Document Retrieval Token Extraction Check

Start

optional

10

What are the available method for retrieval?

Scan

Filtered Scan AQG

ISE

Crawl Query

Iterative Set

Expansion

Automatic Query

Generation

Execution Strategies

11

Execution Time – Generic Model

Strategy

Corpus

12

Execution Time – Simplified

13

Scan (SC)

Time(SC,D) = |Dretr| . (tR + tP)

=

14

Filtered Scan (FS)

Time(FS,D) = |Dretr| . (tR + tF + Cσ . tP )

: selectivity of Cfraction of database documents that C judges useful

one time, offline

15

Iterative Set Expansion (ISE)=

Time(ISE,D) = |Qsent| . tQ + |Dretr| . (tR + tP)

16

Automatic Query Generation (AQG)

Time(AQG,D) = |Qsent| . tQ + |Dretr| . (tR + tP)

=

17

Which strategy to use?

CrawlingQuerying

Text centric tasks

Select a strategy based on

heuristics/intuition

18

A More Disciplined Approach

19

Can we do better?

Define execution models

Estimate cost s

Select appropriate

technique based on cost

Revisit technique selection

Scan

Filtered Scan

AQG

ISE

20

Formalizing the problemGiven a target recall value τ , the goal is to identify an execution strategy S among S1, . . . , Sn such that:

Recall(S, D) ≥ τ

Time(S, D) ≤ Time(Sj , D) if Recall(Sj , D) ≥ τ

21

Degrees

g(d)degree of a document

g(t) # of distinct documents in D from which P can extract t

g(q) # of documents from D retrieved by query q

Duseful

Duseless

# of distinct tokens extracted from d using P

degree of a token

degree of a query

22

Cost of Scan - 1Time(SC,D) = |Dretr| . (tR + tP)

SC retrieves documents in no particular order and does not retrieve the same document twice.

SC is doing multiple token sampling from a finite population in parallel over D

Probability of observing a token t k times in a sample of size S follow hypergeometric distribution

23

Cost of Scan - 2

Probability that token t does not appear in the

sample

# of documents in which the token does not appear

# of ways to select S documents from |D| docs

# of ways to select S documents from |D| - g(t) docs

Probability that token t appears in at least one

document

Expected number of tokens retrieved after

processing S documents

24

Cost of Scan - 3

We do not know the exact g(t) for each tokenBut, we know the form of the token degree distribution [power law distribution]Thus by using estimates for the probabilities Pr{g(t) = i}

|Tokens| * {Pr{g(t) = 1}*[1 – (|D| - 1)!(|D| - S)!/(|D| - 1 – S)!|D|!] + Pr{g(t) = 2}*[1 – (|D| - 2)!(|D| - S)!/(|D| - 2 – S)!|D|!] + ... + Pr{g(t) = ∞}*[1 – (|D| - ∞)!(|D| - S)!/(|D| - ∞ – S)!|D|!]}

Estimated # of documents retrieved to achieve a target recall

25

Cost of Filtered ScanClassifier selectivity

Classifier recall

Cr : the fraction of useful documents in D that are also classified as useful by the classifier. A uniform recall is assumed across tokens

Cr * g(t) : # times each token appears (on average)

26

Cost of Filtered Scan

Estimated # of documents retrieved to achieve a target recall

When Cσ is high, almost all documents in D are processed by P, and the behavior tends towards that of Scan

27

Cost of ISE - Random Graph ModelA random graph is a collection of points, or vertices, with lines, or edges, connecting pairs of them at random

The presence or absence of an edge between two vertices is independent of the presence or absence of any other edge, so that each edge may be considered to be present with independent probability p.

28

Cost of ISE – Querying GraphQuerying Graph: A bipartite graph with (V,E)V = {Tokens, t} U {Documents, d}

E1= {edge: d->t, such that tokens t can be extracted from d}

E2= {edge: t->d, such that a query with t retrieves document d}

E= E1 U E2

29

Cost of ISE – With Generating Functions

Degree distribution of a randomly chosen document

Degree distribution of a randomly chosen token

pdk is the probability that a randomly chosen document d contains k tokens

ptk is the probability that a randomly chosen token t retrieves k documents

30

Cost of ISE – With Generating Functions

degree distribution for a document chosen by following a random edge

degree distribution for a token chosen by following a random edge

31

Cost of ISE – Properties of Generating Functions

32

Cost of ISE - EvaluationConsider: ISE has sent a set Q of tokens as queries

By the Power property, the distribution of the total number of retrieved documents (which are pointed to by these tokens)

The degree distribution of these tokens is: Gt1(x)These tokens were discovered by following random edges on the graph

Gd2(x) = [Gt1(x)]|Q|

Time(ISE,D) = |Qsent| . tQ + |Dretr| . (tR + tP)

Implies - |Dretr| is a random variable whose distribution is given by Gd2(x)

Hence, the degree distribution of these documents is described by Gd1(x)

Documents are retrieved by following random edges on the graph

34

Cost of AQG

35

Algorithms

36

Global Optimization

37

Local Optimization

38

Probablity, Distributions, Parameter Estimations

39

Scan - Parameter EstimationThis relies on the characteristics of the token and document degree distributions. After retrieving and processing a few documents, we can estimate the distribution parameters based on the frequency of the initially extracted tokens and documents. Specifically, we can use a maximum likelihood fit to estimate the parameters of the document degree distribution. For example, the document degrees for Task 1 tend to follow a power-law distribution, with a probability mass function:

Goal: Estimate the most likely value of β, for a given sample of document degrees g(d1), . . . , g(ds)

ζ (β) is the Riemann zeta function (serves as a normalizing factor)

Use MLE to identify the value of β that maximizes the likelihood function:

40

Scan - Parameter Estimation

Find the maxima:

we can estimate the value of β using numeric approximation

41

Scan – Token Distribution Estimation

To maximize the above, we take log, (eliminate factorials by Stirling’s approximation, and equate the derivative to zero to find the maxima

42

Filtered Scan – Parameter Estimation

43

ISE – Parameter Estimation

44

AQG – Parameter Estimation

45

Experimental Setting and Results

46

• Tuple extraction from New York Times archives• Categorized word frequency computation for

Usenet newgroups• Document retrieval on Botany from the

Internet

Details of the Experiments

47

Task 1a, 1b – Information Extraction

Document Processor: Snowball

1a: Extracting a Disease-Outbreaks relation, tuple (DiseaseName, Country)

1b: extracting a Headquarters relation, tuple (Organization,Location)

Token: a single tuple of the target relation

Document: a news article from The New York Times archive

Corpus: Newspaper articles from The NewYork Times, published in 1995 (NYT95) and 1996 (NYT96)

NYT95 documents for training

NYT96 documents for evaluation of the alternative execution strategies

NYT96 Features182,531 documents, 16,921 tokens (Task 1a)605 tokens (Task 1b)

Document Classifier: RIPPER

g(d):power-law distributiong(t): power-law distribution

48

Task 1a, 1b – Information Extraction

FS: Rule Based Classifier (RIPPER)

RIPPER trained with a set of 500 useful documents and 1500 not useful documents from the NYT95 data set

AQG: 2000 documents from the NYT95 data set as a training set to create the queries required by Automatic Query Generation

ISE: construct queries using the AND operator of the attributes of each tuple (tuple typhus, Belize -> [typhus AND Belize])

ISE/AQG: maximum # of returned documents - 100

49

Task 2 - Content Summary Construction

•Separate documents into topics based on high-level name of the newsgroup (comp, sci)•Train a rule-based classifier using RIPPER; creates rules to assign documents into categories •Final queries contain the antecedents of the rules, across all categories

Document Processor: Simple Tokenizer

Token: words and its frequency

Document: A Usenet message

Corpus: 20 Newgroups data set from the UCI KDD Archive. Contains 20,000 messages

g(d):lognormal distributiong(t): power-law distribution

Extracting words and their frequency from newsgroup

FS: not applicable (all documents useful)

ISE: queries are constructed using words that appear in previously retrieved documents ISE/AQG: maximum # of

returned documents - 100

AQG Modus operandi

50

Task 3 – Focused Resource Discovery

•Separate documents into topics based on high-level name of the newsgroup (comp, sci)•Train a rule-based classifier using RIPPER; creates rules to assign documents into categories •Final queries contain the antecedents of the rules, across all categories

Document Processor: Multinomial Naïve Bayes Classifier

Token: URL of page on Botany

Document: Web page

Corpus: 800,000 pages with 12,000 relevant to Botany

g(d):lognormal distributiong(t): power-law distribution

Retrieving document on Botany from the Internet

ISE/AQG: maximum # of returned documents - 100

AQG Modus operandi

51

Task 3 – Database Building• Retrieve 8000 pages listed in Open Directory under: Top -

> Science -> Biology -> Botany• Select 1000 documents as training documents• Create a multinomial Naive Bayes classifier that decides

whether a Web page is about Botany• foreach of the downloaded Botany pages– extract backlinks with Google– classify retrieved pages – foreach page classified as “Botany”

• repeat backlinks extraction • until none of the backlinks was classified under Botany.

52

Task 3 – Database Attributes

• Around 12,000 pages on Botany, pointed to by approximately 32,000 useless documents

• Augment useless documents:– picked 10 more random topics from the third level of the

Open Directory hierarchy– downloaded all the Web pages listed under these topics,

for a total of approximately 100,000 pages.• Final data set – Total: around 800,000 pages– 12,000 are relevant to Botany.

53

Task 3 – Modus Operandi

• SC with a classifier deciding whether each of the retrieved pages belongs to the category of choice.

• For FS– a focused crawler starts from a few Botany Web pages, – visits a Web page only when at least one of the

documents that points to it is useful• For AQG– train a RIPPER classifier using the training set– create a set of rules that assign documents into the

Botany category.

Evaluation – Model Accuracy

54

Task 1a

Task 1b

Task 2

Task 3

55

Evaluation – Global vs. ActualTask 1a

Task 1b

Task 2

Task 3

56

Evaluation – Global vs. LocalTask 1a

Task 1b

Task 2

Task 3

57

• Introduced a rigorous cost model for several query- and crawl-based execution strategies that underlie the implementation of many text-centric tasks

• Develop principled cost estimation approaches for the model introduced• Analyzed the models to predict the execution time and output

completeness of important query- and crawl-based algorithms and accordingly select a strategy– until now these were only empirically evaluated, with limited theoretical

justification• Demonstrated that the suggested modeling can be successfully used to

create optimizers for text-centric tasks• Showed that the optimizers help build efficient execution plans to achieve a

target recall, resulting in executions that can be orders of magnitude faster than alternate choices

Conclusion

58

References• Generator functions• Sampling from a finite population: http://

blog.data-miners.com/2008/05/agent-problem-sampling-from-finite.html

• Random graphs with arbitrary degree distribution and their applications

• Probability Distributions– http://en.wikipedia.org/wiki/Hypergeometric_distribution– http://en.wikipedia.org/wiki/Pareto_distribution– http://en.wikipedia.org/wiki/Power_law– http://en.wikipedia.org/wiki/Zipf's_law

• MLE: http://statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html

http://blog.data-miners.com/2008/05/agent-problem-sampling-from-finite.html




http://en.wikipedia.org/wiki/Hypergeometric_distribution

http://en.wikipedia.org/wiki/Hypergeometric_distribution

http://en.wikipedia.org/wiki/Pareto_distribution

http://en.wikipedia.org/wiki/Power_law

http://en.wikipedia.org/wiki/Zipf's_law

http://statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html

http://statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html

59

Thanks!

60

Backup Slides

61

Probability, Distributions and Estimations

62

Distributions, Models, and Likelihood

63

MLE

Maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters.

64

Zipf’s LawGiven some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

65

Power Law

An example power-law graph, being used to demonstrate ranking of popularity. To the right is the long tail, and to the left are the few that dominate (also known as the 80-20 rule).

66

Hypergeometric Distribution

Discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement

67

Binomial Distribution

Describes the number of successes in a sequence of n draws with replacement.

Towards a Query Optimizer for Text-Centric Tasks

Documents

Transcript of Towards a Query Optimizer for Text-Centric Tasks