Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th,...

47
Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th , 2012

Transcript of Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th,...

Page 1: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

Tamer ElsayedQatar University

On Large-Scale Retrieval Taskswith Ivory and MapReduce

Nov 7th, 2012

Page 2: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

2

My Field …

Information Retrieval (IR) is …Finding material (usually documents)

of an unstructured nature (usually text) that satisfies an information need

from within large collections

Quite effective (at some things) Highly visible (mostly) Commercially successful (some of them)

Page 3: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

3

IR is not just “Document Retrieval” Clustering and Classification Question answering Filtering, tracking, routing Recommender systems Leveraging XML and other Metadata Text mining Novelty identification Meta-search (multi-collection searching) Summarization Cross-language mechanisms Evaluation techniques Multimedia retrieval Social media analysis …

Page 4: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

4

My Research …

Text

Large-ScaleProcessing

emails

+ web pages

Enron

CLuEWebIdentity

Resolution

WebSearch

~500,000

~1,000,000,000

User Application

Page 5: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

5

Back in 2009 … Before 2009, small text collections are available● Largest: ~ 1M documents

ClueWeb09● Crawled by CMU in 2009● ~ 1B documents !● need to move to cluster environments

MapReduce/Hadoop seems like promising framework

Page 6: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

6

E2E Search Toolkit using MapReduce Completely designed for the Hadoop environment Experimental Platform for research Supports common text collections

● + ClueWeb09 Open source release Implements state-of-the-art retrieval models

http://ivory.ccIvory

Page 7: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

7

MapReduce Framework

map

map

map

map

reduce

reduce

reduce

input

input

input

input

output

output

output

Shuffling

group values by: [keys]

(a) Map (b) Shuffle (c) Reduce

(k2, [v2])(k1, v1)

[(k3, v3)][k2, v2]

Framework handles “everything else” !

Page 8: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

8

The IR Black Box

DocumentsQuery

Hits

Page 9: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

9

Inside the IR Black Box

DocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

offlineonline

Page 10: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

10

Indexing

ClintonCheney

B

ClintonObamaClinton

A

ClintonBarackObama

CCheney

Barack

Obama

ClintonA, 2

C, 1B, 1

A, 1C, 1

B, 1

C, 1

Collection Inverted IndexDocuments, IDs Terms, Posting Lists

Page 11: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

11

Indexing

ClintonRomney

B

ClintonObamaClinton

A

ClintonBarackObama

CRomney

Barack

Obama

ClintonA, 2

C, 1B, 1

A, 1C, 1

B, 1

C, 1

Collection Inverted IndexDocuments, IDs Terms, Posting Lists

Page 12: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

12

Indexing(a) Map (b) Shuffle (c) Reduce

Clinton

Romney

Clinton

Barack

Obama

Clinton

Clinton

Obama

Clinton

Obama

Romney

Barack

Romney

Barack

Obama

Clinton

ClintonRomney

ClintonBarackObama

ClintonObamaClinton

Shuffl

ing

reducemap

map

mapreduce

reduce

reduce

ClintonObamaClinton

ClintonRomney

ClintonBarackObama

2

B

A

C

Page 13: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

Retrieval Directly from HDFS!

Cute hack: use Hadoop to launch partition servers● Embed an HTTP server inside each mapper● Mappers start up, initialize servers, enter into infinite service

loop! Why do this?

● Unified Hadoop ecosystem● Simplifies data management issues

PartitionServer

PartitionServer

PartitionServer

RetrievalBroker

SearchClient

HDFSdatanode

HDFSdatanode

HDFSdatanode

HDFSdatanode

HDFSnamenode

PartitionServer

Local Disk

TREC’10

TREC’09

Page 14: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

14

RoadmapIndexing

& Retrieval

• Batch Retrieval• Approx. Pos. Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collection• Training L2R

Iterative Process

• iHadoop

Ivory

SIGIR 2011

SIGIR 2011

CIKM 2011

ACL 2008

TREC 2009TREC 2010

CloudCom 2011

Page 15: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

15

RoadmapIndexing

& Retrieval

• Batch Retrieval• Approx. Pos. Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collection• Training L2R

Iterative Process

• iHadoop

Ivory

SIGIR 2011ACL 2008

Page 16: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

16

Abstract Problem

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

Applications: Clustering Coreference resolution “more-like-that” queries

Page 17: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

17

Decomposition

reduce

Each term contributes only if appears in

map

Page 18: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

18

Pairwise Similarity(a) Generate pairs (b) Group pairs (c) Sum pairs

Clinton

Barack

Romney

Obama

2

1

1

1

1

1

1

2

2

1

11

2

2 2

2

1

13

1

Page 19: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

19

Terms: Zipfian Distribution

term rank

doc

freq

(df)

each term t contributes o(dft2) partial results

very few terms dominate the computations

most frequent term (“said”) 3%most frequent 10 terms 15%

most frequent 100 terms 57%most frequent 1000 terms 95%

~0.1% of total terms(99.9% df-cut)

Page 20: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

20

Efficiency (disk space)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Inte

rmed

iate

Pai

rs (

bil

lio

ns)

no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%

8 trillionintermediate pairs

0.5 trillion intermediate pairs

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k doc

Page 21: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

21

EffectivenessEffect of df-cut on effectiveness

Medline04 - 909k abstracts- Ad-hoc retrieval

50

55

60

65

70

75

80

85

90

95

100

99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)

Re

lati

ve

P5

(%

)

Drop 0.1% of terms“Near-Linear” Growth

Fit on diskCost 2% in Effectiveness

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

ACL’08

Page 22: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

22

Cross-Lingual Pairwise Similarity Find similar document pairs in different languages

Multilingual text mining, Machine Translation

Application: automatic generation of potential “interwiki” language links

More difficult than monolingual!

Page 23: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

23

Vocabulary Space Matching

MTDoc A

MT translate

doc vector A

German English

DocB

English

doc vector B

Doc A

CLIR project

doc vector A

German

DocB

English

doc vector B

doc vector ACLIR

Ff

Ff

fdfefpedf

ftfefpetf

)()|()(

)()|()(

*

*

Page 24: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

24

Locality-Sensitive Hashing (LSH) Cosine score is a good similarity measure but expensive! LSH is a method for effectively reducing the search

space when looking for similar pairs Each vector is converted into a compact representation,

called a signature

A sliding window-based algorithm uses these signatures to search for similar articles in the collection

Vectors close to each other are likely to have similar signatures

Page 25: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

Solution Overview

CLIRprojection

Nf German articles

Ne

Englisharticles

Preprocess

Ne+Nf

English document

vectors

Ne+Nf

SignaturesSignature

generation

Sliding window

algorithm

Similar article pairs

<nobel=0.324, prize=0.227, book=0.01, …>

0111000010111100001010

Random Projection/Minhash/Simhash

Page 26: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

MapReduce 1: Table Generation Phase

Signatures

….110110111010111000010110101010000…

S1’

sortp1

pQ

.

.

.

S1

SQ

.

.

.

SQ’

sort

….111110010110010100111010010000101…

….111111010101001100011001100100100…

permute

….011001001001001100011011111101010…

….001010011101001000010111111001011…

tables

Page 27: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

27

MapReduce 2: Detection Phase

00000110101000100011110010010110100110000000001100100000011001111100110101000001110100101001001101110010110011

table chunks

Page 28: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

28

Evaluation Ground truth: ● Sample 1064 German articles ● cosine score >= 0.3

Compare sliding window with brute force approach● required for exact solution● good reference as an upper-bound for recall and running time

Page 29: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

Evaluation

95% recall at 39% cost

99% recall at 62% cost

No Free Lunch!

Page 30: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

30

Contribution to Wikipedia Identify links between German and English Wikipedia

articles● “Metadaten” “Metadata”, “Semantic Web”, “File Format”● “Pierre Curie” “Marie Curie”, “Pierre Curie”, “Helene

Langevin-Joliot”● “Kirgisistan” “Kyrgyzstan”, “Tulip Revolution”, “2010

Kyrgyzstani uprising”, “2010 South Kyrgyzstan riots”, “Uzbekistan”

Bad results when significant difference in length.

SIGIR’11

Page 31: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

31

RoadmapIndexing

& Retrieval

• Batch Retrieval• Approx. Pos. Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collection• Training L2R

Iterative Process

• iHadoop

Ivory

CIKM 2011

Page 32: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

32

Approximate Positional Indexes

Learn

“Learning to Rank” models

Termpositions

effective ranking functions

Proximity features

Approximate

Largeindex

Slow query evaluation

X XSmaller index

Faster query evaluation√ √

Close Enough is Good Enough?

Page 33: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

33

Variable-Width Buckets 5 buckets / document

………...........….………...........….………...........….………...........….………...........….

d2d1………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….

12345

21

534

Page 34: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

34

Fixed-Width Buckets Buckets of length W

………...........….………...........….………...........….………...........….………...........….

d2

123

d1………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….

12345

Page 35: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

EffectivenessCIKM’11

Page 36: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

36

Roadmap

Indexing & Retrieval

• Batch Retrieval• Approx. Pos.

Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collections

• Training L2R• Evaluation

Ivory

SIGIR ‘11

iHadoop

Page 37: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

Test Collections Documents, queries, and relevance judgments Important driving force behind IR innovation Without test collections, it’s impossible to:

● Evaluate search systems● Tune ranking functions / train models

Traditional● Exhaustive● Pooling

Recent Methodologies● Behavioral logging (query logs, click logs, etc.)● Minimal test collections● Crowdsourcing

Page 38: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

Web Graphweb search

SIGIR 2012

web search

web search

web search

Google

web search

P1

P4

P2

P5

P7

P3

P6

Page 39: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

Queries and Judgments?

SIGIR 2012P1

P4

P2

P7

P3

P6

web search

BingP5

Google

anchor text lines ≈ pseudo queries

target pages ≈ relevant candidates

noise reduction ?

Page 40: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

40

SIGIR’11

Page 41: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

41

RoadmapIndexing

& Retrieval

• Batch Retrieval• Approx. Pos. Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collection• Training L2R

Iterative Process

• iHadoop

Ivory

CloudCom 2011

Page 42: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

42

Iterative MapReduce Applications Many machine learning, and data mining applications● PageRank, k-means, HITS, …

Every iteration has to wait until the previous iteration has written its output completely to the DFS (unnecessary waiting time)

Every iteration starts by reading from the DFS what has just been written by the earlier iteration (wastes CPU time, I/O, bandwidth)

MapReduce is not designed to run iterative applications efficiently

Page 43: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

43

Goal

Page 44: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

44

Asynchronous PipelineCloudCom’11

Page 45: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

45

Conclusion MapReduce allows large-scale processing over web data Ivory

● E2E open-source IR retrieval engine for research● Completely on Hadoop

• even retrieval: from HDFS

Efficiency-effectiveness tradeoff ● Cross-Lingual Pairwise Similarity

• Efficient implementation using MapReduce• Efficiency-effectiveness tradeoff

● Approx Positional Indexes• Efficient and as effective as exact positions

● Pseudo Test Collections• Possible!• Effective for training L2R models

MapReduce is not good for iterative algorithms

http://ivory.cc

Page 46: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

46

Collaborators Jimmy Lin Don Metzler Doug Oard Ferhan Ture Nima Asadi Lidan Wang Eslam Elnikety Hany Ramadan

Page 47: Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

47

Thank You!

Questions?