Parallel and Distributed Information Retrieval System

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval Lecture 7 Lecture 7 (book chapter 9)(book chapter 9): :

Parallel and Distributed IRParallel and Distributed IR Alexander Gelbukh

www.Gelbukh.com

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

How to accelerate search? Same results as sequential Ideas:

Quick-and-dirty rejection of bad objects, 100% recall Fast data structure for search (based on clustering) Careful check of all found candidates

Solution: mapping into fewer-D feature space Condition: lower-bounding of the distance Assumption: skewed spectrum distribution

Few coefficients concentrate energy, rest are less important

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

Object detection (pattern and image recognition) Automatic feature selection Spatial indexing data structures (more than 1D) New types of data.

What features to select? How to determine them? Mixed-type data (e.g., webpages, or images with

sound and description) What clustering/IR methods are better suited for

what features? (What features for what methods?) Similar methods in data mining, ...

4

The problemThe problem

Very large document collections Google: 4,000,000,000 pages Slow response?

Solution: parallel computing Google: 10,000 computers

5

Parallel architecturesParallel architectures

Data stream

Single Multiple

Instruction stream

SingleSISD

classicalSIMDsimple

MultipleMISD(rare)

MIMDmany SISD

6

MIMD architectureMIMD architecture

The most common Can be

tightly coupled loosely coupled

Distributed Many computers interacting via network PC Clusters Similar to MIMD computers, but greater cost of

communication very loosely coupled More coarse-grained programs

7

Performance improvementPerformance improvement

Time: speedup S Ideally, N times (number of processors) In practice impossible

The problem does not decompose into N equal parts Communication and control overhead < 1 / f, where f is the largest separable fraction of the

problem

Cost Per processor: S / N

8

Two approaches to parallelismTwo approaches to parallelism

Build new algorithms E.g., neural nets Naturally parallel Problem: to define the retrieval task

Adapt the existing techniques to parallelism Allows relying on well-studied approaches We will consider this option

9

Ways to use parallelismWays to use parallelism

Multitasking N search engines Good for processing many queriesProblems: A single query is not speeded up Bottleneck: disk access (index) Possible solution: replicating (part of) data. RAIDs

Parallel algorithms IR = data. Main question: how to partition the data Document / index term matrix

(terms can be LSI dimensions, signature bits, etc)

10

Possible partitioningsPossible partitionings

Horizontal: document partitioning. Union of results Vertical: term partitioning. Basically, intersect results

11

Inverted files: Logical partitioningInverted files: Logical partitioning

Logical vs. physical document partitioning Logical: for each term, use pointers into inverted file data for

each processor, to indicate its portion

12

Inverted files: Logical partitioning Inverted files: Logical partitioning Construction and updatingConstruction and updating

Also parallelConstruction Assign docs to processors Order docs such that each processor has an interval Process in parallel Merge. Each piece is ordered already

13

Inverted files:Inverted files:Physical document partitioningPhysical document partitioning

Several separate collections, one per processor Separate indices Then the lists are merged (they are already ordered) Priority queue is used

The result is not sorted; Insertion is quick The maximal element can be found quickly First k elements can be found rather quickly Details in the book

Consistent scores are needed Global statistics is needed. Can be computed at index time

14

Logical or physical partitioning?Logical or physical partitioning?

Logical requires less communication Faster

Physical is more flexible. Simpler implementation Simpler conversion of existing systems

15

Inverted files: Inverted files: Term partitioningTerm partitioning

Each processor processes a part of the inverted file The results are intersected (for AND)

(or as appropriate for Boolean operations, OR and NOT) When term distribution in user queries is skewed,

then document partitioning is better When uniform, term partitioning is better. Twice for long queries, 5 – 10 times for short (Web-like)

16

Suffix arraysSuffix arrays

Array construction can be parallelized merges are parallel

Document partitioning is applied straightforwardly Each processor maintains its own suffix array

Term partitioning can be applied Each processor owns a branch of the tree (lexicographic

interval) Bottleneck: all processors need access to the entire text

18

Signature filesSignature files

Document partitioning: straightforward Create query signature, distribute to each processor Merge results (using Boolean operations if needed)

Term partitioning: shorter signatures Merging and eliminating false drops is slow This method is not recommended

19

SIMD computersSIMD computers

Single Instruction, Multiple data Uncommon Good for simple operations

Bit operations in signature files Details in the book

Ranking is supported in hardware in some computers If signature file does not fit into memory, can be

processed in batches I/O overhead Use multiple queries with the same batch This improves throughput, but not response time

20

… … SIMD computersSIMD computers

Inverted files are difficult to adapt to SIMD The inverted file is restructured Details in the book

21

Distributed IRDistributed IR

MIMD with Slow communication Not all nodes are used for a given query Encryption issues

Document partitioning is usually used Term partitioning imposes greater communication

overhead Document clustering can be useful (to distribute docs

by processors) Index clusters and then search only the best ones Another approach: use training queries, then similarity of

the user query to these

22

Research topicsResearch topics

How to evaluate the speedup New algorithms Adaptation of existing algorithms Merging the results is a bottleneck

Meta search engines Creating large collections with judgements

Is recall important?

23

ConclusionsConclusions

Parallel computing can improve response time for each query and/or throughput: number of queries processed with same speed

Document partitioning is simple good for distributed computing

Term partitioning is good for some data structures Distributed computing is MIMD computing with slow

communication SIMD machines are good for Signature files

Both are out of favor now

24

Thank you!Till May 17? 18?, 6

pm

Parallel and Distributed Information Retrieval System

Engineering

Transcript of Parallel and Distributed Information Retrieval System