Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in...

32
Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany

Transcript of Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in...

Page 1: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

Multimedia Information Retrieval in Networked Digital Libraries

Norbert FuhrUniversity of Duisburg-Essen

Germany

Page 2: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

2

MIND Project• Task: Development of methods for accessing

large numbers of multimedia digital libraries through a single system

• Funded by the EC under FP5 (2/01-7/03)• Participants:

– Carnegie-Mellon University– University of Duisburg-Essen– University of Florence– University of Sheffield– University of Strathclyde

Page 3: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

3

MIND Architecture

Fact

Text

Image

Text

Image

Fact

TextProxy1 Proxy2 Proxy3

Wrapper1 Wrapper2 Wrapper3

Dispatcher

Speech

Fact

Text Proxy4

Wrapper4

Data fuser

User Interface

Data fuser

User Interface

DL1 DL2 DL4DL3

Page 4: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

4

Query Processing

Query Transformation Map between heterogeneous schemas (Dublin Core, MARC 21)

Resource Selection Determine relevant libraries (more cost effective than querying all libraries)

Database Query Run Query each selected library (task of “wrappers”)

Document Transformation Map between heterogeneous schemas (Dublin Core, MARC 21)

Data Fusion Fuse library results into one single ranked list or several clusters

Page 5: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

5

Retrieval in FederatedDigital Libraries

Tasks:• Extract DL metadata („resource

description“)• Select relevant libraries („resource

selection“)• Communicate with libraries („wrapper“)• Combine results („data fusion“)

Page 6: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

6

Resource Description

Page 7: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

7

Resource Description: Query-based Sampling

• For non-cooperative DLs• Generate sequence of queries• Collect answer documents (~ 300)• Generate resource description from

sampled documents

Page 8: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

8

Resource descriptionfor images• Feature vectors are clustered at different

levels of granularity• For each cluster we consider:

– Cluster centre ci;– Cluster radius ri – Cluster population Pi

• The smaller the cluster radii, the finer the approximation of feature vector density.

Page 9: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

9

Demo: Resource descriptions for images Resource descriptions

• Java application demonstrating resource description process for images:• Selection of resource to be

described• Acquisition of individual

document descriptors• Processing of document

descriptors and extraction of resource descriptors at several granularity levels

Page 10: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

10

Resource selection

Page 11: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

11

Resource selection Resource selection

• Querying all accessible DLs is too expensive– Thus route query only to „best“ libraries

• Two competing approaches:– Resource ranking:

• Compute similarity of DL to query (heuristically)• Select fixed number of top-ranked DLs

– Decision-theoretic framework:• Assign costs to retrieval (money, time, retrieval quality)• Compute selection which minimises costs

– #selected DLs not fixed• Described in this talk

Page 12: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

12

Multimedia retrievalmodel Resource selection

• Query conditions c (attribute, predicate, comparison value), weight Pr(q c)– E.g. term, year number, colour histogram

• Probabilistic indexing weights Pr(c d)– E.g. BM25 for text, histogram similarity for images

• Linear retrieval function:

• Mapping onto Pr(rel|q,d):– Linear/logistic function

Pr( )Pr()Pr() ∑∈

←⋅←=←qc

iii

dccqdqindexing weightcondition weight

Page 13: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

13

Probability of Relevance Resource selection

• Probability of relevance computed based on probability of implication (score)

– Linear function with constant Pr(rel|q d)– New approach: logistic function

• Evaluation: better approximation quality than linearfunction

))(Pr(),|Pr( dqfdqrel ←=

0.1

0.2

0.3f(x)=

Pr(re

l|q,d

)

0.4

0.5

0.6

0.7

0.8

0.9

1

00

0.2 0.4x=Pr(q <- d)

0.6 0.8

b0=-20, b1=50

b0=-10, b1=100b0=-4, b1=10

1

)exp(1)exp(:)(

10

10

xbbxbbxf⋅++

⋅+=

Page 14: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

14

Costs Sources Resource selection

• Computation and communication time– Affine linear function

• Charges for delivery– Linear function

• Retrieval quality (most interesting for IR)– C+<C- for relevant (non-relevant) documents– Estimating retrieval quality:

for result set of si documents, estimate the number ri of relevant documents contained

Page 15: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

15

Estimating Retrieval Quality –Method 1 Resource selection

Relevant Documents in DL• Resource description:

– Expectation of indexing weights for terms (images, facts: vector clusters, retrieval function)

• Resource selection: – Estimate number of relevant documents in DLi– Estimate number ri of relevant documents in the

result set• Apply recall-precision-function

– Approximated by linear function (defined by starting point P(0))

0.2 0.400

0.1

0.2

0.3

0.4prec

isio

n0.5

0.6

0.7

0.8

0.9

P(0)=0.9

P(0)=0.5

0.6recall

0.8 1

1

Page 16: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

16

Estimating Retrieval Quality – Method 2 Resource selection

Simulated Retrieval on Sample• Resource description:

– Store complete index of sample • Resource selection:

– Simulate retrieval on indexed sample (for all media types)

– Derive distribution of probabilities of relevance• Assumed to be representative

for whole collection– Estimate number ri of

relevant documents in result set

2.5

1.5

0.5

0.2 0.4 0.6 0.80

1

2

density p(x)

0

|DL| s

2.5

1.5

0.5

0.2 0.4 0.6 0.80

1

2

density p(x)

0

a

Page 17: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

17

Estimating retrieval Quality – Method 3 Resource selection

Modelling Indexing Weights• Resource description:

– Expectation/variance of indexing weights

• Resource selection:– Approximate indexing weight

distribution• New: normal distribution

– Document score distribution• Also normal distribution

– Proceed as for M2– Todo: other media types -0.02 -0.01 0 0.01

document score0.02 0.03 0.04 0.05 0.0

60

50

40

freq

uenc

y

30

20

10

query 51 (collection)

query 51 (normal distribution)

0-0.02 -0.01 0 0.01

document score0.02 0.03 0.04 0.05 0.0

60

50

40

freq

uenc

y

30

20

10

query 51 (collection)0

-0.06 -0.04 -0.02 0indexing weight

0.02 0.04 0.06 0.08 0.1

16

14

freq

uenc

y

12

10

8

6

4

2

0

normal distribution fit

collection

-0.06 -0.04 -0.02 0indexing weight

0.02 0.04 0.06 0.08 0.1

16

14

freq

uenc

y

12

10

8

6

4

2

0

collection

Page 18: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

18

Estimating retrievalquality Resource selection

• Methods:1. Estimate #rel.docs. in DL, apply linear recall-

precision function2. Simulate retrieval on sample, extrapolate to DL3. Normal distribution for indexing weights

normal distribution for retrieval scores• Results:

– Use logistic mapping for retrieval score probability of relevance

– 3 ~ CORI>1 > 2

Page 19: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

19

Data fusion

Page 20: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

20

Data fusion Data fusion

Text1

Speech2

Image3

Text4

Speech5

Text6

Text7

Image8

Image9

Image10

• Rank data– Rank– Score– Surrogates– Duplicates across ranks

• Full document• Broad information

– Collection information– User preferences

Page 21: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

21

Text data fusion Data fusion

• Using a combination of evidence– Original rank position of documents– Re-ranking based on similarity of surrogate to

query• Surrogate could be title or text summary

– Promotion of documents found to be similar to others in the rankings

Page 22: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

22

Speech data fusion Data fusion

• SpeechBot has 50% Word Error Rate on average– 17.56% in 300 top ranked documents– No need to treat speech differently for

fusion

“Speech is spoken and when recognised word errors occur”

“Speech is spoken and when recognised word errors occur”

Peaches broken and when recognised word errors occur.

Peaches broken and when recognised word errors occur.

Page 23: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

23

Image data fusion –score normalisation Data fusion

DL

DF SearchEngine

query queryresults resultsDL Search

Engine s=0.8 σ=0.9image image

s=0.6 σ=0.8image image

s=0.5 σ=0.8image image

s=0.3 σ=0.5image image

s=0.2 σ=0.4image image

normalizedscores

un-normalizedscores

Page 24: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

24

Image data fusion Data fusion

Page 25: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

25

Presentation of retrieval results

Page 26: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

26

2D Information Display Space (IDS) Presentation of results

• Presenting simultaneously multiple properties about documents – each document is represented through a visual

object (VO) – Visual features of each VO encode relevant

properties of documents• Position, size, shape, colour

– Documents sharing the same value of one relevant property (selected by user) are displayed with the same value of the corresponding visual feature

Page 27: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

27

Sample query session Presentation of results

Page 28: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

28

Sample query session Presentation of results

Page 29: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

29

MMIR in Networked DLs – Open Issues• Vague schema mappings for heterogeneous

environments• Cross-media searches• Resource selection for non-textual media• Decentralized (P2P) architectures

Page 30: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

30

JXTA Search Architecture

Two-level architecture:

1. Search hubs

2. Provider peers

Page 31: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

31

Conclusions and Outlook• MIND provides methods for

– resource description,– resource selection and– result fusionin federated multimedia DLs

• Open Issues:– Heterogeneous, cross-media environments– Time-dependent media– Decentralized (P2P) architectures

Page 32: Multimedia Information Retrieval in Networked Digital ... · Multimedia Information Retrieval in Networked Digital Libraries Norbert Fuhr University of Duisburg-Essen Germany. 2 ...

32