The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is...

57
The Future of P2P Audio/Visual Search Fausto Rabitti ISTI-CNR, Pisa, Italy P2PIR Workshop - ACM CIKM 2006

Transcript of The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is...

Page 1: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

The Future of P2P Audio/Visual Search

Fausto RabittiISTI-CNR, Pisa, Italy

P2PIR Workshop - ACM CIKM 2006

Page 2: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 2

Outline of the talk

1. “Guessing” the future of technologies 2. Today outlook

1. Peer-to-Peer Applications2. Image and Video search on the Web

3. Improving effectiveness by Similarity Search4. Scalability issue: P2P solution

Page 3: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 3

Future of Audio/Visual Search

Everything we write, see, or hear can nowbe in a digital form (93% of produced data is digital)In the next three years, we will create more data than has been produced in all of human history, most of it in Audio/Visual form.New trend in MM content production: personal producer VS professional producersDimensions of the search problem:

EffectivenessEfficiency (scalability is the key issue)

Page 4: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 4

Future of Audio/Visual Search

Economic dimension of the problem (e.g., personal journalism, cultural tourism, etc.)Social impact of solutions (e.g. community networks)Scientific research activitiesResults on innovationIs P2P a solution?

Page 5: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 5

What is Peer-to-Peer?

P2P is a class of applications that takes advantage of resources – storage, cycles, content, human presence – available at the edges of the Internet (Shirky)A P2P system is a self-organizing system of equal, autonomous entities (peers) which aims for the shared usage of distributed resources in a networked environment avoiding central services (Steinmetz)P2P is about overcoming the barriers to the formation of ad hoc communities, whether of people, of programs, of devices, or of distributed resources (O’Reilly)

Page 6: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 6

Peer-to-Peer systems

The traditional client-server approaches require a tremendous amount of effort and resource to meet today challengesScalability, security, flexibility are the main requirements of future Internet-based applicationsP2P systems are characterized by decentralized resource usage and decentralized self-organization

Page 7: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 7

Peer-to-Peer systems: Searching

One of the major problems of P2P systems is:how to find a data item stored at some dynamic set of nodes in the system

Three basic strategies can be used:centralized servers (first generation P2P: Napster)flooding (second generation P2P: Gnutella)distributed indexing (DHTs: Kademilia used in eMule):structured P2P system

Page 8: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 8

Structured Peer-to-Peer systems

Inspired by the significant possibilities of decentralized self-organizing systems, researches focused on approaches for distributed indexing structuresDistributed Hash Tables were developed to provide scalability, reliability and fault tolerance

Page 9: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 9

Searching in the World of Peers

Peer-to-peer systems are mostly used for file sharing. This task, which made the fortune of P2P, was not achievable by centralized serversStructured P2P networks such as DHTs, have produced a considerable amount of research but their usage is still limitedToday centralized servers are still largely used for searching between (often illegal) file sharing communities data

Page 10: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 10

eMule: Servers

Page 11: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 11

eMule: Kademilia

Page 12: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 12

eMule: Search

Page 13: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 13

eMule: Searching from a web server

Page 14: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 14

BitTorrent: web servers

Page 15: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 15

Image and Video search on the Web

Today Image and Video Search Engines are trivial applications of Web Search EnginesExamples: Google, Yahoo, Ask, etc.Search is performed on the MM Object context (i.e. Web page) or on manually associated textLimits of this approach: who is going to manually tag all A/V material produced by personal devices?

Page 16: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 16

Searching for “sea”: flickr

Page 17: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 17

Improving effectiveness by Similarity Search on MM Content

Exploiting automatically extracted metadata representing MM content:

MM features (e.g. MPEG-7)Automatic Context information (e.g., GPS generated info)

Solutions based on combination of “traditional”information (Manual text, Web pages) with automatically generated information (i.e. MM features, context) representing MM Content

New searching paradigm based on Similarity

Page 18: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 18

Searching for “sea”: MILOS PhotoBookQuery

Page 19: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 19

Detecting “Coat of arms” by components

Page 20: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 20

Page 21: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 21

Face Recognition (TV)

VideoFrame

Query

2° result

Page 22: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 22

The Importance of Similarity

Quotation:

“An ability to assess similarity lies close to the core of cognition. The sense of sameness is the very keel and backbone of our thinking. An understanding of problem solving, categorization, memory retrieval, inductive reasoning, and other cognitive processes require that we understand how humans assess similarity.”

MIT Encyclopedia of the Cognitive Sciences, Cambridge, MA, MIT Press 2006, pp. 763-765

Page 23: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 23

Feature-based Approach

image database

similar?

Page 24: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 24

Feature-based Approach

image layer

R

B

G

feature layer

Page 25: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 25

Specific Similarity concepts and definitions

Similarity in ChemistryIn order to assess the similarity between two molecules A and B we need to:

first describe the molecules according to some scheme and,choose an appropriate measure to compare the descriptions of the molecules.

Similarity in Social Psychologysimilarity refers to how closely attitudes, values, interests and personality match between people.similarity leads to interpersonal attraction, i.e. the attraction between people which leads to friendship and relationships.similarity forms social networks of individuals with ties mirrored as friends and acquaintances.

Page 26: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 26

Requirements of New ApplicationsMedicine:

Magnetic Resonance Images (MRI)

Finance: stocks with similar time behavior

Digital library:

text retrievalmultimedia information retrieval

Page 27: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 27

Similarity Searching

Effectivenessthe way of formulating the similarity measures - a model of human perception

Efficiencythe way of achieving the required performance over huge volumes of data – index structure

Page 28: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 28

Metric Spacean Abstraction of SimilarityMetric space: M = (D,d)

D – domaindistance function d(x,y)∀x,y,z ∈ D

d(x,y) > 0 - non-negativityd(x,y) = 0 ⇔ x = y - identityd(x,y) = d(y,x) - symmetryd(x,y) ≤ d(x,z) + d(z,y) - triangle inequality

Page 29: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 29

Similarity Search Problem

For X ⊆D in metric space M,pre-process X so that the similarity queriesare executed efficiently.

similarity queriesrange searchR(q,r) = { x ∈ X | d(q,x) ≤ r }

q ∈ D, r ≥ 0q

r

Page 30: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 30

Similarity Queries

k-nearest neighboursNN(q,k) = A, q ∈ D, k > 0A ⊆ X, |A| = k∀x ∈ A, y ∈ X – A, d(q,x) < d(q,y)

similarity joinX = {x1, x2, … xN}, Y = {y1, y2, … yM}{(xi,yj) | d(xi,yj) < μ}

similarity „self“ join X = Y

q

k=5

μ

Page 31: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 31

r

Basic Partitioning Principles

ball partitioning { x ∈ X | d(p,x) ≤ r }

{ x ∈ X | d(p,x) ≥ r }

generalised hyperplane{ x ∈ X | d(p1,x) ≤ d(p2,x) }

{ x ∈ X | d(p1,x) > d(p2,x) }

p

p2

p1

Page 32: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 32

The M-tree (an example)

inherently dynamic structuredisk-oriented (fixed-size nodes)built in a bottom-up fashioninspired by R-trees and B-trees

all data in the leaf nodesinternal nodes: pointers to subtrees and additional information

Page 33: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 33

M-tree: Example

_|.|5.4|1 −−o _|.|9.6|2 −−o _|_|_|_

_|0.0|4.1|1o _|3.3|2.1|10o _|_|_|_ _|8.3|3.1|7o _|0.0|9.2|2o _|3.5|6.1|4o

0.0|1o 4.1|6o _|_ 0.0|10o 2.1|3o _|_

0.0|7o 3.1|5o 0.1|11o

0.0|2o 9.2|8o _|_

0.0|4o 6.1|9o _|_

o1o6

o10o3

o2

o5

o7

o4

o9

o8

o11

Page 34: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 34

Scalability: CPU Costs

labels: radius or k + D (D-index), M (M-tree), SEQdata: from 100,000 to 600,000 objectsM-tree and D-index are faster (D-index slightly better)linear trends

range query: r = 1,000; 2,000 k-NN query: k = 1; 100

Page 35: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 35

Scalability: I/O Costs

the same trends as for CPU costs

Page 36: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 36

Similarity Search Scalability

Similarity search is expensive.The scalability of centralized indexes is linear.

cannot be applied to huge data archivesbecome inefficient after a certain point

Possible solutions:Sacrifice some precision: approximate techniquesUse more storage & computational power: distributed data structures

Page 37: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 37

Similarity Search in the World of Peers

With P2P systems able to perform similarity search:similarity search becomes scalableP2P communities have new search capabilities

While preserving all structured P2P benefits, they will give new search capabilities not available from current centralized servers

Page 38: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 38

Implementation Postulates of Distributed Indexes

scalability – nodes (computers) can be added (removed)

no hot-spots – no centralized nodes, no flooding by messages

update independence – network update at one site does not require an immediate change propagation to all the other sites

Page 39: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 39

DistributedSimilarity Search Structures

Native metric structures:GHT* (Generalized Hyperplane Tree)VPT* (Vantage Point Tree)

Transformation approaches (based on DHTs):M-CAN (Metric Content Addressable Network)M-Chord (Metric Chord)

Page 40: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 40

M-CAN: Range Query Execution

Range query R(q,r)map the q on F(q)route the query towards F(q)

Reach regions with candidate objectsL∞(F(x),F(q)) ≤ r

Propagate the query over the candidate regions

using a multicast algorithm of CANCheck objects using d

Page 41: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 41

Scalability comparison (INFOSCALE 2006)

Compared 4 distributed similarity search structuresQuery size scalabilityDataset size scalabilityCapability of simultaneous query processing

single query multiple queries

GHT* excellent poor

VPT* good satisfactory

MCAN satisfactory good

M-Chord satisfactory very good

Page 42: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 42

Further Research Challenges:Complex Similarity Search

Problems:

different types of queries, involving different features and different similarity measuresmultiple overlays over the same physical network,distributed incremental similarity search,high communication costs of naïve implementations,collaboration with the load balancing mechanism.

Page 43: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 43

Further Research Challenges:Load balancing

Problems:

one node contains data of different features,load balancing cost models – to measure the load and estimate the reorganization costs,postulates of distributed processing must strictly be respectedperformance tuning

Page 44: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 44

P2P Solutions for A/V Search

P2P-based solution to solve the fundamental Scalability Issue, concerning not only:

• Distributed Similarity Search structuresbut also:• Cooperative A/V features extraction• Support of highly dynamic applications (e.g.

videoblogs, photoblogs, etc.)• Push-based/cooperative crawling

Page 45: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 45

Technological requirements for MM Search Engines• Media specific analysis and feature extraction: e.g. Music

Information Retrieval

• Scalable, dynamic and distributed index structuressupporting similarity search

• Complex/multi-feature query processing: combining evidence from different media indexes, using the similarity paradigm (together with the traditional Web search)

• Support of distributed push-based crawling, where containers are asked to publish and “push” information to the search engine (together with the traditional pull-based crawling)

• Scalable dynamic caching techniques to enhance performance

• Context based support (based on user location, activity, etc.) and Multi device support (search from PC, mobile phone, PDAs).

Page 46: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 46

P2P and push-based crawlingConventional pull-based crawling techniques face the high refresh rates and huge size of the Web with increasing difficulty and have limitations in dealing with multimedia information In a distributed push-based crawling model, content providers (both professional and personal) are asked to publish and “push” information to the P2P indexing nodes Collaborative crawling model can effectively deal with important multimedia content that is hidden to traditional crawlers because it is not directly hyperlinked from some HTML page or it is stored in on-line AV specialized repositories that cannot be visited by crawling agents. .

Page 47: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 47

P2P and push-based crawling (2)

Multimedia content providers can be helped by the P2P infrastructure in the heavy process of multimedia feature extractionA collaborative and participatory P2P environment can give the providers the possibility of maintaining the control on the contributed material (publish what you want when you want): IPR-protected material is indexed and searched for, but its delivery controlled directly by the owner.New collaborative business models?

Page 48: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 48

Dynamic combination of crawling and feature extraction modes

Page 49: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 49

P2P and Dynamic Caching

Caching and replication are routinely used in the Web since they allow bandwidth consumption to be reduced, and user-perceived quality of service to be improved.In decentralized P2P systems, caching and replication permit to achieve a better load distribution, shorter latency, and higher availability. These techniques can be applied to contents, query results, and index entries. Literature proposes several solutions in which caching and replication strategies are managed locally, at the peer or super peer levels, or globally, by deploying a distributed cache over several peers.

Page 50: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 50

P2P and Dynamic Caching (2)New dimension of the problem: due to the peculiarities of multimedia content (size, dynamicity, DRM constraints) It is necessary to enhance search responsiveness, and save computational and communication resources (e.g. by exploiting self-similarities among submitted queries which follow a zipfian distribution) The main problem to deal with is related to dynamicity. In fact, it is not clear how long cached information will remain valid. The variability of data and the dynamicity of the networkitself make hard to predict freshness of information for cache entries. To design an on-line caching algorithm, it is necessary to investigate the trade-off between caching-time and validity-time, and explore whether and in which cases there is a correlation between the popularity and the time validity of a cached entry.

Page 51: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 51

Page 52: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 52

Music Information Retrieval (MIR)Search

Mainly based on melody, modeling possible mismatch between the query and the documents

Retrieve a song given an excerpt, sung or recordedRetrieve songs being similar to some query excerpts/songs

ClassificationMainly based on timbre, timing and long-term features

Identify author, performers, artist, genre, style, orchestrationRecommendation

Based on collaborative filtering mixed with content-basedSuggest a number of items to purchase or to organize in a playlist; organize programs for Web radios

VisualizationRepresent large personal music collections, for music browsing, audio preview

Page 53: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 53

MIR: Basic Techniques

MIR tasks require audio processing for:Tempo identificationTranscription of the main melodyRecognition of harmonic structuresTimbre characterization

Similarity is computed usingString matching - i.e. Dynamic Time WarpingStatistical modeling - i.e. Hidden Markov ModelsGeometric approaches - i.e. Earth Movers’ Distance

Techniques to visualize, classify, recommendk-Nearest Neighbor, Gaussian Mixture Models, Self-Organizing Maps, Markov Models

Page 54: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 54

MIR: Evaluation

Problems with copyright issuesResearchers have difficulties to obtain large music collections

Music Information Retrieval Evaluation eXchange (MIREX)Common effort for a TREC-like evaluation framework

Participants propose tasks and provide test collectionsExperiments are carried out by the organizers

Main focus on preprocessing techniquesEffectiveness of feature extraction, no real need of relevance judgmentsInitial efforts also for typical retrieval tasks

Page 55: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 55

MIR: Digital Rights Management

Audio fingerprintingTo recognize copyrighted material

Can be exploited for retrieval tasks too

Audio watermarkingTo include copyright ownership and to track users sharing behaviors

With watermarks retrieval can be based on metadata

Song similarityTo identify plagiarism

Artist identification

Page 56: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 56

MIR: P2P and Portable Devices

The increasing number of large personal collections of digital music, allows for

Music retrieval with distributed music indexes, stored in different peersMusic recommendation using collaborative filtering based on the analysis of personal collection contentComputation of music similarity based on users’ listening behaviors

The audio channel is more suitable for the interaction with portable devices

Music retrieval, classification and recommendation through auralinteractionTechniques for extracting music “snapshots” and “snippets”

Page 57: The Future of P2P Audio/Visual Searchlsir · Similarity Search Scalability Similarity search is expensive. The scalability of centralized indexes is linear. cannot be applied to huge

Future of P2P Audio/Visual Search P2PIR Workshop - ACM CIKM 2006 57

Conclusions• Starting from today situation

• Peer-to-Peer Applications• Image and Video search on the Web

• In order to improve effectiveness by adopting the Similarity Search paradigm

• We need a highly scalable and dynamic solution• P2P solution is feasible and promising, also with

respect to:• Cooperative A/V features extraction• Push-based/cooperative crawling