Research Trends in Multimedia Content Services

Post on 10-Jan-2016

21 views 0 download

Tags:

description

Research Trends in Multimedia Content Services. Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences. András A. Benczúr. Web 2.0, 3.0 …?. Platform convergence (Web, PC, mobile, television) – information vs. recreation - PowerPoint PPT Presentation

Transcript of Research Trends in Multimedia Content Services

Research Trends in Multimedia Content Services

Data Mining and Web Search GroupComputer and Automation Research

Institute

Hungarian Academy of Sciences

András A. Benczúr

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Web 2.0, 3.0 …?

• Platform convergence (Web, PC, mobile, television) – information vs. recreation

• Emphasis on social content (blogs, Wikipedia, photo and video sharing)

• From search towards recommendation (query free, profile based, personalized)

• From text towards multimedia• Glocalization (language, geography)• Spam

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

A sample service

RSSWeb 2.0

• Small screen browsing

• Recommendation based on user profile (avoid query typing)

• Read blogs, view media, …

client software

Recommender engine

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

The user profile

• History stored for each user:• Known ratings, preferences,

opinion – scarce!• Items read, weighted by time spent

• details seen, scrolling, back button• Terms in documents read,

tf.idf weighted top list• User language, region, current

location and known sociodemographic data

• Multimedia!

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Same item—multiple source

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Information vs recreation: Do not mix the two?

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Spam is increasingly annoying

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Distribution of categories

Reputable 70.0%

Spam 16.5%

Weborg 0.8%

Ad 3.7%

Non-existent 7.9%

Empty 0.4%Alias 0.3%

Unknown 0.4%

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Keresési találati pozíció hatása

Talá

lati

pozí

ció n

ézé

sével tö

ltött

id

ő

Talá

lath

oz

érk

ezé

s id

eje

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Multimedia Information Retrieval

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Similar objects

Segmentation

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Class of Query Image

Pre-classified Images

VOC2007

Original Training Set

Query Images

ImageCLEF Object Retrieval Task

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Networked relation

•spam•social network analysis•churn

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Szociális hálózatok

Üzleti ADSL

Üzleti

Egyéni ADSL

Egyéni

Egyéni és üzleti ügyfelek

home

business

ADSL ---ADSL ---

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Biztosítási csalások – hálózatban

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Stacked Graphical Learning

1. Predict churn p(v) of node v2. For target node u, aggregate p(v) for

neighbors to form new feature f(u)3. Rerun classification by adding feature

f(.)4. Iterate

?u

v1

v2

v7

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Why social networks are hard to analyze

Subgraphs of social networks

Medium size dense communities attract

much algorithmic work

Tentacles induce noise

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Mapping into 2D

plain spectral

semidefinite

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Research HighlightsResearch Highlights

Recommenders: KDD Cup 2007 Task 1 First Prize

Predict the probability that a user rated a movie in 2006, based on

year –2005 training data Spam filtering: Web Spam Challenge 1 first

placeChurn prediction: method presented at

KDD Cup 2009 WorkshopTask XXXX

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Netflix: lessons and differences learned

•Ratings 1– 5 stars•Predict an unseen rating•Evaluation: RMSE•0.8572: $1,000,000 •Current leader: 0.8650• Oct/07: 0.8712KDD Cup 2007•same data set•predict existence of a rating

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Results of two separate tasks

BellKor team report [Bell, Koren 2007]:• Low rank approximation• Restricted Boltzmann Machine• Nearest neighborKDD Cup 2007: Predict probability

that a user rated a movie in 2006:• Given list of 100,000 user–movie pairs• Users and movies drawn from Netflix

Prize data setWinner report [K, B, and our colleauges

2007]

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

For a given user i and movie j

where is the predicted valueKDD Cup example:• Our RMSE: 0.256• First runner up: 0.263 • All zeroes prediction: 0.279 (Place 10-13)

But why do we use RMSE and not precision/recall?

• RMSE preferes correct probability guesses for the majority unfrequently visited items

• The presence of the recommender changes usage

Evaluation and Issue 1

ji,

ijij )w(w= 22 ˆRMSE

otherwise 1

given rating no if 0=wij

ijw

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Method Overview

• Probability by naive user-movie independence• Item frequency estimation (Time Series)• User frequency estimation• Reaches RMSE 0.260 in itself (still first

place)

• Data Mining• SVD• Item-item similarities• Association Rules

• Combination (we used linear regression)

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Time series prediction

Interest remains for long time range (several years)

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Short lifetime of online items

OrigoVery different behavior in time: news articles

http://www.origo.hu/filmklub/20060124kiolte.html

Publication day

Next day usage peak

Third day

and gone …

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

K-dim SVD: Noise filtering – the essence of the matrix – optimizes

• SVD explains ratings as effect of few linear factors

• RMSE (ℓ2 error) 10-30 dim: 0.93

Issue: too many news items

18K Netflix movies vs.

potentially infinite set of items

-> may recommend data source but not the item

SVD

22 ˆRMSE )A(A= ijij

use

r

movie news item

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

• Content similarity might be the key feature

• Relative success of trivial estimates on KDD Cup!

• Data mining techniques overlap, apparently catch similar patterns

• Precision/recall is more important than RMSE

• Solution must make heavy use of time

Lessons learned

A Benczur – Research Trends in Multimedia Content Services – FuturICT 28 April 2008

Future plans and ideasFuture plans and ideas

• New partners and application fields: network infrastructure, new generation services, bioinformatics, …?

• Scaling our solutions to multi-core architectures

• Use our search (cross-lingual, multimedia etc) and recommender system capabilities in major solutions; mobile, new generation platforms etc.

• Expand means of our European level collaboration, e.g. KIC participation

benczur
Knowledge and Innovation Communities

Questions ?Andras A. Benczur

benczur@sztaki.huhttp://datamining.sztaki.hu