: Web Usage Mining with Semantic Analysis

22
Web Usage Mining with Semantic Analysis Laura Hollink, VU University Amsterdam Peter Mika, Yahoo! Labs Barcelona Roi Blanco, Yahoo! Labs Barcelona

description

Laura Hollink, Peter Mika and Roi Blanco. Web Usage Mining with Semantic Analysis. In proceedings of the International World Wide Web Conference, Rio de Janeiro, Brazil, May 2013.

Transcript of : Web Usage Mining with Semantic Analysis

Page 1: : Web Usage Mining with Semantic Analysis

Web Usage Mining with Semantic Analysis

Laura Hollink, VU University AmsterdamPeter Mika, Yahoo! Labs BarcelonaRoi Blanco, Yahoo! Labs Barcelona

Page 2: : Web Usage Mining with Semantic Analysis

Analysis of web user behavior

What are typical use cases? Are these carried out in a particular order?

Which use cases are not satisfied? And to which other sites do users go?

Page 3: : Web Usage Mining with Semantic Analysis

Analysis of web user behavior

What are typical use cases? Are these carried out in a particular order?

Which use cases are not satisfied? And to which other sites do users go?

oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!

captain'america'''movies.yahoo.com moneyball'trailer'''movies.yahoo.com'

money'''moneyball'movies.yahoo.com'

moneyball'''movies.yahoo.com''movies.yahoo.com en.wikipedia.org'''movies.yahoo.com''peter'brand'''peter'brand'oakland''nymag.com'''moneyball'the'movie'''www.imdb.com'

moneyball'trailer'movies.yahoo.com''moneyball'trailer''

brad'pi-''brad'pi-'moneyball''brad'pi-'moneyball'movie'brad'pi-'moneyball''brad'pi-'moneyball'oscar'''www.imdb.com'

relay'for'life'calvert'ocunty www.relayforlife.org'trailer'for'moneyball'''movies.yahoo.com 'moneyball.movie-trailer.com'

moneyball'en.wikipedia.org 'movies.yahoo.com map'of'africa''www.africaguide.com'

money'ball'movie'''www.imdb.com money'ball'movie'trailer''moneyball.movie-trailer.com''

brad'pi-'new''www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com 'brad'pi-'news'news.search.yahoo.com moneyball'trailer''moneyball'trailer'www.imdb.com''www.imdb.com!

Transaction logs: sessions of queries and clicks

Page 4: : Web Usage Mining with Semantic Analysis

Analysis of web user behavior

oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!

captain'america'''movies.yahoo.com moneyball'trailer'''movies.yahoo.com'

money'''moneyball'movies.yahoo.com'

moneyball'''movies.yahoo.com''movies.yahoo.com en.wikipedia.org'''movies.yahoo.com''peter'brand'''peter'brand'oakland''nymag.com'''moneyball'the'movie'''www.imdb.com'

moneyball'trailer'movies.yahoo.com''moneyball'trailer''

brad'pi-''brad'pi-'moneyball''brad'pi-'moneyball'movie'brad'pi-'moneyball''brad'pi-'moneyball'oscar'''www.imdb.com'

relay'for'life'calvert'ocunty www.relayforlife.org'trailer'for'moneyball'''movies.yahoo.com 'moneyball.movie-trailer.com'

moneyball'en.wikipedia.org 'movies.yahoo.com map'of'africa''www.africaguide.com'

money'ball'movie'''www.imdb.com money'ball'movie'trailer''moneyball.movie-trailer.com''

brad'pi-'new''www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com 'brad'pi-'news'news.search.yahoo.com moneyball'trailer''moneyball'trailer'www.imdb.com''www.imdb.com!

Transaction logs: sessions of queries and clicks

Are these use cases typical for all movies? Recent movies? Only for Moneyball?

Page 5: : Web Usage Mining with Semantic Analysis

Why are these questions difficult to answer?

Sparsity of the event space‣ 64% percent of queries are unique within a year‣ even the most frequent patterns have extremely low support

To illustrate: top 12 most frequent sessions observed in our data:

Page 6: : Web Usage Mining with Semantic Analysis

Tasks

Question 1: what are typical use cases?‣Task 1: find sequences of events in the data that are more

frequent (have a higher support) than a threshold.Question 2: what use cases are not satisfied?‣Task 2: learn to predict website abandonment from

queries and clicks.

Page 7: : Web Usage Mining with Semantic Analysis

Approach

'oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!

Applied to the movie domain

Connect queries to entities in the linked open data cloud and use properties of these entities to generalize and categorize queries.

Page 8: : Web Usage Mining with Semantic Analysis

Data processing and linking steps

1.link queries to entities2.select types of entities (classes) 3.detect modifier words (download, trailer, cast, date, etc.)4.identify navigational queries5.identify ‘loosing’ queries.

'oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!

Page 9: : Web Usage Mining with Semantic Analysis

1. Linking queries to entities in the LOD cloud

• We link one entity to each query.• The intent of about 40% of unique Web queries is to find a particular entity

[Pound, WWW2008].• We link to Freebase (has a lot of movie related info) and DBpedia (Wikipedia is

widely used)

Page 10: : Web Usage Mining with Semantic Analysis

2. Select one type per entity

• We use the Freebase API to get the semantic “types” of each query URI

• Freebase ‘Notable types API’ is not official and not documented.

• For repeatability and transparency, we have created our own heuristics to select one type for each entity:

1. no internal or administrative types,

2.prefer established domains (‘Commons’) over user defined schemas (’Bases’)

3.aggregate specific types into more general types

a)subtypes of location -> location

b)subtypes of award winners and nominees -> award_winner_nonimee

c)prefer movie related types over other types: film, actor, artist, tv_program, tv_actor and location (order of decreasing preference).

entity

TypeType

Type Type

Type

Type

Page 11: : Web Usage Mining with Semantic Analysis

3. Detect modifier words in queries

Top 100 most frequent words that appear in the query log before or after entity names [Mika ISWC2009, Pantel WWW2012].

movie, movies, theater, cast, quotes, free, theaters, watch, 2011, new, tv, show, dvd, online, sex, video, cinema, trailer, list, theatre . . .

Page 12: : Web Usage Mining with Semantic Analysis

4. Identifying navigational queries

• A navigational query is a query entered with the intention of navigating to a particular website.

• A common heuristic is to consider navigational queries where the query matches the domain name of a clicked result.

• “official homepage” is value of dbpedia:homepage, dbpedia:url, and foaf:homepage.

netflix login www.netflix.com

banana www.bananas.org

European Parliament europarl.europa.eu

Page 13: : Web Usage Mining with Semantic Analysis

5 Identify ‘loosing’ queries

• A ‘loosing’ query is the query that leads a user to abandon a service in favor of another service.

• Common definition: A user repeats the same query and clicks on another result in the list.

• Our broader, semantic definition:

Page 14: : Web Usage Mining with Semantic Analysis

Evaluation

1.Linking to entities and types2.Detection of frequent usage patterns3.Prediction of website abandonment

Applied to the movie domain

• sample of server logs of Yahoo! Search in the US from June, 2011, split into sessions.

• Only sessions that contain at least one visit to any of 16 popular movie sites4.

• 1.7 million sessions, containing over 5.8 million queries and over 6.8 million clicks.

Page 15: : Web Usage Mining with Semantic Analysis

Evaluation of links to entities and types

• Compare manually created <query, entity> and <entity, type> pairs to automatically created links.

• 2 samples: the 50 most frequent queries and 50 random queries.

Examples:• Ambiguous query: “Green Lantern” - the movie or the fictional character?• Wrong type: Oil peak is a serious game subject?

Page 16: : Web Usage Mining with Semantic Analysis

Evaluation of links to entities and types

Queries Entities Types

Freq

uenc

y of

occ

urre

nce

Freq

uenc

y of

occ

urre

nce

Freq

uenc

y of

occ

urre

nce

Page 17: : Web Usage Mining with Semantic Analysis

Frequent usage patterns I

• Freebase:release_date property of entities.Recent movies Older movies

Page 18: : Web Usage Mining with Semantic Analysis

Frequent usage patterns II

• Sequences of consecutive query types.

Page 19: : Web Usage Mining with Semantic Analysis

Frequent usage patterns III

• A comparison of websites.

• most frequent query types that lead to a click on a website.

/film

/fi

lm/a

ctor

/tv

_pro

gram

/p

eopl

e/pe

rson

/b

ook/

book

/fi

ctio

nal_

univ

erse

/fict

iona

l_ch

arac

ter

/mus

ic/a

rtist

/tv

/tv_a

ctor

/lo

catio

n /fi

lm/fi

lm_s

erie

s

Website 1

prop

ortio

n of

que

ries

that

lead

to a

clic

k on

the

web

site

0.0

0.1

0.2

0.3

0.4

0.5

0.6

/film

/lo

catio

n /b

ook/

book

/fi

lm/a

ctor

/b

usin

ess/

empl

oyer

/fi

ctio

nal_

univ

erse

/wor

k_of

_fic

tion

/fict

iona

l_un

iver

se/fi

ctio

nal_

char

acte

r /tv

_pro

gram

/a

rchi

tect

ure/

build

ing_

func

tion

/film

/film

_ser

ies

Website 2

prop

ortio

n of

que

ries

that

lead

to a

clic

k on

the

web

site

0.0

0.1

0.2

0.3

0.4

0.5

0.6

/loca

tion

/bus

ines

s/em

ploy

er

/film

/fi

lm/a

ctor

/o

rgan

izat

ion/

orga

niza

tion

/arc

hite

ctur

e/bu

ildin

g_fu

nctio

n /p

eopl

e/pe

rson

/tv

_pro

gram

/tv

/tv_n

etw

ork

/inte

rnet

/web

site

_cat

egor

y

Website 3

prop

ortio

n of

que

ries

that

lead

to a

clic

k on

the

web

site

0.0

0.1

0.2

0.3

0.4

0.5

0.6

/bus

ines

s/em

ploy

er

/film

/tv

_pro

gram

/tv

/tv_s

erie

s_se

ason

/fi

lm/a

ctor

/c

vg/c

vg_p

latfo

rm

/peo

ple/

pers

on

/com

pute

r/sof

twar

e /tv

/tv_n

etw

ork

/boo

k/bo

ok

Website 4

prop

ortio

n of

que

ries

that

lead

to a

clic

k on

the

web

site

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Prop

ortio

n of

que

ries

Prop

ortio

n of

que

ries

Website BWebsite A

Page 20: : Web Usage Mining with Semantic Analysis

Predicting website abandonment

• 3 Classification Tasks: Given a (part of a) session in which a user is lost/gained, predict...1...whether a user will be gained for a given website.2...given that the session includes a given website, whether this website is in

the loosing or gaining position.3...given that the session includes two given websites, which one is in the

gaining position.

• Gradient Boosted Decision Trees.

Page 21: : Web Usage Mining with Semantic Analysis

Discussion and future work

• Mining patterns of entire queries gives problems with sparsity of data• We interpret the structure and semantics of the queries, using openly

available, up-to-date information on the Web.• give a “semantic” definition of navigational and ‘loosing’ queries• find patterns of user behavior• predict website abandonment

• This is the beginning:• Use more properties of entities, more features.• Detect more complex patterns.• Explore other linked open datasets.

Page 22: : Web Usage Mining with Semantic Analysis

Thank you!

Questions?