Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 ·...

42
Mining di Dati Web Lezione 4: Query Logs

Transcript of Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 ·...

Page 1: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Mining di Dati WebLezione 4: Query Logs

Page 2: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Introduction

Web

Search Server

Client Browser

Query Logs

Web Search Engine

Page 3: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

What’s Stored into Query Logs?

• The query

• The requested page of results

• The searcher ID (usually IP address)

• The clicked result

• The clicked URL

Page 4: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

What’s Really Stored into Query Logs?

• Our tastes

• Our preferences

• Our desires

• Our curiosities

• ...

Page 5: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

So... What’s REALLY REALLY there?

User #17988817 searched on 'how to kill someone' found http://www.43things.com at 2006-04-26 20:41:06 User #17988817 searched on 'how to kill someone' found http://www.everything2.com at 2006-04-26 20:41:06 User #17988817 searched on 'how to kill someone' found http://quizilla.com at 2006-04-26 20:41:06 User #17988817 searched on 'how to kill someone' found http://www.linkbase.org at 2006-04-26 20:41:06 User #17988817 searched on 'the perfect murder' found http://aperfectmurder.warnerbros.com at 2006-04-26 20:40:16 User #17988817 searched on 'the perfect mudred' at 2006-04-26 20:39:57 User #17988817 searched on 'the real perfect crime' found http://www.wired.com at 2006-04-26 20:37:38 User #17988817 searched on 'the perfect crime' found http://www.perfect-crime.com at 2006-04-26 20:35:58 User #17988817 searched on 'real serial killers' found http://www.murraymoffatt.com at 2006-04-26 20:34:57 User #17988817 searched on 'real serial killers' found http://members.firstinter.net at 2006-04-26 20:27:34 User #17988817 searched on 'real serial killers' found http://www.geocities.com at 2006-04-26 20:27:34 User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched on 'killers' at 2006-04-26 20:23:41 User #17988817 searched on 'google'

Page 6: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Uhmmm!!!!

User #2732442 searched on 'how do i know if my friend is a lesbian' found http://www.avert.org at 2006-03-07 01:31:52

User #2732442 searched on 'do lesbians have loud voices' at 2006-03-07 01:29:30

User #2732442 searched on 'what do you do if a girl kisses you and they are the same sex as you' at 2006-03-07 01:27:15

User #2732442 searched on 'can you kiss your friends' at 2006-03-07 01:22:55

Page 7: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

WTF!?!

User #582088 searched for 'can you hear me out there i can hear you i got you i can hear you over i really feel strange i wanna wish for something new this is the scariest thing ive ever done in my life who do we think we are angels and airwaves im gonna count down till 10 52 i can' at 2006-05-30 19:31:54

Page 8: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

I Think I’m Becoming Addicted

User #11486558 searched on 'what is obsessive compulsive disorder' at 2006-05-08 08:37:14 User #11486558 searched on 'what is obsessive compulsive disorder' at 2006-05-08 08:37:14 User #11486558 searched on 'what is obsessive compulsive disorder' at 2006-05-08 08:37:14 User #11486558 searched on 'what is obsessive compulsive disorder' at 2006-05-08 08:37:14 User #11486558 searched on 'what is obsessive compulsive disorder' at 2006-05-08 08:37:14 User #11486558 searched on 'what is obsessive compulsive disorder' at 2006-05-08 08:37:14 User #11486558 searched on 'cat wiping butt on carpet' at 2006-05-08 08:28:18 User #11486558 searched on 'cat wiping butt on carpet' at 2006-05-08 08:28:18 User #11486558 searched on 'cat wiping butt on carpet' at 2006-05-08 08:28:18 User #11486558 searched on 'my cat keeps rubbing her butt on the carpet medical problem' at 2006-05-08 08:27:27 User #11486558 searched on 'anxiety and fasciculations' at 2006-05-08 08:18:52 User #11486558 searched on 'twitching fasciculations at the base of thumb' at 2006-05-08 08:18:17 User #11486558 searched on 'twitching fasciculations at the base of thumb' at 2006-05-08 08:13:15 User #11486558 searched on 'twitching fasciculations at the base of thumb' at 2006-05-08 08:13:15 User #11486558 searched on 'twitching fasciculations at the base of thumb' at 2006-05-08 08:13:15 User #11486558 searched on 'twitching fasciculations at the base of thumb' at 2006-05-08 08:13:15

Page 9: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

The Last One!!User #6416389 searched on 'cannibals eating fat girls' at 2006-05-11 18:48:05 User #6416389 searched on 'cannibals eating fat girls' found http://www.gymmuenchenstein.ch at 2006-05-11 18:44:05 User #6416389 searched on 'cannibals eating fat girls' found http://www.pervscan.com at 2006-05-11 18:40:57 User #6416389 searched on 'cannibals eating fat girls' found http://www.mayhem.net at 2006-05-11 18:40:57 User #6416389 searched on 'pictures of women being spit roasted' at 2006-05-11 15:06:41 User #6416389 searched on 'pictures of women being spit roasted' found http://www.villagevoice.com at 2006-05-11 15:01:04 User #6416389 searched on 'dolcett roasted girls' found http://www.geocities.com at 2006-05-10 15:47:03 User #6416389 searched on 'dolcett roasted girls' found http://www.geocities.com at 2006-05-10 15:47:03 User #6416389 searched on 'dolcett roasted girls' found http://www.sickestsites.com at 2006-05-10 15:47:03 User #6416389 searched on 'dolcett roasted girls' found http://www.sickestsites.com at 2006-05-10 15:47:03 User #6416389 searched on 'roasting girls in a barbeque pit' at 2006-05-10 15:46:16 User #6416389 searched on 'roasting girls in a barbeque pit' at 2006-05-10 15:45:47 User #6416389 searched on 'muki' found http://www.mukiskitchen.com at 2006-05-10 10:59:01 User #6416389 searched on 'cannibal horror stories' found http://movies.monstersandcritics.com at 2006-05-10 10:57:30 User #6416389 searched on 'cannibal horror stories' found http://www.locusmag.com at 2006-05-10 10:51:45 User #6416389 searched on 'cannibal horror stories' found http://daha.best.vwh.net at 2006-05-10 10:51:45 User #6416389 searched on 'cooking the buttocks of women' at 2006-05-10 10:50:55 User #6416389 searched on 'cooking the buttocks of women' at 2006-05-10 10:50:21 User #6416389 searched on 'fat buttocks of girls ready for the pot' at 2006-05-10 10:49:16 User #6416389 searched on 'raping and cooking girl flesh' at 2006-05-10 10:48:32

Page 10: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Serious Studies

• Broder, A. 2002. A taxonomy of web search. SIGIR Forum 36, 2 (Sep. 2002), 3-10.

• Jansen, B. J. and Spink, A. 2006. How are we searching the world wide web?: a comparison of nine search engine transaction logs. Inf. Process. Manage. 42, 1 (Jan. 2006), 248-263.

• Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2004. Hourly analysis of a very large topically categorized web query log. In Proceedings of the 27th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Sheffield, United Kingdom, July 25 - 29, 2004). SIGIR '04. ACM, New York, NY, 321-328.

Page 11: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Classical IR Query Formulation

Info Need Query

Matching Rules

Corpus

ResultsQuery

Refinement

Page 12: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Web IR Query Formulation

Info Need Query

Matching Rules

Corpus

ResultsQuery

Refinement

Verbal Form

Task

Page 13: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Verbal Form

• Means put in words what we want to search.

• Put in “simple” words like:

• “real serial killer”

• “cat wiping butt on carpet” instead of “my cat keeps rubbing her butt on the carpet medical problem”

Page 14: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Tasks

• Users have in mind something when search the Web

• User intent is driven by a task to accomplish

Page 15: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

A Taxonomy of Web Searches

• Broder, A. 2002. A taxonomy of web search. SIGIR Forum 36, 2 (Sep. 2002), 3-10.

• Web queries can be classified according to their intent into 3 classes:

• Navigational

• Informational

• Transactional

Page 16: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Navigational Queries

• Greyhound Bus

• Probable target http://www.greyhound.com

• compaq

• Probable target http://www.compaq.com

• national car rental

• Probable target http://www.nationalcar.com

• american airlines home

• Probable target http://www.aa.com

• Don Knuth

• Probable target http://www-cs-faculty.stanford.edu/~knuth

Page 17: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Informational Queries

• Wide

• cars

• San Francisco

• Narrow

• normocytic anemia

• Scoville heat units

Page 18: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Transactional Queries

• Web search engines are used to look for a more specialized provider

• find music

• where is the yellow page DB?

• find servers for gaming

Page 19: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

User Survey

!"#$%!"#%&'#('%)!!*!&+#%!,-)(+'%*.!#('!*!*)/%-*.+,-'0%122,(+*.3%!,%455678%9:;%,<%!"#%&'#('%=*#-%'&2"%>,>?&>'%)'%*.!#(<#(#.2#0%%

!%%%%%%%%%1'%)%2,.'#@&#.2#%,<%'#/<%'#/#2!*,.A%!"#%>#(2#.!)3#%,<%@&#(*#'%-*!"%'#B&)/%2,.!#.!%*'%&.+#(%7;0%C.%2,.!()'!A%!"#%>#(2#.!)3#%,<%'&2"%@&#(*#'%*.%!"#%/,3%*'%)D,&!%7E;0%%

!%%%%%%%%%FE%*'%&'#+%!,%+*'!*.3&*'"%D#!-##.%.)=*3)!*,.)/%).+%.,.?.)=*3)!*,.)/%@&#(*#'0%G"#%>#(2#.!)3#%,<%@&#(*#'%*+#.!*<*#+%)'%.)=*3)!*,.)/%-)'%EH0I;A%.,.?.)=*3)!*,.)/%@&#(*#'%)22,&.!#+%<,(%J90H;A%).+%K07;%,<%!"#%'&(=#L'%+*+%.,!%).'-#(%F70%G"&'%)$,.3%(#'>,.+#.!'%!,%FEA%!"#%>#(2#.!)3#%,<%.)=*3)!*,.)/%@&#(*#'%-)'%EJ0H;0%%

!%%%%%%%%%M,-#=#(A%-#%2,&/+%.,!%<*.+%)%'*$>/#%@&#'!*,.%!,%+*'!*.3&*'"%D#!-##.%!().')2!*,.)/%).+%*.<,($)!*,.)/%@&#(*#'0%C.'!#)+%-#%*+#.!*<*#+%',$#%,<%!"#%$,'!%>,>&/)(%!().')2!*,.)/%@&#(*#'A%.)$#/L%'",>>*.3%,.%!"#%-#D%NF:07O%).+%+,-./,)+*.3%+)!)%NF:0:O0%NP,!#%!")!%*<%!"#%*.!#.!%*'%'",>>*.3%!""#$%&#'&(%!"#%@&#(L%*'%>(,D)D/L%*.<,($)!*,.)/%()!"#(%!").%!().')2!*,.)/O0%1$,.3%>#,>/#%!")!%).'-#(#+%D,!"%FE%).+%F:%!"#'#%!L>#%,<%@&#(*#'%)22,&.!%<,(%:E0:;%,<%!"#%.,.?.)=*3)!*,.)/%@&#(*#'0%NQ",>>*.3%,.%!"#%-#D%K0JI;%).+%+,-./,)+%EH0JI;0O%G"&'%!"#%!,!)/%.&$D#(%,<%!().')2!*,.)/%@&#(*#'%*'%)!%/#)'!%E:09;0%%%G"#%'&(=#L%")+%',$#%)++*!*,.)/%@&#'!*,.'A%*.%>)(!*2&/)(%FK%-)'%)*#+!,-#!'*#'!-./0#12&3/&#.&/4-5(&#$%&#&634$#15&4&#!"#5*"!-73$5!*#+!,#3-&#/&&85*90%R)'#+%,.%')$>/#%,<%E66%@&#(*#'A%!"#%@&#(L%!#B!A%).+%!"#%#B>/).)!*,.%>(,=*+#+%)!%FKA%-#%#'!*$)!#%",-#=#(%!")!%!"#%.&$D#(%,<%!().')2!*,.)/%@&#(*#'%)$,.3%'&(=#L%(#'>,.+#.!'%*'%)D,&!%:J;0%%

!%%%%%%%%%F&#(*#'%!")!%)(#%.#*!"#(%!().')2!*,.)/A%.,(%.)=*3)!*,.)/A%)(#%)''&$#+%!,%D#%*.<,($)!*,.)/0%%!%%%%%%%%%G"#%,=#()//%(#'&/!'%,<%!"#%'&(=#L%)(#%>(#'#.!#+%*.%<*3&(#%H0%%

!"#$%&'()'*$%+&,'-./0&%/)''

12#'-.-3,/"/'

Q*.2#%*.<#((*.3%!"#%&'#(%*.!#.!%<(,$%!"#%@&#(L%*'%)!%D#'!%).%*.#B)2!%'2*#.2#A%D&!%&'&)//L%)%-*/+%3&#''A%!"#%+)!)%,D!)*.#+%<(,$%/,3%).)/L'*'%*'%=#(L%S',<!S0%T#%'#/#2!#+%)!%().+,$%)%'#!%,<%7666%@&#(*#'%<(,$%!"#%+)*/L%1/!)U*'!)%/,30%V(,$%!"*'%'#!%-#%(#$,=#+%.,.?W.3/*'"%@&#(*#'%).+%'#B&)//L%,(*#.!#+%@&#(*#'0%NG"#%/)!#(%D#*.3%)D,&!%76;%,<%!"#%W.3/*'"%@&#(*#'O0%V(,$%!"#%(#$)*.*.3%'#!%!"#%<*('!%H66%@&#(*#'%-#(#%*.'>#2!#+0%F&#(*#'%!")!%-#(#%.#*!"#(%!().')2!*,.)/A%.,(%.)=*3)!*,.)/A%-#(#%)''&$#+%!,%D#%

4)'56"76'28'96&'823320".#':&/7%";&/';&/9'06-9',2$'-%&'9%,".#'92':2<%%

%%%%%%%4()=>?%%C%-).!%!,%3#!%!,%)%'>#2*<*2%-#D'*!#%!")!%C%)/(#)+L%")=#%*.%$*.+%%

%%%%%%%@A)(B?%%C%-).!%)%3,,+%'*!#%,.%!"*'%!,>*2A%D&!%C%+,.X!%")=#%)%'>#2*<*2%'*!#%*.%$*.+%%%% % %% %% % % % % % % % % %% %%

>)'56"76'28'96&'823320".#';&/9':&/7%";&/'06,',2$'72.:$79&:'96"/'/&-%76<%%

%%%%%%%A)B@?%%C%)$%'",>>*.3%<,(%',$#!"*.3%!,%D&L%,.%!"#%C.!#(.#!%%

%%%%%%%=)(@?%%C%)$%'",>>*.3%<,(%',$#!"*.3%!,%D&L%#/'#-"#(#%!").%,.%!"#%C.!#(.#!%%

%%%%%%%44)==?%%C%-).!%!,%+,-./,)+%)%<*/#%N#030A%$&'*2A%*$)3#'A%>(,3()$'A%#!20O%%

%%%%%%%=C)BD?%%P,.#%,<%!"#'#%(#)',.'%%%% % %% %% % % % % % % % % %% %%

()'56"76'28'96&'823320".#':&/7%";&/';&/9'06-9',2$'-%&'322E".#'82%<%%

%%%%%%%B()A>?%%1%'*!#%-"*2"%*'%)%2,//#2!*,.%,<%/*.Y'%!,%,!"#(%'*!#'%(#3)(+*.3%!"*'%!,>*2%%

%%%%%%%C@)@4?%%G"#%D#'!%'*!#%(#3)(+*.3%!"*'%!,>*2%%%% % %% %% % % % % % % % % %% %%

Page 20: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Query Classification!"#$%&'(!$"')*!"*!"(+"(,**

!"#$%&'()'*$&%+',-.//"0",.1"23)''

()'45&'&62-$1"23'20'/&.%,5'&3#"3&/'

-"*.!+/*$#*(0+*('1$"$&2*3!45644+3*4$*#'%*/+*!3+"(!#2*(0%++*4('7+4*!"*(0+*+.$)6(!$"*$#*/+8*4+'%50*+"7!"+49**

!*********:!%4(*7+"+%'(!$"*;;*64+4*&$4()2*$";<'7+*3'('*=(+1(*'"3*#$%&'((!"7>*'"3*!4*.+%2*5)$4+*($*5)'44!5*-?,*@6<<$%(4*&$4()2*!"#$%&'(!$"')*A6+%!+4,*B0!4*/'4*4('(+;$#;(0+*'%(*'%$6"3*CDDE;CDDF*'"3*/'4*+1+&<)!#!+3*82*G)('H!4('I*J15!(+I*K+8L%'/)+%I*+(5,**

!*********@+5$"3*7+"+%'(!$"*;;*64+*$##;<'7+I*/+8;4<+5!#!5*3'('*4650*'4*)!"M*'"')24!4I*'"50$%;(+1(I*'"3*5)!5M;(0%$670*3'(',*B0!4*7+"+%'(!$"*46<<$%(4*8$(0*!"#$%&'(!$"')*'"3*"'.!7'(!$"')*A6+%!+4*'"3*4('%(+3*!"*CDDN;CDDD,*O$$7)+*/'4*(0+*#!%4(*+"7!"+*($*64+*)!"M*'"')24!4*'4*'*<%!&'%2*%'"M!"7*#'5($%*'"3*P!%+5(Q!(*5$"5+"(%'(+3*$"*5)!5M;(0%$670*3'(',*R2*"$/I*'))*&'S$%*+"7!"+4*64+*'))*(0+4+*(2<+4*$#*3'(',*T!"M*'"')24!4*'"3*'"50$%(+1(*4++&4*5%65!')*#$%*"'.!7'(!$"')*A6+%!+4,**

!*********B0!%3*7+"+%'(!$"*;;*+&+%7!"7*"$/I*'((+&<(4*($*8)+"3*3'('*#%$&*&6)(!<)+*4$6%5+4*!"*$%3+%*($*(%2*($*'"4/+%*UU(0+*"++3*8+0!"3*(0+*A6+%2VV,*:$%*!"4('"5+*$"*'*A6+%2*)!M+*!"#$%&"#'()'**(0+*

+"7!"+*&!70(*<%+4+"(*3!%+5(*)!"M4*($*'*0$(+)*%+4+%.'(!$"*<'7+*#$%*@'"*:%'"5!45$I*'*&'<*4+%.+%I*'*/+'(0+%*4+%.+%I*+(5,*B064*(0!%3*7+"+%'(!$"*+"7!"+4*7$*8+2$"3*(0+*)!&!('(!$"*$#*'*#!1+3*5$%<64I*.!'*4+&'"(!5*'"')24!4I*5$"(+1(*3+(+%&!"'(!$"I*32"'&!5*3'('*8'4+*4+)+5(!$"I*+(5,*B0+*'!&*!4*($*46<<$%(*!"#$%&'(!$"')I*"'.!7'(!$"')I*'"3*(%'"4'5(!$"')*A6+%!+4,*B0!4*!4*'*%'<!3)2*50'"7!"7*)'"345'<+,**

7)'8&-.1&9':2%;'

B0+%+*!4*'"*#'!%)2*%!50*)!(+%'(6%+*3+')!"7*/!(0*"'.!7'(!$"*$"*(0+*/+8I*86(*!(*!4*&$4()2*5$"5+%"+3*/!(0*"'.!7'(!$"*.!'*)!"M4*$"*<'7+I*8$$M&'%M4I*'"3*(0+*8%$/4+%*W8'5MW*86(($",*@++*B'6450+%*'"3*O%++"8+%7*XBODFYI*L$5M86%"*'"3*Z$"+4*XLZD[YI*'"3*L'()+37+*'"3*\!(M$/*XL\DEY,*]'.!7'(!$"*=WO$*($*<'7+W>*!4*!3+"(!#!+3*'4*$"*$#*(0+*('4M4*!"*(0+*WB'4M$"$&2W*=4!5>*$#*KKK*64+*!".+"(+3*82*R2%"+*'(*'),*XRZKLDDY*Q$/+.+%I*4+'%50*'4*'*"'.!7'(!$"*($$)*0'4*"$(*8++"*45%6(!"!^+3*&650*4$*#'%I*&'28+*8+5'64+*4+'%50*+"7!"+4*0'.+*8+5$&+*<%$#!5!+"(*'(*(0!4*$")2*!"*%+5+"(*2+'%4,**

P'.!3*Q'/M!"7*'"3*0!4*5$))+'76+4*0'.+*5$"365(+3*'*4+%!+4*$#*/+8*4+'%50*+"7!"+4*+.')6'(!$"4I*!"*<'%(!56)'%*WK0!50*4+'%50*+"7!"+*!4*8+4(*'(*#!"3!"7*'!%)!"+*0$&+*<'7+4_W*XLQO`CY*/0!50*!4*$#*5$6%4+*'*%+<%+4+"('(!.+*"'.!7'(!$"')*('4MI*'"3*WK0!50*4+'%50*+"7!"+*!4*8+4(*'(*#!"3!"7*$")!"+*4+%.!5+4_W*XQLO`CY*/0!50*!4*'*(%'"4'5(!$"')*('4M,*Q'/M!"7*+(*'),*')4$*40$/*(0'(*'"50$%*(+1(*<%$.!3+4*'*.+%2*+##+5(!.+*0+)<*#$%*"'.!7'(!$"')*A6+%!+4*XLQ?`CY**

<)'=23,-$/"23/'.39'0$1$%&'9"%&,1"23/'

a"*(0+*/+8*W(0+*"++3*8+0!"3*(0+*A6+%2W*&!70(*8+*

4+>&'20'?$&%+ @/&%'A$%6&+**$&%+'B2#'C3.-+/"/*

]'.!7'(!$"')* bc,Ed* b`d*

-"#$%&'(!$"')*__*=+4(!&'(+3*

eDd>**cNd*

B%'"4'5(!$"')*f*bbd*=+4(!&'(+3*e[d>*

e`d*

Page 21: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Comparison of Nine Search Engines

• Jansen, B. J. and Spink, A. 2006. How are we searching the world wide web?: a comparison of nine search engine transaction logs. Inf. Process. Manage. 42, 1 (Jan. 2006), 248-263.

• a 1997 study of Excite Web search engine• a 1998 study of the Fireball Web search• a 1998 study of the AltaVista Web search engine• a 1999 study of the Excite Web search engine• a 2000 study of the BWIE Web search service• a 2001 study of Excite Web search engine• a 2002 study of AlltheWeb.com• a 2002 study of AltaVista

Page 22: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Characteristics of the Studies

Table 1Aggregate data from Web search engine studies from 1997 through 2002

Study no. 1 2 3 4 5 6 7 8 9

Excite Fireball AltaVista Excite BWIE AlltheWeb.com Excite AlltheWeb.com AltaVista

Region US European US US European European US European USData collection Tuesday 16

September 19971–31 July 1998 2 August–13

September 1998Wednesday 1December 1999

3–18 May2000

Tuesday 6February 2001

Monday 30April 2001

Tuesday 28May 2002

Sunday 8September2002

Sessions 211,063 Not reported 285,474,117 325,711 83,232 153,297 262,025 345,093 369,350Queries 1,025,908 16,252,902 993,208,159 1,025,910 71,810 451,551 1,025,910 957,303 1,073,388Terms 1,277,763 Not reported Not reported 1,500,500 116,953 1,350,619 1,538,120 2,225,141 1,073,388

B.J.

Jansen,A.Spink

/Inform

ationProcessing

andManagem

ent42

(2006)248–263

253

Page 23: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Characteristics of the Studies

Table 1Aggregate data from Web search engine studies from 1997 through 2002

Study no. 1 2 3 4 5 6 7 8 9

Excite Fireball AltaVista Excite BWIE AlltheWeb.com Excite AlltheWeb.com AltaVista

Region US European US US European European US European USData collection Tuesday 16

September 19971–31 July 1998 2 August–13

September 1998Wednesday 1December 1999

3–18 May2000

Tuesday 6February 2001

Monday 30April 2001

Tuesday 28May 2002

Sunday 8September2002

Sessions 211,063 Not reported 285,474,117 325,711 83,232 153,297 262,025 345,093 369,350Queries 1,025,908 16,252,902 993,208,159 1,025,910 71,810 451,551 1,025,910 957,303 1,073,388Terms 1,277,763 Not reported Not reported 1,500,500 116,953 1,350,619 1,538,120 2,225,141 1,073,388

B.J.

Jansen,A.Spink

/Inform

ationProcessing

andManagem

ent42

(2006)248–263

253

Table 1Aggregate data from Web search engine studies from 1997 through 2002

Study no. 1 2 3 4 5 6 7 8 9

Excite Fireball AltaVista Excite BWIE AlltheWeb.com Excite AlltheWeb.com AltaVista

Region US European US US European European US European USData collection Tuesday 16

September 19971–31 July 1998 2 August–13

September 1998Wednesday 1December 1999

3–18 May2000

Tuesday 6February 2001

Monday 30April 2001

Tuesday 28May 2002

Sunday 8September2002

Sessions 211,063 Not reported 285,474,117 325,711 83,232 153,297 262,025 345,093 369,350Queries 1,025,908 16,252,902 993,208,159 1,025,910 71,810 451,551 1,025,910 957,303 1,073,388Terms 1,277,763 Not reported Not reported 1,500,500 116,953 1,350,619 1,538,120 2,225,141 1,073,388

B.J.

Jansen,A.Spink

/Inform

ationProcessing

andManagem

ent42

(2006)248–263

253

Page 24: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Terminology• Session length

• is the number of queries that a searcher submits in one episode with a Web search engine

• Query length

• is the number of terms in a query

• Term

• Operator usage

• whether or not the query includes boolean operators like AND, OR, MUST, APPEAR, PHRASE

• Result page

• Result page viewed

• It usually is the clicked result

Page 25: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Percentage of Single Query Sessions

All figures in this paper follow a similar layout. The x-axis is the year of the study. The y-axis is the mea-sured percentage for a particular metric. The dark bar columns show the data points for the European stud-ies. The light bar columns show the data points for the US studies. There is a label on the columnsidentifying each study (i.e., ATW—AlltheWeb.com, AV—AltaVista, BWIE—BWIE, EX—Excite, FB—Fireball).

Fig. 1 shows that for the US Web search engines, it does not appear that the complexity of interactionsis increasing as indicated by longer sessions (i.e., users submitting more Web queries). We conducted aChi-Square goodness of fit procedure to evaluate whether or not the percentage of one query sessionacross Web search engines was significantly di!erent (i.e., homogeneity of proportions). A Chi-Squaretest indicated only marginally significance di!erence among the Web search engines in terms of percent-age of one query sessions (Chi-Square (6) = 11.09, p = 0.086). However, if the 1998 AltaVista dataset isremoved, there is no significant di!erence among the remaining search engine datasets. This would indi-cate that the temporal cut-o! used for analysis in the 1998 AltaVista study (Silverstein et al., 1999) wastoo short.

In 2002, approximately 47% of searchers on AltaVista submitted only one query, down from 77% in1998. In the 1998 study, however, a session was artificially limited to 5 min. Subsequent research has shownthat the typical Web session is about fifteen minutes (He et al., 2002; Jansen & Spink, 2003). Therefore, the1998 AltaVista study probably over estimates the number of one query sessions. The downward trend alsoappears with Excite users from 1999 to 2002, dropping from 60% to 55%, although not a significant de-crease. The data analysis methods were similar for all Excite studies and did not impose a session time limit.

The session data for European users is available from two Web search engines, BWIE and Allthe-Web.com. For these European Web search engines, there is also no significant change in one query sessions.So, for session length, the trend appears to be one of stability, with no di!erences among search engines.

5.2. Queries

At the query length level, we analyze the percentage of queries with only one term. The percentage ofone-term queries will inform us whether or not the length of queries is increasing or decreasing. Fig. 2 dis-plays the results for the analysis of Web query lengths.

A Chi-Square test did indicate a significant di!erence among the Web search engines in terms of percent-age of one-term queries (Chi-Square (7) = 26.43, p = 0.01). However, if the 1998 Fireball dataset is re-moved, there is no significant di!erence among the remaining search engine data. This would signify

EX

AV

EX EX

AVATW

ATW

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

1997 1998 1999 2000 2002

Year of Study

Perc

enta

ge o

f One

Que

ry S

essi

ons

2001

Fig. 1. Percentage of single query sessions.

B.J. Jansen, A. Spink / Information Processing and Management 42 (2006) 248–263 255

Page 26: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Percentage of One-Term Queries

that something in the Fireball user base, content, or system di!erentiates it from users of the other Websearch engines.

For the US-based Web-search engines the percentage of one-term queries is holding steady, within arange of 20–29% of all queries. Using data from 1999 onward, the trend with US-based Web-search enginesappears to be of one-term queries declining as a percentage of all queries, dropping from 30% to 20%.

For the Europe-based Web-search-engine users, the trend appears to be one of little change, althoughthere is a spike in 2002 with AlltheWeb.com users. Otherwise, we see a percentage of one-term querieson these European-based Web-search engines within a range of about 25–35%, excluding the 1998 Fireballstudy.

5.3. Query operators

We also analyze the percentage of Web queries containing searching operators. The trend in the percent-age of queries with searching operators will inform us whether or not the complexity of query structure isincreasing or decreasing.

Based on the use of advanced operators, the complexity of interaction appears to be at least remainingstable. Fig. 3 shows the results for query operator usage on the various Web search engines.

The usage of query operators appears to be search-engine dependent, and there is a notable regional dif-ference. A Chi-Square test indicated significant di!erence among the US Web search engines in terms ofpercentage of usage of query operators (Chi-Square (4) = 16.383, p = 0.01). A Chi-Square test indicatedno significant di!erence among the three Excite search-engine datasets in terms of percentage of usageof query. A Chi-Square test indicated no significant di!erence among the two AltaVista search-engine data-sets in terms of percentage of usage of query operators. This indicates that there is a search engine depen-dency in terms of the use of query operators with a particular search-engine system.

For the AltaVista Web search engine, the usage of query operators has held steady at approximately20%. For the Excite Web search engine, the usage increased steadily from 1997 to 2001, although not a sta-tistically significant variation between datasets.

For the European-based Web search engines, the usage also varied among the three Web search engines,but these searchers seldom use advanced operators. A Chi-Square test indicated no significant di!erenceamong the four European search datasets in terms of percentage of usage of query operators, with the usageextremely low on all.

EXAVEX EX

AV

FB

ATW

ATW

0%

10%

20%

30%

40%

50%

60%

1997 1998 1999 2000 2001 2002

Year of Study

Perc

enta

ge o

f Que

ries

Fig. 2. Percentage of one-term queries.

256 B.J. Jansen, A. Spink / Information Processing and Management 42 (2006) 248–263

Page 27: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Operator Usage

The most notable feature of operator usage is the rather large gap between usage on the US and Euro-pean-based Web search engines. The usage of query operators on the US-based Web search engines variedfrom 11% to 20%. The usage on the European-based Web search engines varied from 2% to 10% and heldfairly stable at under 5% from 1998 to 2001.

5.4. Results pages

We analyze the percentage of users viewing only one results page. This trend will inform us how persis-tent searchers are when locating information or services on the Web. Overall, it appears that Web searchersare tending to view fewer documents per Web query, which might indicate a move to less complex interac-tions. Fig. 4 presents results-page-viewing findings.

We see that the percentage of searchers viewing only one results page is increasing for users of both USand European-based Web search engines. The percentage of searchers viewing only the first results page hasincreased from 29% in 1997 to 73% in 2002 for US-based Web search engines users. Again, the 1998

EX

AV

EXEX

AV

FBBWIE

ATWATW

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

1997 1998 1999 2000 2001 2002Year of Study

Perc

enta

ge o

f Res

ult P

ages

Vie

wed

Fig. 4. Percentage of single result page viewing.

EXEX

EXAV AV

FBBWIE

ATW

ATW

0%

5%

10%

15%

20%

25%

1997 1998 1999 2000 2001 2002

Year of Study

Perc

enta

ge o

f Que

ries

Fig. 3. Percentage of operator usage.

B.J. Jansen, A. Spink / Information Processing and Management 42 (2006) 248–263 257

Page 28: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Single Result Page Views

The most notable feature of operator usage is the rather large gap between usage on the US and Euro-pean-based Web search engines. The usage of query operators on the US-based Web search engines variedfrom 11% to 20%. The usage on the European-based Web search engines varied from 2% to 10% and heldfairly stable at under 5% from 1998 to 2001.

5.4. Results pages

We analyze the percentage of users viewing only one results page. This trend will inform us how persis-tent searchers are when locating information or services on the Web. Overall, it appears that Web searchersare tending to view fewer documents per Web query, which might indicate a move to less complex interac-tions. Fig. 4 presents results-page-viewing findings.

We see that the percentage of searchers viewing only one results page is increasing for users of both USand European-based Web search engines. The percentage of searchers viewing only the first results page hasincreased from 29% in 1997 to 73% in 2002 for US-based Web search engines users. Again, the 1998

EX

AV

EXEX

AV

FBBWIE

ATWATW

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

1997 1998 1999 2000 2001 2002Year of Study

Perc

enta

ge o

f Res

ult P

ages

Vie

wed

Fig. 4. Percentage of single result page viewing.

EXEX

EXAV AV

FBBWIE

ATW

ATW

0%

5%

10%

15%

20%

25%

1997 1998 1999 2000 2001 2002

Year of Study

Perc

enta

ge o

f Que

ries

Fig. 3. Percentage of operator usage.

B.J. Jansen, A. Spink / Information Processing and Management 42 (2006) 248–263 257

Page 29: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

AllTheWeb Topics Queried

AltaVista study limited sessions to five minutes, which probably increased the percentage of sessions withonly one page result. For European searchers, the variability ranged from 60% to 83%, although there wasa dip to 76% in 2002.

A Chi-Square test indicated significant di!erence among the Web search engines in terms of percentageof single result page viewing (Chi-Square (8) = 45.743, p = 0.01). A Chi-Square test indicated a significancedi!erence among the three Excite Web search-engine datasets in terms of percentage of single result pageviewing (Chi-Square (2) = 6.049, p = 0.05). A Chi-Square test indicated no significance di!erence among thetwo AltaVista search-engine datasets in terms of percentage of single result page viewing (Chi-Square(1) = 0.911, p = 0.34). A Chi-Square test indicated no significance di!erence among the four Europeansearch datasets in terms of percentage of percentage of single result page viewing (Chi-Square(3) = 4.136, p = 0.247). Therefore, there was trend among Excite users to view fewer result pages. Exciteusers viewed more result pages than users of other Web search engines. However, as time processed, thetendency was to view fewer.

5.5. Topical classification

For the six Web query datasets that we had access to, we qualitatively analyzed a random sample ofapproximately 2600 queries from each in order to determine trends in the type of information peopleare searching for on the Web. We classified each query into eleven non-mutually exclusive, general topiccategories developed by Spink et al. (2002a). At least two independent evaluators manually classified que-ries from each dataset independently. The evaluators then met and resolved discrepancies.

Tables 2 and 3 display the topical evaluation results for European and US-based Web search engines,respectively.

For searching on AlltheWeb.com, People, Places or Things category remained the top ranked categorywith a large percentage increase from 2001 to 2002, accounting for over forty percent of queries. Commerce,Travel, Employment or Economy and Computers, Internet or Technology accounted approximately 25% ofthe queries. Noticeably percentage decreases occurred in Computers or Internet, Entertainment or Recrea-tion, and Sex or Pornography. A Chi-square goodness of fit test indicates a significant di!erence betweenthe Web search-engine datasets based on category of People, Place or Things (Chi-Square (3) = 5.554,p = 0.05).

Table 2Distribution of AlltheWeb.com general topic categories

Categories 2001 (2503 English queries) (%) 2002 (2525 English queries) (%)

1 People, places or things 22.5 41.52 Computers or Internet 21.8 16.33 Commerce, travel, employment, or economy 12.3 12.74 Sex or pornography 10.8 9.55 Entertainment or recreation 9.1 4.96 Health or sciences 7.8 4.57 Society, culture, ethnicity or religion 4.8 2.68 Performing or fine arts 4.7 2.59 Education or humanities 2.9 2.310 Government 2.7 2.111 Unknown or Other 0.6 1.1

100.0 100.0

Note: Bolded percentages indicate the highest ranked topic in a given year.

258 B.J. Jansen, A. Spink / Information Processing and Management 42 (2006) 248–263

Page 30: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Excite AltaVista Topics

On the US-based Web search engines. Queries for People, Place or Things account for nearly half of thequeries in 2002, with Commerce, Travel, Employment or Economy and Computers, Internet or Technologyaccounting for another 25% of the queries. There appears to be a steady rise in searching for People, Placeor Things and Commerce, Travel, Employment or Economy, with decreased searching for Sex and Porno-graphy and Entertainment or Recreation. A Chi-squared goodness of fit test indicated significant di!erencesamong the Web search-engines datasets based on distribution of queries among categories in the areas ofPeople, Places, or Things (Chi-Square (3) = 39.317, p = 0.01), Entertainment or Recreation (Chi-Square(3) = 13.80, p = 0.01), and sex and pornography (Chi-Square (3) = 10.892, p = 0.05). There was a marginallysignificant di!erence with the category of Commerce, Travel, Employment, or Economy (Chi-Square(3) = 4.136, p = 0.06). There was no significant di!erence among the datasets in the other categories. Thepercentages of People, Places, or Things and Commerce, Travel, Employment, or Economy are increasingat the expense of Entertainment or Recreation and Sex and Pornography.

6. Discussion

As the Web is becoming a worldwide phenomenon, we need to understand better the emerging trends inWeb searching given the tremendous influence Web search engines have on directing tra"c to online infor-mation and services. Our findings indicate that the interactions between Web search engines and searchersare not becoming more complex, and in some respects, are becoming less complex. Our comparative ana-lysis also indicates that finding from a study focusing on one Web search engine cannot be applied whole-sale to all Web search engines.

Sessions lengths are not increasing as measured by number of queries. The percentage of one-term ses-sions is remaining stable over time and across Web search engines. There was a di!erence with the 1998AltaVista study, but this appears to be caused by an artificially short session duration that the researchersused. Queries lengths are also not increasing as measured by number of terms. There was a statistical dif-ference in the percentage of one-term queries on the German Fireball Web search engine, which may be dueto linguistic di!erences with the other Web search engines. The percentage of single-term queries is holdingsteady, and the use of query operators is also remaining stable. Web search engines in the future may betterleverage the implicit feedback from this interaction to provide more personalized results (Callan &

Table 3Distribution of Excite and AltaVista general topic categories

Categories 1997 Excite(2414 queries) (%)

1999 Excite(2539 queries) (%)

2001 Excite(2453 queries) (%)

2002 AltaVista(2603 queries) (%)

1 People, places, or things 6.7 20.3 19.7 49.32 Commerce, travel, employment, or economy 13.3 24.5 24.7 12.53 Computers or Internet 12.5 10.9 9.7 12.44 Health or sciences 9.5 7.8 7.5 7.55 Education or humanities 5.6 5.3 4.6 5.06 Entertainment or recreation 19.9 7.5 6.7 4.67 Sex and pornography 16.8 7.5 8.6 3.38 Society, culture, ethnicity, or religion 5.7 4.2 3.9 3.19 Government 3.4 1.6 2.0 1.610 Performing or fine arts 5.4 1.1 1.2 0.711 Non-English or unknown 4.1 9.3 11.4 0.0

102.9 100.0 100.0 100.0

Note: Bolded percentages indicate the highest ranked topic in a given year.

B.J. Jansen, A. Spink / Information Processing and Management 42 (2006) 248–263 259

Page 31: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Tracking Down Query Analysis

• Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2004. Hourly analysis of a very large topically categorized web query log. In Proceedings of the 27th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Sheffield, United Kingdom, July 25 - 29, 2004). SIGIR '04. ACM, New York, NY, 321-328.

• Analyze a query log of hundred of millions of queries

• An entire week of query traffic to AOL search service

• Analysis made on a hourly basis

Page 32: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Query Distribution

!

"#$"!"#%!&'("!)')*+$,!--./0!1*%,2%(!,%),%(%3"!'3+4!566!7+*("%,(!'8!

1*%,2%(!*(239!:288%,239!(%"(!'8!1*%,4!"%,&(.!

;$34!<%=!(%$,7#!(%,>27%(!#$>%!=%9*3!"'!'88%,!>2%<(!'8!"#%!&'("!

)')*+$,! $3:?',! 7#$39239! @=%7'&239! :,$("27$++4! &',%! ',! +%((!

)')*+$,A! 1*%,2%(B! !CDE!F!!"#$"%&'%"()*G!H$#''! I!+,--& .()"/G!

E47'(! I! '0"& 1234*& 56& 7890& :;%4(& <30;9-G! J''9+%! F! ="89>"8*9G!

C+"$K2("$! I!'4?&@,"%8"*G!C(L!M%%>%(G!N$("! @C++O#%P%=A.! !O#%(%!

>2%<(! 3%7%(($,2+4! 237',)',$"%! $! "%&)',$+! $()%7"G! '8"%3! (#'<239!

)')*+$,! 1*%,2%(! 8',! "#%! 7*,,%3"! "2&%! )%,2':! $3:! "#'(%! "#$"! $,%!

7'3(2("%3"+4!)')*+$,.!!Q'&%!$+('!=,%$L!:'<3!)')*+$,2"4!=4!"')27$+!

7$"%9',2%(.! ! Q4("%&(! (%%L239! "'! :2()+$4! 7#$39239! 1*%,2%(! &*("!

$::,%((! "#%! 2((*%!'8! ,%+$"2>%!>%,(*(!$=('+*"%!7#$39%! 23!$!1*%,4R(!

8,%1*%374! "'! 823:! 1*%,2%(! <#'(%! 7#$39%! 2(! S23"%,%("239TG! 3'"!

(2&)+4! $! 1*%,4! "#$"! <%3"! 8,'&! 8,%1*%374! '3%! "'! "<'! @$! -660!

U*&)AG!',!'3%! "#$"!<%3"! 8,'&!V6G666! "'!VVG666!@$!V666!$=('+*"%!

7#$39%A.!!!

!"# $%&'())#*+&',#-'(../0#

P%! %W$&23%! $! (%$,7#! +'9! 7'3(2("239! '8! #*3:,%:(! '8!&2++2'3(! '8!

1*%,2%(! 8,'&!$!&$U',! 7'&&%,72$+! (%$,7#!(%,>27%!'>%,! "#%! (%>%3I

:$4! )%,2':! 8,'&! V-?-X?6Y! "#,'*9#! V?V?65.! ! O#2(! +'9! ,%),%(%3"(!

1*%,2%(!8,'&!$)),'W2&$"%+4!/6!&2++2'3!*(%,(.!!P%!),%),'7%((!"#%!

1*%,2%(! "'! 3',&$+2Z%! "#%! 1*%,4! (",239(! =4! ,%&'>239! $34! 7$(%!

:288%,%37%(G!,%)+$7239!$34!)*37"*$"2'3!<2"#!<#2"%!()$7%!@(",2))239!

$:>$37%:!(%$,7#!')%,$"',(!8,'&!"#%!$)),'W2&$"%+4!-0!'8!1*%,2%(!

7'3"$23239! "#%&AG!$3:!7'&),%((239!<#2"%! ()$7%! "'! (239+%! ()$7%(.!!

O#%!$>%,$9%!1*%,4!+%39"#!2(!V.[!"%,&(!8',!)')*+$,!1*%,2%(!$3:!-.-!

"%,&(!'>%,!$++!1*%,2%(.! !D3!$>%,$9%G!*(%,(!>2%<!'3+4!'3%!)$9%!'8!

,%(*+"(!\V0!'8!"#%!"2&%G!"<'!)$9%(!V\0!$3:!"#,%%!',!&',%!V0!'8!

"#%!"2&%.!!N2,("G!<%!%W$&23%!",%3:(!23!"#%!1*%,4!(",%$&!$(!$!<#'+%G!

$3:! "#%3!8'7*(!'3! ",%3:(!,%+$"%:! "'!1*%,2%(!&$3*$++4!7$"%9',2Z%:!

23"'!"')27$+!7$"%9',2%(.!

P%!=%923!'*,!$3$+4(2(!'8!"#%!'>%,$++!(",%$&!=4!%W$&23239!#'<!"#%!

>'+*&%!'8! 1*%,4! ",$8827! 7#$39%(! $(!<%!&'>%! 8,'&!)%$L! "'!3'3I

)%$L! #'*,(.! ! P%! (#'<! "#%! )%,7%3"$9%! '8! "#%! :$4R(! "'"$+! $3:!

:2("237"! 3*&=%,! '8! 1*%,2%(! 8',! %$7#! #'*,! 23! "#%! :$4! '3! $>%,$9%!

'>%,!'*,!(%>%3I:$4!)%,2':!23!N29*,%!V!@$++!"2&%(!23!'*,!1*%,4!+'9!

$,%!]$("%,3!Q"$3:$,:!O2&%A.!!D3+4!6.[/0!'8!"#%!:$4R(!"'"$+!1*%,2%(!

$))%$,! 8,'&! /IXC;G!<#%,%$(! X.[0! '8! "#%! :$4R(! 1*%,2%(! $))%$,!

8,'&!^IV6_;.!!_%,#$)(!&',%!23"%,%("239+4G!"#%!,$"2'!'8!:2("237"!"'!

"'"$+!1*%,2%(!23!$!92>%3!#'*,!2(!3%$,+4!7'3("$3"!"#,'*9#'*"!"#%!:$4.!!

O#2(!(#'<(!"#$"!"#%!$>%,$9%!3*&=%,!'8!"2&%(!$!1*%,4!2(!,%)%$"%:!2(!

>2,"*$++4! 7'3("$3"! '>%,! "#%! #'*,(! 23! $! :$4G! ,%&$23239! 3%$,! -.V5!

<2"#!'3+4!$!6.V-!("$3:$,:!:%>2$"2'3.!

C+"#'*9#! "#%! $>%,$9%! ,%)%"2"2'3! '8! 1*%,2%(! ,%&$23(! 3%$,+4!

7'3("$3"G!<%!7$3!%W$&23%! "#2(! 23!9,%$"%,!:%"$2+!=4!&%$(*,239! "#%!

8,%1*%374! :2(",2=*"2'3! '8! 1*%,2%(! $"! >$,2'*(! #'*,(! 23! "#%! :$4G! $(!

(%%3! 23! N29*,%! -.! ! N,'&! "#2(! $3$+4(2(! 2"! 2(! 7+%$,! "#$"! "#%! >$("!

&$U',2"4!'8!1*%,2%(! 23!$3!#'*,!$))%$,!'3+4!'3%! "'!82>%! "2&%(!$3:!

"#$"! "#%(%! ,$,%! 1*%,2%(! 7'3(2("%3"+4! $77'*3"! 8',! +$,9%! )',"2'3(! '8!

"#%!"'"$+!1*%,4!>'+*&%!"#,'*9#'*"!"#%!7'*,(%!'8!"#%!:$4.!

.12345#6#

C+"#'*9#! <%! #$>%! (#'<3! "#$"! "#%! 1*%,4! :2(",2=*"2'3! :'%(! 3'"!

7#$39%! (*=("$3"2$++4! '>%,! "#%! 7'*,(%! '8! $! :$4G! "#2(! :'%(! 3'"!

),'>2:%!23(29#"!23"'!#'<!"#%!(%"(!'8!1*%,2%(!>$,4!8,'&!'3%!#'*,!"'!

"#%!3%W".! !O'!%W$&23%! "#2(G!<%!&%$(*,%! "#%!'>%,+$)!=%"<%%3! "#%!

(%"(!'8!1*%,2%(!%3"%,%:!:*,239!"#'(%!#'*,(.!!P%!*(%!",$:2"2'3$+!(%"!

$3:!=$9!'>%,+$)!&%$(*,%(!$(!92>%3!23!]1*$"2'3!V!$3:!]1*$"2'3!-G!

,%()%7"2>%+4.!!`2("237"!'>%,+$)!&%$(*,%(!"#%!(2&2+$,2"4!=%"<%%3!"#%!

(%"(!'8!*321*%!1*%,2%(!8,'&!%$7#!#'*,G!<#2+%!'>%,$++!@=$9A!'>%,+$)!

&%$(*,%(! "#%! (2&2+$,2"4! '8! "#%2,! 8,%1*%374! :2(",2=*"2'3(! =4!

237',)',$"239!"#%!3*&=%,!'8!"2&%(!%$7#!1*%,4!$))%$,(!23!$3!#'*,G!

Aa@ :AB 8.!!P#2+%!"#%(%!&%$(*,%(!%W$&23%!"#%!(2&2+$,2"4!'8!"#%!(%"(!

'8!1*%,2%(! ,%7%2>%:! 23!$3!#'*,!$3:! "#%!3*&=%,!'8! "2&%(! "#%4!$,%!

%3"%,%:G!"#%4!:'!3'"!237',)',$"%!"#%!,%+$"2>%!)')*+$,2"4!',!,$3L239!

'8! 1*%,2%(! <2"#23! "#%! 1*%,4! (%"(.! ! O'! %W$&23%! "#2(G! <%! $+('!

&%$(*,%! "#%!_%$,('3! 7',,%+$"2'3!'8! "#%!1*%,2%(R! 8,%1*%372%(.! !C(!

7$3!=%!(%%3!8,'&!]1*$"2'3!Y!@<#%,%! Aa@ :AB !2(!"#%!&%$3!3*&=%,!

'8! 1*%,4! ,%)%"2"2'3(! 23! )%,2':! C! $3:! Aa@ :AB* ! 2(! "#%! ("$3:$,:!

:%>2$"2'3!'8!$++!"#%!1*%,4!8,%1*%372%(!23!)%,2':!CAG!"#2(!&%$(*,%(!

"#%! :%9,%%! '8! +23%$,! 7',,%+$"2'3! =%"<%%3! "#%! 8,%1*%372%(! '8! "#%!

1*%,2%(! 23! %$7#! #'*,G! ('! "<'! #'*,(! "#$"! #$:! %W$7"+4! "#%! ($&%!

1*%,2%(! <2"#! %W$7"+4! "#%! ($&%! 8,%1*%372%(! <'*+:! #$>%! $!

7',,%+$"2'3! '8! '3%.! ! b'"%! "#$"! "#2(! 3',&$+2Z%(! 8',! "#%! %88%7"! '8!

:288%,239! 1*%,4! >'+*&%G! 2.%.G! "#%! 7',,%+$"2'3! '8! "<'! #'*,(! <2"#!

%W$7"+4!"#%!($&%!*3:%,+4239!1*%,4!:2(",2=*"2'3(!(2&)+4!(7$+%:!=4!$!

7'3("$3"!<'*+:!$+('!#$>%!$!7',,%+$"2'3!'8!'3%.!

.12345#7#

!"#$"%&'(")*+),-"#'(").'/01)23"#1)4#'++/$)'&)5'$6)7*3#

60

V0

-0

Y0

50

/0

X0

[0

\0

6 X V- V\

7*3#)*+).'1

!"#$"%&'(")*+).'/01)23"#1)4#'++/$

!"#$%&#'()*%+',-#$.#/

!"#$%&#'0./*.12*',-#$.#/

.458359:;#<1=>41?3>1@9#@A#B5C5:>5D#E@34=#A4@F#67G7HGI!

60

/0

V60

V/0

-60

-/0

Y60

Y/0

560

VG66VIV6G666

/6VIVG666

-6VI/66

V6VI-66

/VIV66

-XI/6

-VI-/

VXI-6

VVIV/

V6 ^ \ [ X / 5 Y - V

.458359:;#'J925=

K54:59>J25#@A#-@>JC#*35415=

V-C;IVC;

XC;I[C;

V-_;IV_;

X_;I[_;

323

Page 33: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Hourly Frequency Distribution

!

"#$"!"#%!&'("!)')*+$,!--./0!1*%,2%(!,%),%(%3"!'3+4!566!7+*("%,(!'8!

1*%,2%(!*(239!:288%,239!(%"(!'8!1*%,4!"%,&(.!

;$34!<%=!(%$,7#!(%,>27%(!#$>%!=%9*3!"'!'88%,!>2%<(!'8!"#%!&'("!

)')*+$,! $3:?',! 7#$39239! @=%7'&239! :,$("27$++4! &',%! ',! +%((!

)')*+$,A! 1*%,2%(B! !CDE!F!!"#$"%&'%"()*G!H$#''! I!+,--& .()"/G!

E47'(! I! '0"& 1234*& 56& 7890& :;%4(& <30;9-G! J''9+%! F! ="89>"8*9G!

C+"$K2("$! I!'4?&@,"%8"*G!C(L!M%%>%(G!N$("! @C++O#%P%=A.! !O#%(%!

>2%<(! 3%7%(($,2+4! 237',)',$"%! $! "%&)',$+! $()%7"G! '8"%3! (#'<239!

)')*+$,! 1*%,2%(! 8',! "#%! 7*,,%3"! "2&%! )%,2':! $3:! "#'(%! "#$"! $,%!

7'3(2("%3"+4!)')*+$,.!!Q'&%!$+('!=,%$L!:'<3!)')*+$,2"4!=4!"')27$+!

7$"%9',2%(.! ! Q4("%&(! (%%L239! "'! :2()+$4! 7#$39239! 1*%,2%(! &*("!

$::,%((! "#%! 2((*%!'8! ,%+$"2>%!>%,(*(!$=('+*"%!7#$39%! 23!$!1*%,4R(!

8,%1*%374! "'! 823:! 1*%,2%(! <#'(%! 7#$39%! 2(! S23"%,%("239TG! 3'"!

(2&)+4! $! 1*%,4! "#$"! <%3"! 8,'&! 8,%1*%374! '3%! "'! "<'! @$! -660!

U*&)AG!',!'3%! "#$"!<%3"! 8,'&!V6G666! "'!VVG666!@$!V666!$=('+*"%!

7#$39%A.!!!

!"# $%&'())#*+&',#-'(../0#

P%! %W$&23%! $! (%$,7#! +'9! 7'3(2("239! '8! #*3:,%:(! '8!&2++2'3(! '8!

1*%,2%(! 8,'&!$!&$U',! 7'&&%,72$+! (%$,7#!(%,>27%!'>%,! "#%! (%>%3I

:$4! )%,2':! 8,'&! V-?-X?6Y! "#,'*9#! V?V?65.! ! O#2(! +'9! ,%),%(%3"(!

1*%,2%(!8,'&!$)),'W2&$"%+4!/6!&2++2'3!*(%,(.!!P%!),%),'7%((!"#%!

1*%,2%(! "'! 3',&$+2Z%! "#%! 1*%,4! (",239(! =4! ,%&'>239! $34! 7$(%!

:288%,%37%(G!,%)+$7239!$34!)*37"*$"2'3!<2"#!<#2"%!()$7%!@(",2))239!

$:>$37%:!(%$,7#!')%,$"',(!8,'&!"#%!$)),'W2&$"%+4!-0!'8!1*%,2%(!

7'3"$23239! "#%&AG!$3:!7'&),%((239!<#2"%! ()$7%! "'! (239+%! ()$7%(.!!

O#%!$>%,$9%!1*%,4!+%39"#!2(!V.[!"%,&(!8',!)')*+$,!1*%,2%(!$3:!-.-!

"%,&(!'>%,!$++!1*%,2%(.! !D3!$>%,$9%G!*(%,(!>2%<!'3+4!'3%!)$9%!'8!

,%(*+"(!\V0!'8!"#%!"2&%G!"<'!)$9%(!V\0!$3:!"#,%%!',!&',%!V0!'8!

"#%!"2&%.!!N2,("G!<%!%W$&23%!",%3:(!23!"#%!1*%,4!(",%$&!$(!$!<#'+%G!

$3:! "#%3!8'7*(!'3! ",%3:(!,%+$"%:! "'!1*%,2%(!&$3*$++4!7$"%9',2Z%:!

23"'!"')27$+!7$"%9',2%(.!

P%!=%923!'*,!$3$+4(2(!'8!"#%!'>%,$++!(",%$&!=4!%W$&23239!#'<!"#%!

>'+*&%!'8! 1*%,4! ",$8827! 7#$39%(! $(!<%!&'>%! 8,'&!)%$L! "'!3'3I

)%$L! #'*,(.! ! P%! (#'<! "#%! )%,7%3"$9%! '8! "#%! :$4R(! "'"$+! $3:!

:2("237"! 3*&=%,! '8! 1*%,2%(! 8',! %$7#! #'*,! 23! "#%! :$4! '3! $>%,$9%!

'>%,!'*,!(%>%3I:$4!)%,2':!23!N29*,%!V!@$++!"2&%(!23!'*,!1*%,4!+'9!

$,%!]$("%,3!Q"$3:$,:!O2&%A.!!D3+4!6.[/0!'8!"#%!:$4R(!"'"$+!1*%,2%(!

$))%$,! 8,'&! /IXC;G!<#%,%$(! X.[0! '8! "#%! :$4R(! 1*%,2%(! $))%$,!

8,'&!^IV6_;.!!_%,#$)(!&',%!23"%,%("239+4G!"#%!,$"2'!'8!:2("237"!"'!

"'"$+!1*%,2%(!23!$!92>%3!#'*,!2(!3%$,+4!7'3("$3"!"#,'*9#'*"!"#%!:$4.!!

O#2(!(#'<(!"#$"!"#%!$>%,$9%!3*&=%,!'8!"2&%(!$!1*%,4!2(!,%)%$"%:!2(!

>2,"*$++4! 7'3("$3"! '>%,! "#%! #'*,(! 23! $! :$4G! ,%&$23239! 3%$,! -.V5!

<2"#!'3+4!$!6.V-!("$3:$,:!:%>2$"2'3.!

C+"#'*9#! "#%! $>%,$9%! ,%)%"2"2'3! '8! 1*%,2%(! ,%&$23(! 3%$,+4!

7'3("$3"G!<%!7$3!%W$&23%! "#2(! 23!9,%$"%,!:%"$2+!=4!&%$(*,239! "#%!

8,%1*%374! :2(",2=*"2'3! '8! 1*%,2%(! $"! >$,2'*(! #'*,(! 23! "#%! :$4G! $(!

(%%3! 23! N29*,%! -.! ! N,'&! "#2(! $3$+4(2(! 2"! 2(! 7+%$,! "#$"! "#%! >$("!

&$U',2"4!'8!1*%,2%(! 23!$3!#'*,!$))%$,!'3+4!'3%! "'!82>%! "2&%(!$3:!

"#$"! "#%(%! ,$,%! 1*%,2%(! 7'3(2("%3"+4! $77'*3"! 8',! +$,9%! )',"2'3(! '8!

"#%!"'"$+!1*%,4!>'+*&%!"#,'*9#'*"!"#%!7'*,(%!'8!"#%!:$4.!

.12345#6#

C+"#'*9#! <%! #$>%! (#'<3! "#$"! "#%! 1*%,4! :2(",2=*"2'3! :'%(! 3'"!

7#$39%! (*=("$3"2$++4! '>%,! "#%! 7'*,(%! '8! $! :$4G! "#2(! :'%(! 3'"!

),'>2:%!23(29#"!23"'!#'<!"#%!(%"(!'8!1*%,2%(!>$,4!8,'&!'3%!#'*,!"'!

"#%!3%W".! !O'!%W$&23%! "#2(G!<%!&%$(*,%! "#%!'>%,+$)!=%"<%%3! "#%!

(%"(!'8!1*%,2%(!%3"%,%:!:*,239!"#'(%!#'*,(.!!P%!*(%!",$:2"2'3$+!(%"!

$3:!=$9!'>%,+$)!&%$(*,%(!$(!92>%3!23!]1*$"2'3!V!$3:!]1*$"2'3!-G!

,%()%7"2>%+4.!!`2("237"!'>%,+$)!&%$(*,%(!"#%!(2&2+$,2"4!=%"<%%3!"#%!

(%"(!'8!*321*%!1*%,2%(!8,'&!%$7#!#'*,G!<#2+%!'>%,$++!@=$9A!'>%,+$)!

&%$(*,%(! "#%! (2&2+$,2"4! '8! "#%2,! 8,%1*%374! :2(",2=*"2'3(! =4!

237',)',$"239!"#%!3*&=%,!'8!"2&%(!%$7#!1*%,4!$))%$,(!23!$3!#'*,G!

Aa@ :AB 8.!!P#2+%!"#%(%!&%$(*,%(!%W$&23%!"#%!(2&2+$,2"4!'8!"#%!(%"(!

'8!1*%,2%(! ,%7%2>%:! 23!$3!#'*,!$3:! "#%!3*&=%,!'8! "2&%(! "#%4!$,%!

%3"%,%:G!"#%4!:'!3'"!237',)',$"%!"#%!,%+$"2>%!)')*+$,2"4!',!,$3L239!

'8! 1*%,2%(! <2"#23! "#%! 1*%,4! (%"(.! ! O'! %W$&23%! "#2(G! <%! $+('!

&%$(*,%! "#%!_%$,('3! 7',,%+$"2'3!'8! "#%!1*%,2%(R! 8,%1*%372%(.! !C(!

7$3!=%!(%%3!8,'&!]1*$"2'3!Y!@<#%,%! Aa@ :AB !2(!"#%!&%$3!3*&=%,!

'8! 1*%,4! ,%)%"2"2'3(! 23! )%,2':! C! $3:! Aa@ :AB* ! 2(! "#%! ("$3:$,:!

:%>2$"2'3!'8!$++!"#%!1*%,4!8,%1*%372%(!23!)%,2':!CAG!"#2(!&%$(*,%(!

"#%! :%9,%%! '8! +23%$,! 7',,%+$"2'3! =%"<%%3! "#%! 8,%1*%372%(! '8! "#%!

1*%,2%(! 23! %$7#! #'*,G! ('! "<'! #'*,(! "#$"! #$:! %W$7"+4! "#%! ($&%!

1*%,2%(! <2"#! %W$7"+4! "#%! ($&%! 8,%1*%372%(! <'*+:! #$>%! $!

7',,%+$"2'3! '8! '3%.! ! b'"%! "#$"! "#2(! 3',&$+2Z%(! 8',! "#%! %88%7"! '8!

:288%,239! 1*%,4! >'+*&%G! 2.%.G! "#%! 7',,%+$"2'3! '8! "<'! #'*,(! <2"#!

%W$7"+4!"#%!($&%!*3:%,+4239!1*%,4!:2(",2=*"2'3(!(2&)+4!(7$+%:!=4!$!

7'3("$3"!<'*+:!$+('!#$>%!$!7',,%+$"2'3!'8!'3%.!

.12345#7#

!"#$"%&'(")*+),-"#'(").'/01)23"#1)4#'++/$)'&)5'$6)7*3#

60

V0

-0

Y0

50

/0

X0

[0

\0

6 X V- V\

7*3#)*+).'1

!"#$"%&'(")*+).'/01)23"#1)4#'++/$

!"#$%&#'()*%+',-#$.#/

!"#$%&#'0./*.12*',-#$.#/

.458359:;#<1=>41?3>1@9#@A#B5C5:>5D#E@34=#A4@F#67G7HGI!

60

/0

V60

V/0

-60

-/0

Y60

Y/0

560

VG66VIV6G666

/6VIVG666

-6VI/66

V6VI-66

/VIV66

-XI/6

-VI-/

VXI-6

VVIV/

V6 ^ \ [ X / 5 Y - V

.458359:;#'J925=

K54:59>J25#@A#-@>JC#*35415=

V-C;IVC;

XC;I[C;

V-_;IV_;

X_;I[_;

323

Page 34: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Query Categories

!

!"

!"!"#$%&'()*+,-

!

"="#$% !

!"#$%&'()*+)),&-%&(.%)/0123$4)'5)6#127)81%-)52'9):'#2-);)

$(<)=)

!

!!!

!

!###

!#

$+=

!".

++

!.

+

".

+

!".

++

+++

+

!./"./!./"./

!./"./

!"#$%&'()""&$"#&$'()$"&$"&$

""&$"#&$'()$

"#$

!

!"#$%&'()>+))/012$33)/0123$4)'5)6#127)81%-)52'9):'#2-);)$(<)

=)

!

"&$"&$

*

#

""&$"&$"$"&$"&$$*

*

!./"./

0

+

++

!",,

!./!./"./"./0

&

!=

$$$

= !

!"#$%&'()?+))@1$2-'()A'2213$%&'()'5)6#127)B21"#1(.&1-)52'9)

:'#2-);)$(<)=)

!

B&C#21)?))

+)! ,(-./0! 1! 20! 034'()0! 560! 470/4-0! 80708! 9:! 970/84;! 4)<!

=9//0845(9)!>05200)!560!?.0/@!A05A!/0=0(70<!<./()-!560!A4'0!69./!

:9/!04=6!<4@!970/!9./!200B%!!CA!'04A./()-!970/84;!970/!560!A05!9:!

488! ?.0/(0A! 4;;04/()-! ()! 9./! 200B! 29.8<! >0! =9';.545(9)488@!

03;0)A(70#!20!.A0!560!A05!9:!488!560!50)A!9:!'(88(9)A!9:!?.0/(0A!()!

560!<4@!4:50/!9./!A070)D<4@!;0/(9<!4A!4)!()<0;0)<0)5!A4';80!4)<!

'04A./0!970/84;!45!04=6!69./!()!9./!200B!9:!560!?.0/(0A!'45=6()-!

569A0! ()! 5645! A4';80%! ! C8569.-6! 20! ;/07(9.A8@! A42! 5645! 560!

:/0?.0)=@! <(A5/(>.5(9)! 9:! ?.0/(0A! <90A! )95! A.>A54)5(488@! =64)-0!

4=/9AA!69./A!9:!560!<4@#!,(-./0!1!A692A!5645!560!A('(84/(5@!>05200)!

560!4=5.48!?.0/(0A!5645!4/0!/0=0(70<!<./()-!04=6!69./!<90A!()!:4=5!

=64)-0%! ! E6(A! 5/0)<! A00'A! 59! :98892! ?.0/@! 798.'0#! 26(=6! (A!

4;;4/0)5! (:!20!A9/5! 560!A4'0!970/84;!<454!>@!?.0/@!798.'0!4A! (A!

<9)0!()!,(-./0!F%!!G804/8@#!4A!?.0/@!798.'0!()=/04A0A!560!?.0/(0A!

5645! =9';9A0! 5645! 5/4::(=! 4/0! '9/0! 8(B08@! 59! >0! A('(84/! 4=/9AA!

A4';80A!9:!569A0!;04B!5('0!;0/(9<A%!

E6(A!:()<()-!(A!=9)A(A50)5!2(56!;/(9/!4)48@A0A!9:!20>!?.0/@!=4=60A!

A692()-! 560@! A(-)(:(=4)58@! (';/970! ;0/:9/'4)=0! .)<0/! 6047@!

894<%! ! E60! '9/0! /0<.)<4)=@! 560@! 4/0! 4>80! 59! <050=5#! 560! '9/0!

=4=6()-!48-9/(56'A!4/0!4>80!59!0)64)=0!56/9.-6;.5%!!C8569.-6!560!

;/(9/! 29/B! ;/('4/(8@! '04A./0A! 560! 0::0=5! 9:! 56(A! /0<.)<4)=@! ()!

=4=60!;0/:9/'4)=0#! (5! (A!9>7(9.A! 5645! /0<.)<4)=@!'.A5!03(A5! 4)<!

>0! <050=50<! :9/! =4=6()-! 59! A.==00<%! ! H@! 034'()()-! 560! 970/488!

?.0/@! A5/04'! >@! 69./! 20! 4/0! 4>80! 59! ():0/! 560! 0::0=5(70)0AA! 9:!

-0)0/48!=4=6()-!48-9/(56'A!45!569A0!5('0A%!!!

B&C#21)D)

DE) 6F!GH)A;I!J/GK!8)+)! I0=5(9)! 1! 20! 4)48@J0<! 560! 0)5(/0! ?.0/@! 89-%! ! K92070/#! 56(A!

>84)B05!7(02!9:!560!?.0/@!5/4::(=!<90A!)95!;/97(<0!()A(-65!()59!560!

=64/4=50/(A5(=A! 9:! ;4/5(=.84/! =450-9/(0A! 9:! ?.0/(0A! 5645! '(-65! >0!

03;89(50<!:9/!0)64)=0<!0::(=(0)=@!9/!0::0=5(70)0AA%!!,9/!034';80#!4!

A04/=6!;/97(<0/!269! /05./)A! A;0=(48(J0<! /0A.85A! :9/!0)50/54()'0)5!

?.0/(0A!=4))95!<050/'()0!:/9'!-0)0/48!?.0/@!5/4::(=!489)0!260560/!

4! -(70)! ?.0/@! (A! '9/0! 8(B08@! 59! >0! /0:0//()-! 59! 0)50/54()'0)5!

/08450<!=9)50)5!9/!692!59!>0A5!;/9=0AA!4)<!=4=60!5645!?.0/@%!

E60!/0'4()<0/!9:!9./!4)48@A(A!:9=.A0A!9)!5/0)<A!/0845()-!59!59;(=48!

=450-9/@! 9:! ?.0/(0A%! ! L./! ?.0/@! A05! (A! =450-9/(J0<! A(';8@! >@!

034=58@!'45=6()-!?.0/(0A!59!9)0!9:!560!8(A5A!=9//0A;9)<()-!59!04=6!

=450-9/@%! ! E60A0! 8(A5A! 4/0! '4).488@! =9)A5/.=50<! >@! 0<(59/A! 269!

=450-9/(J0!/048!.A0/AM!?.0/(0A#!-0)0/450!8(B08@!?.0/(0A#!4)<!(';9/5!

8(A5A!9:!;6/4A0A!8(B08@!59!>0!?.0/(0A!()!4!=450-9/@!$0%-%#!=(5(0A!()!560!

NI! :9/! 560!NI!I(50A! =450-9/@"%! !O.0/(0A! 5645!'45=6! 45! 804A5! 9)0!

=450-9/@! 8(A5! =9';/(A0!*1P!9:! 560! 59548!?.0/@! 5/4::(=!9)!470/4-0%!!

E6(A!/0;/0A0)5A!'(88(9)A!9:!?.0/(0A!;0/!<4@%!

B&C#21)L)

E9! 70/(:@! 5645! 9./! <0:()0<! =450-9/@! 8(A5A! A.::(=(0)58@! =970/! 560!

59;(=A! ()! 560! ?.0/@! A5/04'#! 20! '4).488@! =84AA(:(0<! 4! /4)<9'!

A4';80! 9:! ?.0/(0A#! 4AA(-)()-! 560'! 59! QL560/R! (:! 560@! <(<! )95!

()5.(5(708@!:(5!()59!4)!03(A5()-!=450-9/@#!4A!=4)!>0!A00)!()!,(-./0!S%!!

E9! <050/'()0! 560! ).'>0/! 9:! ?.0/(0A! /0?.(/0<! 59! 4=6(070! 4!

/0;/0A0)545(70! A4';80#!20! =48=.8450! 560!)0=0AA4/@! A4';80! A(J0! ()!

?.0/(0A#!,,121345%567&5#!260/0!4!(A!560!=9):(<0)=0!80708!748.0#!%!(A!

8'2%1<);012$C1)/0123$4)AM$2$.%12&-%&.-)52'9)*N>NOD)%M$%)P$%.M1<)!$.M):'#2

*TP

*SP

UTP

USP

1TP

1SP

FTP

FSP

STP

SSP

VTP

S V F W 1 U X * Y *T T ** *U *1 *F U1 *S *V *W *X UU *Y U* UT

:'#2)'5),$7

@12.1(%$C1

L70/84;

Z(A5()=5!L70/84;

[04/A9)

8$9431<)A$%1C'2&Q1<)6#127)8%21$9)=21$R<'S(

[0/A9)48!

,()4)=0

1P

G9';.5()-

YP

\0A04/=6!]!

^04/)

YP

_)50/54()'0)5

*1P

4̀'0A

SP

K98(<4@A

*P

K9'0

SP

NI!I(50A

1P

[9/)

*TP

I69;;()-

*1P

I;9/5A

1P

E/4708

SP

L560/

*VP

K04856

SP

;012$C1)/0123$4)AM$2$.%12&-%&.-)'5)P$%.M&(C)6#12&1-)52'9)*N>NOD

*TP

*SP

UTP

USP

1TP

1SP

FTP

FSP

STP

SSP

T V *U *X:'#2)'5),$7

L70/84;Z(A5()=5!L70/84;[04/A9)

324

Page 35: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Hourly Categories

!

"#$!%&'()$! %"&*+&,+!+$-.&"./*0!&*+!!! .%! "#$!$,,/,!,&"$1!23!%$"".*4!/5,! 6/*7.+$*6$! )$-$)! "/! 889! &*+! $,,/,! ,&"$! "/! :90!;$! ,$<5.,$! &!

%&'()$!/7!=>>!<5$,.$%1!!?#$!,$)&".-$!($,6$*"&4$%!7/,!$&6#!6&"$4/,3!

/7! "#$! &((,/@.'&"$)3! AB9! /7! <5$,3! -/)5'$! "#&"! '&"6#! &*3!

6&"$4/,3!).%"!/-$,!/5,!;$$C!D%$$!E.45,$!8F!&,$!;."#.*!"#$!$,,/,!,&"$!

/7!"#/%$!7,/'!/5,!'&*5&))3!6&"$4/,.G$+!%&'()$1!!?#.%!%#/;%!"#&"!

/5,!).%"%!&,$!&!,$&%/*&H)$!,$(,$%$*"&"./*!/7!"#$%$!"/(.6&)!6&"$4/,.$%1!

I$!7/65%!/*!&!%5H%$"!/7!"#$%$!6&"$4/,.$%!&*+!$@&'.*$!'5%.6!&*+!

'/-.$%! .*+$($*+$*"!/7!/"#$,!$*"$,"&.*'$*"!<5$,.$%1! !?#$!,$)&".-$!

%.G$!/7!$&6#!6&"$4/,3!).%"!;$!5%$+!.%!4.-$*!.*!E.45,$!=1!!JH-./5%)30!

*/"! &))! <5$,.$%! ).%"$+! &6"5&))3! '&"6#! "#/%$! $*"$,$+! H3! 5%$,%0!

$%($6.&))3! ;#$*! "#$! 6&"$4/,3! 6/*"&.*%! )&,4$! .'(/,"$+! ).%"%! /7!

(#,&%$%1!!

!"#$%&'('

K)"#/54#!;$!#&-$!%#/;*!"#&"!/5,!).%"%!&,$!&!7&.,!,$(,$%$*"&"./*!/7!

"#$!"/(.6%!.*!"#$!<5$,3!%",$&'0!"#.%!+/$%!*/"!.*+.6&"$!;#&"!(/,"./*!

/7! "#$! 7,$<5$*63! +.%",.H5"./*! /7! "#&"! %",$&'! "#$3! ,$(,$%$*"1! ! ?/!

+$"$,'.*$! "#.%0! ;$! '$&%5,$+! "#$! &-$,&4$! (,/(/,"./*! /7! <5$,.$%!

'&"6#.*4!&*3!6&"$4/,3!).%"!"#&"!&(($&,!&"!-&,./5%!7,$<5$*6.$%!$&6#!

#/5,!&*+!6/'(&,$+!"#$'!"/!"#$!&-$,&4$!/-$,&))!#/5,)3!7,$<5$*63!

+.%",.H5"./*!/7! "#$! <5$,3! %",$&'! D%$$! E.45,$! LF1! !M*%5,(,.%.*4)30!

"#.%!6/'(&,.%/*!%#/;%!"#&"!<5$,.$%!.*!"#$!6&"$4/,3!).%"%!,$(,$%$*"!

'/,$!(/(5)&,0!,$($&"$+!<5$,.$%!"#&*!&-$,&4$0!&)"#/54#!"#$!4$*$,&)!

%#&($!/7!"#$!+.%",.H5"./*%!.%!%.'.)&,1!

!"#$%&')'

*+,' -%&./0'".'123&#4%5'647$82%"35'I$! H$4.*! /5,! "$'(/,&)! &*&)3%.%! /7! "/(.6&)! 6&"$4/,.$%! H3!

'$&%5,.*4!"#$.,!,$)&".-$!(/(5)&,."3!/-$,!"#$!#/5,%!.*!&!+&31!!E.,%"0!

;$!$@&'.*$!"#$!($,6$*"!/7!"/"&)!<5$,3!-/)5'$!'&"6#.*4!&!%$)$6"$+!

4,/5(!/7!6&"$4/,3!).%"%0!&%!6&*!H$!%$$*!.*!E.45,$!N1!!O"!.%!6)$&,!"#&"!

+.77$,$*"!"/(.6&)!6&"$4/,.$%!&,$!'/,$!&*+!)$%%!(/(5)&,!&"!+.77$,$*"!

".'$%!/7! "#$!+&31! !P$,%/*&)! 7.*&*6$0! 7/,! $@&'()$0! H$6/'$%!'/,$!

(/(5)&,!7,/'!LQA>KR0!;#.)$!'5%.6!<5$,.$%!H$6/'$!)$%%!(/(5)&,1!!

K)"#/54#!."!.%!+.77.65)"!"/!6/'(&,$!"#$!,$)&".-$!)$-$)!/7!(/(5)&,."3!

%#.7"!7,/'!/*$!6&"$4/,3!"/!&*/"#$,!+5$!"/!"#$!+.77$,$*6$%!.*!%6&)$!

/7! $&6#! /7! "#$.,! ($,6$*"&4$%! /7! "#$! <5$,3! %",$&'0! ."! .%! 6)$&,! "#&"!

%/'$!6&"$4/,.$%S!(/(5)&,."3! 6#&*4$%!'/,$!+,&%".6&))3! "#,/54#/5"!

"#$!+&3!"#&*!/"#$,%1!!!

!"#$%&'9''

O*! /,+$,! "/! <5&*".73! "#.%0! ;$! 6&)65)&"$+! "#$! TUQ+.-$,4$*6$!

DV<5&"./*!WF! H$";$$*! "#$! ).C$).#//+!/7! ,$6$.-.*4! &*3!<5$,3! &"! &!

(&,".65)&,! ".'$! &*+! "#$! ).C$).#//+! /7! ,$6$.-.*4! &! <5$,3! .*! &!

(&,".65)&,!6&"$4/,30!&%!6&*!H$!%$$*!.*!E.45,$!81! !?#.%!,$-$&)%!"#&"!

"#$! "/(! "#,$$! 6&"$4/,.$%! .*! "$,'%! /7! (/(5)&,."3! &,$! (/,*/4,&(#30!

$*"$,"&.*'$*"0!&*+!'5%.61!

F0XD

FXD)/4FXDFF0XDFXDD

!"#$

!#$!#$!"#$!#$%

#

!= !

:;$23"4.'*<''=>?@"A&%#&.B&'4C'D$&%5'EBB$%%&.B&'>"F&8"G44/'

C4%'123&#4%5'" '2./'-4328'H3%&2I'23'-"I&' ! '

!"#$%&'J'

Y/'(&,.*4! "#$%$! +.-$,4$*6$%! "/! "#$! (,/(/,"./*! /7! 6&"$4/,.G$+!

<5$,.$%! .*! $&6#! 6&"$4/,3! .*! E.45,$! =! <5.6C)3! .))5%",&"$%! "#&"!

+.-$,4$*6$! .%! */"! 6/,,$)&"$+! ;."#! "#$! *5'H$,! /7! <5$,.$%!

6&"$4/,.G$+! .*! $&6#! 6&"$4/,31! ! K)%/! %#/;*! .*! E.45,$! 8! .%! "#$!

&-$,&4$!($,6$*"&4$!/7!"#$!$*".,$!<5$,3!-/)5'$!&*+!+.%".*6"!<5$,.$%!

"#&"!'&"6#!$&6#!6&"$4/,31!!K)"#/54#!"#$!6&"$4/,.$%!"#&"!6/-$,!"#$!

)&,4$%"! (/,"./*%! /7! "#$! <5$,3! %",$&'! &)%/! #&-$! "#$!'/%"! ,$)&".-$!

(/(5)&,."3! 7)56"5&"./*0! "#.%! 6/,,$)&"./*! +/$%! */"! 6/*".*5$!

"#,/54#/5"!&))!6&"$4/,.$%1!

K&823"A&'6&%B&.32#&'4C'123&#4%"L&/'D$&%"&0

>9

:9

A>9

A:9

Z>9

Z:9

B>9

B:9

W>9

W:9

:>9

[#/((.*4

Y/'(5".*4

?,&-$)

\/'$

\$&)"#

]/-$,*'$*"

]&'$%

^$%$&,6#!_!U$&,*.*4P/,*

\/).+&3%

[(/,"%

R/-.$%

P$,%/*&)!E.*&*6$

V*"$,"&.*'$*"

M[![."$%

R5%.6

P$,6$*"&4$!/7!Y&"$4/,.G$+!`5$,.$%

M4$%85'!%&;$&.B5'@"03%"N$3"4.'4C'O23BG".#'D$&%"&0'A0+'P88'D$&%"&0'PA&%2#&/'4A&%'

)'@250'2./',('123&#4%"&0

>9

:9

A>9

A:9

Z>9

Z:9

B>9

B:9

a!A0>>>

Z>AQ:>>

:AQA>>

ZAQZ:

AAQA: 8 L : B A

E,$<5$*63!^&*4$%

P$,6$*"&4$!/7!K-$,&4$!?/"&

R&"6#.*4!`5$,.$%

K-41!R&"6#.*4!`5$,.$%

K-41!`5$,.$%

123&#4%5'6&%B&.32#&'4C':.3"%&'D$&%5'H3%&2I'2./'@"A&%#&.B&'C%4I'

>"F&8"G44/'4C'2.5'D$&%5'23':2BG'M4$%

>9

A9

Z9

B9

W9

:9

=9

14I7$3".#

H74%30

M48"/250

K&0&2%BG'2./'>&2%.".#

M&283G

Q2I&0

RH'H"3&0

HG477".#

Q4A&%.I&.3

O4A"&0

-%2A&8

6&%04.28'!".2.B&

M4I&

O$0"B

:.3&%32".I&.3

64%.

123&#4%5

TUQb.-$,4$*6$

9!/7!<5$,3!%",$&'

b.%".*6"!9!/7!<5$,3!%",$&'

123&#4%"B28'6&%B&.3'4A&%'-"I&

>9

A9

Z9

B9

W9

A Z B W : = L N 8 A> AA AZ AB AW A: A= AL AN A8 Z> ZA ZZ ZB ZW

\/5,!/7!b&3

V*"$,"&.*'$*"]&'$%\$&)"#P$,%/*&)!E.*&*6$[#/((.*4R5%.6M[[."$%P/,*

325

Page 36: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

The Real Popular Categories

!

"#$!%&'()$! %"&*+&,+!+$-.&"./*0!&*+!!! .%! "#$!$,,/,!,&"$1!23!%$"".*4!/5,! 6/*7.+$*6$! )$-$)! "/! 889! &*+! $,,/,! ,&"$! "/! :90!;$! ,$<5.,$! &!

%&'()$!/7!=>>!<5$,.$%1!!?#$!,$)&".-$!($,6$*"&4$%!7/,!$&6#!6&"$4/,3!

/7! "#$! &((,/@.'&"$)3! AB9! /7! <5$,3! -/)5'$! "#&"! '&"6#! &*3!

6&"$4/,3!).%"!/-$,!/5,!;$$C!D%$$!E.45,$!8F!&,$!;."#.*!"#$!$,,/,!,&"$!

/7!"#/%$!7,/'!/5,!'&*5&))3!6&"$4/,.G$+!%&'()$1!!?#.%!%#/;%!"#&"!

/5,!).%"%!&,$!&!,$&%/*&H)$!,$(,$%$*"&"./*!/7!"#$%$!"/(.6&)!6&"$4/,.$%1!

I$!7/65%!/*!&!%5H%$"!/7!"#$%$!6&"$4/,.$%!&*+!$@&'.*$!'5%.6!&*+!

'/-.$%! .*+$($*+$*"!/7!/"#$,!$*"$,"&.*'$*"!<5$,.$%1! !?#$!,$)&".-$!

%.G$!/7!$&6#!6&"$4/,3!).%"!;$!5%$+!.%!4.-$*!.*!E.45,$!=1!!JH-./5%)30!

*/"! &))! <5$,.$%! ).%"$+! &6"5&))3! '&"6#! "#/%$! $*"$,$+! H3! 5%$,%0!

$%($6.&))3! ;#$*! "#$! 6&"$4/,3! 6/*"&.*%! )&,4$! .'(/,"$+! ).%"%! /7!

(#,&%$%1!!

!"#$%&'('

K)"#/54#!;$!#&-$!%#/;*!"#&"!/5,!).%"%!&,$!&!7&.,!,$(,$%$*"&"./*!/7!

"#$!"/(.6%!.*!"#$!<5$,3!%",$&'0!"#.%!+/$%!*/"!.*+.6&"$!;#&"!(/,"./*!

/7! "#$! 7,$<5$*63! +.%",.H5"./*! /7! "#&"! %",$&'! "#$3! ,$(,$%$*"1! ! ?/!

+$"$,'.*$! "#.%0! ;$! '$&%5,$+! "#$! &-$,&4$! (,/(/,"./*! /7! <5$,.$%!

'&"6#.*4!&*3!6&"$4/,3!).%"!"#&"!&(($&,!&"!-&,./5%!7,$<5$*6.$%!$&6#!

#/5,!&*+!6/'(&,$+!"#$'!"/!"#$!&-$,&4$!/-$,&))!#/5,)3!7,$<5$*63!

+.%",.H5"./*!/7! "#$! <5$,3! %",$&'! D%$$! E.45,$! LF1! !M*%5,(,.%.*4)30!

"#.%!6/'(&,.%/*!%#/;%!"#&"!<5$,.$%!.*!"#$!6&"$4/,3!).%"%!,$(,$%$*"!

'/,$!(/(5)&,0!,$($&"$+!<5$,.$%!"#&*!&-$,&4$0!&)"#/54#!"#$!4$*$,&)!

%#&($!/7!"#$!+.%",.H5"./*%!.%!%.'.)&,1!

!"#$%&')'

*+,' -%&./0'".'123&#4%5'647$82%"35'I$! H$4.*! /5,! "$'(/,&)! &*&)3%.%! /7! "/(.6&)! 6&"$4/,.$%! H3!

'$&%5,.*4!"#$.,!,$)&".-$!(/(5)&,."3!/-$,!"#$!#/5,%!.*!&!+&31!!E.,%"0!

;$!$@&'.*$!"#$!($,6$*"!/7!"/"&)!<5$,3!-/)5'$!'&"6#.*4!&!%$)$6"$+!

4,/5(!/7!6&"$4/,3!).%"%0!&%!6&*!H$!%$$*!.*!E.45,$!N1!!O"!.%!6)$&,!"#&"!

+.77$,$*"!"/(.6&)!6&"$4/,.$%!&,$!'/,$!&*+!)$%%!(/(5)&,!&"!+.77$,$*"!

".'$%!/7! "#$!+&31! !P$,%/*&)! 7.*&*6$0! 7/,! $@&'()$0! H$6/'$%!'/,$!

(/(5)&,!7,/'!LQA>KR0!;#.)$!'5%.6!<5$,.$%!H$6/'$!)$%%!(/(5)&,1!!

K)"#/54#!."!.%!+.77.65)"!"/!6/'(&,$!"#$!,$)&".-$!)$-$)!/7!(/(5)&,."3!

%#.7"!7,/'!/*$!6&"$4/,3!"/!&*/"#$,!+5$!"/!"#$!+.77$,$*6$%!.*!%6&)$!

/7! $&6#! /7! "#$.,! ($,6$*"&4$%! /7! "#$! <5$,3! %",$&'0! ."! .%! 6)$&,! "#&"!

%/'$!6&"$4/,.$%S!(/(5)&,."3! 6#&*4$%!'/,$!+,&%".6&))3! "#,/54#/5"!

"#$!+&3!"#&*!/"#$,%1!!!

!"#$%&'9''

O*! /,+$,! "/! <5&*".73! "#.%0! ;$! 6&)65)&"$+! "#$! TUQ+.-$,4$*6$!

DV<5&"./*!WF! H$";$$*! "#$! ).C$).#//+!/7! ,$6$.-.*4! &*3!<5$,3! &"! &!

(&,".65)&,! ".'$! &*+! "#$! ).C$).#//+! /7! ,$6$.-.*4! &! <5$,3! .*! &!

(&,".65)&,!6&"$4/,30!&%!6&*!H$!%$$*!.*!E.45,$!81! !?#.%!,$-$&)%!"#&"!

"#$! "/(! "#,$$! 6&"$4/,.$%! .*! "$,'%! /7! (/(5)&,."3! &,$! (/,*/4,&(#30!

$*"$,"&.*'$*"0!&*+!'5%.61!

F0XD

FXD)/4FXDFF0XDFXDD

!"#$

!#$!#$!"#$!#$%

#

!= !

:;$23"4.'*<''=>?@"A&%#&.B&'4C'D$&%5'EBB$%%&.B&'>"F&8"G44/'

C4%'123&#4%5'" '2./'-4328'H3%&2I'23'-"I&' ! '

!"#$%&'J'

Y/'(&,.*4! "#$%$! +.-$,4$*6$%! "/! "#$! (,/(/,"./*! /7! 6&"$4/,.G$+!

<5$,.$%! .*! $&6#! 6&"$4/,3! .*! E.45,$! =! <5.6C)3! .))5%",&"$%! "#&"!

+.-$,4$*6$! .%! */"! 6/,,$)&"$+! ;."#! "#$! *5'H$,! /7! <5$,.$%!

6&"$4/,.G$+! .*! $&6#! 6&"$4/,31! ! K)%/! %#/;*! .*! E.45,$! 8! .%! "#$!

&-$,&4$!($,6$*"&4$!/7!"#$!$*".,$!<5$,3!-/)5'$!&*+!+.%".*6"!<5$,.$%!

"#&"!'&"6#!$&6#!6&"$4/,31!!K)"#/54#!"#$!6&"$4/,.$%!"#&"!6/-$,!"#$!

)&,4$%"! (/,"./*%! /7! "#$! <5$,3! %",$&'! &)%/! #&-$! "#$!'/%"! ,$)&".-$!

(/(5)&,."3! 7)56"5&"./*0! "#.%! 6/,,$)&"./*! +/$%! */"! 6/*".*5$!

"#,/54#/5"!&))!6&"$4/,.$%1!

K&823"A&'6&%B&.32#&'4C'123&#4%"L&/'D$&%"&0

>9

:9

A>9

A:9

Z>9

Z:9

B>9

B:9

W>9

W:9

:>9

[#/((.*4

Y/'(5".*4

?,&-$)

\/'$

\$&)"#

]/-$,*'$*"

]&'$%

^$%$&,6#!_!U$&,*.*4P/,*

\/).+&3%

[(/,"%

R/-.$%

P$,%/*&)!E.*&*6$

V*"$,"&.*'$*"

M[![."$%

R5%.6

P$,6$*"&4$!/7!Y&"$4/,.G$+!`5$,.$%

M4$%85'!%&;$&.B5'@"03%"N$3"4.'4C'O23BG".#'D$&%"&0'A0+'P88'D$&%"&0'PA&%2#&/'4A&%'

)'@250'2./',('123&#4%"&0

>9

:9

A>9

A:9

Z>9

Z:9

B>9

B:9

a!A0>>>

Z>AQ:>>

:AQA>>

ZAQZ:

AAQA: 8 L : B A

E,$<5$*63!^&*4$%

P$,6$*"&4$!/7!K-$,&4$!?/"&

R&"6#.*4!`5$,.$%

K-41!R&"6#.*4!`5$,.$%

K-41!`5$,.$%

123&#4%5'6&%B&.32#&'4C':.3"%&'D$&%5'H3%&2I'2./'@"A&%#&.B&'C%4I'

>"F&8"G44/'4C'2.5'D$&%5'23':2BG'M4$%

>9

A9

Z9

B9

W9

:9

=9

14I7$3".#

H74%30

M48"/250

K&0&2%BG'2./'>&2%.".#

M&283G

Q2I&0

RH'H"3&0

HG477".#

Q4A&%.I&.3

O4A"&0

-%2A&8

6&%04.28'!".2.B&

M4I&

O$0"B

:.3&%32".I&.3

64%.

123&#4%5

TUQb.-$,4$*6$

9!/7!<5$,3!%",$&'

b.%".*6"!9!/7!<5$,3!%",$&'

123&#4%"B28'6&%B&.3'4A&%'-"I&

>9

A9

Z9

B9

W9

A Z B W : = L N 8 A> AA AZ AB AW A: A= AL AN A8 Z> ZA ZZ ZB ZW

\/5,!/7!b&3

V*"$,"&.*'$*"]&'$%\$&)"#P$,%/*&)!E.*&*6$[#/((.*4R5%.6M[[."$%P/,*

325

Page 37: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

What’s the OVERALL Distribution of Queries?Boosting the Performance of Web Search Engines • 59

Fig. 2. (a) Number of submissions of the most popular queries contained in the three query logs(log-log scale). (b) Distances (in number of queries) between subsequent submissions of the samequery.

between them. The distance is measured in number of queries. In particular, foreach distance d, we plotted the cumulative number of repeated queries occur-ring at a distance less than or equal to d . The results were encouraging: in morethan 350,000 cases the distance between successive submissions of the samequery was less than 100 in the Tiscali log; this distance is slightly smaller in

ACM Transactions on Information Systems, Vol. 24, No. 1, January 2006.

Page 38: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Distance Among Repetitions

Boosting the Performance of Web Search Engines • 59

Fig. 2. (a) Number of submissions of the most popular queries contained in the three query logs(log-log scale). (b) Distances (in number of queries) between subsequent submissions of the samequery.

between them. The distance is measured in number of queries. In particular, foreach distance d, we plotted the cumulative number of repeated queries occur-ring at a distance less than or equal to d . The results were encouraging: in morethan 350,000 cases the distance between successive submissions of the samequery was less than 100 in the Tiscali log; this distance is slightly smaller in

ACM Transactions on Information Systems, Vol. 24, No. 1, January 2006.

Page 39: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Page Request BreakdownBoosting the Performance of Web Search Engines • 61

Fig. 3. Probability of the occurrence of a request for the ith page given a previous request for page(i ! 1).

query is submitted, it is necessary to browse many pages in order to find therelevant information.

3.3 Theoretical Upper Bounds on the Cache Hit Ratio

From the data reported in Table I it is easy to devise the theoretical upperbounds on the hit ratios achievable on the three query logs when prefetching isnot used. To this end, let us suppose the availability of a cache of infinite size,so that only compulsory misses have to be taken into account, that is, cachemisses corresponding to the first reference to each distinct query. The rate ofcompulsory misses is thus

m = DQ

, (1)

where D is the number of distinct queries, and Q is the total number of queriesin the log. The value m is the minimum miss ratio, while the maximum hitratio, H, is obviously given by

H = 1 ! m = 1 ! DQ

. (2)

The analysis becomes a little more complicated when we introduce prefetch-ing [Lempel and Moran 2003]. It is worth recalling that a general user query hasthe form (keywords, page no), and that, for this analysis, we are interested inconsidering queries characterized by identical keywords and distinct page no.In particular, if we retrieve each time k successive pages of results starting fromthe one requested by the user (hereinafter we will call k the prefetching factor),we have to distinguish among queries requesting pages in the first block of kpages (1 " page no " k), in the second block of k pages (k+1 " page no " 2 · k),

ACM Transactions on Information Systems, Vol. 24, No. 1, January 2006.

Page 40: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Distance Among Repetitions74 • T. Fagni et al.

Fig. 11. Distances, expressed as the number of distinct queries received, between successive sub-missions of each one of the 128,000 most popular queries present in the Alta Vista log.

We conducted some tests on the Alta Vista log aimed at estimating how manytimes a frequent query q should be evicted from a pure LRU cache because thedistance between two successive requests for q (in terms of distinct queries) isgreater than the size of the cache itself. To this end, we extracted the 128,000most popular queries present in the log, and we measured the number of dis-tinct queries received by Alta Vista in the interval between one submissionand the next submission of each of these popular queries. Figure 11 plots thecumulative number of occurrences of each distance, measured as the numberof distinct queries received by Alta Vista in the interval between two succes-sive submissions of each frequent query. As expected, many popular querieswere resubmitted after long intervals. These queries would surely cause cachemisses in an LRU cache having a size smaller than the distance plotted on thex axis. In particular, we measured that 698,852 repetitions of the most pop-ular queries occurred at distances less than 128,000, while 164,813 occurredat distances greater than 128,000.2 Thus, the adoption of an LRU cache of128,000 blocks should surely result in 164,813 misses for these most popu-lar queries. The distribution of the distances plotted in Figure 11, thus ex-plains the higher hit ratio achieved by our hybrid caching policy with respectto a purely dynamic policy like LRU which is based on recency of referencesonly.

2It is worth noting that due to the log-scale, in Figure 11, the black area corresponding to x !128,000 appears to be very small. It is instead about a quarter of the one corresponding tox <128,000.

ACM Transactions on Information Systems, Vol. 24, No. 1, January 2006.

Page 41: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

Stability and FreshnessBoosting the Performance of Web Search Engines • 73

Fig. 10. Static set hit ratio on the Alta Vista query log as a function of time. The size of the staticset has been set to 128,000 elements, while the training time was varied between 1 h and 2 days.

exploits also the locality of these “burst” queries due to the presence of its dy-namic set.

5.3.2 Daily Patterns. From Figure 10 we can see that all the curves plottedhave a common behavior: there is a sort of periodicity in the hit ratio that resultsin a smooth sinusoidal trend of the curves.

Around each 24 h the hit ratio reaches a local peak and starts to degradeuntil a local minimum is reached after about 12 h. We looked at the query logand we observed that the peak time was around midnight. This particular trenddemonstrates the presence in the log of groups of frequent queries which aremore or less popular in specific hours of each day. The benefits to be derivedfrom the exploitation of the knowledge of such daily patterns merits furtherinvestigation. One promising source in this regard are the results presented inBeitzel et al. [2004].

5.4 Effectiveness of the Static Set

The rationale of introducing our hybrid, static and dynamic, cache, relies on thehypothesis that some popular queries may not be requested for relatively longtime intervals, and might be unprofitably evicted from a purely dynamic cache.On the other hand, we saw that there are queries that are popular only withinrelatively short time intervals, and thus do not benefit from the use of a purelystatic cache filled with most popular queries, but can be effectively handled bythe dynamic portion of our cache.

ACM Transactions on Information Systems, Vol. 24, No. 1, January 2006.

Page 42: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione4.pdf · 2008-11-05 · User #17988817 searched on 'serial killers' at 2006-04-26 20:26:23 User #17988817 searched

The Lesson is Over