referensi TF-IDF

download referensi TF-IDF

of 37

Transcript of referensi TF-IDF

  • 7/24/2019 referensi TF-IDF

    1/37

    Prasad L08VSM-tfd 1

    Vector Space Model : TF - IDF

    Adapted from Lectures by

    Prabhakar Raghavan (Yahoo and Stanford) andhr!"topher Mann!ng (Stanford)

  • 7/24/2019 referensi TF-IDF

    2/37

    Recap la"t lect#re

    ollect!on and vocab#lar$ "tat!"t!c" %eap"& and '!pf&" la"

    D!ct!onar$ copre""!on for *oolean !nde+e" D!ct!onar$ "tr!ng, block", front cod!ng

    Po"t!ng" copre""!on ap encod!ng #"!ng pref!+-#n!.#e code"

    Var!able-*$te and aa code"

    Pra"ad /012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    3/37

    Th!" lect#re3 Sect!on" 45/-45657

    Scor!ng doc#ent"

    Ter fre.#enc$

    ollect!on "tat!"t!c"

    8e!ght!ng "chee"

    Vector "pace "cor!ng

    Pra"ad 7012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    4/37

    Ranked retr!eval

    Th#" far, o#r .#er!e" have all been *oolean5 Doc#ent" e!ther atch or don&t5 ood for e+pert #"er" !th prec!"e #nder"tand!ng

    of the!r need" and the collect!on (e5g5,l!brar$ "earch)5

    9l"o good for appl!cat!on": 9ppl!cat!on" can ea"!l$

    con"#e 111" of re"#lt"5

    ;ot good for the a

  • 7/24/2019 referensi TF-IDF

    5/37

    Proble !th *oolean "earch:

    fea"t or fa!ne *oolean .#er!e" often re"#lt !n e!ther too fe (=1)or too an$ (111") re"#lt"5 >#er$ : ?standard user dlink 650@ A /11,111 h!t"

    >#er$ /: ?standard user dlink 650 no card found@:1 h!t"

    It take" "k!ll to coe #p !th a .#er$ that

    prod#ce" a anageable n#ber of h!t"5

    8!th a ranked l!"t of doc#ent", !t doe" not

    atter ho large the retr!eved "et !"5

    Pra"ad B012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    6/37

    Scor!ng a" the ba"!" of ranked

    retr!eval 8e !"h to ret#rn in order the doc#ent" o"tl!kel$ to be #"ef#l to the "earcher

    %o can e rank-order the doc#ent" !n the

    collect!on !th re"pect to a .#er$C

    9""!gn a "core "a$ !n E1, to each doc#ent Th!" "core ea"#re" ho ell doc#ent and

    .#er$ ?atch@5

    Pra"ad 4012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    7/37

    >#er$-doc#ent atch!ng "core"

    8e need a a$ of a""!gn!ng a "core to a

    .#er$Gdoc#ent pa!r

    0et&" "tart !th a one-ter .#er$ If the .#er$ ter doe" not occ#r !n the doc#ent:

    "core "ho#ld be 1

    The ore fre.#ent the .#er$ ter !n the

    doc#ent, the h!gher the "core ("ho#ld be)

    8e !ll look at a n#ber of alternat!ve" for th!"5

    Pra"ad H012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    8/37

    Take : accard coeff!c!ent

    Recall: accard coeff!c!ent !" a coonl$ #"ed

    ea"#re of overlap of to "et"Aand B

  • 7/24/2019 referensi TF-IDF

    9/37

    accard coeff!c!ent: Scor!ng

    e+aple 8hat !" the .#er$-doc#ent atch "core that theaccard coeff!c!ent cop#te" for each of the to

    doc#ent" beloC

    >#er$: ides of marc" Doc#ent : caesar died in marc"

    Doc#ent /: t"e lon# marc"

    Pra"ad K012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    10/37

    I""#e" !th accard for "cor!ng

    It doe"n&t con"!der ter fre.#enc$ (ho an$

    t!e" a ter occ#r" !n a doc#ent)

    It doe"n&t con"!der doc#entGcollect!on

    fre.#enc$ (rare ter" !n a collect!on are ore!nforat!ve than fre.#ent ter")

    8e need a ore "oph!"t!cated a$ of

    noral!J!ng for length 0ater !n th!" lect#re, e&ll #"e

    5 5 5 !n"tead of L9 *LGL9 *L (accard) for lengthnoral!Jat!on5

    |BA|/|BA|

    Pra"ad 1012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    11/37

    Recall (0ect#re ): *!nar$ ter-

    doc#ent !nc!dence atr!+

    Each document is represented by a binary vector !0"1# $V

    Pra"ad 012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    12/37

    Ter-doc#ent co#nt atr!ce"

    on"!der the n#ber of occ#rrence" of a ter !n

    a doc#ent: Nach doc#ent !" a co#nt vector !n %v: a col#n

    belo

    Pra"ad /012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    13/37

    Ba# of $ords odel

    Vector repre"entat!on doe"n&t con"!der the

    order!ng of ord" !n a doc#ent %o"n is &uicker t"an 'ary and 'ary is &uicker

    t"an %o"nhave the "ae vector" Th!" !" called the bag of ord"odel5

    In a "en"e, th!" !" a "tep back: The po"!t!onal

    !nde+ a" able to d!"t!ng#!"h the"e to

    doc#ent"5 8e !ll look at ?recover!ng@ po"!t!onal !nforat!on

    later !n th!" co#r"e5

    Pra"ad 7012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    14/37

    Ter fre.#enc$ tf

    The term fre&uency tft,dof ter t!n doc#ent d!"

    def!ned a" the n#ber of t!e" that t occ#r" !n d5

    8e ant to #"e tfhen cop#t!ng .#er$-

    doc#ent atch "core"5 *#t hoC a$ter fre.#enc$ !" nothat e ant:

    9 doc#ent !th 1 occ#rrence" of the ter a$ be

    ore relevant than a doc#ent !th one occ#rrence of

    the ter5 *#t not 1 t!e" ore relevant5

    Relevance doe" not !ncrea"e proport!onall$ !th

    ter fre.#enc$5

    Pra"ad 6012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    15/37

    0og-fre.#enc$ e!ght!ng

    The log fre.#enc$ e!ght of ter t !n d !"

    1 A 1, A , / A 57, 1 A /, 111 A 6, etc5

    Score for a doc#ent-.#er$ pa!r: "# over ter"

    t!n both &and d:

    "core

    The "core !" 1 !f none of the .#er$ ter" !" pre"ent !n the

    doc#ent5

    >+

    =otherwise0,

    0tfif,tflog1

    10 t,dt,d

    t,dw

    += dqt dt )tflog(1 ,

    Pra"ad B012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    16/37

    Doc#ent fre.#enc$

    Rare ter" are ore !nforat!ve than fre.#ent ter"

    Recall "top ord"

    on"!der a ter !n the .#er$ that !" rare !n the collect!on

    (e5g5, arac"nocentric)

    9 doc#ent conta!n!ng th!" ter !" ver$ l!kel$ to be

    relevant to the .#er$ arac"nocentric

    A 8e ant a h!gher e!ght for rare ter" l!ke

    arac"nocentric5

    Pra"ad 4012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    17/37

    Doc#ent fre.#enc$, cont!n#ed

    on"!der a .#er$ ter that !" fre.#ent !n the collect!on (e5g5,

    "i#", increase, line)

    9 doc#ent conta!n!ng "#ch a ter !" ore l!kel$ to be

    relevant than a doc#ent that doe"n&t, but its not a sure

    indicator of rele*ance+ A For fre.#ent ter", e ant po"!t!ve e!ght" for ord"

    l!ke "i#", increase, and line, b#t loer e!ght" than for rare

    ter"5

    8e !ll #"e doc#ent fre.#enc$ (df) to capt#re

    th!" !n the "core5 df () !" the n#ber of doc#ent" that conta!n

    the ter

    Pra"ad H012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    18/37

    !df e!ght

    dft!" the doc#ent fre.#enc$ of t: the n#ber of

    doc#ent" that conta!n t df !" a ea"#re of the !nforat!vene"" of t

    8e def!ne the !df (!nver"e doc#ent fre.#enc$)of tb$

    8e #"e log Gdft

    !n"tead of Gdft

    to ?dapen@ theeffect of !df5

    tt N/dflogidf 10=

    &i'' turn out that the base o the 'o( is immateria')

    Pra"ad 2012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    19/37

    !df e+aple, "#ppo"e = !ll!on

    term df t

    idft

    calp#rn!a 4

    an!al 11 6

    "#nda$ ,111 7

    fl$ 1,111 /

    #nder 11,111

    the ,111,111 1

    *here is one id va'ue or each term tin a co''ection)

    Pra"ad K012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    20/37

    ollect!on v"5 Doc#ent fre.#enc$

    The collect!on fre.#enc$ of t!" the n#ber of

    occ#rrence" of t!n the collect!on, co#nt!ng

    #lt!ple occ#rrence"5

    8h!ch ord !" a better "earch ter (and "ho#ld

    get a h!gher e!ght)C

    Word Collection frequency Document frequency

    insurance 1661 7KKH

    try 16// 2H41

    Pra"ad /1012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    21/37

    tf-!df e!ght!ng

    The tf-!df e!ght of a ter !" the prod#ct of !t" tf

    e!ght and !t" !df e!ght5

    *e"t knon e!ght!ng "chee !n !nforat!on retr!eval ;ote: the ?-@ !n tf-!df !" a h$phen, not a !n#" "!gnO

    9lternat!ve nae": tf5!df, tf + !df

    Increa"e" !th the n#ber of occ#rrence" !th!n a

    doc#ent

    Increa"e" !th the rar!t$ of the ter !n the collect!on

    tdt Ndt df/log)tflog1(w ,,+=

    Pra"ad /012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    22/37

    *!nar$ A co#nt A e!ght atr!+

    Each document is no+ represented by a rea'-va'ued vector o t-id +ei(hts R$V$

    Pra"ad //012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    23/37

    Doc#ent" a" vector"

    So e have a LVL-d!en"!onal vector "pace

    Ter" are a+e" of the "pace

    Doc#ent" are po!nt" or vector" !n th!" "pace

    Ver$ h!gh-d!en"!onal: h#ndred" of !ll!on" of

    d!en"!on" hen $o# appl$ th!" to a eb "earch

    eng!ne Th!" !" a ver$ "par"e vector - o"t entr!e" are

    Jero5

    Pra"ad /7012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    24/37

    >#er!e" a" vector"

    e$ !dea : Do the "ae for .#er!e": repre"ent

    the a" vector" !n the "pace

    e$ !dea /: Rank doc#ent" accord!ng to the!r

    pro+!!t$ to the .#er$ !n th!" "pace pro+!!t$ = "!!lar!t$ of vector"

    pro+!!t$ Q !nver"e of d!"tance

    Recall: 8e do th!" beca#"e e ant to get aa$fro the $o#&re-e!ther-!n-or-o#t *oolean odel5

    In"tead: rank ore relevant doc#ent" h!gher

    than le"" relevant doc#ent"

    Pra"ad /6012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    25/37

    Foral!J!ng vector "pace pro+!!t$

    F!r"t c#t: d!"tance beteen to po!nt" ( = d!"tance beteen the end po!nt" of the to

    vector")

    N#cl!dean d!"tanceC

    N#cl!dean d!"tance !" a bad !dea 5 5 5

    5 5 5 beca#"e N#cl!dean d!"tance !" large forvector" of d!fferent length"5

    Pra"ad /B012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    26/37

    8h$ d!"tance !" a bad !dea

    The N#cl!deand!"tance beteen &

    and d-!" large eventho#gh the

    d!"tr!b#t!on of ter"!n the .#er$ &andthe d!"tr!b#t!on of

    ter" !n the

    doc#ent d-arever$ "!!lar5

    Pra"ad /4012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    27/37

    "e angle !n"tead of d!"tance

    Tho#ght e+per!ent: take a doc#ent d and

    append !t to !t"elf5 all th!" doc#ent d5

    ?Seant!call$@ d and d have the "ae content

    The N#cl!dean d!"tance beteen the todoc#ent" can be .#!te large

    The angle beteen the to doc#ent" !" 1,

    corre"pond!ng to a+!al "!!lar!t$5

    e$ !dea: Rank doc#ent" accord!ng to angle

    !th .#er$5

    Pra"ad /H012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    28/37

    Fro angle" to co"!ne"

    The follo!ng to not!on" are e.#!valent5 Rank doc#ent" !n decrea"!ng order of the angle

    beteen .#er$ and doc#ent

    Rank doc#ent" !n !ncrea"!ng order ofco"!ne(.#er$,doc#ent)

    o"!ne !" a onoton!call$ decrea"!ng f#nct!on for

    the !nterval E1o, 21o

    Pra"ad /2012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    29/37

    0ength noral!Jat!on

    9 vector can be (length-) noral!Jed b$ d!v!d!ng

    each of !t" coponent" b$ !t" length for th!" e

    #"e the 0/nor:

    D!v!d!ng a vector b$ !t" 0/nor ake" !t a #n!t

    (length) vector

    Nffect on the to doc#ent" d and d (d appended

    to !t"elf) fro earl!er "l!de: the$ have !dent!cal

    vector" after length-noral!Jat!on5

    =i

    ixx2

    2

    Pra"ad /K012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    30/37

    co"!ne(.#er$,doc#ent)

    ==

    ===

    =

    V

    i iV

    i i

    V

    i ii

    dq

    dq

    d

    d

    q

    q

    dq

    dqdq

    1

    2

    1

    2

    1),cos(

    ,ot product nit vectors

    qiis the t-id +ei(ht o term iin the .uery

    diis the t-id +ei(ht o term iin the document

    cos/q,d is the cosine simi'arity o qand d or"e.uiva'ent'y" the cosine o the an('e bet+een qand d)

    Pra"ad 71012VSM-tf!df

  • 7/24/2019 referensi TF-IDF

    31/37

    o"!ne "!!lar!t$ aong"t 7 doc#ent"

    term SaS PaP WH

    affect!on B B2 /1

  • 7/24/2019 referensi TF-IDF

    32/37

    7 doc#ent" e+aple contd5

    Log frequency weighting

    term SaS PaP WH

    affect!on 7514 /5H4 /571

    S"&;?

  • 7/24/2019 referensi TF-IDF

    33/37

    op#t!ng co"!ne "core"

  • 7/24/2019 referensi TF-IDF

    34/37

    tf-!df e!ght!ng ha" an$ var!ant"

    @o'umns headed AnB are acronyms or +ei(ht schemes)

    &hy is the base o the 'o( in id immateria'?

  • 7/24/2019 referensi TF-IDF

    35/37

    8e!ght!ng a$ d!ffer !n .#er!e" v"

    doc#ent" Man$ "earch eng!ne" allo for d!fferent

    e!ght!ng" for .#er!e" v" doc#ent"

    To denote the cob!nat!on !n #"e !n an eng!ne,

    e #"e the notat!on ...5ddd !th the acron$"fro the prev!o#" table

    N+aple: ltn5lnc ean": >#er$: logar!th!c tf (l !n lefto"t col#n), !df (t !n

    "econd col#n), no noral!Jat!on

    Doc#ent logar!th!c tf, no !df and co"!ne

    noral!Jat!onCs this a bad idea?

    Pra"ad 7B

  • 7/24/2019 referensi TF-IDF

    36/37

    tf-!df e+aple: ltn5lnc

    Term Query Document Prod

    tf-ra tf-t df !df t tf-ra tf-t t n&l!Jed

    a#to 1 1 B111 /57 1 15B/ 1

    be"t B1111 57 57 1 1 1 1 1

    car 1111 /51 /51 15B/ 516

    !n"#rance 111 751 751 / 57 75K /517 451K

    ,ocumentD car insurance auto insuranceueryD best car insurance

    EFerciseD +hat is N" the number o docs?

  • 7/24/2019 referensi TF-IDF

    37/37

    S#ar$ vector "pace rank!ng

    Repre"ent the .#er$ a" a e!ghted tf-!df vector

    Repre"ent each doc#ent a" a e!ghted tf-!df vector

    op#te the co"!ne "!!lar!t$ "core for the .#er$

    vector and each doc#ent vector

    Rank doc#ent" !th re"pect to the .#er$ b$ "core Ret#rn the top 3(e5g5, 3= 1) to the #"er

    Pra"ad 7H012VSM tf!df