CAIM: Cerca i Anàlisi d’Informació MassivaCAIM/slides/2ir_models.pdfOptions to “hack them...

CAIM: Cerca i Anàlisi d’Informació MassivaFIB, Grau en Enginyeria Informàtica

Slides by Marta Arias, José Luis Balcázar,Ramon Ferrer-i-Cancho, Ricard Gavaldá

Department of Computer Science, UPC

Fall 2018http://www.cs.upc.edu/~caim

1 / 21

2. Information Retrieval Models

Information Retrieval Models, ISetting the stage to think about IR

What is an Information Retrieval Model?

We need to clarify:I A proposal for a logical view of documents

(what info is stored/indexed about each document?),I a query language

(what kinds of queries will be allowed?),I and a notion of relevance

(how to handle each document, given a query?).

3 / 21

Information Retrieval Models, IIA couple of IR models

Focus for this course:I Boolean model,

I Boolean queries, exact answers;I extension: phrase queries.

I Vector model,I weights on terms and documents;I similarity queries, approximate answers, ranking.

4 / 21

Boolean Model of Information RetrievalRelevance assumed binary

Documents:A document is completely identified by the set of terms that itcontains.

I Order of occurrence considered irrelevant,I number of occurrences considered irrelevant

(but a closely related model, called bag-of-words or BoW,does consider relevant the number of occurrences).

Thus, for a set of terms T = {t1, . . . , tT }, a document is just asubset of T .Each document can be seen as a bit vector of length T ,d = (d1, . . . , dT ), where

I di = 1 if and only if ti appears in d, or, equivalently,I di = 0 if and only if ti does not appear in d.

5 / 21

Queries in the Boolean Model, IBoolean queries, exact answers

Atomic query:a single term.

The answer is the set of documents that contain it.

Combining queries:

I OR, AND: operate as union or intersection of answers;I Set difference, t1 BUTNOT t2 ≡ t1 AND NOT t2;I motivation: avoid unmanageably large answer sets.

In Lucene: +/− signs on query terms, Boolean operators.

6 / 21

Queries in the Boolean Model, IIA close relative to propositional logic

Analogy:

I Terms act as propositional variables;I documents act as propositional models;I a document is relevant for a term if it contains the term,

that is, if, as a propositional model, satisfies the variable;I queries are propositional formulas

(with a syntactic condition of avoiding global negation);I a document is relevant for a query if, as a propositional

model, it satisfies the propositional formula.

7 / 21

Example, IA very simple toy case

Consider 7 documents with a vocabulary of 6 terms:

d1 = one threed2 = two two threed3 = one three four five five fived4 = one two two two two three six sixd5 = three four four four sixd6 = three three three six sixd7 = four five

8 / 21

Example, IIOur documents in the Boolean model

five four one six three two

d1 = [ 0 0 1 0 1 0 ]d2 = [ 0 0 0 0 1 1 ]d3 = [ 1 1 1 0 1 0 ]d4 = [ 0 0 1 1 1 1 ]d5 = [ 0 1 0 1 1 0 ]d6 = [ 0 0 0 1 1 0 ]d7 = [ 1 1 0 0 0 0 ]

(Invent some queries and compute their answers!)

9 / 21

Queries in the Boolean Model, IIINo ranking of answers

Answers are not quantified:A document either

I matches the query (is fully relevant),I or does not match the query (is fully irrelevant).

Depending on user needs and application, this feature may begood or may be bad.

10 / 21

Phrase Queries, ISlightly beyond the Boolean model

Phrase queries: conjunction plus adjacencyAbility to answer with the set of documents that have the termsof the query consecutively.

I A user querying “Keith Richards” may not wish a documentthat mentions both Keith Emerson and Emil Richards.

I Requires extending the notion of “basic query” to includeadjacency.

11 / 21

Phrase Queries, IIOptions to “hack them in”

Options:

I Run as conjunctive query, then doublecheck the wholeanswer set to filter out nonadjacency cases.

This option may be very slow in cases of large amounts of“false positives”.

I Keep in the index dedicated information about adjacencyof any two terms in a document (e.g. positions).

I Keep in the index dedicated information about a choice of“interesting pairs” of words.

12 / 21

Vector Space Model of Information Retrieval, IBasis of all successful approaches

I Order of words still irrelevant.I Frequence is relevant.I Not all words are equally important.I For a set of terms T = {t1, . . . , tT }, a document is a vector

d = (w1, . . . , wT ) of floats instead of bits.I wi is the weight of ti in d.

13 / 21

Vector Space Model of Information Retrieval, IIMoving to vector space

I A document is now a vector in IRT .I The document collection conceptually becomes a matrix

terms × documents.

but we never compute the matrix explicitly.I Queries may also be seen as vectors in IRT .

14 / 21

The tf-idf schemeA way to assign weight vector to documents

Two principles:

I The more frequent t is in d, the higher weight it shouldhave.

I The more frequent t is in the whole collection, the less itdiscriminates among documents, so the lower its weightshould be in all documents.

15 / 21

The tf-idf scheme, IIThe formula

A document is a vector of weights

d = [wd,1, . . . , wd,i, . . . , wd,T ].

Each weight is a product of two terms

wd,i = tfd,i · idfi.

The term frequency term tf is

tfd,i =fd,i

maxj fd,j, where fd,j is the frequency of tj in d.

And the inverse document frequency idf is

idfi = log2D

dfi, where D = number of documents

and dfi = number of documents that contain term ti.

16 / 21

Example, I

five four one six three two maxf

d1 = [ 0 0 1 0 1 0 ] 1d2 = [ 0 0 0 0 1 2 ] 2d3 = [ 3 1 1 0 1 0 ] 3d4 = [ 0 0 1 2 1 4 ] 4d5 = [ 0 3 0 1 1 0 ] 3d6 = [ 0 0 0 2 3 0 ] 3d7 = [ 1 1 0 0 0 0 ] 1

df = 2 3 3 3 6 2

17 / 21

Example, II

df = 2 3 3 3 6 2d3 = [ 3 1 1 0 1 0 ]→

d3 = [ 33log2

13log2

03log2

13log2

03log2

= [ 1.81 0.41 0.41 0 0.07 0 ]

d4 = [ 0 0 1 2 1 4 ]→

d4 = [ 04log2

04log2

14log2

24log2

14log2

44log2

= [ 0 0 0.61 1.22 0.11 3.61 ]

18 / 21

Similarity of Documents in the Vector Space ModelThe cosine similarity measure

I “Similar vectors” may happen to have very different sizes.I We better compare only their directions.I Equivalently, we normalize them before comparing them to

have the same Euclidean length.

sim(d1, d2) =d1 · d2

|d1| |d2|=

|d1|· d2

|d2|where

v · w =∑i

vi · wi, and |v| =√v · v =

√∑i

I Our weights are all nonnegative.I Therefore, all cosines / similarities are between 0 and 1.

19 / 21

Cosine similarity, Example

d3 = [ 1.81 0.41 0.41 0 0.07 0 ]d4 = [ 0 0 0.61 1.22 0.11 3.61 ]

Then|d3| = 1.898, |d4| = 3.866, d3 · d4 = 0.26

and sim(d3, d4) = 0.035 (i.e., small similarity).

20 / 21

Query Answering

I Queries can be transformed to vectors too.I Sometimes, tf-idf weights; often, binary weights.I sim(doc, query) ∈ [0, 1].I Answer: List of documents sorted by decreasing similarity.

I We will find uses for comparing sim(d1, d2) too.

21 / 21

CAIM: Cerca i Anàlisi d’Informació MassivaCAIM/slides/2ir_models.pdfOptions to “hack them...

Documents

Transcript of CAIM: Cerca i Anàlisi d’Informació MassivaCAIM/slides/2ir_models.pdfOptions to “hack them...

El més important no sempre és el més visible i valorat. El cas de la recerca … · de les tecnologies digitals augmenta tant el volum d’informació com la dificultat de trobar

Fotografia digital Fonts d’informació. Fonts personals Fonts institucionals Fonts analògiques Fonts documentals English Sources Col·lectives Individuals.

eHealth Area by Juan Carlos Castro per la XI Jornada Fòrum Catalalà d’Informació i Salut

¡No somos los 3 cerditos!milibrodigital.aulaalustante.com/libro digital... · ¡No somos los 3 cerditos! Elisenda Roca Obtenció d’informació explícita ... Que el cuento enseña

BARBER REVISTA D’INFORMACIÓ LOCAL A · 2010. 3. 10. · Revista d’Informació Local 11.000 Exemplars de distribució gratuïta Edita: Ajuntament de Barberà del Vallès Direcció:

IBM Domino Doublecheck - Engageengage.ug/.../$file/BLUG_IBMDominoDoublecheck.pdf · Domino Doublecheck is a proven method against competitive attacks. It provides the factual evidence

BARBER REVISTA D’INFORMACIÓ LOCAL A · 2010. 3. 10. · Sergi Freixes - El Petit Estudi Fotografia: Joan Muñoz 54 Gallery Impressió: C. G. Canigó, s.l. tel 937210707 Dip Leg:

Major potential hazards in Urea Plants V7 210917€¦ · pressure with less hazardous media such as nitrogen or CO2 is under development. Doublecheck design and actual bolt loads

New English in Use 4 Programació didàctica català ... · Web view18. Consultar i citar adequadament fonts d’informació variades, per realitzar un treball acadèmic en suport

CAIM: Cerca i Anàlisi d’Informació Massivacaim/slides/8lsh.pdf · 2018. 11. 20. · CAIM: Cerca i Anàlisi d'Informació Massiva - FIB, Grau en Enginyeria Informàtica Author:

flash50mm - FCF · 2014-08-28 · 2 flash50mm és un butlletí digital amb un recull d’informació fotogràfica que setmanalment envia la Federació Catalana de Fotografia. D’altra

Content and Language Integrated Learning (CLIL) Materials in … · 2009-03-11 · L’esquelet que vertebra el material són les Input Sources (fonts d’informació), a partir de

“Si no tens una meta, és difícil aconseguir-la”.- Paul Arden ......“Si no tens una meta, és difícil aconseguir-la”.- Paul Arden BUTLLETÍ D’INFORMACIÓ MUNICIPAL DE SUECA

Avortament legal a Catalunya, 2006 - COnnecting REpositoriesAvortament legal segons lloc d’informació sobre l’avortament. Catalunya, 2006.....24 Taula 12. Avortament legal segons

DESIGN PHASE KICK-OFF EVENT AND AWARD CEREMONY · PIC scientific data centre Port d’Informació Científica (PIC) is the largest scientific data centre in Spain, supporting research

BARBER REVISTA D’INFORMACIÓ LOCAL A · 2010. 3. 10. · por agentes de paisano, para garantizar el cumplimiento de las Ordenanzas Municipales en referencia a los anima-les domésticos.

Gerència - Hospital Universitari de Bellvitge · Nacho Nieto García Assegurar la disponibilitat, fiabilitat i seguretat de les aplicacions i els sistemes d’informació, tant electrònics

UNIVERSITAT DE BARCELONA Facultat de Biblioteconomia i Documentació Grau d’Informació i Documentació Research Methods Non-experimental designs Professor:

PIC port d’informació científica 15 February 2006Managing small files in Mass Storage systems using Virtual Volumes / CHEP06 / M. Delfino1 Managing small.

Llegenda Roma - Amazon Web Services · 2018-11-05 · Llegenda De Sant Pere a Santa Maria de la Pau Punts d’Informació del Comune di Roma Com arribar-hi Amb autobús: • Línia