LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)

Prepared by:

Presented to:

Latent Semantic Indexing

Bayonne/2013

HATOUM Saria

DONGO Irvin

Prof. CHBEIR Richard

• Introduction– Information Retrieval• Vector Space Model

• Problems

• Latent Semantic Indexing– Algorithm

– Example

– Advantages

– Disadvantages

Overview

• Many documents available.

• The need to extract information.

• Sorted and classified information.

• Queries the information.

Introduction

Information Retrieval

• Before LSI: – Literally Matching Text corpus with many documents.• Given a query, find relevant documents.

– Some terms in a user's query will literally match terms in irrelevant documents .

• Set-Theoretic– Fuzzy Set

• Algebraic– Vector Space• Generalised Vector Space• Latent Semantic Indexing

• Probabilistic– Binary Interdependence

Some Methods for IR

• An algebraic model for representing text documents.

• Documents and Queries are both vectors

dj =(w1,j , w2,j , …, wt,j)

qj =(w1,q , w2,q , …, wt,q)

Vector Space Model

vector space method– Term (rows) by document (columns) matrix, based on occurrence– one vector will be associate for each document– Cosine to measure distance between vectors (documents)• small angle = large cosine = similar• large angle = small cosine = dissimilar

• Sim(di, dj) = 1, if di = dj.

• Sim(di, dj) = 0, if di and dj are different.

Cosine Similarity Meausure

Document vector space

Word 1

Word 2

Problem Introduction• Traditional term-matching method doesn’t work well in information

retrieval

• We want to capture the concepts instead of words. Concepts are reflected in the words. However,– One term may have multiple meaning

– Different terms may have the same meaning.

The Problems• Two problems that arose using the vector space model:– synonymy: many ways to express a given concept e.g.

“automobile” when querying on “car”• leads to poor recall “the percentage of all relevant documents are

retrieved”– polysemy: words have multiple meanings e.g. “surfing”• leads to poor precision “the percentage of the retrieved documents

are relevant”

• The context of the documents.

Polysemy and Context• Document similarity on single word level: polysemy and context

carcompany

•••dodgeford

meaning 2

ringjupiter

•••space

meaning 1…

saturn...

…planet

contribution to similarity, if used in 1st meaning, but not if in 2nd

Problematic• Allow users to retrieve information on the basis of a conceptual topic or

meaning of a document.

Latent Semantic Indexing• Overcome these problems of lexical matching :– Using a statistical information retrieval method that is capable of retrieving

text based on the concepts it contains, not just by matching specific keywords.

Characteristics of LSI • Documents are represented as "bags of words", where the order of the

words in a document is not important, only how many times each word appears in a document.

• Is a technique that projects queries and documents into a space with “latent” semantic dimensions.

• Convert high-dimensional space to lower-dimensional space

•Concepts are represented as patterns of words that usually appear together in documents.

– For example “jaguar", “car", and “speed" might usually appear in documents about sports cars, whereas “jaguar”, “animal”, “hunting” might refer to the concept of jaguar the animal.

•LSI is based on the principle that words that are used in the same contexts tend to have similar meanings.

•LSI uses Singular Value Decomposition for the mapping of terms to concepts.

Characteristics of LSI

• Number of words is huge

• throw out noise ‘and’, ‘is’, ‘at’, ‘the’, .etc.

• Select and use a smaller set of words that are of interest

• Stemming which means remove endings e.g. learning , learned , learn

Generate matrix

“Semantic” Space

Domicile

Kumquat

Orange

Information Retrieval• Represent each document as a word vector

• Represent corpus as term-document matrix (T-D matrix) using a linear analysis method called SVD

• A classical method:– Create new vector from query terms

– Find documents with highest cosine similarity

• We decompose the term-document matrix into three matrices.

Singular Value Decomposition(SVD)

• d1: Shipment of gold damaged in a fire.

• d2: Delivery of silver arrived in a silver truck.

• d3: Shipment of gold arrived in a truck.

• q: Gold silver truck

Example

New vectors

• d1 = [-0.4945, 0.6492]

• d2 = [-0.6458, -0.7194]

• d3 = [-0.5817, 0.2469]

Example

sim(q, di) = CosΘ

• sim(q,d1) = -0.0541

• sim(q,d2) = 0.9910

• sim(q,d3) = 0.4478

Example

Advantages• LSI overcomes two of the most problematic constraints of queries:– Synonymy

– Polysemy

• True (latent) dimensions: the new dimensions are a better representation of documents and queries.

• Term Dependence: The traditional vector space model assumes term independence but LSI has strong associations between terms like the language.

Disadvantages• Storage– Many documents have more than 150 unique terms so the sparce.

• Efficiency– With LSI, the query must be compared to every document in the collection.

• Static Matrix– If we have new documents, we need to do a new SVD in the main matrix.

References• [Furnas et al., 1988] Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer,

T. K., Harshman, R. A., Streeter, L. A., and Lochbaum, K. E. (1988). Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '88, pages 465-480, New York, NY, USA. ACM.

• [Hull, 1994] Hull, D. (1994). Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '94, pages 282-291, New York, NY, USA. Springer-Verlag New York, Inc.

• [Atreya and Elkan, 2011] Atreya, A. and Elkan, C. (2011). Latent semantic indexing (lsi) fails for trec collections. SIGKDD Explor. Newsl., 12(2):5-10.

• [Deerwester et al., 1990] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 41(6):391-407.

• [Littman et al., 1998] Littman, M., Dumais, S. T., and Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. In Cross-Language Information Retrieval, chapter 5, pages 51{62. Kluwer Academic Publishers.

References

Thank you for your Attention!!!

MILESKER ANITZ

LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)

Technology

Transcript of LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)

Selected Press - Galerie Chantal Crousel...alerie hantal rousel hala, olette. «Mona Hatoum, au sinulier et au pluriel», L’Orient le Jour, ul , 21.http.lorientlejour ...

Curriculum Vitae 2016 ANA MAFALDA LEITE - ULisboapascal.iseg.utl.pt/~cesa/templates/cv/CV AMafaldaLeite.pdf · Um estudo comparado entre Relato de um Certo Oriente de Milton Hatoum

To ensure originality, we stamp our logo "JINDAL TMT 500" on every meter of saria, and a Blue "JINDAL TMT 500" strip on every bundle proves the genuineness of JINDAL Saria. So watch

Republic of Kenya The Project for Technical Assistance …open_jicareport.jica.go.jp/pdf/12246666.pdf · The Project for Technical Assistance to Kenya Ports Authority on Dongo Kundu

Mombasa Port Master Plan including Dongo KunduJapan International Cooperation Agency Final Report Mombasa Port Master Plan including Dongo Kundu October 2015 The Overseas Coastal Area

About Society: Mona Hatoum

Dario Dongo: GIFT è - Food and Agriculture Requirements · 4/2/2015 Dario Dongo: GIFT è Marketing e non solo, per export agroalimentare Made in Italy nel mondo

Saria Enterprises

Surgical management of MDR and XDR TB Lehlohonolo Dongo Hannes Meyer Cardiothoracic Surgery Research an Trainining Symposium Stellenbosch 22-24 March 2012.

Lamu-Kitui-Nairobi East & Dongo Kundu Mariakani … East & Dongo Kundu – Mariakani Request for Proposals Preface This Request for Proposals (“RFP”) has been prepared by the Client

Saria school library project information

Performance: And interview with Mona Hatoum · 2017. 8. 12. · Mona Hatoum in performance at the Strategies for Survival Conference, June 1986, Vancouver. •• ... the sound of

Long-term effects of conservation soil management in Saria ...

GASEOUS EXCHANGE FEBRUARY 2017 Author: DONGO SHEMA F … · GASEOUS EXCHANGE │ FEBRUARY 2017 │ Author: DONGO SHEMA F +256 782 642 338 Page 3 of 32 IMPLICATIONS: 1. As organisms

Boom TowN BY SARIA Levitin illustrated By

C:Documents and SettingsAdministratorDesktopCo Chitrapur Gola Petarwar Ramgarh Tenughat Jainagar Domchanchi Doranda Dhanwar Markachho Khadagdiha Birni Barki saria Choudhariband Hazaribag

Mona Hatoum, Roadworks, 1985, Performance © Mona Hatoum, … · 2020. 2. 7. · Mona Hatoum, Roadworks, 1985, Performance © Mona Hatoum, Courtesy the artist (Photo: Patrick Gilbert).

Diabetes Genome Wide Association Alessandra C Goulart Ida Hatoum Stalo Karageorgi Mara Meyer EPI293 January 2008 Harvard School of Public Health Alessandra.

Ethical Issues In Bioinformatics and Genetic Testing Saria Awadalla With contributions by Martha Raj.

SARIA NEWS · Bioibérica: world reference in the production of heparin 20 Interview: “Heparin is an essential molecule – a drug that saves millions ... SARIA News and also a