Concepts and Challenges of Text Retrieval for Search Engine

CONCEPTS AND CHALLENGES OF TEXT RETRIEVAL

FOR SEARCH ENGINESPRE CONFERENCE TUTORIAL

by Gan Keng Hoon

16th August 2016

THIS TUTORIAL

Overview: Text Retrieval & Search EngineConcept : Basics of Text RetrievalChallenges: Semantics & Specific Case: Expert Search Engine

Search3

What Do People Search for?

Fu Yuanhui

How to get free Pokeball ?

How to write thesis in three month ?

keynote speaker ICAICTA 2016

What Do People Expect ? How to get free Pokeball

Behind the Click?

Quiz: Which one is not a Search Engine?

Type of Search Engine

Web Search EngineGoogle, Yahoo, Bing

Domain Specific Search Engine Medline/PubmedMicrosoft Academic

Desktop Search Engine Copernic

Connecting Two Ends

Search Collection

Web Domain

Specific Personal Enterprise Etc.

Information Needs

I want to know more about the keynotes speech of ICAICTA

I need more Pokeballs

Free Of Charge..…

What’s so funny about Fu Yuan

Scholarship ending soon, three months left to submit my thesis….

Web Sites Journal

Articles News Images Videos Audio Scanned

Documents Tweets Posts Reviews Etc…

A Conceptual Model for Text Retrieval

Information Needs

Search Collection

Document Representation

Retrieved Documents

IndexingFormulation

Retrieval Function

Relevance Feedback

Natural Language Content Analysis

Search Collection (Retrieval Unit)

Web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, etc.

Retrieval unit can be Part of document, e.g. a paragraph, a slide, a page etc. In the form different structure, html, xml, text etc. In different sizes/length.

Document Representation

Full Text RepresentationKeep everything. Complete. Require huge resources. Too much may not be good.

Reduced (partial) Content RepresentationRemove not important contents e.g. stopwords.Standardization to reduce overlapped contents e.g. stemming.Retain only important contents, e.g. noun phrases, header etc.

Document RepresentationThink of representation as some ways of storing the document.

Bag of Words Model Store the words as the bag (multiset) of its words, disregarding grammar and even word order.

Document 1: "The cat sat on the hat"Document 2: "The dog ate the cat and the hat"From these two documents, a word list is constructed:{ the, cat, sat, on, hat, dog, ate, and }The list has 8 distinct words. Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 }Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1}

Information Needs & Query

Information Needs != Query

Recall the information needs Query: icaicta 2016 keynote

Information Need: I want to know more about the keynotes speech of ICAICTA 2016

Query: free pokeballInformation Need: I need more Pokeballs. I don’t want to pay. No cheat codes.

Retrieved DocumentsFrom the original collection, a subset of documents are obtained.

What is the factor that determines what document to return?

Simple Term Matching Approach 1. Compare the terms in a document and query.2. Compute “similarity” between each document in the collection and

the query based on the terms they have in common.3. Sorting the document in order of decreasing similarity with the

query.4. The outputs are a ranked list and displayed to the user - the top ones

are more relevant as judged by the system.16

Indexing

Convert documents into representation or data structure to improve the efficiency of retrieval.

To generate a set of useful terms called indexes.

Why?Many variety of words used in texts, but not all are important.Among the important words, some are more contextually relevant.

Some basic processes involved•Tokenization•Stop Words Removal •Stemming •Phrases• Inverted File

Indexing (Tokenization)Convert a sequence of characters into a sequence of tokens with some basic meaning.

“The cat chases the mouse.”

“Bigcorp's 2007 bi-annual report showed profits rose 10%.”

thecatchasesthemouse

bigcorp2007biannualreportshowedprofitsrose10%

Indexing (Tokenization)Token can be single or multiple terms.

“Samsung Galaxy S7 Edge, redefines what a phone can do.”

samsung galaxy s7 edge redefineswhataphone cando

samsunggalaxy s7 edge redefineswhata ….

Indexing (Tokenization)

Common Issues1. Capitalized words can have different meaning from lower case words

Bush fires the officer. Query: Bush fire The bush fire lasted for 3 days. Query: bush fire

2. Apostrophes can be a part of a word, a part of a possessive, or just a mistake

rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's

Indexing (Tokenization)

3. Numbers can be important, including decimals nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358

4. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations

I.B.M., Ph.D., cs.umass.edu, F.E.A.R.

Note: tokenizing steps for queries must be identical to steps for documents

Indexing (Stopping)Top 50 Words from AP89 News Collection

Recall,

Indexes should be useful term links to a document.

Are the terms on the right figure useful?

Indexing (Stopping)

Stopword list can be created from high-frequency words or based on a standard list

Lists are customized for applications, domains, and even parts of documents

e.g., “click” is a good stopword for anchor text

Best policy is to index all words in documents, make decisions about which words to use at query time?

Indexing (Stemming)

Many morphological variations of wordsinflectional (plurals, tenses)derivational (making verbs nouns etc.)

In most cases, these have the same or very similar meanings

Stemmers attempt to reduce morphological variations of words to a common stem

usually involves removing suffixes

Can be done at indexing time or as part of query processing (like stopwords)

Indexing (Stemming)Porter Stemmer

Algorithmic stemmer used in IR experiments since the 70s

Consists of a series of rules designed to the longest possible suffix at each step

Produces stems not wordsExample Step 1 (right figure)

Indexing (Phrases)

Recall, token, meaningful tokens are better indexes, e.g. phrases.

Text processing issue – how are phrases recognized?

Three possible approaches:Identify syntactic phrases using a part-of-speech (POS) taggerUse word n-gramsStore word positions in indexes and use proximity operators in queries

Indexing (Phrases)Example Noun Phrases

* Other method like N-Gram

Indexing (Inverted Index)

Recall, indexes are designed to support search.

Each index term is associated with an inverted listContains lists of documents, or lists of word occurrences in documents, and other information.

Each entry is called a posting. The part of the posting that refers to a specific document or location is called a pointerEach document in the collection is given a unique numberLists are usually document-ordered (sorted by document number)

Indexing (Inverted Index)Sample collection. 4 sentences from Wikipedia entry for Tropical Fish

Indexing (Inverted Index)Simple inverted index.

Indexing (Inverted Index)

Inverted index with counts.

Support better ranking algorithms.

Indexing (Inverted Index)Inverted index with positions.

Support proximity matching.

Retrieval FunctionRankingDocuments are retrieved in sorted order according to a score computing using the document representation, the query, and a ranking algorithm

Retrieval Function (Vector Space Model)Ranked based method.

Documents and query represented by a vector of term weights.

Collection represented by a matrix of term weights.

Retrieval Function (Vector Space Model)

borneo daily new north straits timesD1 0 0 1 0 1 1D2 0 1 1 0 1 0D3 1 0 0 1 0 1

D1: new straits timesD2: new straits dailyD3 : north borneo times

Vector of useful terms

borneo daily new north straits times

D1 0 0 0.176 0 0.176 0.176

D2 0 0.477 0.176 0 0.176 0

D3 0.477 0 0 0.477 0 0.176

idf (borneo) = log(3/1) =0.477idf (daily) = log(3/1) = 0.477idf (new) = log(3/2) =0.176idf (north) = log(3/1) = 0.477idf (straits) = log(3/2) = 0.176idf (times) = log(3/2) = 0.176then multiply by tf

tf.idf weightTerm frequency weight measures importance in document:

Inverse document frequency measures importance in collection:

Note: Doc Length, Term Location, Term Semantic Meaning

Documents ranked by distance between points representing query and documents

Similarity measure more common than a distance or dissimilaritymeasuree.g. Cosine correlation

Retrieval Function (Vector Space Model)Consider two documents D1, D2 and a query Q

Q = “straits times”

Compare against collection, D1 = “new straits times”

(borneo, daily, new, north, straits, times)

Q = (0, 0, 0, 0, 0.176, 0.176)

D1 = (0, 0, 0.176, 0, 0.176, 0.176)

D2 = (0, 0.477, 0.176, 0, 0.176, 0)

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷,𝑄𝑄 =0∗0 + 0∗0 + 0.176∗0 + 0∗0 + 0.176∗0.176 +(0.176∗0.176)

0.1762+0.1762+0.1762 (0.1762+0.1762)=0.816

Find Cosine (D2,Q).Which document is more relevant?

Evaluation

A must to evaluate the retrieval function, preprocessing steps etc.

Standard CollectionTask specificHuman experts are used to judge relevant results.

Performance Metric PrecisionRecall

Evaluation (Collection)Test collections consisting of documents, queries, and relevance judgments, e.g.,

Evaluation (Collection)

Example query and narrative for golden standard.

Evaluation (Effectiveness Measures)

A is set of relevant documents, B is set of retrieved documents

Evaluation (Ranking Effectiveness)

Recall@4 = 3/4 Precision@4 = 3/4

Recall@2 = 2/4 Precision@2 = 2/2

ChallengesSocial Texts, e.g. Tweets,

Hard question. Hard Disk ?

Named Entity Various levels and

aspects of annotations

Challenges

Small DataSpecific searchImprove semantics extensively

Big Data

Multi modal retrieval

Connecting many medias

Case: Adding Semantics Bibliography

Improve Search Results Display

Facet-based semantic

Useful Terms

Demo: ir.cs.usm.my

THANK YOUkhgan@usm.my

Concepts and Challenges of Text Retrieval for Search Engine

Presentations & Public Speaking

Transcript of Concepts and Challenges of Text Retrieval for Search Engine

Engine Maintenance Concepts for Financier.pdf

IEEE Engine Generator Paralleling Concepts

Engine Mx Concepts for Financiers V2

[Snia 2013] indexing and retrieval engine wahyu hidayat

FIRE – Flexible Image Retrieval Engine: ImageCLEF 2004 ...thomas.deselaers.de/publications/papers/deselaers_clef04.pdf · FIRE – Flexible Image Retrieval Engine: ImageCLEF 2004

Modern Information Retrieval Lecture 2: Key concepts in IR.

Personalized Information Retrieval and Access: Concepts, Methods and Practices (Premier Reference

BMW Diesel - Engine Concepts for Efficient Dynamics

An indexing and retrieval engine for the Semantic Web

Information Retrieval Search Engine Technology (5&6) Prof. Dragomir R. Radev.

1 Chapter 21: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Information Retrieval.

An E cient Visual Search Engine for Cultural …ceur-ws.org/Vol-2034/paper_1.pdfof Lucene Image Retrieval Engine (LIRE), an open-source Conte nt-Based Image Retrieval system, for video

IEEE Engine Generator Paralleling Concepts Engine... · Engine Generator Paralleling Concepts Gen. #1 Gen. #2 ... Electrically connected to the utility grid ... Parallel Controller

Content-Based Image Retrieval Using Fuzzy Cognition Concepts

Personalized Web Search for Improving Retrieval Effectivenessmeng/pub.d/tkde_fang.pdf · Personalized Web Search For Improving Retrieval Effectiveness ... Search Engine * A preliminary

Data Organization ­ B­trees - Brown University · PDF fileDatabase System Concepts 11.2 Data organization and retrieval File organization can improve data retrieval time SELECT *

Composite Fan Blade Design for Advanced Engine Concepts

Search Engine Optimization Concepts - The Future of Search

©Silberschatz, Korth and SudarshanB.1Database System Concepts Chapter B: Hierarchical Model Basic Concepts Tree-Structure Diagrams Data-Retrieval Facility.

Collaborative Information Retrieval: Concepts, Models and Evaluation

Data Organization Btrees - Brown University · PDF fileDatabase System Concepts 11.2 Data organization and retrieval File organization can improve data retrieval time SELECT *