LAST WEEK

45
LAST WEEK LAST WEEK Retrieval evaluation Retrieval evaluation Why? Why? How? How? Recall and precision – Venn’s Recall and precision – Venn’s Diagram & Contingency Table Diagram & Contingency Table

description

LAST WEEK. Retrieval evaluation Why? How? Recall and precision – Venn’s Diagram & Contingency Table. WMES3103 INFORMATION RETRIEVAL. WEEK 5 QUERY LANGUAGES AND OPERATION. QUERY LANGUAGES. Will cover the different kinds of queries sent to text retrieval systems. - PowerPoint PPT Presentation

Transcript of LAST WEEK

Page 1: LAST WEEK

LAST WEEKLAST WEEK

Retrieval evaluationRetrieval evaluation Why?Why? How?How? Recall and precision – Venn’s Diagram & Recall and precision – Venn’s Diagram &

Contingency TableContingency Table

Page 2: LAST WEEK

WMES3103WMES3103INFORMATION INFORMATION

RETRIEVALRETRIEVALWEEK 5WEEK 5

QUERY LANGUAGESQUERY LANGUAGES

AND OPERATION AND OPERATION

Page 3: LAST WEEK

QUERY LANGUAGESQUERY LANGUAGES

Will cover the different kinds of queries Will cover the different kinds of queries sent to text retrieval systems.sent to text retrieval systems.

Will show the different types of query that Will show the different types of query that a user can formulate.a user can formulate.

Normally, the main and most popularly Normally, the main and most popularly used type of user query is the keyword-used type of user query is the keyword-based retrieval.based retrieval.

Page 4: LAST WEEK

Different queries are continuously sent to an Different queries are continuously sent to an IRS.IRS.

Most query languages use the content Most query languages use the content (semantics) and the structure of the text (text (semantics) and the structure of the text (text syntax) to find the relevant documents.syntax) to find the relevant documents.

At times, the IRS may fail to trace and retrieve At times, the IRS may fail to trace and retrieve the relevant documents.the relevant documents.

Therefore, we need to use a number of Therefore, we need to use a number of techniques which will hopefully enhance the techniques which will hopefully enhance the query and this will enable us to retrieve an query and this will enable us to retrieve an acceptable level of relevant documents.acceptable level of relevant documents.

Page 5: LAST WEEK

QUERY LANGUAGESQUERY LANGUAGES

eg. use of thesaurus, synonyms, stemming, eg. use of thesaurus, synonyms, stemming, stopwords, etcstopwords, etc

A A keywordkeyword is a word that can be retrieved by an is a word that can be retrieved by an IRS.IRS.

The The retrieval unitretrieval unit is the basic element which is the basic element which can be retrieved by the system as an answer to can be retrieved by the system as an answer to a query = also known as documentsa query = also known as documents

A A retrieval unitretrieval unit can be a file, document, Web can be a file, document, Web page, paragraph, or some other structural unit page, paragraph, or some other structural unit which contains the answer to the query.which contains the answer to the query.

Page 6: LAST WEEK

Example : KeywordExample : Keyword

Keyword used Keyword used is : “artificial is : “artificial intelligence”intelligence”

Page 7: LAST WEEK

Example : Retrieval unitExample : Retrieval unit

websitewebsite

Page 8: LAST WEEK

Example : Retrieval unitExample : Retrieval unit

documentdocument

Page 9: LAST WEEK

TYPES OF QUERY TYPES OF QUERY LANGUAGESLANGUAGES

Keyword-based queryingKeyword-based querying Single-wordSingle-word ContextContext BooleanBoolean Natural languageNatural language

Pattern matchingPattern matching Structural queriesStructural queries

Form-like fixedForm-like fixed HypertextHypertext HierarchicalHierarchical

Page 10: LAST WEEK

KEYWORD-BASED KEYWORD-BASED QUERYINGQUERYING

Query = formulation of a user information need.Query = formulation of a user information need. Query =a keyword or a number of keywords = a Query =a keyword or a number of keywords = a

basic querybasic query Documents containing such keywords are Documents containing such keywords are

searched for in the IRS.searched for in the IRS. Keyword-based queries are popular because: Keyword-based queries are popular because:

IntuitiveIntuitive Easy to expressEasy to express Allows for fast ranking.Allows for fast ranking.

Page 11: LAST WEEK

Single-word querySingle-word query

Simplest form of query that can be formulated in Simplest form of query that can be formulated in an IRS.an IRS.

Text document = long sequences of words.Text document = long sequences of words. The IRS will look at the text and search for the The IRS will look at the text and search for the

word.word. Result of a word query = a set of documents Result of a word query = a set of documents

containing at least one of the words of the query.containing at least one of the words of the query. Set of documents will be ranked according to the Set of documents will be ranked according to the

degree of similarity to the query.degree of similarity to the query.

Page 12: LAST WEEK
Page 13: LAST WEEK
Page 14: LAST WEEK
Page 15: LAST WEEK

Single-word querySingle-word query

Ranking done via word occurences inside Ranking done via word occurences inside the textthe text

Most popularly used = term frequency = Most popularly used = term frequency = counts the number of times a word counts the number of times a word appears inside a documentappears inside a document

Page 16: LAST WEEK

Context queryContext query

Singleword queries are complemented Singleword queries are complemented with the ability to search for words in a with the ability to search for words in a given context = near other words.given context = near other words.

Words which appear near other words Words which appear near other words may indicate a higher possibility of may indicate a higher possibility of relevance than if they appear apart.relevance than if they appear apart.

2 type of queries2 type of queries phrase phrase proximityproximity

Page 17: LAST WEEK

Phrase – sequence of single-word queries.Phrase – sequence of single-word queries. Proximity – more relaxed version of the Proximity – more relaxed version of the

phrase query.phrase query. Phrase is given together with a maximum Phrase is given together with a maximum

allowed distance between them.allowed distance between them. Distance measured in characters or words Distance measured in characters or words

depending on the systemdepending on the system

Page 18: LAST WEEK

Example : ABI-INFORM (CD)Example : ABI-INFORM (CD)

W/nW/n – first keyword – first keyword must be within n words must be within n words of the second keyword.of the second keyword.

computer w/1 data = computer w/1 data = the word the word computer computer must be within must be within 11 word word of the word of the word datadata = = computer generated computer generated data, computer data, computer simulated data, data simulated data, data mining computermining computer

PRE/nPRE/n – first keyword – first keyword precede second keyword precede second keyword by up to n words.by up to n words.

European pre/1 European pre/1 community = the word community = the word European must precede European must precede the word community by the word community by up to 1 word = European up to 1 word = European economic community, economic community, European flavoured European flavoured communitycommunity

Page 19: LAST WEEK

Example : COMPENDEXExample : COMPENDEX

Search for a phrase = Search for a phrase = type in each keyword type in each keyword separated by a space separated by a space = will search for the = will search for the phrase with the 2 phrase with the 2 keywords next to keywords next to each other and in the each other and in the specified order specified order

artificial intelligenceartificial intelligence

Desired proximity of Desired proximity of keywords specified with keywords specified with full stops between full stops between keywordskeywords

back..basics = back to back..basics = back to basics, back to the basics, back to the basicsbasics

Keywords must appear Keywords must appear in the same sentence = in the same sentence = type in the keywords type in the keywords separated by an separated by an underscoreunderscore

computer_medicinecomputer_medicine

Page 20: LAST WEEK

Boolean queryBoolean query

Oldest form of keyword query = use of Boolean Oldest form of keyword query = use of Boolean operatorsoperators

Typical Boolean query = words + operators.Typical Boolean query = words + operators. Given 2 basic keyword queries : A and BGiven 2 basic keyword queries : A and B A or B - selects all documents with the word A or A or B - selects all documents with the word A or

B.B. A and B – selects all documents with A and BA and B – selects all documents with A and B A not B – selects all documents with the word A A not B – selects all documents with the word A

but without the word B.but without the word B. Represented by Venn’s DiagramRepresented by Venn’s Diagram

Page 21: LAST WEEK

Boolean operator : ANDBoolean operator : AND

roboticsMalaysia

each document in this set will containboth the words robotics and malaysia

Page 22: LAST WEEK

Boolean operator : ORBoolean operator : OR

water pollution marine pollution

each document in this set will contain one or both of the keywords - water pollution , marine pollution

Page 23: LAST WEEK

Boolean operator : NOTBoolean operator : NOT

digital watches

Documents with the word digital will not have the word watches in them

Page 24: LAST WEEK

Natural LanguageNatural Language

User determines the keywords that should User determines the keywords that should be eliminated and are not useful for be eliminated and are not useful for searching.searching.

Ranking for documents with these Ranking for documents with these keywords would be very low.keywords would be very low.

Page 25: LAST WEEK

TARGET - DialogTARGET - Dialog? target? targetInput search terms separated by spaces ( e.g. DOG CAT FOOD). You can enhance your TARGET Input search terms separated by spaces ( e.g. DOG CAT FOOD). You can enhance your TARGET

search with the following options:search with the following options:  --                    PHRASES are enclosed in single quotes (e.g. ‘DOG FOOD’)PHRASES are enclosed in single quotes (e.g. ‘DOG FOOD’)--                    SYNONYMS are enclosed in parentheses (e.g. (DOG CANINE))SYNONYMS are enclosed in parentheses (e.g. (DOG CANINE))--                    SPELLING variations are indicated with a ? (e.g. DOG? To search for DOG, DOGS)SPELLING variations are indicated with a ? (e.g. DOG? To search for DOG, DOGS)--                    Terms that MUST be present are flagged with an asterisk (e.g. DOG *FOOD)Terms that MUST be present are flagged with an asterisk (e.g. DOG *FOOD)  Q = QUITQ = QUIT H = HELPH = HELP  ? komodo dragon food diet nutrition? komodo dragon food diet nutritionYour TARGET search request will retrieve up to 50 of the statistically relevant records.Your TARGET search request will retrieve up to 50 of the statistically relevant records.Searching 1997 – 1998 records onlySearching 1997 – 1998 records only… … Processing CompleteProcessing Complete  

Your search retrieved 50 recordsYour search retrieved 50 recordsPress ENTER to browse resultsPress ENTER to browse results C = Customize displayC = Customize display Q = QuitQ = Quit H = HelpH = Help

Page 26: LAST WEEK

Pattern MatchingPattern Matching

More specific query formulation More specific query formulation Retrieve pieces of text that have some Retrieve pieces of text that have some

property.property. Used in the retrieval of text statistics, data Used in the retrieval of text statistics, data

extraction, etc.extraction, etc. A pattern is a set of syntactic features that A pattern is a set of syntactic features that

must occur in a text segment.must occur in a text segment. Segments that fulfils the pattern Segments that fulfils the pattern

specifications = pattern matchspecifications = pattern match

Page 27: LAST WEEK

Pattern MatchingPattern Matching

Interested in documents containing segments Interested in documents containing segments which match the given search pattern.which match the given search pattern.

Each IRS will allow some degree of search Each IRS will allow some degree of search pattern.pattern.

Very simple or very complex.Very simple or very complex. The more powerful the set of patterns allowed, The more powerful the set of patterns allowed,

the more involved are the queries that can be the more involved are the queries that can be formulated by the user, and the more complex is formulated by the user, and the more complex is the implementation of the search.the implementation of the search.

Page 28: LAST WEEK

Pattern MatchingPattern Matching

Words – a word in the text, most basic pattern.Words – a word in the text, most basic pattern. Prefixes – the beginning of a text word – eg. Prefixes – the beginning of a text word – eg.

prefix prefix computcomput will retrieve all documents will retrieve all documents containing the words such as containing the words such as computerscomputers, , computingcomputing, , computationcomputation, , computationalcomputational, etc. , etc.

Suffixes - the termination of a text word – eg. Suffixes - the termination of a text word – eg. prefix prefix tersters will retrieve all documents containing will retrieve all documents containing the words such as the words such as monstersmonsters, , postersposters, , potterspotters, , painterspainters, etc. , etc.

Page 29: LAST WEEK

Pattern MatchingPattern Matching

Substrings –can appear within a text word Substrings –can appear within a text word – eg. – eg. taltal will retrieve all documents will retrieve all documents containing the words such as containing the words such as coastalcoastal, , talktalk, , metallicmetallic, , pedestalpedestal, etc. , etc.

Ranges – A pair of strings which matches Ranges – A pair of strings which matches any word lying between them in any word lying between them in lexicographical order – eg. range between lexicographical order – eg. range between words words heldheld and and holdhold will retrieve strings will retrieve strings such as such as hoaxhoax, , hissinghissing, , helmhelm, , helphelp, etc., etc.

Page 30: LAST WEEK

Pattern MatchingPattern Matching

Allowing errors – A word together with an Allowing errors – A word together with an error thresholderror threshold will retrieve all text words which are similar to will retrieve all text words which are similar to

the given word. the given word. errors are caused by typing, spelling, etc.errors are caused by typing, spelling, etc. most accepted model is the Levenshtein most accepted model is the Levenshtein

distance or edit distance. distance or edit distance.

Page 31: LAST WEEK

Pattern MatchingPattern Matching

Example : Edit distance between COLOR and Example : Edit distance between COLOR and COLOUR is 1, SURVEY and SURGERY is 2. COLOUR is 1, SURVEY and SURGERY is 2. Therefore, in the query, we must specify the Therefore, in the query, we must specify the maximum number of allowed errors for a word maximum number of allowed errors for a word to match the pattern.to match the pattern.

Page 32: LAST WEEK

Structural QueriesStructural Queries

Based on structure of the text Based on structure of the text 3 structures – fixed, hypertext, hierarchical3 structures – fixed, hypertext, hierarchical The user will query the text based on the The user will query the text based on the

structure.structure. Query language nowadays integrates both Query language nowadays integrates both

contents and structural queries.contents and structural queries.

Page 33: LAST WEEK

Example : UM Library OPAC recordsExample : UM Library OPAC records Example of query : fi au ali and subject Example of query : fi au ali and subject

malaysia malaysia

Page 34: LAST WEEK
Page 35: LAST WEEK
Page 36: LAST WEEK

Query ProtocolsQuery Protocols

Protocol: a strict set of rules that govern the Protocol: a strict set of rules that govern the exchange of information between computer exchange of information between computer devicesdevices

Query languages used automatically by software Query languages used automatically by software applications to query text databases.applications to query text databases.

Some are standards for querying CD-ROMs or Some are standards for querying CD-ROMs or as intermediate languages to query library as intermediate languages to query library systems.systems.

Not intended for human use = refer as protocols Not intended for human use = refer as protocols and not languages.and not languages.

Page 37: LAST WEEK

Query ProtocolsQuery Protocols

Z39.50 –query bibliographical information using Z39.50 –query bibliographical information using a standard interface between the client and the a standard interface between the client and the host database manager which is independent of host database manager which is independent of the client user interface and of the query the client user interface and of the query database language at the host. Originally used database language at the host. Originally used for bibliographical information based on MARC for bibliographical information based on MARC format.format.

WAIS – Wide Area Information Service – WAIS – Wide Area Information Service – popular before Web – network publishing popular before Web – network publishing protocol and can query databases through the protocol and can query databases through the Internet.Internet.

Page 38: LAST WEEK

www.ukoln.ac.uk/dlis/z3950/www.ukoln.ac.uk/dlis/z3950/

Page 39: LAST WEEK
Page 40: LAST WEEK

Protocols for CD-ROMProtocols for CD-ROM

Allows for flexibility in data communication Allows for flexibility in data communication between primary information providers and end between primary information providers and end users. users.

Significant cost savings - allows access to a Significant cost savings - allows access to a variety of information without the need to buy, variety of information without the need to buy, install, and train users for different data retrieval install, and train users for different data retrieval applications.applications.

3 protocols has been recommended :3 protocols has been recommended : CCL (Common Comand Language) CCL (Common Comand Language) CD-RDx (Compact Disk Read only Data exchange) CD-RDx (Compact Disk Read only Data exchange) SFQL (Structured Full-text Query Language) SFQL (Structured Full-text Query Language)

Page 41: LAST WEEK

QUERY OPERATIONSQUERY OPERATIONS

Users - difficult to formulate queries which are Users - difficult to formulate queries which are well-designed for retrieval purposes because they well-designed for retrieval purposes because they do not know the collection make-up and the do not know the collection make-up and the retrieval environment.retrieval environment.

Web search engines – users spend a lot of time Web search engines – users spend a lot of time reformulating their queries to get effective retrieval.reformulating their queries to get effective retrieval.

First query formulation – retrieve documents and First query formulation – retrieve documents and examine for relevance - construct new improved examine for relevance - construct new improved query formulations - retrieve documents and query formulations - retrieve documents and examine for relevance - process is repeated until examine for relevance - process is repeated until the user is satisfied.the user is satisfied.

Page 42: LAST WEEK

QUERY OPERATIONSQUERY OPERATIONS

2 processes involved2 processes involved expanding the original query with new termsexpanding the original query with new terms reweighting the terms in the expanded query.reweighting the terms in the expanded query.

2 ways of improving initial query formulation2 ways of improving initial query formulation approaches based on feedback information from the approaches based on feedback information from the

useruser approaches based on information derived from the set approaches based on information derived from the set

of documents initially retrieved (called the local set of of documents initially retrieved (called the local set of documents)documents)

Page 43: LAST WEEK

User Relevance FeedbackUser Relevance Feedback

Most popular query formulation strategy.Most popular query formulation strategy. User is presented with a list of retrieved User is presented with a list of retrieved

documents, examines them, and marks those documents, examines them, and marks those which are relevant.which are relevant.

Only the top 10 or 20 ranked documents need to Only the top 10 or 20 ranked documents need to be examined.be examined.

Separates into relevant and non-relevant.Separates into relevant and non-relevant. Select important terms attached to the retrieved Select important terms attached to the retrieved

and relevant documents only, and enhance and relevant documents only, and enhance importance of terms in new query formulation.importance of terms in new query formulation.

Page 44: LAST WEEK

User Relevance Feedback User Relevance Feedback

Expect new query will move towards the relevant Expect new query will move towards the relevant documents and away from the non-relevant documents and away from the non-relevant ones.ones.

Advantages :Advantages : Protects the user from the details of the query Protects the user from the details of the query

reformulation process because all the user has to do reformulation process because all the user has to do is reuse the terms is reuse the terms

Breaks down the entire search process into a Breaks down the entire search process into a sequence of small steps which are easier to grasp.sequence of small steps which are easier to grasp.

Provides a control process designed to emphasis Provides a control process designed to emphasis some terms and deemphasis others.some terms and deemphasis others.

Page 45: LAST WEEK

Automatic Local AnalysisAutomatic Local Analysis

Documents retrieved for a given query are Documents retrieved for a given query are examined immediately to determine terms for examined immediately to determine terms for query expansion.query expansion.

Similar to relevance feedback cycle but done Similar to relevance feedback cycle but done without the assistance of the user – automatic.without the assistance of the user – automatic.

Local feedback strategies are based on Local feedback strategies are based on expanding the query with terms correlated to the expanding the query with terms correlated to the query terms = local clusters built from local query terms = local clusters built from local documents set.documents set.