LIS6 18 lecture 6 Vector Model and ProQuest

63
LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

description

LIS6 18 lecture 6 Vector Model and ProQuest. Thomas Krichel 2011-11-01. advantages of Boolean model. supposedly easy to grasp by the user precise semantics of queries implemented in the majority of commercial systems. problems of Boolean model. - PowerPoint PPT Presentation

Transcript of LIS6 18 lecture 6 Vector Model and ProQuest

Page 1: LIS6 18 lecture  6 Vector Model and  ProQuest

LIS618 lecture 6

Vector Model and ProQuest

Thomas Krichel2011-11-01

Page 2: LIS6 18 lecture  6 Vector Model and  ProQuest

advantages of Boolean model

• supposedly easy to grasp by the user• precise semantics of queries• implemented in the majority of commercial

systems

Page 3: LIS6 18 lecture  6 Vector Model and  ProQuest

problems of Boolean model

• sharp distinction between relevant and irrelevant documents

• no ranking possible • users find it difficult to formulate Boolean

queries• users find it difficult to resolve Boolean

queries

Page 4: LIS6 18 lecture  6 Vector Model and  ProQuest

vector model

• associates weights with each index term appearing in the query and in each database document.

• relevance can be calculated as the cosine between the two vectors, i.e. their cross product divided be the square roots of the squares of each vector. This measure varies between 0 and 1.

Page 5: LIS6 18 lecture  6 Vector Model and  ProQuest

tf/idf

• stands for term frequency / inverse document frequency

• This refers to a technique that gives term a high rank in a document if– the term appears frequently in a document– the term does not appear frequently in other

documents• We will look at each component one at time.

Page 6: LIS6 18 lecture  6 Vector Model and  ProQuest

absolute & maximum term frequency

• Let F_t_d be the number of times term t appears in the document d. This is its absolute term frequency in the document.

• Let m_d be the maximum absolute term frequency achieved by any term in document d. Examples– Document 1: a b a a b c c d m_1 = 3, because "a" appears 3 times– Document 2: a b a f f f e d f a a m_2 = 4, because "a" or "f" appears 4 times

Page 7: LIS6 18 lecture  6 Vector Model and  ProQuest

relative document term frequency• The relative term frequency f_t_d, is given by f_t_d = F_t_d / m_d that is the absolute term frequency of term t in

document d divided by the maximum absolute term frequency of document d.

• This completes the "term frequency" part of the tf/idf formula.

• Let us look at this part through an example.

Page 8: LIS6 18 lecture  6 Vector Model and  ProQuest

main example, part I• Consider three documents– 1: a b c a f o n l p o f t y x– 2: a m o e e e n n n a n p l– 3: r a e e f n l i f f f f x l

• First, look at the maximum frequency achieved by any term in a given document.m_1 = 2 ("a", "f" and "o" are there twice)m_2 = 4 ("n" is there four times)m_3 = 5 ("f" is there five times)

Page 9: LIS6 18 lecture  6 Vector Model and  ProQuest

main example part II

• Now look at some example of absolute term frequencyF_a_1 = 2 F_e_2 = 3 F_x_3 = 1

• and some examples of relative term frequency f_a_1 = F_a_1 / m_1 = 2 / 2 = 1f_e_2 = F_e_2 / m_2 = 3 / 4 = 0.75 f_x_3 = F_x_3 / m_3 = 1 / 5 = 0.2

Page 10: LIS6 18 lecture  6 Vector Model and  ProQuest

inverse document frequency

• Let N be the number of documents in the datebase. N=3 in our example.

• Let n_t be the number of documents where the term t appears. In our examplen_a = 3 n_e = 2 n_x = 2

• N/n_t is an indication of inverse document frequency of a term. It is larger the less a term appears across documents in the database.

Page 11: LIS6 18 lecture  6 Vector Model and  ProQuest

intermezzo: the logarithm

• The logarithm, written log() is a mathematical function. You should know that– log() is an increasing function, i.e. the bigger is x,

the bigger is log(x). – log(1) = 0– log(x) > 0 if x > 1

• Your calculator will tell you what the logarithm of a number is.

Page 12: LIS6 18 lecture  6 Vector Model and  ProQuest

tf/idf formula

• Term frequency and inverse document frequency have to be combined.

• The final formula for the weight combines the terms as follows

w_t_d = f_t_d * log( N / n_t )

Page 13: LIS6 18 lecture  6 Vector Model and  ProQuest

main example part III

N = 3w_a_1 = 1 * log(3/3) = log(1) = 0 !w_e_2 = 0.75 * log(3/2)w_x_3 = 0.2 * log(3/2)

where log(3/2) = 0.176, approximately

Page 14: LIS6 18 lecture  6 Vector Model and  ProQuest

practical operation

• The computer will search the documents for the query term and return the documents where the weight of term in the index for that document is strictly positive, by order of weights, highest to lowest.

• If there are several query terms the computer will perform a more complicated operation that we will not further study here, so we limit ourselves to the case of one query term.

Page 15: LIS6 18 lecture  6 Vector Model and  ProQuest

practical tests

• You ask the computer to query the term "a" in our example. What documents are being returned? – Compare with the result of the Boolean model.

• You ask the computer to query the term "e". What documents are being returned, and in what order?

Page 16: LIS6 18 lecture  6 Vector Model and  ProQuest

advantages of vector model

• term weighting improves performance• sorting is possible• easy to compute, therefore fast• results are difficult to improve without – query expansion– user feedback circle

Page 17: LIS6 18 lecture  6 Vector Model and  ProQuest

ProQuest search targets

• ProQuest searches “citations” and “documents”.

• “citations” are description of documents such as author names, titles, journal etc.

• “documents” contain the full-text of documents.

• Target differences imply different behavior of an expression when matched against a candidate.

Page 18: LIS6 18 lecture  6 Vector Model and  ProQuest

ProQuest search

• If you enter two search terms, they will be used as one phrase.

• If you use three term, they are searched to be appearing in proximity.

• You can force phrase interpretation by placing the search expressions into double quotes.

Page 19: LIS6 18 lecture  6 Vector Model and  ProQuest

terms

• A search term is something you type and that has a meaning on its own.

• For example: house, or krichel.• Terms have a regular expression

interpretation.

Page 20: LIS6 18 lecture  6 Vector Model and  ProQuest

regular expressions

• ‘*’ is used as a right-handed truncation character only; it will find all forms of a word.For example, searching for “econom*”.

• ‘?’ is used to replace any single character, either inside the word or the right end of the word. For example, searching for “wom?n”

• ‘?’ cannot be used to begin a word.

Page 21: LIS6 18 lecture  6 Vector Model and  ProQuest

operators: and

• AND Find the words. • When searching for keywords in "Citation and

Document Text," AND finds documents in which the words occur in the same paragraph (within approx. 1000 characters) or the words appear in any citation field.

Page 22: LIS6 18 lecture  6 Vector Model and  ProQuest

operator: and not, or

• “and not” is the same as “not” in Dialog.• “or” is a normal Boolean or.

Page 23: LIS6 18 lecture  6 Vector Model and  ProQuest

proximity operators

• W/number Find documents where these words are within some number number of words apart (either before or after). Use when searching for keywords within "Citation and Document Text" or "Document Text."Example: computer W/3 careers

• NOT W/number does the opposite.

Page 24: LIS6 18 lecture  6 Vector Model and  ProQuest

proximity operators

• W/PARA Finds documents where these words are within the same paragraph (within approx. 1000 characters). Use when searching for keywords within "Document Text."Example: internet W/PARA web

Page 25: LIS6 18 lecture  6 Vector Model and  ProQuest

proximity operators

• W/DOC Find documents where all the words appear within the document text. Use W/DOC in place of AND when searching for keywords within "Citation and Document Text" or "Document Text" to retrieve more comprehensive results.Example: Internet W/DOC education

Page 26: LIS6 18 lecture  6 Vector Model and  ProQuest

proximity operators

• PRE/number Find documents where the first word appears some number number of words before the second word.

• Use when searching for keywords within "Citation and Document Text" or "Document Text."Example: world pre/3 web

Page 27: LIS6 18 lecture  6 Vector Model and  ProQuest

field syntax

• It is possible to limit a search for a term to a field.

• This is done by writing field( term)

Page 28: LIS6 18 lecture  6 Vector Model and  ProQuest

abstract

• ABS() search article abstracts for your terms. • Examples:

ABS(customer delight) ABS(ozone)

Page 29: LIS6 18 lecture  6 Vector Model and  ProQuest

appendix

• APX() searches the appendix of a document. The appendix usually comes at the end of the document, identified by a header

• Use Keywords to search this field.• Example: APX(Michigan)

Page 30: LIS6 18 lecture  6 Vector Model and  ProQuest

author

• AU() is used to find articles written by an author or reviewer.

• Example AU(Thomas Krichel)

Page 31: LIS6 18 lecture  6 Vector Model and  ProQuest

Classification code (ABI)• Use Classification Codes when searching

business topics. Classification Codes are a fast way to precisely target a search by topic, industry or market, geographical area, or article type.

• Examples: CC(1120) for Economic Policy & Planning

• This only applies to a subset of data from ABI inform, which has these codes.

Page 32: LIS6 18 lecture  6 Vector Model and  ProQuest

Coden• This is use to search the coden index. A coden

is an alphanumeric code used for shelving/ordering books and journals in libraries, often based on a publication’s title.

• Example: CODEN(EDUSBI)

Page 33: LIS6 18 lecture  6 Vector Model and  ProQuest

Column / Document Column Head

• The title of a column in a periodical or newspaper, such as “The Week in Review”. This search field finds all articles where the search words are in the column head.

• Examples: COL(futures) COL("The Week In Review")

Page 34: LIS6 18 lecture  6 Vector Model and  ProQuest

company / organization

• CO() searches for an organization featured prominently in an article, – Associations and cooperatives– Companies and their divisions– Governmental organizations and olitical parties– sports teams, music bands and churches– native american tribes

• Comes with LCO({}) option for full matches.

Page 35: LIS6 18 lecture  6 Vector Model and  ProQuest

publication date

• PDN() searches the publication date in numeric format (mm/dd/yyyy).

• You can use the < and > signs to indicate dates before and after a date, or between specific dates.

• For example, PDN(>1/1/2002) AND PDN(<1/5/2002) will find results from publications with numeric dates between January 1 2002 and January 5 2002.

Page 36: LIS6 18 lecture  6 Vector Model and  ProQuest

dateline

• DLN() searches article Datelines. The dateline occurs frequently in newspapers, just after the article title, giving the date and place of the articles origin. You can use Boolean, proximity and truncation operators.

• DLN(lebanon pre/1 ohio)

Page 37: LIS6 18 lecture  6 Vector Model and  ProQuest

document features

• SF() is used to search document features, such as an index or auxiliary materials, that may be included in or accompany a document.

• The document features indexed are:– Graphs and Illustrations– Maps– References– Tables

Page 38: LIS6 18 lecture  6 Vector Model and  ProQuest

search by proquest handle

• ID() Searches the unique database ID for articles and documents in ProQuest.

• Examples: ID(356894)

Page 39: LIS6 18 lecture  6 Vector Model and  ProQuest

document language

• LA() is used to search Language index. This field contains the language in which the document was published originally.

• Examples: LA(french) LN(french or english)

Page 40: LIS6 18 lecture  6 Vector Model and  ProQuest

document text

• Searches only the full text of articles for your search terms. Article abstracts are not included in this search. AND, OR, and other search operators are treated as such unless enclosed in quotes.

• Examples: TEXT(Kofi Annan) TEXT("North Sea oil")

Page 41: LIS6 18 lecture  6 Vector Model and  ProQuest

title searches

• TI() searches the title of a document, such as “Seigniorage, Taxation and Myopia in EMU”

Page 42: LIS6 18 lecture  6 Vector Model and  ProQuest

document type

• DT() is used to look for search words or phrases in documents of a certain type.

• Examples DT(commentary) DT(editorial cartoon) DT(review) DT(arts/exhibits review) DT(television review-no opinion)

Page 43: LIS6 18 lecture  6 Vector Model and  ProQuest

company number

• DUNS() searches Dunn and Bradstreet trading partner identification number. These numbers provide a universal system for computer identification of companies.

• Examples: DUNS(00 695 7856) DUN(03 575 3920)

Page 44: LIS6 18 lecture  6 Vector Model and  ProQuest

footnote

• FOOT() searches the article footnotes for your terms.

• Examples: FOOT(326 U.S. 465)

Page 45: LIS6 18 lecture  6 Vector Model and  ProQuest

volume

• Volume() searches the volume. • Examples:

VO(100)

Page 46: LIS6 18 lecture  6 Vector Model and  ProQuest

word count• WC() restricts the number of words in the

article text. Use this search field to locate articles under (<) or over (>) a certain length.

• Examples:– WC(<1000) – WC(>500)– WC(>750 AND <1000)

Page 47: LIS6 18 lecture  6 Vector Model and  ProQuest

year

• Year searches the publication year• Examples:

YR(1986) YR(1986-1987) YR(>1998) YR(<1998)

Page 48: LIS6 18 lecture  6 Vector Model and  ProQuest

location

• GEO() is used this search field to look for articles in which a geographical area or location figures prominently in the text.

• Examples: GEO(Midwest) GN(UK) GEO(New South Wales) GN(Black Forest)

• Comes with LGEO({})

Page 49: LIS6 18 lecture  6 Vector Model and  ProQuest

headnote• HEAD() looks for words that occur in the

headnotes of an article. Headnotes are short introductions, explanations, or comments at the beginning of an article. They are different from abstracts in that they do not attempt to summarize the content of the article.

• Examples: HEAD(escalator accidents) HDN(digital tv) HEAD(Global Economy)

Page 50: LIS6 18 lecture  6 Vector Model and  ProQuest

caption texts

• CAP() This search field looks for occurrences of search words in the caption text accompanying article illustrations, graphs, and photographs.

• Examples: CAP(Chart)

Page 51: LIS6 18 lecture  6 Vector Model and  ProQuest

(additional) index

• INDEX() locates all occurrences of search words in any searchable index field. It does not find occurrences in the text of the articles.

• Examples: INDEX(starcore)

Page 52: LIS6 18 lecture  6 Vector Model and  ProQuest

ISSN

• ISSN() looks for the eight-digit International Standard Serials Number (ISSN), where available. Hyphens are optional.

• Examples: ISSN(0011-4664) SN(00916358)

Page 53: LIS6 18 lecture  6 Vector Model and  ProQuest

issue()

• ISSUE() is used to search Issue Number.• Valid Forms:

ISSUE IS

• Examples: IS(10)

Page 54: LIS6 18 lecture  6 Vector Model and  ProQuest

NAICIS / SIC

• NAICS() or SIC() searches for industry codes. The NAICS/SIC code defines the economic activity of a business as defined by the US Census Bureau.

• Examples: SIC(4911) SIC(514210)

Page 55: LIS6 18 lecture  6 Vector Model and  ProQuest

start page

• PAGE() is used for specific pages of a publication. Useful for finding front page articles.

• Example: PAG(A.1) AND PUB(wall street journal) AND PDN(1/10/2003)

Page 56: LIS6 18 lecture  6 Vector Model and  ProQuest

person

• NAME() finds articles about a person. When the Personal Name field is displayed in an article citation, the life spans of historical figures follow their names.

• You can enter the name in any format. Searching for NA(John A Smith) will return the same results as NA(Smith, John A).

Page 57: LIS6 18 lecture  6 Vector Model and  ProQuest

product name

• PROD() finds articles about a specific product. • Examples:

PROD(TiVo) PR(harley-davidson)

Page 58: LIS6 18 lecture  6 Vector Model and  ProQuest

journal

• JN() is used to search by a specific publication or publications.

• Examples: JN(Forbes) JN(New York Times or Washington Post) JN(computing) — retrieves all periodicals with "computing" in their titles

Page 59: LIS6 18 lecture  6 Vector Model and  ProQuest

section

• SECTION() finds articles that appear in a specific section of a publication. Use the SOURCE search field to specify a publication. You must specify the section name exactly as it appears in the publication.

• Examples: SOURCE(New York Times) AND SECTION(editorial) AND AU(Gore Vidal) SEC(sports) AND NA(Florence Griffith Joyner)

Page 60: LIS6 18 lecture  6 Vector Model and  ProQuest

source type

• STYPE() is used include or exclude the following source types from your search: dissertations, newspapers, periodicals and wire feeds.

• Examples: NA(Winston Churchill) AND STYPE(periodical) GEO(Japan) AND STYPE(wire feed)

Page 61: LIS6 18 lecture  6 Vector Model and  ProQuest

subject terms

• SU() is used to look for articles about a specific subject. When searching Hoover's, this contains information on company type.

• Examples: SU(Music) SU(venture capital companies) SU(Health Care) SU(nonprofit)

• Comes with LSU({}) facility

Page 62: LIS6 18 lecture  6 Vector Model and  ProQuest

combined search

• When you select “Citations and abstracts” from the drop-down menu, ProQuest searches the following fields: AU(), NAME(), ABS() PN(), TI(), SU(), CO(), SO(), GEO()

Page 63: LIS6 18 lecture  6 Vector Model and  ProQuest

http://openlib.org/home/krichel

Please shutdown the computers whenyou are done.

Thank you for your attention!