Automatically Building a Stopword List for an Information Retrieval System

University of Glasgow

Rachel Tsz-Wai Lo, Ben He, Iadh Ounis

Outline Stopwords Investigation of two approaches

Approach based on Zipf’s Law New Term-based random sampling

approach Experimental Setup Results and Analysis Conclusion

What is a Stopword? Common words in a document

e.g. the, is, and, am, to, it Contains no information about documents Low discrimination value in terms of IR

meaningless, no contribution Search with stopwords will usually result

in retrieving irrelevant documents

Objective Different collection contains different

contents and word patterns Different collections may require a

different set of stopwords Given a collection of documents Investigate ways to automatically

create a stopword list

Objective (cont)

1. Baseline Approach (benchmark) 4 variants inspired by Zipf’s Law

TF Normalised TF IDF Normalised IDF

2. How informative a term is (new proposed approach)

Fox’s Classical Stopword List and Its Weakness

Contains 733 stopwords > 20 years old Lacks potentially new words Defined for General Purpose

different collections require different stopword lists

Outdated

Zipf’s Law Based on the term frequencies of terms,

rank these terms accordingly term with highest TF will have rank = 1, next

highest term with rank = 2 etc

Zipf’s Law

CrF )(

Zipf’s Law

Baseline Approach Algorithm

Generate a list of frequencies vs terms based on corpus

Sort the frequencies in descending order Rank the terms according to their

frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc.

Draw a graph of frequencies vs rank

Baseline Approach Algorithm (cont.)

Choose a threshold and any words that appear above the threshold are treated as stopwords

Run the queries with the above said stopword list, all stopwords in the queries will be removed

Evaluate system with Average Precision

Baseline Approach - Variants

Term Frequency Normalised Term Frequency

Inverse Document Frequency (IDF)

Normalised IDF

TFTFNorm log

NDocidf log

5.0)(log

kNormk D

DNDocidf

Baseline Approach – Choosing Threshold

Produce best set of stopwords > 50 stopword lists for each variant

Investigate the frequencies difference between two consecutive ranks

big difference (i.e. sudden jump) Important to choose appropriate

threshold

)()1( rFrF

Term-Based Random Sampling Approach (TBRSA)

Our proposed new approach Depends on how informative a

term is Based on the Kullback-Leibler

divergence measure Similar to the idea of query

expansion

Kullback-Leibler Divergence Measure

Used to measure the distance between two distributions.

In our case, distribution of two terms, one of which is a random term

The weight of a term t in the sampled document set is given by:

PPtw 2log)(

cc token

TBRAS Algorithm

KL divergence measure

Repeat Y times

Random term

Normalise weights by max weight

0.0 0.1 0.3 0.5 0.7

Rank in ascending order Top X ranked

0.0 0.1 0.3

Retrieve

TBRSA Algorithm (cont.)

Extract top L ranked as stopwords

0.05 1.00.75

0.1 0.15

0.75 0.8

1.00.3

0.051.0 0.850.5 0.90.80.1 0.150.3 0.70.0

0.0 0.3

0.0 0.70.1 0.30.150.05

Advantages / Disadvantages

Advantages based on how informative a term is computational effort minimal, compared to

baselines better coverage of collection No need to monitor progress

Disadvantages Generates first term randomly, could retrieve a

small data set Repeat experiments Y times

Experimental Setup Four TREC collections

http://trec.nist.gov/data/docs_eng.html Each collection is indexed and stemmed

with no pre-defined stopwords removed No assumption of stopwords in the beginning

Long queries were used Title, Description and Narrative

Maximise our chances of using the new stopword lists

Experimental Platform

Terrier - TERabyte RetrIEveR IR Group, University of Glasgow Based on Divergence From

Randomness (DFR) framework Deriving parameter-free probabilistic

models PL2 model http://ir.dcs.gla.ac.uk/terrier/

PL2 Model One of the DFR document weighting

models Relevance score of a document d for

query Q is:

dtwQdscore ),(),(

)2(log5.0log12

1222 tfnetfn

tfntfn

lavgctftfn

_1log2 )0( c

Collections

disk45, WT2G, WT10G and DOTGOV

Collection

Size # Docs # Tokens c value

disk45 2GB 528155 801397 2.13

WT2G 2GB 247491 1020277 2.75

WT10G 10GB 1692096 3206346 2.43

DOTGOV 18GB 1247753 2821821 2.00

Queries

Collection

Query Sets # Queries

disk45 TREC7 and TREC8 of ad-hoc tasks

WT2G TREC8 50

WT10G TREC10 50

DOTGOV TREC11 and TREC12 merged 100

Merging Stopword Lists Merging classical with best generated

using baseline and novel approach respectively

Adding 2 lists together, removing duplicates

Might be stronger in terms of effectiveness Follows from classical IR technique of

combining evidence

Results and Analysis Produce as many sets of stopwords

(by choosing different thresholds for baseline approach)

Compare results obtained to Fox’s classical stopword list, based on average precision

Baseline Approach – Overall Results

* indicates significant difference at 0.05 level

Normalised IDF and for every collection

Collection

Classical

TF Norm TF

IDF Norm IDF

p-value

disk45 0.2123 0.2130

0.2123 0.2113

0.2130 0.8845

WT2G 0.2569 0.2650

0.2676 0.2682

0.2700 0.001508*

WT10G 0.2000 0.2049

0.2076 0.2079

0.2079 0.1231

DOTGOV 0.1223 0.1212

0.1208 0.1227

0.1227 0.55255

Baseline Approach – Additional Terms Produced

disk45 WT2G WT10G DOTGOV

financial html able content

company http copyright gov

president htm ok define

people internet http year

market web html administrate

london today january http

national policy history web

structure content facil economic

january document

html year

TBRSA – Overall Results

* indicates significant difference at 0.05 level disk45 and WT2G both show improvements

Collection Classical Best Obtain p-value

disk45 0.2123 0.2129 0.868

WT2G 0.2569 0.2668 0.07544

WT10G 0.2000 0.1900 0.4493

DOTGOV 0.1223 0.1180 0.002555*

TBRSA – Additional Terms Produced

disk45 WT2G WT10G DOTGOV

column advance copyright

server

general beach friend modify

califonia

company memory length

industry environment

mountain

content

month garden problem accept

director industry science inform

desk material special connect

economic

pollution internet gov

business

school document

Refinement - Merging New approach (TBRSA) gives

comparable results Computation effort is less

Fox’s classical stopword list was very effective, despite its old age Worth using

Queries were quite “conservative”

Merging – Baseline Approach

* indicates significant difference at 0.05 level Produced a more effective stopword list

Collection Classical Norm IDF Merged p-value

disk45 0.2123 0.2130 0.2130 0.8845

WT2G 0.2569 0.2700 0.2712 0.00746*

WT10G 0.2000 0.2079 0.2109 0.03854*

DOTGOV 0.1223 0.1227 0.1241 0.6775

Merging – TBRSA

* indicates significant difference at 0.05 level Produced an improved stopword list with less

computational effort

Collection

Classical

Best Obtained

Merged

p-value

disk45 0.2123 0.2129 0.2129 0.868

WT2G 0.2569 0.2668 0.2703 0.008547*

WT10G 0.2000 0.1900 0.2066 0.4451

DOTGOV 0.1223 0.1180 0.1228 0.5085

Conclusion & Future Work

Proposed a novel approach for automatically generating a stopword list

Effectiveness and robustness Compared to 4 baseline variants,

based on Zipf’s Law Merge classical stopword list with

best found result to produce a more effective stopword list

Conclusion & Future Work (cont.)

Investigate other divergence metrics Poisson-based approach

Verb vs Noun “I can open a can of tuna with a can

opener” “to be or not to be”

Detect nature of context Might have to keep some of the terms but

remove others

Thank you! Any questions?

Thank you for your attention

Automatically Building a Stopword List for an Information Retrieval System

Documents

Transcript of Automatically Building a Stopword List for an Information Retrieval System

Twitter data analysis - doc.ic.ac.ukpl1515/files/Twitter data analysis tour.pdf · Tokenize [no, more, media, blackout, hiding, #occupywallstreet] 3. Stopword Removal [no, media,

1 Retrieval: Getting Information Out Module 27. 2 Retrieval: Getting Information Out Retrieval Cues.

WMES3103 INFORMATION RETRIEVAL WEEK 1 AND 2. WHAT IS INFORMATION RETRIEVAL? Information Retrieval – IR Information Retrieval – IR Information Information.

Introduction to Information Retrieval - Stanford University...Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search

Introduction to Information Retrieval XML Retrieval.

Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis.

Media Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval.

Multimedia Retrieval. Outline Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval.

multimedia retrieval...7 The Semantic Gap • Content-based retrieval often fails due to the gap between information extractable automatically from the visual data (feature-vectors)

Powerful Archive & Retrieval for Improved Customer Experience · the customer experience. Quadient Archive & Retrieval automatically captures and stores high-volumes of documents

Automatically Determining Semantics for World Wide Web Multimedia Information Retrieval

Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.

Word Sense Disambiguation - shane.st€¦ · Compute context of word to disambiguate Compare overlap between signature and context Select sense with highest (non-stopword) overlap

ARCHIVE RECORDS STORAGE AND RETRIEVAL SERVICES …€¦ · ARCHIVE RECORDS STORAGE AND RETRIEVAL SERVICES ... ARCHIVE RECORDS STORAGE AND RETRIEVAL SERVICES ... faster retrieval of

Plagiarism Detection Using Stopword n-grams€¦ · of plagiarism detection. One major issue in plagiarism detection is efficiency (Schleimer, Wilkerson, & Aiken, 2003; Stein & Meyer

EVALUASI DAFTAR STOPWORD BAHASA INDONESIA · 2019. 8. 1. · Daftar kata tersebut diberi istilah daftar stopword atau stoplist (Luhn, 1959)(Flood, 1999). Daftar kata ini bersifat

Modern Information Retrieval Lecture 3: Boolean Retrieval.

An information retrieval approach for automatically ...jmaletic/cs69995-TAinSE/papers/... · information retrieval approaches to construction of software libraries and explains why

Information Retrieval Language ModelInformation Retrieval INFO 4300 / CS 4300 ! Retrieval models – Older models » Boolean retrieval » Vector Space model – Probabilistic Models

Content-Based Image Retrieval Rong Jin. Content-based Image Retrieval Retrieval by text Label database images by text tags Image retrieval as text retrieval.