Web Search And Mining (Ntuim)

119
05/2004 L. F. Chien Opportunities and Challenges of Web Search and Mining Lee-Feng Chien ( 簡簡簡 ) Academia Sinica & National Taiwan Academia Sinica & National Taiwan University University

Transcript of Web Search And Mining (Ntuim)

Page 1: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Opportunities and Challenges of Web Search and Mining

Lee-Feng Chien ( 簡立峰 )

Academia Sinica & National Taiwan University Academia Sinica & National Taiwan University

Page 2: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Outline Web SE

Inside SE Google’s Business Models Google’s Impacts Recent Development Next-Generation WSE

Web Mining

Page 3: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

WSE = Google

Globalization!

Page 4: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

WSE = Google

Page 5: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Problems of WSE

Inside WSE . Fast . Coverage

. Accuracy

Page 6: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Problems of WSE

Inside WSE . Fast . Coverage

. Accuracy

Business . Profitable . Models

. Competitions

Page 7: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Problems of WSE

Inside WSE . Fast . Coverage

. Accuracy

Business . Profitable . Models

. Competition

Impacts . Web Computing . Knowledge Windows . New Paradigm of Civilization

Page 8: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

I. Some Must-Know

Statistics

Page 9: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Online Language Populations

Source: Global Reach (global-reach.biz/globstats)

Page 10: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Top Ten Languages in the Web

TOP TEN LANGUAGESIN THE INTERNET

Internet Users,by Language

AveragePenetration

World PopulationEstimate for Language

Language as % ofTotal Internet Users

English 287,369,520 26.2 % 1,098,654,265 35.9 %

Chinese 105,484,112 8.0 % 1,321,669,200 13.2 %

Japanese 66,548,060 52.1 % 127,853,600 8.3 %

German 54,035,201 56.3 % 95,893,300 6.8 %

Spanish 53,670,063 13.9 % 386,413,200 6.7 %

French 35,034,269 9.3 % 375,164,185 4.4 %

Korean 30,670,000 41.0 % 74,730,000 3.8 %

Italian 28,610,000 49.3 % 57,987,100 3.6 %

Portuguese 23,058,254 10.3 % 224,664,100 2.9 %

Dutch 13,657,170 56.6 % 24,125,950 1.7 %

TOP TEN LANGUAGES 698,353,773 18.4 % 3,787,154,900 87.3 %

Rest of the Languages 101,686,725 3.9 % 2,602,992,587 12.7 %

WORLD TOTAL 800,040,498 12.5 % 6,390,147,487 100.0 %

Source: Internet World Stats

More and more non-English users!

Page 11: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

0.1

1.0

10.0

100.0

Inte

rnet

Hos

ts (

mil

lion

):

English Japanese German French Dutch Finnish Spanish Chinese Swedish

Language (estimated by domain)

Web Content

Source: Network Wizards Jan 99 Internet Domain Survey

More and more non-English pages

Page 12: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Web Users and Pages (5 years ago)

Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99

Challenge of Scalability !

Chinese Users: 110M

Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others.

Source: Global Reach, 2004

Page 13: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

573,000,000 pages

Scalability Problem !

Number of Chinese Web Pages

Page 14: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Number of Web Pages

The world’s largest search engine ?

4,285,199,774 pages (Google) 4.28 billion Web pages, 880 million images, and other documents

Billions Of Textual Documents IndexedAs of Sept 2, 2003

KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista. Source: Search Engine Watch

Page 15: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

The top 10 Internet trends 2004 predicted by eOneNet.com 1. World Internet population will continue to grow at an exponential r

ate, with China taking the lead in Asia having more than 100 million Internet users.

2. Broadband Internet penetration will continue to grow with China and US in the lead with an expected growth rate exceeding 30% each.

3. Online retail sales will still be led by the US with an expected revenue exceeding US$80 billion.

4. Paid search will account for the biggest online ad spending. With the successful paid search business models of Google and Overture, more search engines will offer paid search advertising.

Page 16: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

The top 10 Internet trends 2004 predicted by eOneNet.com

5. Spams will increase at least 20% despite the new US anti-spam law. The US legislators will be forced to consider amending the anti-spam law from an opt-out law to an opt-in law.

6. Ads placed in opt-in email newsletters will increase 25% as legitimate marketers find this is the easier way to comply with the anti-spam law and a better way of targeting customers.

7. Rich media will continue to be hot. More than 25% of online ads served will contain rich media contents.

Page 17: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

The top 10 Internet trends 2004 predicted by eOneNet.com

8. 20% more small businesses will develop their own websites or use the Internet as a sales and marketing channel.

9. Entertainment online will be grow at a rapid pace, with more sites offering videos and digital music download services.

10. The Internet boom will revive with more Internet companies going for IPO both in the US and in Asia, in particular kicked off by the most anticipated Google IPO in Spring.

Page 18: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

II. Inside WSE

Page 19: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Components

Crawler/Spider Index Server Query Server Document Delivery

Page 20: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Architecture

SESE

SESE

SESE BrowserBrowserWeb

1B queries/day

Quality results

LogLog.Spam. Freshness

5B pages

Scalable Scalable

IndexIndex

IndexIndex

IndexIndexSpiderSpider

IndexerIndexer

ArchiveArchive

(1)

(2)

(3)

(4)

(5)

Page 21: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Spider

Get all Pages from the Web Web Traverse Challenges

Performance, e.g., #Pages/Per PC Coverage Currency Spam Filtering Hidden Web

Page 22: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Index Server Index occurrences of all words in the

pages Data Cleanness Challenges

Space Overhead,#pages/PC Incremental Scalability & Distributed Processing Multiple Languages

Page 23: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

System Anatomy

Page 24: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Data StructureLexicon: fit in memory

two different formsHit list: account for most space

use 2 bytes to save spaceForward index: barrels are sorted

by wordID. Inside barrel, sorted by docID

Inverted Index: some content as the forward index, butsorted by wordID.doc list is sorted by docID

Page 25: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Query Server

Search Relevant URLs for queries via looking up indices

Challenges Speed, check #queries/Per Sec Functions supported Localization

Page 26: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

PageRank

Page 27: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

PageRank (Cont.)

be the set of pages that point to u. be the number of links from u and let c be a factor used for normalization, thena simplified version of PageRank:

Page 28: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Search Functions Phrase search, e.g. "petite galerie" Truncation, e.g. librar*, wom*n Constraining search, e.g. title:"The Wall Street Journal" Proximity search, e.g. gold near silver Boolean, e.g. +noir +film -"pinot noir" Parentheses and Nested Boolean, e.g. silver and not (gold or platinu

m) Limit search, e.g. limit by date range Capitalization, e.g. turkey vs. Turkey Ranking fields and refine search LiveTopics Translate Service Other

Page 29: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Document Delivery Bottleneck of Bandwidth Presentation Caching

Queries, Search Results Aakman Model

Page 30: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

III. Business

Page 31: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

What is Google? Specialized web search engine Founded in 1998 by 2 graduate students at Stanford

University (Larry Page and Sergey Brin) Provides a comprehensive, relevant, and easy-to-use web

search and browsing service (free)

Google’s features: fast, unbiased, and accurate results, allows access to over 4 billion web pages, and over 800 million images (most important; valid web pages)

Page 32: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Company Facts

Employees: 1,300+

Languages spoken: 34

Worldwide Offices: 21

(Mostly in US & Europe)

Annual Revenues: $900m

Page 33: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Google Revenue

Revenue—(an e-business):

½ from selling relevant text-based ads (sponsored links near search

results)

½ from licensing its search technology to companies like Yahoo

Source: Eric Schmidt Interview, PCWorld.com (January 30, 2002)

Page 34: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Sources of Revenue

Adwords (150,000 advertisers) “sponsored links” adcost-per-click pricing; only when people click on the link -- Advertisement is extremely cheap and effectivei.e. Edmunds.com spent “$250,000 a month in advertising“ because $1 spent generated $1.70.

Google Search Appliance an integrated hardware/software solution that extends the

power of Google to corporate intranets and web servers

-- Customers include: Cisco Systems, Sony, Procter & Gamble, Sun Microsystems, etc

Page 35: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Challenges (cont.)

Easy entry into the Search Engine Industry

Lack of customer lock-in (vs. Microsoft);

Google will focus on creating services to voluntarily draw in customers

Large, well-known competitors are focusing on in-house search technology (Yahoo, Microsoft, AOL, eBay, Amazon)

Customers are becoming competitors (Yahoo, AOL)

Page 36: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Competitors: Ebay and Amazon

Ebay (www.ebay.com) E-commerceWeb-based marketplace in which a community of buyers and sellers are brought together to browse, buy and sell various items -- Business revenue: Charges Proceeds (Fees)

(5%) 0.01-$25 (2.5%) $25-$1000 (1.25%) over $1000 Amazon (www.amazon.com) E-commerce a customer-centric company that sells a range of products that it

purchases from manufacturers and distributors

Page 37: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Competitors: Microsoft and Yahoo

Microsoft is developing its own search engine-- Can “lasso” users into its search engine through its operating system-- Has the “braniacs” to implement top of the line search engine technology

Yahoo was customer of Google (may now become Google’s biggest competitor)-- Offers placement under sponsored links and within actual results (“unethical”)

Page 38: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

IV. Impacts

Page 39: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Impacts Web Computing Knowledge Windows New Web OS

Page 40: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Web Computing Faster than local search Very-large scale of computing

systems Realize global users’ behaviors Acquire global information sources

Page 41: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Web Computing Local disc or global disc? Personal information management?

Gmails Photo search

Page 42: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Knowledge Windows Windows of Information Search Alliance with online databases Windows of Personal Knowledge

Management Knowledge Windows

Page 43: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

New Web OS

Merged with Linux OS Software download from end-users Information Service OS

Page 44: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

V. New Gen. of WSE

Page 45: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Advanced Google Is Google good enough?

“Takano” “Takano NII” “Takano NII Japan”

More about Google Services http://www.google.com/options/

Page 46: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

New Features in Google Google Labs: http://labs.google.com/ Google Desktop Search

Searching text, Web, Word, Excel, PowerPoint, Outlook, AOL Instant Messenger

Google SMS Searching phone book, dictionary, product prices,

… Google Print

Searching books

Page 47: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Page 48: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Other Search Tools A9.com (by Amazon)

Bookmark, history, discover, diary Books, movies, …

Clusty.com (by Vivisimo) Clustering engine

Snap.com (by Idealab) Sorting by popularity, satisfaction, Web popularity, Web satisfactio

n, domain, … Alexa.com (by Amazon)

Average user review ratings, … Others: Yahoo, AskJeeves, AOL Search, HotBot, MSN, Netscape,

Lycos, Altavista, LookSmart, Gigablast, Overture, About, FindWhat, Teoma, InformSearch, …

Page 49: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Clusty.com

Page 50: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Example on Vivisimo

Page 51: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Vivisimo (cont.)

Page 52: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

New Directions Personalization

Photo search, email search & filtering Information Extraction

EX: Scholar search Information Agent Deep Web Search

Page 53: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

VI. Web Mining

Page 54: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Web Search/Information Retrieval

Web Search Engine

Information Seeking

Millions of Users

Page 55: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Improving Search via Mining

Webtexts, images, logs …

Search Engine

Knowledge Discovery

Millions of Users

Page 56: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Valuable Web Resources

Weblogs, texts, images, …

Knowledge Discovery

Millions of Users

Hyper LinksAnchor Texts Search Result PagesQuery LogsQuery Session LogsClicked Stream LogsDeep Web, ….

Page 57: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Discovered Knowledge

Weblogs, texts, images, …

Knowledge Discovery

Millions of Users

Users’ Preferences/Need: Topic, Location, Timing, …Authority/Popularity: Site, File, People, Company, ProductClusters/Associations/Relations: Site, Page, People, Company, Product, Query

Page 58: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Web Mining for IR

Weblogs, texts, images, …

Knowledge Discovery

Millions of Users

SearchClassificationClusteringCross-language IR

Information Extraction Text miningFiltering

Page 59: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

CS 276 / LING 239IInformation Retrieval and Web Mining

Prabhakar Raghavan and Hinrich Schütze

Course Description: Basic and advanced techniques for text-based information syst

ems: efficient text indexing; Boolean, vector space, and probabilistic retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web metadata; text/Web clustering, classification, wrapper, information extraction, and collaborative filtering systems; text mining. Projects can be chosen from diverse topics in information retrieval.

Page 60: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Computational Linguistics, 29 , Issue 3, September 2003 .

Page 61: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Research at Web Knowledge Discovery Lab

Web

Resources

Discovered

Knowledge

IR Application

SIGIR’00,

WWW’00,

WI’01,

ICDM’01,

JASIST’02, …

Query Log

Query Session

Log

Classified/Relevant

Queries

User’s Interests Finding

Query Classification

Term Suggestion

Thesaurus Construction

SIGIR’01

ICDM’02

TOIS’04

Anchor Text

Hyper Link

Translation Pairs Translation Lexicon

Cross-language IR

Cross-Language Web Search

ACL’04

SIGIR’04

Search Result

Pages

Translation Pairs Translation Lexicon

Cross-language IR

Cross-Language Web Search

ICDM’02

CILM’04

TOIS’04

Search Result

Pages

Clustered

Queries/Term

Taxonomy

Search Result Clustering

Taxonomy Generation

ICDE’04

WWW’04

TALIP’04

Search Result

Pages

Online Corpus

Generated

Metadata

Online Classifier

Class-based Web Search

Page 62: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Research at Web Knowledge Discovery Lab

Live series LiveTrans

• SIGIR’04, ACL’04, JCDL’04• ACM Trans. On Information System, 2004• Online Translation of unknown queries via Web

LiveClassifier • WWW’04, IJCNLP’04• ACM Trans. on ALIP, 2004 • Training classifiers and classifying short text via Web

Page 63: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Research at Web Knowledge Discovery Lab

LiveCluster • CIKM’04• ACM Trans. On Information System, 2004• Generating taxonomy from terms or documents

Page 64: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

LiveTrans: Cross-language Web Search

Page 65: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

LiveClassifier: Classifying search results into user-defined classification tree

Page 66: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

LiveClassifier : Paper Title Categorization

Note: no labeled training data

Page 67: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

LiveCluster: Taxonomy Generation

Page 68: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Terms Clustering

Page 69: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Query Clustering

勞委會

職訓局

就業

青輔會

自傳

徵才

人力資源

104

人力銀

行人力銀行

找工作

履歷表

求職

求才

占卜

塔羅牌

算命

紫微斗數

命理

姓名學

心理測驗

星座

愛情

eva長榮航空

長榮

航空公司

航空

華航

中華航空

補帖

大補帖

泡麵

dbt武俠

金庸

武俠小說

黃易

作家

武俠金庸武俠小說黃易作家

補帖大補帖泡麵dbt

eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)

占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情

勞委會職訓局就業青輔會自傳徵才人力資源104 人力銀行人力銀行找工作履歷表求職求才cut

1 2 3 4 5

1 23 4

5

Page 70: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Page 71: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Outline

• Translating Unknown Queries (SIGIR’04)• Training Text Classifiers (WWW’04)• Generating Taxonomy/Topic Hierarchies

(TOIS’04)

Page 72: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Translating Unknown Queries

I. Anchor Text Mining I. Probabilistic Modeling (ACM TALIP’02)

II. Transitive Translation (ACM TOIS’04)

II. Search-Result Page Mining I. Translation Extraction & Selection (JCDL’04)

II. CLIR & Other Applications (SIGIR’04, ACL’04)

Note: First work dealing with online translation

Page 73: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Introduction (cont.)

Bottleneck of CLIR service Real queries are often short Out-of-dictionary terms and might have local variations

• Ex: proper nouns, new terminologies, …

Need for a powerful query translation engine Up-to-date dictionary

English Terminologies

Chinese Translation

Digital library 數位圖書館 /數字圖書館

Banff 班夫 / 班芙

Ishikawa 石川県

NII Japan 国立情報学研究所

louvre museum 羅浮宮

SARS 嚴重急性呼吸道症候群 / 非典 / 沙士

Clinton 柯林頓 / 克林頓

Bill Gates 比爾蓋茲

Page 74: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Web Mining of Query Translations

Different problems for different resources

Source Term

TargetTranslations

TermTranslation

Web Mining

Anchor-Text Mining Search-Result Mining

OOD

Yahoo <-> 雅虎

Page 75: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Anchor Text (Yahoo <-> 雅虎 )

Applies to most languages Translation candidates are likely to appear in the same anchor-text-set

Page 76: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Search Result Page (National Palace Museum vs. 故宮博物院 )

Mixed-language characteristic in Chinese pages

Page 77: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Problems

Term extraction Translation selection & noisy reduction Language pairs with limited corpora Processing speed Data cleanness (language identification) Language independence

Page 78: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Term Extraction: SCPCD

Page 79: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

inktotal in-l

Uin-linkUP

UP)|U)P(T|UP(TUTPUTP

U)P|U)P(T|UP(T

UPUTTP

UPUTTP

TTP

TTPTTP

ii

Uiitisitis

Uiitis

Uiits

Uiits

ts

tsts

i

i

i

i

#

of #)( where

)(])|()|([

)(

)()|(

)()|(

)(

)()(

……

Term Selection: Probabilistic Inference Model

Page Authority

Co-occurrence

PageRank

Integrating anchor texts and link structures into probabilistic inference model

Based on co-occurrence & page authority

Page 80: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Source Term(Ts) Translation(Tt)Source Term(Ts) Translation(Tt)

雅虎=>

Yahoo

Observation of Anchor Text

Page 81: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

- in USATaiwan -

www.yahoo.comwww.yahoo.com.tw

Yahoo Yahoo

Source Query

Observation of Anchor Text

Page 82: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

- in USATaiwan - 台灣 - 搜尋引擎

www.yahoo.comwww.yahoo.com.tw

Yahoo 雅虎雅虎 Yahoo

Translation Candidates

Anchor-Text Set

Observation of Anchor Text

Page 83: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

……(#in-link= 187)

……(#in-link= 21)

- in USATaiwan - 台灣 - 搜尋引擎

www.yahoo.comwww.yahoo.com.tw

Yahoo 雅虎雅虎 Yahoo

Page Authority

Observation of Anchor Text

Co-occurrence

Page 84: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Search Result Mining

Term

Extraction

Term

Extraction

Source Query

TargetTranslations

WebPages

SearchEngineSearchEngine

PAT-tree based term extraction method [Chien, SIGIR ‘97]

PAT-tree based term extraction method [Chien, SIGIR ‘97]

Term

Selection

Term

Selection

Page 85: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Term Selection

How to decide the ranking?

S, Ti: frequently co-occur in the same pages Not necessarily true for

synonyms and antonyms

S, Ti: the result pages containing similar co-occurring context terms as feature vectors

QueryS

QueryS .

.

.

T1

T2

Tn

Page 86: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Chi-Square Test

Chi-Square Test: a statistical method for co-occurrence analysis [Gale & Church ‘91]

(3) . )()()()(

)(),(

2

2dcdbcaba

cbdaNtsSx

a: # of pages containing both terms s and t

b: # of pages containing term s but not t

c: # of pages containing term t but not s

d: # of pages containing neither term s nor t

N: the total number of pages, i.e., N= a+b+c+d

a: # of pages containing both terms s and t

b: # of pages containing term s but not t

c: # of pages containing term t but not s

d: # of pages containing neither term s nor t

N: the total number of pages, i.e., N= a+b+c+d

Page 87: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Context Vector Analysis

Context Vector Analysis: co-occurring context terms as feature vectors

Similarity measure: cosine measure

(4) , )n

log(),(max

),(

N

dtf

dtfw

jj

iti

(5) . )()(

),(),(

1

2

1

2

1

m

it

m

is

tsm

icv

ii

ii

ww

wwftsS

Page 88: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Indirect Association Problem

Ciscos t

s1 t1系統 (system) system

Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors.

思科 (Cisco)

Page 89: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Competitive Linking Algorithm

t1systems

t2

系統 (system) Cisco

資訊 (information)

網路 (network)

電腦 (computer)

St1

思科 (Cisco)

Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2.

St2

Page 90: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Combined Method To take advantage of both methods

Anchor-text-based: higher precision Search-result-based: higher coverage

(6) ,),(

),( m m

m

tsRtsSall

Rm(s,t) : Ranking of score in different methods

Rm(s,t) : Ranking of score in different methods

Page 91: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Experiments

Performance on Query Translation Test Bed: real query terms from the Dreamer

search engine log in Taiwan 228,566 unique terms, during a period of 3

months in 1998 Random-query test set:

• 50 query terms in Chinese, randomly selected from the top 20,000 queries in the log

• 40 of them were out-of-dictionary

Page 92: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Random Query Test Set

Table 2. Coverage and top 1~5 inclusion rates obtained with the four different methods for the random-query set.

Method Top-1 Top-3 Top-5 Coverage

CV 40.0% 54.0% 54.0% 68%

X2 36.0% 50.0% 52.0% 68%

AT 20.0% 32.0% 32.0% 32%

Combined 44.0% 64.0% 66.0% 72%

Many query terms didn’t appear in anchor-text sets (coverage)

Page 93: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Other Experiments

430 popular Chinese queries, 67.4% top-1 inclusion rate

Common terms: randomly selected 100 common nouns and 100 common verbs from general-purpose Chinese dictionary

Page 94: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Transitive Translation

m

t

Sony

(English)

s : source term

t : target translation

m : intermediate translation

索尼

(Simplified

Chinese)

s

新力

(Traditional

Chinese)

Fig. 3. An abstract diagram showing the

concepts of direct translation and indirect

translation.

Top-n inclusion rates obtained with different models.

Model Top1 Top2 Top3 Top4 Top5

Direct Translation

35.7% 43.0% 46.9% 49.6% 51.2%

Indirect Translation

44.2% 55.1% 58.0% 59.7% 60.5%

Transitive Translation

49.2% 58.1% 60.9% 61.6% 62.0%

Page 95: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Transitive Translation Model

(3) )(),( tsPtsPdirect

(4) ),()()(

)(),(),(

mPtmPmsP

mPtmmsPtsP

m

mindirect

(5) otherwise, ),,(

)( if ),,(),(

tsP

θs,tPtsPtsP

indirect

directdirecttrans

Page 96: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese.

Model Top1 Top2 Top3 Top4 Top5

Direct 10.5% 12.8% 14.3% 15.1% 15.1%

Indirect 40.2% 49.4% 56.6% 58.6% 59.6%

Transitive 42.9% 51.4% 58.6% 61.3% 61.9%

Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.

Source terms (Traditional Chinese)

Extracted target translations

English Simplified Chinese

Japanese

新力耐吉史丹佛雪梨網際網路網路首頁電腦資料庫資訊

SonyNikeStanfor

dSydneyinternetnetworkhomepa

gecomput

erdatabas

einforma

tion

索尼耐克斯坦福悉尼互联网网络主页计算机数据库信息

ソニーナイキスタンフォー

ドシドニーインターネッ

トネットワークホームページコンピュータ

ーデータベースインフォメー

ション

Chinese-Japanese Translation

Page 97: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Translation Lexicons with

Regional Variations

(a) Taiwan (b) Mainland China (c) Hong KongFigure 1: Examples of search-result pages in different Chinese regions that were obtained via the English query words “George Bush” from Goo

gle.

Page 98: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Summary

A work dealing with live translation of unknown queries Anchor-text-based

High precision for high-frequency terms Effective for proper nouns in multiple languages Not applicable if size of anchor-text set not enough

Search-result-based Exploit rich Web resources High coverage for English-Chinese language pair

Page 99: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

LiveCluster: Generating Taxonomy from

terms or documents

Page 100: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Taxonomy Generation from Terms

Page 101: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Hierarchical Query Clustering

Page 102: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

The Steps Feature Extraction

Use co-occurred seed terms extracted from retrieved top pages Term Vector

Each query term is assigned a term vector • Record the co-occurred feature terms and their frequency values in t

he retrieved documents. Term Similarity

tf*idf-based Cosine measurement Hierarchical Term Clustering

Cluster popular query terms in the log into initial categories Query terms with similar features are grouped into clusters.

Page 103: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Feature Extraction Use co-occurred seed terms extracted from

retrieved top pages

Creative Nude Photography Network -- Fine Art Nude and ... ... The Creative Nude and Erotic Photography Network is the number one net portal to the best in fine art nude and erotic photography! Over 100 CNPN Member Sites ...

Nude Places... to be naked. Walking in the forest, cruising the lake in open boats, swimming, picnicking and nude photography are all enjoyed in the nude. 60 minutes $39.95. ...

A Brave Nude World... A Brave Nude World! Warning: This site contains links to fine art nude & erotic photography. If you are under 18 or do not wish to view this material, You can ...

nudeCo-occurred feature terms

3/2erotic photography

1/1naked

………

3/2art

2/2photography

tf/dfterm

Page 104: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Term Weighting

Page 105: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Extraction of Basic Feature Terms Performance of different features: randomly selected,

hi-frequency, and seed terms Popular queries not affected by ephemeral trends, e.g., “movie”,

“basketball”, “mutual fund”, etc. More expressive and distinguishable in describing a particular category Two logs compared and extracted 9,709 overlapping top query terms as

feature terms

G-1999D-1998

Top 1,000 terms top 20,000 terms ALL

Top 1,000 terms 583/58.30% 977/97.70% 992/99.20%

Top 20,000 terms 914/91.40% 9,709/50.71% 14,721/76.89%

Page 106: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Task I: Query Clustering (Cont.)

Feature Extraction Use co-occurred seed terms extracted from retrieved top

pages Term Vector

Each query term is assigned a term vector • Record the co-occurred feature terms and their frequency

values in the retrieved documents. Term Similarity

TF *IDF-based Cosine measurement Hierarchical Term Clustering

Cluster popular query terms in the log into initial categories Query terms with similar features are grouped into clusters.

Page 107: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Term Similarity

Page 108: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Hierarchical Term Clustering

Agglomerative hierarchical clustering (AHC) Compute the similarity between all pairs of clusters

• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster

similarity value Merge the most similar (closest) two clusters

• Complete linkage method Update the cluster vector of the new cluster Repeat steps 2 and 3 until only a single cluster

remains

Page 109: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Page 110: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Clustering Results

勞委會

職訓局

就業

青輔會

自傳

徵才

人力資源

104

人力銀

行人力銀行

找工作

履歷表

求職

求才

占卜

塔羅牌

算命

紫微斗數

命理

姓名學

心理測驗

星座

愛情

eva長榮航空

長榮

航空公司

航空

華航

中華航空

補帖

大補帖

泡麵

dbt武俠

金庸

武俠小說

黃易

作家

武俠金庸武俠小說黃易作家

補帖大補帖泡麵dbt

eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)

占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情

勞委會職訓局就業青輔會自傳徵才人力資源104 人力銀行人力銀行找工作履歷表求職求才cut

1 2 3 4 5

1 23 4

5

Page 111: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Cluster Partition

Page 112: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Quality Function

Page 113: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Quality Function (Cont.)

Page 114: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Quality Function (Cont.)

Page 115: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Preliminary Experiment

Test queries • Two sets: top 1k queries and random 1k queries• Each of the test queries has been manually assigned

according classes

Evaluation metrics• F-Measure

Page 116: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Evaluation: F-Measure

Page 117: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Obtained F-Measures

Page 118: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Page 119: Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Results of Hierarchical Structure Generation