Web Search And Mining (Ntuim)

Post on 21-May-2015

836 views 7 download

Transcript of Web Search And Mining (Ntuim)

05/2004 L. F. Chien

Opportunities and Challenges of Web Search and Mining

Lee-Feng Chien ( 簡立峰 )

Academia Sinica & National Taiwan University Academia Sinica & National Taiwan University

05/2004 L. F. Chien

Outline Web SE

Inside SE Google’s Business Models Google’s Impacts Recent Development Next-Generation WSE

Web Mining

05/2004 L. F. Chien

WSE = Google

Globalization!

05/2004 L. F. Chien

WSE = Google

05/2004 L. F. Chien

Problems of WSE

Inside WSE . Fast . Coverage

. Accuracy

05/2004 L. F. Chien

Problems of WSE

Inside WSE . Fast . Coverage

. Accuracy

Business . Profitable . Models

. Competitions

05/2004 L. F. Chien

Problems of WSE

Inside WSE . Fast . Coverage

. Accuracy

Business . Profitable . Models

. Competition

Impacts . Web Computing . Knowledge Windows . New Paradigm of Civilization

05/2004 L. F. Chien

I. Some Must-Know

Statistics

05/2004 L. F. Chien

Online Language Populations

Source: Global Reach (global-reach.biz/globstats)

05/2004 L. F. Chien

Top Ten Languages in the Web

TOP TEN LANGUAGESIN THE INTERNET

Internet Users,by Language

AveragePenetration

World PopulationEstimate for Language

Language as % ofTotal Internet Users

English 287,369,520 26.2 % 1,098,654,265 35.9 %

Chinese 105,484,112 8.0 % 1,321,669,200 13.2 %

Japanese 66,548,060 52.1 % 127,853,600 8.3 %

German 54,035,201 56.3 % 95,893,300 6.8 %

Spanish 53,670,063 13.9 % 386,413,200 6.7 %

French 35,034,269 9.3 % 375,164,185 4.4 %

Korean 30,670,000 41.0 % 74,730,000 3.8 %

Italian 28,610,000 49.3 % 57,987,100 3.6 %

Portuguese 23,058,254 10.3 % 224,664,100 2.9 %

Dutch 13,657,170 56.6 % 24,125,950 1.7 %

TOP TEN LANGUAGES 698,353,773 18.4 % 3,787,154,900 87.3 %

Rest of the Languages 101,686,725 3.9 % 2,602,992,587 12.7 %

WORLD TOTAL 800,040,498 12.5 % 6,390,147,487 100.0 %

Source: Internet World Stats

More and more non-English users!

05/2004 L. F. Chien

0.1

1.0

10.0

100.0

Inte

rnet

Hos

ts (

mil

lion

):

English Japanese German French Dutch Finnish Spanish Chinese Swedish

Language (estimated by domain)

Web Content

Source: Network Wizards Jan 99 Internet Domain Survey

More and more non-English pages

05/2004 L. F. Chien

Web Users and Pages (5 years ago)

Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99

Challenge of Scalability !

Chinese Users: 110M

Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others.

Source: Global Reach, 2004

05/2004 L. F. Chien

573,000,000 pages

Scalability Problem !

Number of Chinese Web Pages

05/2004 L. F. Chien

Number of Web Pages

The world’s largest search engine ?

4,285,199,774 pages (Google) 4.28 billion Web pages, 880 million images, and other documents

Billions Of Textual Documents IndexedAs of Sept 2, 2003

KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista. Source: Search Engine Watch

05/2004 L. F. Chien

The top 10 Internet trends 2004 predicted by eOneNet.com 1. World Internet population will continue to grow at an exponential r

ate, with China taking the lead in Asia having more than 100 million Internet users.

2. Broadband Internet penetration will continue to grow with China and US in the lead with an expected growth rate exceeding 30% each.

3. Online retail sales will still be led by the US with an expected revenue exceeding US$80 billion.

4. Paid search will account for the biggest online ad spending. With the successful paid search business models of Google and Overture, more search engines will offer paid search advertising.

05/2004 L. F. Chien

The top 10 Internet trends 2004 predicted by eOneNet.com

5. Spams will increase at least 20% despite the new US anti-spam law. The US legislators will be forced to consider amending the anti-spam law from an opt-out law to an opt-in law.

6. Ads placed in opt-in email newsletters will increase 25% as legitimate marketers find this is the easier way to comply with the anti-spam law and a better way of targeting customers.

7. Rich media will continue to be hot. More than 25% of online ads served will contain rich media contents.

05/2004 L. F. Chien

The top 10 Internet trends 2004 predicted by eOneNet.com

8. 20% more small businesses will develop their own websites or use the Internet as a sales and marketing channel.

9. Entertainment online will be grow at a rapid pace, with more sites offering videos and digital music download services.

10. The Internet boom will revive with more Internet companies going for IPO both in the US and in Asia, in particular kicked off by the most anticipated Google IPO in Spring.

05/2004 L. F. Chien

II. Inside WSE

05/2004 L. F. Chien

Components

Crawler/Spider Index Server Query Server Document Delivery

05/2004 L. F. Chien

Architecture

SESE

SESE

SESE BrowserBrowserWeb

1B queries/day

Quality results

LogLog.Spam. Freshness

5B pages

Scalable Scalable

IndexIndex

IndexIndex

IndexIndexSpiderSpider

IndexerIndexer

ArchiveArchive

(1)

(2)

(3)

(4)

(5)

05/2004 L. F. Chien

Spider

Get all Pages from the Web Web Traverse Challenges

Performance, e.g., #Pages/Per PC Coverage Currency Spam Filtering Hidden Web

05/2004 L. F. Chien

Index Server Index occurrences of all words in the

pages Data Cleanness Challenges

Space Overhead,#pages/PC Incremental Scalability & Distributed Processing Multiple Languages

05/2004 L. F. Chien

System Anatomy

05/2004 L. F. Chien

Data StructureLexicon: fit in memory

two different formsHit list: account for most space

use 2 bytes to save spaceForward index: barrels are sorted

by wordID. Inside barrel, sorted by docID

Inverted Index: some content as the forward index, butsorted by wordID.doc list is sorted by docID

05/2004 L. F. Chien

Query Server

Search Relevant URLs for queries via looking up indices

Challenges Speed, check #queries/Per Sec Functions supported Localization

05/2004 L. F. Chien

PageRank

05/2004 L. F. Chien

PageRank (Cont.)

be the set of pages that point to u. be the number of links from u and let c be a factor used for normalization, thena simplified version of PageRank:

05/2004 L. F. Chien

Search Functions Phrase search, e.g. "petite galerie" Truncation, e.g. librar*, wom*n Constraining search, e.g. title:"The Wall Street Journal" Proximity search, e.g. gold near silver Boolean, e.g. +noir +film -"pinot noir" Parentheses and Nested Boolean, e.g. silver and not (gold or platinu

m) Limit search, e.g. limit by date range Capitalization, e.g. turkey vs. Turkey Ranking fields and refine search LiveTopics Translate Service Other

05/2004 L. F. Chien

Document Delivery Bottleneck of Bandwidth Presentation Caching

Queries, Search Results Aakman Model

05/2004 L. F. Chien

III. Business

05/2004 L. F. Chien

What is Google? Specialized web search engine Founded in 1998 by 2 graduate students at Stanford

University (Larry Page and Sergey Brin) Provides a comprehensive, relevant, and easy-to-use web

search and browsing service (free)

Google’s features: fast, unbiased, and accurate results, allows access to over 4 billion web pages, and over 800 million images (most important; valid web pages)

05/2004 L. F. Chien

Company Facts

Employees: 1,300+

Languages spoken: 34

Worldwide Offices: 21

(Mostly in US & Europe)

Annual Revenues: $900m

05/2004 L. F. Chien

Google Revenue

Revenue—(an e-business):

½ from selling relevant text-based ads (sponsored links near search

results)

½ from licensing its search technology to companies like Yahoo

Source: Eric Schmidt Interview, PCWorld.com (January 30, 2002)

05/2004 L. F. Chien

Sources of Revenue

Adwords (150,000 advertisers) “sponsored links” adcost-per-click pricing; only when people click on the link -- Advertisement is extremely cheap and effectivei.e. Edmunds.com spent “$250,000 a month in advertising“ because $1 spent generated $1.70.

Google Search Appliance an integrated hardware/software solution that extends the

power of Google to corporate intranets and web servers

-- Customers include: Cisco Systems, Sony, Procter & Gamble, Sun Microsystems, etc

05/2004 L. F. Chien

Challenges (cont.)

Easy entry into the Search Engine Industry

Lack of customer lock-in (vs. Microsoft);

Google will focus on creating services to voluntarily draw in customers

Large, well-known competitors are focusing on in-house search technology (Yahoo, Microsoft, AOL, eBay, Amazon)

Customers are becoming competitors (Yahoo, AOL)

05/2004 L. F. Chien

Competitors: Ebay and Amazon

Ebay (www.ebay.com) E-commerceWeb-based marketplace in which a community of buyers and sellers are brought together to browse, buy and sell various items -- Business revenue: Charges Proceeds (Fees)

(5%) 0.01-$25 (2.5%) $25-$1000 (1.25%) over $1000 Amazon (www.amazon.com) E-commerce a customer-centric company that sells a range of products that it

purchases from manufacturers and distributors

05/2004 L. F. Chien

Competitors: Microsoft and Yahoo

Microsoft is developing its own search engine-- Can “lasso” users into its search engine through its operating system-- Has the “braniacs” to implement top of the line search engine technology

Yahoo was customer of Google (may now become Google’s biggest competitor)-- Offers placement under sponsored links and within actual results (“unethical”)

05/2004 L. F. Chien

IV. Impacts

05/2004 L. F. Chien

Impacts Web Computing Knowledge Windows New Web OS

05/2004 L. F. Chien

Web Computing Faster than local search Very-large scale of computing

systems Realize global users’ behaviors Acquire global information sources

05/2004 L. F. Chien

Web Computing Local disc or global disc? Personal information management?

Gmails Photo search

05/2004 L. F. Chien

Knowledge Windows Windows of Information Search Alliance with online databases Windows of Personal Knowledge

Management Knowledge Windows

05/2004 L. F. Chien

New Web OS

Merged with Linux OS Software download from end-users Information Service OS

05/2004 L. F. Chien

V. New Gen. of WSE

05/2004 L. F. Chien

Advanced Google Is Google good enough?

“Takano” “Takano NII” “Takano NII Japan”

More about Google Services http://www.google.com/options/

05/2004 L. F. Chien

New Features in Google Google Labs: http://labs.google.com/ Google Desktop Search

Searching text, Web, Word, Excel, PowerPoint, Outlook, AOL Instant Messenger

Google SMS Searching phone book, dictionary, product prices,

… Google Print

Searching books

05/2004 L. F. Chien

05/2004 L. F. Chien

Other Search Tools A9.com (by Amazon)

Bookmark, history, discover, diary Books, movies, …

Clusty.com (by Vivisimo) Clustering engine

Snap.com (by Idealab) Sorting by popularity, satisfaction, Web popularity, Web satisfactio

n, domain, … Alexa.com (by Amazon)

Average user review ratings, … Others: Yahoo, AskJeeves, AOL Search, HotBot, MSN, Netscape,

Lycos, Altavista, LookSmart, Gigablast, Overture, About, FindWhat, Teoma, InformSearch, …

05/2004 L. F. Chien

Clusty.com

05/2004 L. F. Chien

Example on Vivisimo

05/2004 L. F. Chien

Vivisimo (cont.)

05/2004 L. F. Chien

New Directions Personalization

Photo search, email search & filtering Information Extraction

EX: Scholar search Information Agent Deep Web Search

05/2004 L. F. Chien

VI. Web Mining

05/2004 L. F. Chien

Web Search/Information Retrieval

Web Search Engine

Information Seeking

Millions of Users

05/2004 L. F. Chien

Improving Search via Mining

Webtexts, images, logs …

Search Engine

Knowledge Discovery

Millions of Users

05/2004 L. F. Chien

Valuable Web Resources

Weblogs, texts, images, …

Knowledge Discovery

Millions of Users

Hyper LinksAnchor Texts Search Result PagesQuery LogsQuery Session LogsClicked Stream LogsDeep Web, ….

05/2004 L. F. Chien

Discovered Knowledge

Weblogs, texts, images, …

Knowledge Discovery

Millions of Users

Users’ Preferences/Need: Topic, Location, Timing, …Authority/Popularity: Site, File, People, Company, ProductClusters/Associations/Relations: Site, Page, People, Company, Product, Query

05/2004 L. F. Chien

Web Mining for IR

Weblogs, texts, images, …

Knowledge Discovery

Millions of Users

SearchClassificationClusteringCross-language IR

Information Extraction Text miningFiltering

05/2004 L. F. Chien

CS 276 / LING 239IInformation Retrieval and Web Mining

Prabhakar Raghavan and Hinrich Schütze

Course Description: Basic and advanced techniques for text-based information syst

ems: efficient text indexing; Boolean, vector space, and probabilistic retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web metadata; text/Web clustering, classification, wrapper, information extraction, and collaborative filtering systems; text mining. Projects can be chosen from diverse topics in information retrieval.

05/2004 L. F. Chien

Computational Linguistics, 29 , Issue 3, September 2003 .

05/2004 L. F. Chien

Research at Web Knowledge Discovery Lab

Web

Resources

Discovered

Knowledge

IR Application

SIGIR’00,

WWW’00,

WI’01,

ICDM’01,

JASIST’02, …

Query Log

Query Session

Log

Classified/Relevant

Queries

User’s Interests Finding

Query Classification

Term Suggestion

Thesaurus Construction

SIGIR’01

ICDM’02

TOIS’04

Anchor Text

Hyper Link

Translation Pairs Translation Lexicon

Cross-language IR

Cross-Language Web Search

ACL’04

SIGIR’04

Search Result

Pages

Translation Pairs Translation Lexicon

Cross-language IR

Cross-Language Web Search

ICDM’02

CILM’04

TOIS’04

Search Result

Pages

Clustered

Queries/Term

Taxonomy

Search Result Clustering

Taxonomy Generation

ICDE’04

WWW’04

TALIP’04

Search Result

Pages

Online Corpus

Generated

Metadata

Online Classifier

Class-based Web Search

05/2004 L. F. Chien

Research at Web Knowledge Discovery Lab

Live series LiveTrans

• SIGIR’04, ACL’04, JCDL’04• ACM Trans. On Information System, 2004• Online Translation of unknown queries via Web

LiveClassifier • WWW’04, IJCNLP’04• ACM Trans. on ALIP, 2004 • Training classifiers and classifying short text via Web

05/2004 L. F. Chien

Research at Web Knowledge Discovery Lab

LiveCluster • CIKM’04• ACM Trans. On Information System, 2004• Generating taxonomy from terms or documents

05/2004 L. F. Chien

LiveTrans: Cross-language Web Search

05/2004 L. F. Chien

LiveClassifier: Classifying search results into user-defined classification tree

05/2004 L. F. Chien

LiveClassifier : Paper Title Categorization

Note: no labeled training data

05/2004 L. F. Chien

LiveCluster: Taxonomy Generation

05/2004 L. F. Chien

Terms Clustering

05/2004 L. F. Chien

Query Clustering

勞委會

職訓局

就業

青輔會

自傳

徵才

人力資源

104

人力銀

行人力銀行

找工作

履歷表

求職

求才

占卜

塔羅牌

算命

紫微斗數

命理

姓名學

心理測驗

星座

愛情

eva長榮航空

長榮

航空公司

航空

華航

中華航空

補帖

大補帖

泡麵

dbt武俠

金庸

武俠小說

黃易

作家

武俠金庸武俠小說黃易作家

補帖大補帖泡麵dbt

eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)

占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情

勞委會職訓局就業青輔會自傳徵才人力資源104 人力銀行人力銀行找工作履歷表求職求才cut

1 2 3 4 5

1 23 4

5

05/2004 L. F. Chien

05/2004 L. F. Chien

Outline

• Translating Unknown Queries (SIGIR’04)• Training Text Classifiers (WWW’04)• Generating Taxonomy/Topic Hierarchies

(TOIS’04)

05/2004 L. F. Chien

Translating Unknown Queries

I. Anchor Text Mining I. Probabilistic Modeling (ACM TALIP’02)

II. Transitive Translation (ACM TOIS’04)

II. Search-Result Page Mining I. Translation Extraction & Selection (JCDL’04)

II. CLIR & Other Applications (SIGIR’04, ACL’04)

Note: First work dealing with online translation

05/2004 L. F. Chien

Introduction (cont.)

Bottleneck of CLIR service Real queries are often short Out-of-dictionary terms and might have local variations

• Ex: proper nouns, new terminologies, …

Need for a powerful query translation engine Up-to-date dictionary

English Terminologies

Chinese Translation

Digital library 數位圖書館 /數字圖書館

Banff 班夫 / 班芙

Ishikawa 石川県

NII Japan 国立情報学研究所

louvre museum 羅浮宮

SARS 嚴重急性呼吸道症候群 / 非典 / 沙士

Clinton 柯林頓 / 克林頓

Bill Gates 比爾蓋茲

05/2004 L. F. Chien

Web Mining of Query Translations

Different problems for different resources

Source Term

TargetTranslations

TermTranslation

Web Mining

Anchor-Text Mining Search-Result Mining

OOD

Yahoo <-> 雅虎

05/2004 L. F. Chien

Anchor Text (Yahoo <-> 雅虎 )

Applies to most languages Translation candidates are likely to appear in the same anchor-text-set

05/2004 L. F. Chien

Search Result Page (National Palace Museum vs. 故宮博物院 )

Mixed-language characteristic in Chinese pages

05/2004 L. F. Chien

Problems

Term extraction Translation selection & noisy reduction Language pairs with limited corpora Processing speed Data cleanness (language identification) Language independence

05/2004 L. F. Chien

Term Extraction: SCPCD

05/2004 L. F. Chien

inktotal in-l

Uin-linkUP

UP)|U)P(T|UP(TUTPUTP

U)P|U)P(T|UP(T

UPUTTP

UPUTTP

TTP

TTPTTP

ii

Uiitisitis

Uiitis

Uiits

Uiits

ts

tsts

i

i

i

i

#

of #)( where

)(])|()|([

)(

)()|(

)()|(

)(

)()(

……

Term Selection: Probabilistic Inference Model

Page Authority

Co-occurrence

PageRank

Integrating anchor texts and link structures into probabilistic inference model

Based on co-occurrence & page authority

05/2004 L. F. Chien

Source Term(Ts) Translation(Tt)Source Term(Ts) Translation(Tt)

雅虎=>

Yahoo

Observation of Anchor Text

05/2004 L. F. Chien

- in USATaiwan -

www.yahoo.comwww.yahoo.com.tw

Yahoo Yahoo

Source Query

Observation of Anchor Text

05/2004 L. F. Chien

- in USATaiwan - 台灣 - 搜尋引擎

www.yahoo.comwww.yahoo.com.tw

Yahoo 雅虎雅虎 Yahoo

Translation Candidates

Anchor-Text Set

Observation of Anchor Text

05/2004 L. F. Chien

……(#in-link= 187)

……(#in-link= 21)

- in USATaiwan - 台灣 - 搜尋引擎

www.yahoo.comwww.yahoo.com.tw

Yahoo 雅虎雅虎 Yahoo

Page Authority

Observation of Anchor Text

Co-occurrence

05/2004 L. F. Chien

Search Result Mining

Term

Extraction

Term

Extraction

Source Query

TargetTranslations

WebPages

SearchEngineSearchEngine

PAT-tree based term extraction method [Chien, SIGIR ‘97]

PAT-tree based term extraction method [Chien, SIGIR ‘97]

Term

Selection

Term

Selection

05/2004 L. F. Chien

Term Selection

How to decide the ranking?

S, Ti: frequently co-occur in the same pages Not necessarily true for

synonyms and antonyms

S, Ti: the result pages containing similar co-occurring context terms as feature vectors

QueryS

QueryS .

.

.

T1

T2

Tn

05/2004 L. F. Chien

Chi-Square Test

Chi-Square Test: a statistical method for co-occurrence analysis [Gale & Church ‘91]

(3) . )()()()(

)(),(

2

2dcdbcaba

cbdaNtsSx

a: # of pages containing both terms s and t

b: # of pages containing term s but not t

c: # of pages containing term t but not s

d: # of pages containing neither term s nor t

N: the total number of pages, i.e., N= a+b+c+d

a: # of pages containing both terms s and t

b: # of pages containing term s but not t

c: # of pages containing term t but not s

d: # of pages containing neither term s nor t

N: the total number of pages, i.e., N= a+b+c+d

05/2004 L. F. Chien

Context Vector Analysis

Context Vector Analysis: co-occurring context terms as feature vectors

Similarity measure: cosine measure

(4) , )n

log(),(max

),(

N

dtf

dtfw

jj

iti

(5) . )()(

),(),(

1

2

1

2

1

m

it

m

is

tsm

icv

ii

ii

ww

wwftsS

05/2004 L. F. Chien

Indirect Association Problem

Ciscos t

s1 t1系統 (system) system

Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors.

思科 (Cisco)

05/2004 L. F. Chien

Competitive Linking Algorithm

t1systems

t2

系統 (system) Cisco

資訊 (information)

網路 (network)

電腦 (computer)

St1

思科 (Cisco)

Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2.

St2

05/2004 L. F. Chien

Combined Method To take advantage of both methods

Anchor-text-based: higher precision Search-result-based: higher coverage

(6) ,),(

),( m m

m

tsRtsSall

Rm(s,t) : Ranking of score in different methods

Rm(s,t) : Ranking of score in different methods

05/2004 L. F. Chien

Experiments

Performance on Query Translation Test Bed: real query terms from the Dreamer

search engine log in Taiwan 228,566 unique terms, during a period of 3

months in 1998 Random-query test set:

• 50 query terms in Chinese, randomly selected from the top 20,000 queries in the log

• 40 of them were out-of-dictionary

05/2004 L. F. Chien

Random Query Test Set

Table 2. Coverage and top 1~5 inclusion rates obtained with the four different methods for the random-query set.

Method Top-1 Top-3 Top-5 Coverage

CV 40.0% 54.0% 54.0% 68%

X2 36.0% 50.0% 52.0% 68%

AT 20.0% 32.0% 32.0% 32%

Combined 44.0% 64.0% 66.0% 72%

Many query terms didn’t appear in anchor-text sets (coverage)

05/2004 L. F. Chien

Other Experiments

430 popular Chinese queries, 67.4% top-1 inclusion rate

Common terms: randomly selected 100 common nouns and 100 common verbs from general-purpose Chinese dictionary

05/2004 L. F. Chien

Transitive Translation

m

t

Sony

(English)

s : source term

t : target translation

m : intermediate translation

索尼

(Simplified

Chinese)

s

新力

(Traditional

Chinese)

Fig. 3. An abstract diagram showing the

concepts of direct translation and indirect

translation.

Top-n inclusion rates obtained with different models.

Model Top1 Top2 Top3 Top4 Top5

Direct Translation

35.7% 43.0% 46.9% 49.6% 51.2%

Indirect Translation

44.2% 55.1% 58.0% 59.7% 60.5%

Transitive Translation

49.2% 58.1% 60.9% 61.6% 62.0%

05/2004 L. F. Chien

Transitive Translation Model

(3) )(),( tsPtsPdirect

(4) ),()()(

)(),(),(

mPtmPmsP

mPtmmsPtsP

m

mindirect

(5) otherwise, ),,(

)( if ),,(),(

tsP

θs,tPtsPtsP

indirect

directdirecttrans

05/2004 L. F. Chien

Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese.

Model Top1 Top2 Top3 Top4 Top5

Direct 10.5% 12.8% 14.3% 15.1% 15.1%

Indirect 40.2% 49.4% 56.6% 58.6% 59.6%

Transitive 42.9% 51.4% 58.6% 61.3% 61.9%

Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.

Source terms (Traditional Chinese)

Extracted target translations

English Simplified Chinese

Japanese

新力耐吉史丹佛雪梨網際網路網路首頁電腦資料庫資訊

SonyNikeStanfor

dSydneyinternetnetworkhomepa

gecomput

erdatabas

einforma

tion

索尼耐克斯坦福悉尼互联网网络主页计算机数据库信息

ソニーナイキスタンフォー

ドシドニーインターネッ

トネットワークホームページコンピュータ

ーデータベースインフォメー

ション

Chinese-Japanese Translation

05/2004 L. F. Chien

Translation Lexicons with

Regional Variations

(a) Taiwan (b) Mainland China (c) Hong KongFigure 1: Examples of search-result pages in different Chinese regions that were obtained via the English query words “George Bush” from Goo

gle.

05/2004 L. F. Chien

Summary

A work dealing with live translation of unknown queries Anchor-text-based

High precision for high-frequency terms Effective for proper nouns in multiple languages Not applicable if size of anchor-text set not enough

Search-result-based Exploit rich Web resources High coverage for English-Chinese language pair

05/2004 L. F. Chien

LiveCluster: Generating Taxonomy from

terms or documents

05/2004 L. F. Chien

Taxonomy Generation from Terms

05/2004 L. F. Chien

Hierarchical Query Clustering

05/2004 L. F. Chien

The Steps Feature Extraction

Use co-occurred seed terms extracted from retrieved top pages Term Vector

Each query term is assigned a term vector • Record the co-occurred feature terms and their frequency values in t

he retrieved documents. Term Similarity

tf*idf-based Cosine measurement Hierarchical Term Clustering

Cluster popular query terms in the log into initial categories Query terms with similar features are grouped into clusters.

05/2004 L. F. Chien

Feature Extraction Use co-occurred seed terms extracted from

retrieved top pages

Creative Nude Photography Network -- Fine Art Nude and ... ... The Creative Nude and Erotic Photography Network is the number one net portal to the best in fine art nude and erotic photography! Over 100 CNPN Member Sites ...

Nude Places... to be naked. Walking in the forest, cruising the lake in open boats, swimming, picnicking and nude photography are all enjoyed in the nude. 60 minutes $39.95. ...

A Brave Nude World... A Brave Nude World! Warning: This site contains links to fine art nude & erotic photography. If you are under 18 or do not wish to view this material, You can ...

nudeCo-occurred feature terms

3/2erotic photography

1/1naked

………

3/2art

2/2photography

tf/dfterm

05/2004 L. F. Chien

Term Weighting

05/2004 L. F. Chien

Extraction of Basic Feature Terms Performance of different features: randomly selected,

hi-frequency, and seed terms Popular queries not affected by ephemeral trends, e.g., “movie”,

“basketball”, “mutual fund”, etc. More expressive and distinguishable in describing a particular category Two logs compared and extracted 9,709 overlapping top query terms as

feature terms

G-1999D-1998

Top 1,000 terms top 20,000 terms ALL

Top 1,000 terms 583/58.30% 977/97.70% 992/99.20%

Top 20,000 terms 914/91.40% 9,709/50.71% 14,721/76.89%

05/2004 L. F. Chien

Task I: Query Clustering (Cont.)

Feature Extraction Use co-occurred seed terms extracted from retrieved top

pages Term Vector

Each query term is assigned a term vector • Record the co-occurred feature terms and their frequency

values in the retrieved documents. Term Similarity

TF *IDF-based Cosine measurement Hierarchical Term Clustering

Cluster popular query terms in the log into initial categories Query terms with similar features are grouped into clusters.

05/2004 L. F. Chien

Term Similarity

05/2004 L. F. Chien

Hierarchical Term Clustering

Agglomerative hierarchical clustering (AHC) Compute the similarity between all pairs of clusters

• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster

similarity value Merge the most similar (closest) two clusters

• Complete linkage method Update the cluster vector of the new cluster Repeat steps 2 and 3 until only a single cluster

remains

05/2004 L. F. Chien

05/2004 L. F. Chien

Clustering Results

勞委會

職訓局

就業

青輔會

自傳

徵才

人力資源

104

人力銀

行人力銀行

找工作

履歷表

求職

求才

占卜

塔羅牌

算命

紫微斗數

命理

姓名學

心理測驗

星座

愛情

eva長榮航空

長榮

航空公司

航空

華航

中華航空

補帖

大補帖

泡麵

dbt武俠

金庸

武俠小說

黃易

作家

武俠金庸武俠小說黃易作家

補帖大補帖泡麵dbt

eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)

占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情

勞委會職訓局就業青輔會自傳徵才人力資源104 人力銀行人力銀行找工作履歷表求職求才cut

1 2 3 4 5

1 23 4

5

05/2004 L. F. Chien

Cluster Partition

05/2004 L. F. Chien

Quality Function

05/2004 L. F. Chien

Quality Function (Cont.)

05/2004 L. F. Chien

Quality Function (Cont.)

05/2004 L. F. Chien

Preliminary Experiment

Test queries • Two sets: top 1k queries and random 1k queries• Each of the test queries has been manually assigned

according classes

Evaluation metrics• F-Measure

05/2004 L. F. Chien

Evaluation: F-Measure

05/2004 L. F. Chien

Obtained F-Measures

05/2004 L. F. Chien

05/2004 L. F. Chien

Results of Hierarchical Structure Generation