Download - Q - LTRC - IIIT Hyderabad

Natural Language Processingbased on and for

Information Explosion on the Web

Sadao KurohashiKyoto University / NICT

(TCS NLP Winter School 2008, 2008/1/5, IIIT, Hyderabad, India)

Search• Web is influential to:

– People's daily life– Enterprise management – Governmental policy decision

• 75% of people would rather use the web to answer their questions than their own family members.

• Service-industry-workers spend 30% of their time for search.

• 50% of complex queries go unanswered

High-Performance Computing Environment

800 CPU-core, 100TB storage

減っている

数が

ミンク鯨の

増えている絶滅しかかっているミンク鯨の数は問題はミンク鯨だ

Word segmentation and identification

増えている

ミンク鯨の

数は

Predicate-argument structure analysis

絶滅しかかっている

問題は

ミンク鯨だ

Conflict

ミンククジラの数は増えている問題はミンク鯨だ．絶滅しかかっているminke whale number increasing problem is minke whale. face extiction

Anaphora resolution

Flexiblematching

Deep NLP ⇒ Information Credibility

NLP based on Information Explosion on the Web

NLP for Information Explosion on the Web

• Compilation of a basic lexicon and robust morphological analysis

• Case frame acquisition and predicate-argument structure analysis

• Synonymous expression acquisition and flexible matching

• Open search engine infrastructure• Information organization system• Information credibility analysis system

Japanese

サッカーのカメルーン代表が、ケニアで大統

領選をめぐり暴動が発生していることを受け、アフリカ選手権（２０日開幕、ガーナ）に備えて同国内で行う予定だった１０日間の練習合宿を取りやめたことが２日、分かった。ＡＦＰ通信が伝えた。合宿中に予定されていたケニア代表との強化試合も中止となった。

http://www.asahi.com/sports/update/0103/JJT200801030002.html

Characteristics of Japanese

• No space between words ⇒ Segmentation• Four sets of letters: ⇒ Synonym

– HIRAGANA e.g., いんど

– KATAKANA e.g., インド

– Chinese characters e.g., 印度(KANJI)

– English alphabet e.g., India

a. Head finalb. Free word orderc. Postpositions function as case markersd. Hidden case markerse. Omission of case components


a. Head finalb. Free word orderc. Postpositions function as case markers


Kare -he

Deutschgo -German

hanasu.speak

ganom

woacc

(He speaks German.)

d. Hidden case markers


Kare -he

Deutschgo -German

hanasu.speak

watopic marker

woacc

ga? wo?(He speaks German.)

Deutschgo -German

hanasuspeak

woacc

sensei …teacher

(the teacher who speaks German)

ga

ga


φ-ganom

e. Omission of case components

Deutschgo -German

hanasuspeak

woacc

sensei -teacher

woacc

yatotta.hire

(φ hired a teacher who speaks German.)

Compilation of a basic lexicon and robust morphological analysis

Basic Lexicon

• Dictionaries for human: 200,000 entries• EDR: 200,000 entries→ Side effects for segmentation

Hard to maintain

⇒ 30,000 words (97% coverage for news texts)

Spelling Variation

蟹

かに

カニ

(crab)

→ 蟹/かにKanji

Hiragana

Katakana

Representative Form (ID)

Spelling Variation

落ちる

落る

おちる

(drop)↓

落ちる/おちる

綺麗だ

奇麗だ

きれいだ

(beautiful)↓

綺麗だ/きれいだ

子供

子ども

こども

(child)↓

子供/こども

Other Information for Basic Lexicon

• Possibility Form– 書ける（can-write） → 書く（write）

• Honorific Form– 召し上がる（eat） → 食べる（eat）

• Category (22 classes, e.g, <human> <organization> …)

• Domain (12 classes, e.g., <business> <education> …)

Robust Morphological Analysis

上海ガニをばくばく食べたShanghai in big mouthfuls eatcrab

（BAKU-BAKU）onomatopoeia

GANI ⇔ KANI（カニ）

Case frame acquisition and predicate-argument structure analysis

Language Understanding and Common sense

Mary ate the salad with a forkMary ate the salad with mushrooms

クロールで泳いでいる女の子を見た

望遠鏡で泳いでいる女の子を見た

crawl swim girl saw

telescope swim girl saw

Case frame泳ぐ swim

｛人 person, 子 child,…｝が

｛クロール crawl, 平泳ぎ,…｝で

｛海 sea, 大海,…｝を

見る see｛人 person, 者,…｝が

｛望遠鏡 telescope, 双眼鏡 ,,…｝で

｛姿 figure, 人 person,…｝を

Case frames for90K predicates

[Kawahara and Kurohashi, HLT2001, COLING2002, LREC2006]

500M sentences(20M pages)

WEB

Parsing (KNP)Filtering

Predicate-argumentstructures

86.7% for all97.3% for 18.1% PAs 86.7% → 87.4%

Clustering

PC Clusters (350CPUs)

500M sentences(20M pages)

Parsing (KNP)Filtering

Predicate-argumentstructures

Clustering

1day

7days

Case frames for90K predicates

WEB

86.7% for all97.3% for 18.1% PAs 86.7% → 87.4%

[Kawahara and Kurohashi, HLT2001, COLING2002, LREC2006]

Building a web corpus1. Crawl the web2. Extract Japanese page candidates using encoding

information• charset attribute, perl Encode::guess_encoding()

3. Judge Japanese pages using linguistic information (20M pages)• Jap. postpositions (ga, wo, ni, …） > 0.5%

4. Extract sentences from each page5. Extract Japanese sentences

• HIRAGANA, KATAKANA, KANJI > 60%6. Delete duplicate sentences→ 500M Japanese sentences (Japanese: 995 / 1,000)

もれなくプレゼント！(Present it to you all!)

でも僕はＴシャツの上に長袖のシャツ。(But, I wear a long-sleeved shirt on a T-shirt.)

今回は某アイドルの高橋一也も参加したので客が若い。(Since Kazuya Takahashi, who is an idol, joined this time, the audience was young.)

団体Ａが「まちづくり」をテーマにインターネット上で公開講座を開催しようとしている。(The organization A is trying to hold an open class about “city planning” on the Internet.)

ｈｔａｃｃｅｓｓを置いたとたんそのディレクトリ以下で．(As soon as you put htaccess, under the directory.)

昨年の没後４００年祭を機に復元した井戸を紹介する木下さん(This is Mr. Kinoshita, who introduces a well restored last year marking fourth centennial

of the death.)恋は、真剣勝負。

(Love is a game played in earnest.)ほめ言葉が多くって嬉しいですね。

(I’m glad to receive many compliments.)いまだに言うでしょう。

(You still say that.)「買いパラ」を見たと伝えれば、お買い上げ合計金額より５％引きいたします。

(If you say that you saw “Kaipara”, we offer a 5% discount from all the bills.)政治も危機的状況ですし、物資も不足しています。

(Politics is at a crisis, and commodities are scarce.)思いやりのある優しい子に育ってネ。

(Grow up to be a considerate and kind person.)

Compiling case frames from the web corpus

• Collect reliable parse results (predicate-arguments) of the web corpusAccuracy: 86.7% (all) → 97.3% (18.1%)

• Semantic ambiguity, scrambling, omission → A verb and its closest argument are coupled

望遠鏡で泳いでいる女の子を見たtelescope swim girl saw

泳いでいる女の子を望遠鏡で見たtelescope swim girl saw

jugyoin -worker

kuruma -car

nimotsu -baggage

tsumuload

ganom

nidat

woacc

nimotsu -baggage

woacc

nimotsu -baggage

busshi -supply

hikouki -airplane

busshi -supply

tsumuload

nidat

woacc

tsumuload

tsumuload

kare -he

nidat

ganom

woacc

woacc

tsumuload

keiken -experience

woacc

tsumuaccumulate

truck -truck

keiken -experience

tsumuaccumulate

woacc

sensyu -player

ganom

kuruma -car

nidat

jugyoin -worker

ganom

sagyosya -operator

ganom

Case frame examples

R:1583, CD:664, CDR:3, …ni

yaku (3)(copy)

bread:2484, meat:1521, cake:1283, …woyaku (1)(bake)

data:178, file:107, copy:9, …womaker:1, distributor:1, …ga

…

attack:18, action:15, son:15, …nihand:2950wo

oven:1630, frying pan:1311, …deteacher:3, government:3, person:3, …gayaku (2)

(have difficulty)

I:18, person:15, craftsman:10, …gaexamplesCS

Statistics of the acquired case frames

4.226.9Average # of unique examples for CS29.872.9Average # of examples for CS

2.43.2Average # of CS for a case frame17.534.3Average # of case frames for a verb

461444262noun+copula9914121adjective

1264140860verb1824689243# of predicatesnewsweb

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

-10 -8 -6 -4 -2 0 2 4

Corpus size

●：Similar match■：Exact match

31M 62M 125M 250M 500M 1G

（cf. Penn treebank based lexical parser: 1.5% [Bikel 04]）

Coverage （bi-lexical dependency）

Case frame search is available

Related Work

• Subcategorization frame acquisition[Brent, 1993] [Ushioda et al., 1993] [Manning, 1993] [Briscoe and Carroll, 1997]…

• FrameNet [Baker et al., 1998]

• PropBank [Palmer et al., 2005]

• Unsupervised learning for English [McClosky et al., 2006]

Integrated probabilistic model for syntactic and case structure analysis

[Kawahara and Kurohashi, HLT-NAACL2006]

dinner-wa eat-te go_home-tabangohan-wa tabe-te kaet-ta

dinner-wa

eat-te

go_home-ta

dinner-wa

eat-te

go_home-taEOSEOS

)|go_home( EOStaP − )|go_homedinner( EOStawaP −−

)go_home|eatdinner( tatewaP −−− )go_home|eat( tateP −−005.0= 003.0=

002.0= 000001.0=>

Integrated model for syntactic and case structure analysis

( )

),,(maxarg)(

),,(maxarg

)|,(maxarg,

),(

),(

),(

SLTPSP

SLTP

SLTPLT

LT

LT

LTbestbest

=

=

=

Case structure L

∏∈T

hii

ibP

C

)|C(def

iCihb

clause

SInput sentence , Dep. structure T ,

∏∈

×=T

hihiiLTLTi

iiffPwfCSPSLTP

C),(),()|(),|(maxarg),,(maxarg

Generative probability of case structure

),|()|()|( ilkilhi fCFCAPvCFPwvPi

≈

Probability of generating predicate ivProbability of generating case frame from predicate ivlCF

Probability of generating case assignment from case frame lCFkCA

)go_home|eat(P )eat|( 1eatCFP

dinner-wa

eat-te

go_home-ta

Case frame CFeat1

dinner, lunch, …woperson, student, …ga

eat1

(no correspondence)

Case assignment CAk

),|(ihii wfCSP

Generative probability of case assignment

( )∏=

==1)(:

,,|,,1)(),|(jj sAs

jiljjjilk sfCFfnsAPfCFCAP

( )∏=

=×0)(:

,,|0)(jj sAs

jilj sfCFsAP

( )woteCFwawoAP ,,|,dinner,1)( 1eat= ( )gateCFgaAP ,,|0)( 1eat=

:jn content word:jf type

:js case slot

dinner-wa

eat-te

Case frame CFeat1

dinner, lunch, …woperson, student, …ga

eat1

(no correspondence)

Case assignment CAk

( )∏∈

×××=T

hiilkilhiLTLTi

iiffPfCFCAPvCFPwvPSLTP

C),(),()|(),|()|()|(maxarg),,(maxarg

go_home-ta

Resources for parameter estimation

Probability Resourcewhat is generated

CS analysis resultscase framecase frameswordsparse resultspredicateKyoto Text Corpussurface case

CS analysis resultscase slot

Kyoto Text Corpuspredicate typeKyoto Text Corpustopic markerKyoto Text Corpuspunctuation mark)|( ij fpP

),|( jij pftP

)|( jj scP

),1)(,|( jjlj ssACFnP =

),|}1,0{)(( jlj sCFsAP =

)|(ihi wvP

)|( il vCFP

),,|,(iii hhhii oupupP

Supervised

Unsupervisd

Experiments• Resources for parameter estimation

– Case frames： constructed from 500M web sentences– Parse, CS analysis results： analysis results of 6M web

sentences• Experiment for syntactic structure

(675 web sentences)– Evaluate the head of each bunsetsu

(except the last and second last bunsetsu)• Experiment for case structure

(215 web sentences)– Evaluate case interpretation of TM phrases (～ wa) and

clausal modifiees

Experimental resultsOur methodMere parsing

0.920 (457/497)0.911 (453/497)VB→NB0.791 (601/760)0.780 (593/760)VB→VB0.946 (526/556)0.944 (525/556)NB→NB0.869 (1086/1249)0.853 (1066/1249)others0.812 (242/298)0.819 (244/298)TM(～wa)0.858 (1328/1547)0.847 (1310/1547)NB→VB0.874 (3477/3976)0.867 (3447/3976)all

Dep. structure

Case structure

0.781 (121/155)0.690 (107/155)Clausal modifiee0.781 (82/105)0.686 (72/105)TM phrase

Our methodSim-based baseline

Improved examples

水が高いところから低いところへ流れる。

?

すぐに標識用のエビを同港に停泊した当港所属調査船

「おやしお丸」に搬送し、…

?

(water) (high) (place) (low) (place) (flow)

(soon) (for sign) (shrimp) (same port) (cast anchor)

(transfer)(“Oyashiomaru”)

(investigation ship)

Synonymous expression acquisition and flexible matching

Flexible Matching

• There are a lot of expressions that convey almost the same meaning– great difficulty in many NLP tasks

• Automatic extraction of synonymous expressions from a dictionary and a Web corpus [Shibata et al. IJCNLP2008]

• Flexible matching using SYNGRAPH data structure [Shibata et al. IJCNLP2008]

Automatic Acquisition of Synonymous Expressions

Web

pattern distributional similarity

parenthetic expression

Dictionary

BSE=bovine spongiform

encephalitis

husband=shedog=spy buy=purchase

Synonym and HypernymExtraction from a Dictionary

• Using the definition sentence patterns– Hypernym

• dinner: yugata (evening) no (of) syokuji (meal)

– Synonym• ice: “ice cream” no (of) ryaku (abbreviation)• purchase: kau (buy) koto (matter) (one phrase)

• Wide coverage, but includes exceptional or idiosyncratic usages – dog:1/2 → animal– dog:2/2 = spy– tap water:2/2 = strait

Distributional Similarity

• “Two terms are similar if they appear in similar contexts”– If two terms have similar co-occurrence words, the

two terms are similar• Calculate the distributional similarity using a

Web corpus (500M sentences)– co-occurrence in the dependency relation– co-occurrence words if PMI (Pointwise Mutual

Information) is positive– similarity is defined as the overlap of co-occurrence

words: calculate using the Simpson coefficient

e.g.: co-occurrence word and similar word of “doctor”

……10.281be stopped by10.506turn white11.024want to consult11.277be examined11.589be pronounced12.173see

PMICo-occurrence word

……0.565DOCTOR0.573eye doctor0.613teacher0.664midwife0.742veterinary0.754ENT doctor

Simpson efficient

Similar word

Synonym and HypernymExtraction from a Dictionary

• Using the definition sentence patterns– Hypernym

• dinner: yugata (evening) no (of) syokuji (meal)

– Synonym• ice: “ice cream” no (of) ryaku (abbreviation)• purchase: kau (buy) koto (matter) (one phrase)

• Wide coverage, but includes exceptional or idiosyncratic usages – dog:1/2 → animal– dog:2/2 = spy– tap water:2/2 = strait

0.419

0.1190.338

Synonym Extraction from a Web corpus

• Extract from symmetry parenthesis expressions– ..A(B).., ..B(A).. → A=B

• Can extract synonyms between NEs/terminologies/neologisms, which cannot be extracted from a dictionary– 国際連合教育科学文化機関 = ユネスコ

(UNESCO)– 放射性同位元素 = RI (radioisotope)– 携帯電話 = ケータイ (cellular phone)

Acc.

Web (parenthetic expression)

94%

96%

Acc.

5,225

23,292

9,274

#

Dic Dic & Web (Similarity)

98%Synonym

#

98%17,207Hypernym

99%6,867Synonym

Acc.#

accuracy: for randomly-selected 100 terms

NLP based on Information Explosion on the Web

NLP for Information Explosion on the Web

• Compilation of a basic lexicon and robust morphological analysis

• Case frame acquisition and predicate-argument structure analysis

• Synonymous expression acquisition and flexible matching

• Open search engine infrastructure• Information organization system• Information credibility analysis system

• 情報爆発(Cyber Infrastructure for the Information-explosion Era)– 文部科学省科学研究費補助金(Ministry of

Education, Culture, Sports, Science and Technology, Grants-in-Aid for Scientific Research)

• 情報分析(Information Analysis Project)– 総務省(Ministry of Internal Affairs and

Communications)/NICT

Two Governmental Projects

Open search engine infrastructure

Yahoo/Google

Next-Generation Search

Search Engine InfrastructureT S U B A K I

Grid computing environmentand huge storage servers


Search Engine Infrastructure TSUBAKI




• Reproducible search results– Fix 100 million Japanese web

pages* (May - July 2007)• Web standard format for

advanced NLP– Available at TSUBAKI API

• Deep NLP indexing– Spelling variations,

dependency relations and synonymous expressions

• Open search algorithm• API without any restriction

Web Standard Format

Web standard format

• Problems of use of web pages in NLP– Unclear sentence boundaries– Several meta-data in several tag formats– Spam

• A simple XML-styled data format for annotating meta-data and text-data of a web page– Meta-data:

• URL, crawl date, character encoding, title and anchor text (in-links/out-links)

– Text-data:• Sentences in a web page, and their analysis results by NLP

tools

しかしさすがに電池の持ちが悪くなってきたのと、 たまたまキャンペーンをやっていて無料で機種変できるみたいだったから 

愛着のわいた携帯を手放すことにした。

で、折角変えるならまた長く使えるのがいいじゃない？ すごいいいデザインのがあって（しかもロゴがsoftbank!!） これに決めた！！

と思ったら…

なんと品切れ。他の店舗に問い合わせてもどこも品切れ。 唯一あったのが電車で40分かかる所。

もちろん行きました。

そこまでして手に入れた携帯だから前以上に既に愛着がわいてますw また５年間使い続けるぞい！

</div></div>

投稿者: KN006 日時: 2006年10月16日 22:05



Sentence boundaryLayout adjustmentMeta data

<?xml version="1.0" encoding="utf-8"?><StandardFormat Url="http://nlp.kuee.kyoto-u.ac.jp/blog/KUNTT_blog/2006/10/"OriginalEncoding="utf8" Time="2007-06-18 18:13:38"><Text>…中略…<S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="10">

<RawString>しかしさすがに電池の持ちが悪くなってきたのと、たまたまキャンペー

ンをやっていて無料で機種変できるみたいだったから愛着のわいた携帯を手放すことにした。</RawString></S><S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="11">

<RawString>で、折角変えるならまた長く使えるのがいいじゃない？</RawString></S><S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="12"><RawString>すごいいいデザインのがあって（しかもロゴがｓｏｆｔｂａｎｋ！！）これに決め

た！！と思ったら…</RawString></S>…中略…<S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="18">

<RawString>そこまでして手に入れた携帯だから前以上に既に愛着がわいてますｗまた５年間使い続けるぞい！</RawString></S></Text></StandardFormat>

<?xml version="1.0" encoding="utf-8"?><StandardFormat Url="http://nlp.kuee.kyoto-u.ac.jp/blog/KUNTT_blog/2006/10/"OriginalEncoding="utf8" Time="2007-06-18 18:13:38"><Text>…中略…<S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="10">

<RawString>しかしさすがに電池の持ちが悪くなってきたのと、たまたまキャンペーンをやっていて無料で機種変できるみたいだったから愛着のわいた携帯を手放すことにした。</RawString>

<Annotation Scheme="Knp"><![CDATA[* 14D <BGH:しかし/しかし><文頭><接続詞><係:連用>しかししかししかし接続詞 10 * 0 * 0 * 0 "代表表記:しかし/しかし"<自立><文節始>* 4D <BGH:流石/さすが><助詞><体言><修飾><係:ニ格><格要素><連用要素>さすがさすがさすが副詞 8 * 0 * 0 * 0 "代表表記:流石/さすが" <自立><文節始>ににに助詞 9 格助詞 1 * 0 * 0 NIL <付属>* 3D <BGH:電池/でんち><助詞><連体修飾><体言><係:ノ格>電池でんち電池名詞 6 普通名詞 1 * 0 * 0 "ドメイン:家庭・暮らしカテゴリ:人工物-その他代表表記:電池/でんち“ <名詞相当語><自立><文節始>… 中略 …したしたする動詞 2 * 0 サ変動詞 16 タ形 10 NIL <連体修飾><活用語><付属>。。。特殊 1 句点 1 * 0 * 0 NIL <文末><英記号><記号><付属>EOS]]></Annotation></S>

Part-of-speechPart-of-speechDomainDomain CategoryCategory

Representative formRepresentative form

Deep NLP Indexing

Inverted Index

Page1, Page2, Page3ofPage3andPage3information

Page1, Page2, Page3problemPage1, Page2computerPage1, Page3language

language, problem, information, of, andPage3

computer, problem, ofPage2

language, computer, problem, ofPage1

Items in index data

O

O

O

O

Doc. ids

O

O

O

O

Freq. in a doc.

O

O

O

O

String

X

X

O

O

DF

OXSynonymous expressions

OXDep. of

synonymous expressions

XXDep. of words

OOWord

PositionSent. IDsIndex type

DF: document frequency

Sentences in a document

せんたくしましたwashedselected

服(clothes)

こども(child)

こども(child)

一緒に(together)

を(wo)

。(.)

と(to)

。(.)

Word index


服(clothes)

こども(child)

こども(child)

一緒に(together)

721.0TO

311.0WO410.5WASH821.0TOGETHER

2.0

0.51.02.0

Freq.

1,2

11

1,2SID

1,6CHILD2CLOTH4SELECT

5,9.

PositionWords

S1

S2

Sentence IDs

を(wo)

。(.)

と(to)

。(.)

1 32

6 8

Positions in a document

4 5

97

Dependency relation index


服(clothes)

こども(child)

こども(child)

一緒に(together)

を(wo)

。(.)

と(to)

。(.)

1.0CHILD→TOGETHER

0.5

0.5

1.0

Freq.

CHILD→CLOTH

CLOTH→WASH

CLOTH→SELECT

Dependency relations

5

Synonymousexpression index

S17250:洗濯しました。S10184:選択

S55:服をS11412:こども

S11412:こどもと S15355:一緒に。

1 42

6 8

A LITTLE CHILD, CHILD

CLOTHES, CLOTHING

GARMENTS, WEAR, DRESS

WASH, CLEAN

SELECT,CHOOSE


TOGETHER

3

7 9

40.5S17250:洗濯

1.02.00.51.0

Freq.

2S55:服4S10184:選択

1,6S11412:こども

8S15355:一緒

PositionSynonymousexpression sets

5

Dep. index ofSynonymousexpressions

S17250:洗濯しました。S10184:選択

S55:服をS11412:こども

S11412:こどもと S15355:一緒に。

1 42

6 8


WASH, CLEAN

SELECT,CHOOSE


20.5S55:服→S17250:洗濯

1.0

1.0

0.5

Freq.

2S55:服→S10184:選択

1S11412:こども→S55:服

6S11412:こども→ S15355:一緒

PositionDependency relationsbetween synonymous

expression sets

3

7 9

TOGETHER

CLOTHES, CLOTHING

GARMENTS, WEAR, DRESS

Query syntax• Natural languagesentence:

• Phrase search：

• Proximity search(word):

• Proximity search(sentence)：

“京都大学”

京都大学~5S京都 and 大学 are co-occurred within 5 sentencesin that order.

京都大学~5W京都 and 大学 are co-occurred within 5 words inthat order

• Combination of the above search notations京都大学への行き方 “市バス”

京都大学(Kyoto univ.)

行き方(access)

への

(city bus)

w1 w2 w3Query:

w1 w2 w3

d1

d2

Query:

w1 w2 w3

d1

d2

Query:

within N words

( ) ),(),( dQreldQrelQ,dscore ddww +=

Scoring method

Score calculatedfrom words in Q

Score calculated fromdependency relations in Q

• Score calculated from a query Q for a document d

– Q=

⇒ Qw={子供，体力，低下}, Qd={子供→体力，体力→低下}

子供(child)

の(no)

体力(strength)

低下(decrement)

Scoring method

• Score calculated from word indices Qw inQ for a document d– Qw={子供(child)，体力(strength)，

低下(decrement)}

( ) ∑∈

⎟⎟⎠

⎞⎜⎜⎝

⎛++−

×+×

×=wQq

ww nnN

fqKfqqfq,dQrel

5.05.0log3

OKAPI BM25

( )avellbK

2312 +−=

fq: the frequency of the expression q in dqfq: the frequency of q in Q

n: the document frequency of q in 100 million pages

N: 1 x 108

l: the document length of dlave: the average document length over all the pages

Scoring method

• Score calculated from dependency relationindices Qd in Q for a document d– Qd=

( ) ∑∈

=dQq

dd dqh,dQrel ),(

if d includes qotherwise⎩

⎨⎧

=),(),(

),(dqgdqf

dqh

子供(child)

体力,(strength)

→ 低下(decrement)

体力(strength)

→

Scoring method (d includes q)

• Score calculated from dependency relation indices Qd in Q for a document d

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛++−

×+×

×=5.0

5.0log3n

nNfqKfqqfqq,df

OKAPI BM25

( )avellbK

2312 +−=

fq: the frequency of the expression q in dqfq: the frequency of q in Q

n: the document frequency of q in 100 million pages

N: 1 x 108

l: the document length of dlave: the average document length over all the pages.

Scoring method (d does not include q)

• Score calculated from dependency relation indices Qd in Q for a document d

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛++−

×+×

×=5.0

5.0log)()(3,

nnN

qwKqwqfqdqg

⎪⎩

⎪⎨⎧ −

=0

))(),(min()( D

qrqlDqw

l(q): parent of dependency relation qr(q): child of dependency relation qmin(q1,q2) : minimum distance between q1

and q2 (# of words)D: threshold of distance (D = 30)n: DF value of dependency relation q1→q2

min(l(q),r(q)) < D

otherwise

Contribution of deep NLP indices

0.2530.170Dependency relation(only f (q,d))

0.2300.168Dependency relation(f (q,d) + g (q,d))

0.2570.162Wordrepresentative form

0.2320.155Baseline

P10R-precision

(NTCIR 10M web test set)

4 master servers

16 snippet creation servers

Index data generated from a million web pages

Load balance server

UserParse a queryMerge retrieved pagesCreate a search result

Retrieve pagesRank retrieved pages

Snippet CreationSnippet Creation

A million web standard format data27 search servers

4 master servers

16 snippet creation servers

index data generatedfrom a million web pages

Load balance server

User

A million web standardformat data

27 search servers

9.3 GBTitle DB

132 GBtotal115 GBDF DB

7.9 GBURL DB

Data sizes per 100 million pages

index data sizes per a million pages

48 GBDep. of synonymousexpressions

85.9 GBtotal

18 GBSynonymousexpressions

(word and phrase)

8.9 GBDep.11 GBWord

31 GBWeb standard form data

Gzipped file size pera million pages

Required time per a query

RetrievingScore

calculation& ranking

Snippetcreation

Get title& URL

Merge &ranking

Search servers Master server

Query paring

Load balanceserverSend

a query

Masterserver

Snippetcreationservers

7.9 secondsGet hitcount by API

Get document IDs, titles and URLs of top 100 pages by API

Ordinary search (50 pages are shown)

Get document IDs, titles and URLs of top 1000 pages by API

9.7 seconds

32.6 seconds

12.7 seconds

* Document IDs are necessary for obtaining cached web pages and web standard format data

TSUBAKI API

TSUBAKI APIhttp://tsubaki.ixnlp.nii.ac.jp/api.cgi

• No user registration• No limited number of API calls a day• Provide all pages in a search result

– cf. Yahoo! API: top 1000 pages in a searchresult,Google AJAX Search API: top 8 pages in asearch result, and(previous) Google API: top 1000 pages in asearch result

• Provide web standard format data

Request parameters

The document type to return.html/xmlformat

The document ID to obtaina cached web page or standardformat data correspondingto the ID.

integeridSet to 1 to obtain a query’s hitcount only.0/1only_hitcount

The number of results to return.integerresultsThe logical operation tosearch for.

AND/ORlogical_operator

The starting result position to return.integerstart

The query to search for (UTF-8 encoded). The query parameter is required for obtaining search results．

stringQueryDescriptionValueParameter

http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20


URI encoded string of the query “京都観光”(Kyoto sightseeing)

http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=html&id=06832381http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=html&id=06832381

http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=xml&id=06832381http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=xml&id=06832381

Conclusion• Search engine infrastructure TSUBAKI

– reproducible search results,– Web standard format for sharing pre-

processed web pages,– indices generated by deep NLP,– open search algorithm, and– APIs without any restriction

• Available from http://tsubaki.ixnlp.nii.ac.jp/index.cgi

Information organization system

Search result clustering

• Advantage– Provide a bird’s-eye view of a search result– Provide efficient access for necessarily pages– Provide low-ranked pages in a search result

• Requirement– Quickness of cluster construction– Quality of cluster labels

• Affect access of necessarily pages

Characteristics of our system

• Cooperation with search engine infrastructure TSUBAKI– Full text data in a web page and their

analyzed data– High-performance computing environment

• Label acquisition based on deep NLP– Assimilate expressive divergence

• Spelling variations• Synonymous expressions

Distillation of labels

Discardedサイトマップサイトマップ

得点力アップ

新教育課程

DiscardedDiscarded教育基本法改正

教育基本法改正案

得点力アップ

新教育課程


得点力アップ得点力アップ

得点力ＵＰ

新カリキュラム

教育基本教育基本

新教育課程新教育課程

法改正法改正

...

教育基本法改正教育基本法改正

教育基本法の改正案



Assimilate expressive divergence

Eliminate inappropriatecompound nouns

Merge substrings

Overview of our clustering systemStep 1. Label acquisition Step 2. Cluster generation

国際捕鯨委員会

調査捕鯨

Step 3. Cluster organization Step 4. Display

○○○○○○○○

○○○○○○○○○○○○○○○○○○○○ ○○○○○○○○○○○○○ ○○○

○○○○○○ ○○○○ ○○○○

○○○○○○ ○○○○○○○○○○○ ○○○

○○○○○○○○○○

国際捕鯨委員会(IWC)調査捕鯨(Scientific whaling)…

Architecture

Search EngineTSUBAKI

QueryQuery

Query

Search result

Compound NounExtraction

Compound NounExtraction

Label Selection &Clustering

Search &Page ID Gathering

Web Standard Format Collection

ClustersClusters

Clustering result- Whaling problem -

• IWC (357pages)– 科学委員会

(Scientific committee)– 年次総会

(Annual meeting)– IWC総会

(IWC meeting)– 原住民生存捕鯨

(Aboriginal subsistencewhaling scheme)

– 鯨種(Species of whales)

…• 調査捕鯨 (145pages)

(Scientific whaling)– 日本の調査捕鯨

(Scientific whaling in Japan)• 捕鯨船 (65pages)○

(Whaling port)

△ The explanation of IWC,criticism of IWC and others

○ The explanation and the positiveor negative opinions of thescientific whaling

○ Accidents and history of whalingports

●IWC (357pages)科学委員会(Scientific committee)年次総会(Annual meeting)IWC総会(IWC meeting)原住民生存捕鯨(Aboriginal Subsistence Whaling Scheme)鯨種(Species of whales)

…●調査捕鯨 (145pages)

(Scientific whaling)日本の調査捕鯨(Scientific whaling in Japan)

●捕鯨船 (65pages)(Whaling port)

●南極海 (51pages)(Antarctic ocean)

1 st72 nd

41 st

44 th

37 th

32 nd

1st

94 th

6 th

19 th

Rank of a web page including a label

The clustering result of the query ``whaling problem’’

Users can find web pages that are low-ranked in a search result

Users can find web pages that are low-ranked in a search result

Conclusion

• Label based search result clustering system• Cooperation with search engine infrastructure

TSUBAKI– Full text data in a web page and their analyzed data– High-performance computing environment

• Label acquisition based on deep NLP– Assimilate expressive divergence

• Spelling variations• Synonymous expressions

Information credibility analysis system

Information Credibility Analysis

1. Credibility of information contents2. Credibility of information sender3. Credibility estimated from document style

and superficial characteristics4. Credibility based on social evaluation of

information contents/sender

1.Credibility of information contents

• Sentences in the related documents are classified into opinions, events, and facts, and opinion sentences are classified into positive opinions and negative opinions.

• Documents in each cluster should be summarized, by using multi-document summarization techniques and their extensions.

• Several relations such as similarities, oppositions, causal relations, supporting relations are detected among inner- and inter-cluster statements, which leads to the detection of logical consistency and contradiction.

減っている

数が

ミンク鯨の

増えている絶滅しかかっているミンク鯨の数は問題はミンク鯨だ

Word segmentation and identification

増えている

ミンク鯨の

数は

Predicate-argument structure analysis

絶滅しかかっている

問題は

ミンク鯨だ

Conflict

ミンククジラの数は増えている問題はミンク鯨だ．絶滅しかかっているminke whale number increasing problem is minke whale. face extiction

Anaphora resolution

Flexiblematching

Deep NLP ⇒ Information Credibility

2. Credibility of information sender

• Information sender:– individuals

• expert or not• identified individual by handle-name, and others

– organizations• public organization (administrative organ, academic

association, universities), • media, • commercial companies, and others.

• Distinguished by:– meta-information such as URLs, page titles, anchor

texts, and RSS– NE extraction

2. Credibility of information sender

• Check the quantity and quality of information the sender produced so far.

• Information quality evaluation can be given based on the other 3 criteria.

• Speciality of individual and organization is important, which can be detected by topic detection.

3. Credibility estimated from document style and superficial characteristics

• Guessed by integrating many criteria such as sentential style (formal or informal, written-language or spoken-language), page layout, and appropriateness of links in the page, and so on.

• cf. Persuasive technology at Stanford University, and Google News automatic assembling criteria.

4. Credibility based on social evaluation of information contents/sender

• How they are evaluated by others.

• One way is to perform opinion mining from the web based on NLP, and collect and count positive and negative evaluations for the information content/sender.

• Another way is to directly use rankings and comments of others, like social network framework.

Sender Clustering Ontology Opinion Q & A

Information Credibility Analysis SystemWISDOM （2006～）

Pageclustering

Agaricus

Sender

Opiniondistribution

Information Credibility Analysis SystemWISDOM （2006～）

Summary• Several linguistic and extra-linguistic knowledge

can be acquired from the web corpus, using high-performance computing environment.

• Deep NLP, especially accurate predicate-argument structure analysis and flexible matching are key technologies for next generation search.

• Automatic information credibility evaluation is not easy, but information organization system and multi-faceted information analysis system help users’ evaluation a lot.

References• D. Kawahara and S. Kurohashi. Case frame compilation from the

web using high-performance computing. In Proceedings of LREC2006, 2006.

• D. Kawahara and S. Kurohashi. A fully-lexicalized probabilistic model for Japanese syntactic and case structure analysis. In Proceedings of HLT-NAACL2006, pages 176--183, 2006.

• H. Miyamori, S. Akamine, Y. Kato, K. Kaneiwa, K. Sumi, K. Inui, and S. Kurohashi. Evaluation data and prototype system wisdom for information credibility analysis. In Proceedings of First International Symposium on Universal Communication, 2007.

• T. Shibata, M. Odani, J. Harashima, T. Oonishi, and S. Kurohashi. SYNGRAPH: A flexible matching method based on synonymous expression extraction from an ordinary dictionary and a web corpus. In Proceedings of IJCNLP2008, 2008.

• K. Shinzato, T. Shibata, D. Kawahara, C. Hashimoto, and S. Kurohashi. TSUBAKI: An open search engine infrastructure for developing new information access methodology. In Proceedings ofIJCNLP2008, 2008.