Natural Language Processingbased on and for
Information Explosion on the Web
Sadao KurohashiKyoto University / NICT
(TCS NLP Winter School 2008, 2008/1/5, IIIT, Hyderabad, India)
Search• Web is influential to:
– People's daily life– Enterprise management – Governmental policy decision
• 75% of people would rather use the web to answer their questions than their own family members.
• Service-industry-workers spend 30% of their time for search.
• 50% of complex queries go unanswered
High-Performance Computing Environment
800 CPU-core, 100TB storage
減っている
数 が
ミンク鯨 の
増えている 絶滅しかかっているミンク鯨 の 数 は 問題 は ミンク鯨 だ
Word segmentation and identification
増えている
ミンク鯨 の
数 は
Predicate-argument structure analysis
絶滅しかかっている
問題 は
ミンク鯨 だ
Conflict
ミンククジラの数は増えている 問題はミンク鯨だ.絶滅しかかっているminke whale number increasing problem is minke whale. face extiction
Anaphora resolution
Flexiblematching
Deep NLP ⇒ Information Credibility
NLP based on Information Explosion on the Web
NLP for Information Explosion on the Web
• Compilation of a basic lexicon and robust morphological analysis
• Case frame acquisition and predicate-argument structure analysis
• Synonymous expression acquisition and flexible matching
• Open search engine infrastructure• Information organization system• Information credibility analysis system
Japanese
サッカーのカメルーン代表が、ケニアで大統
領選をめぐり暴動が発生していることを受け、アフリカ選手権(20日開幕、ガーナ)に備えて同国内で行う予定だった10日間の練習合宿を取りやめたことが2日、分かった。AFP通信が伝えた。合宿中に予定されていたケニア代表との強化試合も中止となった。
http://www.asahi.com/sports/update/0103/JJT200801030002.html
Characteristics of Japanese
• No space between words ⇒ Segmentation• Four sets of letters: ⇒ Synonym
– HIRAGANA e.g., いんど
– KATAKANA e.g., インド
– Chinese characters e.g., 印度(KANJI)
– English alphabet e.g., India
a. Head finalb. Free word orderc. Postpositions function as case markersd. Hidden case markerse. Omission of case components
Characteristics of Japanese
a. Head finalb. Free word orderc. Postpositions function as case markers
Characteristics of Japanese
Kare -he
Deutschgo -German
hanasu.speak
ganom
woacc
(He speaks German.)
d. Hidden case markers
Characteristics of Japanese
Kare -he
Deutschgo -German
hanasu.speak
watopic marker
woacc
ga? wo?(He speaks German.)
Deutschgo -German
hanasuspeak
woacc
sensei …teacher
(the teacher who speaks German)
ga
ga
Characteristics of Japanese
φ-ganom
e. Omission of case components
Deutschgo -German
hanasuspeak
woacc
sensei -teacher
woacc
yatotta.hire
(φ hired a teacher who speaks German.)
Compilation of a basic lexicon and robust morphological analysis
Basic Lexicon
• Dictionaries for human: 200,000 entries• EDR: 200,000 entries→ Side effects for segmentation
Hard to maintain
⇒ 30,000 words (97% coverage for news texts)
Spelling Variation
蟹
かに
カニ
(crab)
→ 蟹/かにKanji
Hiragana
Katakana
Representative Form (ID)
Spelling Variation
落ちる
落る
おちる
(drop)↓
落ちる/おちる
綺麗だ
奇麗だ
きれいだ
(beautiful)↓
綺麗だ/きれいだ
子供
子ども
こども
(child)↓
子供/こども
Other Information for Basic Lexicon
• Possibility Form– 書ける(can-write) → 書く(write)
• Honorific Form– 召し上がる(eat) → 食べる(eat)
• Category (22 classes, e.g, <human> <organization> …)
• Domain (12 classes, e.g., <business> <education> …)
Robust Morphological Analysis
上海ガニをばくばく食べたShanghai in big mouthfuls eatcrab
(BAKU-BAKU)onomatopoeia
GANI ⇔ KANI(カニ)
Case frame acquisition and predicate-argument structure analysis
Language Understanding and Common sense
Mary ate the salad with a forkMary ate the salad with mushrooms
クロールで泳いでいる女の子を見た
望遠鏡で泳いでいる女の子を見た
crawl swim girl saw
telescope swim girl saw
Case frame泳ぐ swim
{人 person, 子 child,…}が
{クロール crawl, 平泳ぎ,…}で
{海 sea, 大海,…}を
見る see{人 person, 者,…}が
{望遠鏡 telescope, 双眼鏡 ,,…}で
{姿 figure, 人 person,…}を
Case frames for90K predicates
[Kawahara and Kurohashi, HLT2001, COLING2002, LREC2006]
500M sentences(20M pages)
WEB
Parsing (KNP)Filtering
Predicate-argumentstructures
86.7% for all97.3% for 18.1% PAs 86.7% → 87.4%
Clustering
PC Clusters (350CPUs)
500M sentences(20M pages)
Parsing (KNP)Filtering
Predicate-argumentstructures
Clustering
1day
7days
Case frames for90K predicates
WEB
86.7% for all97.3% for 18.1% PAs 86.7% → 87.4%
[Kawahara and Kurohashi, HLT2001, COLING2002, LREC2006]
Building a web corpus1. Crawl the web2. Extract Japanese page candidates using encoding
information• charset attribute, perl Encode::guess_encoding()
3. Judge Japanese pages using linguistic information (20M pages)• Jap. postpositions (ga, wo, ni, …) > 0.5%
4. Extract sentences from each page5. Extract Japanese sentences
• HIRAGANA, KATAKANA, KANJI > 60%6. Delete duplicate sentences→ 500M Japanese sentences (Japanese: 995 / 1,000)
もれなくプレゼント!(Present it to you all!)
でも僕はTシャツの上に長袖のシャツ。(But, I wear a long-sleeved shirt on a T-shirt.)
今回は某アイドルの高橋一也も参加したので客が若い。(Since Kazuya Takahashi, who is an idol, joined this time, the audience was young.)
団体Aが「まちづくり」をテーマにインターネット上で公開講座を開催しようとしている。(The organization A is trying to hold an open class about “city planning” on the Internet.)
htaccessを置いたとたんそのディレクトリ以下で.(As soon as you put htaccess, under the directory.)
昨年の没後400年祭を機に復元した井戸を紹介する木下さん(This is Mr. Kinoshita, who introduces a well restored last year marking fourth centennial
of the death.)恋は、真剣勝負。
(Love is a game played in earnest.)ほめ言葉が多くって嬉しいですね。
(I’m glad to receive many compliments.)いまだに言うでしょう。
(You still say that.)「買いパラ」を見たと伝えれば、お買い上げ合計金額より5%引きいたします。
(If you say that you saw “Kaipara”, we offer a 5% discount from all the bills.)政治も危機的状況ですし、物資も不足しています。
(Politics is at a crisis, and commodities are scarce.)思いやりのある優しい子に育ってネ。
(Grow up to be a considerate and kind person.)
Compiling case frames from the web corpus
• Collect reliable parse results (predicate-arguments) of the web corpusAccuracy: 86.7% (all) → 97.3% (18.1%)
• Semantic ambiguity, scrambling, omission → A verb and its closest argument are coupled
望遠鏡で 泳いでいる 女の子を 見たtelescope swim girl saw
泳いでいる 女の子を 望遠鏡で 見たtelescope swim girl saw
jugyoin -worker
kuruma -car
nimotsu -baggage
tsumuload
ganom
nidat
woacc
nimotsu -baggage
woacc
nimotsu -baggage
busshi -supply
hikouki -airplane
busshi -supply
tsumuload
nidat
woacc
tsumuload
tsumuload
kare -he
nidat
ganom
woacc
woacc
tsumuload
keiken -experience
woacc
tsumuaccumulate
truck -truck
keiken -experience
tsumuaccumulate
woacc
sensyu -player
ganom
kuruma -car
nidat
jugyoin -worker
ganom
sagyosya -operator
ganom
jugyoin -worker
kuruma -car
nimotsu -baggage
tsumuload
ganom
nidat
woacc
nimotsu -baggage
woacc
nimotsu -baggage
busshi -supply
hikouki -airplane
busshi -supply
tsumuload
nidat
woacc
tsumuload
tsumuload
kare -he
nidat
ganom
woacc
woacc
tsumuload
keiken -experience
woacc
tsumuaccumulate
truck -truck
keiken -experience
tsumuaccumulate
woacc
sensyu -player
ganom
kuruma -car
nidat
jugyoin -worker
ganom
sagyosya -operator
ganom
jugyoin -worker
kuruma -car
nimotsu -baggage
tsumuload
ganom
nidat
woacc
nimotsu -baggage
woacc
nimotsu -baggage
busshi -supply
hikouki -airplane
busshi -supply
tsumuload
nidat
woacc
tsumuload
tsumuload
kare -he
nidat
ganom
woacc
woacc
tsumuload
keiken -experience
woacc
tsumuaccumulate
truck -truck
keiken -experience
tsumuaccumulate
woacc
sensyu -player
ganom
kuruma -car
nidat
jugyoin -worker
ganom
sagyosya -operator
ganom
Case frame examples
R:1583, CD:664, CDR:3, …ni
yaku (3)(copy)
bread:2484, meat:1521, cake:1283, …woyaku (1)(bake)
data:178, file:107, copy:9, …womaker:1, distributor:1, …ga
…
attack:18, action:15, son:15, …nihand:2950wo
oven:1630, frying pan:1311, …deteacher:3, government:3, person:3, …gayaku (2)
(have difficulty)
I:18, person:15, craftsman:10, …gaexamplesCS
Statistics of the acquired case frames
4.226.9Average # of unique examples for CS29.872.9Average # of examples for CS
2.43.2Average # of CS for a case frame17.534.3Average # of case frames for a verb
461444262noun+copula9914121adjective
1264140860verb1824689243# of predicatesnewsweb
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
-10 -8 -6 -4 -2 0 2 4
Corpus size
●:Similar match■:Exact match
31M 62M 125M 250M 500M 1G
(cf. Penn treebank based lexical parser: 1.5% [Bikel 04])
Coverage (bi-lexical dependency)
Case frame search is available
Related Work
• Subcategorization frame acquisition[Brent, 1993] [Ushioda et al., 1993] [Manning, 1993] [Briscoe and Carroll, 1997]…
• FrameNet [Baker et al., 1998]
• PropBank [Palmer et al., 2005]
• Unsupervised learning for English [McClosky et al., 2006]
Integrated probabilistic model for syntactic and case structure analysis
[Kawahara and Kurohashi, HLT-NAACL2006]
dinner-wa eat-te go_home-tabangohan-wa tabe-te kaet-ta
dinner-wa
eat-te
go_home-ta
dinner-wa
eat-te
go_home-taEOSEOS
)|go_home( EOStaP − )|go_homedinner( EOStawaP −−
)go_home|eatdinner( tatewaP −−− )go_home|eat( tateP −−005.0= 003.0=
002.0= 000001.0=>
Integrated model for syntactic and case structure analysis
( )
),,(maxarg)(
),,(maxarg
)|,(maxarg,
),(
),(
),(
SLTPSP
SLTP
SLTPLT
LT
LT
LTbestbest
=
=
=
Case structure L
∏∈T
hii
ibP
C
)|C(def
iCihb
clause
SInput sentence , Dep. structure T ,
)|()go_home,|)eatdinner((
)EOS|()EOS,|)go_home((
tatePtewaCSP
taPtaCSP
×−×
×
)|()go_home,|)eat((
)EOS|()EOS,|)go_homedinner((
tatePteCSP
taPtawaCSP
×××
−
)go_home|eatdinner()EOS|go_home(
tatewaPtaP
−−−×−
)go_home|eat()EOS|go_homedinner(
tatePtawaP
−−×−−
∏∈
=T
hiLTLTi
ibPSLTP
C),(),()|C(maxarg),,(maxarg
dinner-wa
eat-te
go_home-ta
dinner-wa
eat-te
go_home-taEOSEOS
eat1gawo dinner
eat1ga dinnerwo
eat2gawo dinnerni
eat2ga dinnerwoni
eat2gawoni dinner
… … …
eat1gawo
eat2gawoni
…
∏∈
×=T
hihiiLTLTi
iiffPwfCSPSLTP
C),(),()|(),|(maxarg),,(maxarg
Generative probability of case structure
),|()|()|( ilkilhi fCFCAPvCFPwvPi
≈
Probability of generating predicate ivProbability of generating case frame from predicate ivlCF
Probability of generating case assignment from case frame lCFkCA
)go_home|eat(P )eat|( 1eatCFP
dinner-wa
eat-te
go_home-ta
Case frame CFeat1
dinner, lunch, …woperson, student, …ga
eat1
(no correspondence)
Case assignment CAk
),|(ihii wfCSP
Generative probability of case assignment
( )∏=
==1)(:
,,|,,1)(),|(jj sAs
jiljjjilk sfCFfnsAPfCFCAP
( )∏=
=×0)(:
,,|0)(jj sAs
jilj sfCFsAP
( )woteCFwawoAP ,,|,dinner,1)( 1eat= ( )gateCFgaAP ,,|0)( 1eat=
:jn content word:jf type
:js case slot
dinner-wa
eat-te
Case frame CFeat1
dinner, lunch, …woperson, student, …ga
eat1
(no correspondence)
Case assignment CAk
( )∏∈
×××=T
hiilkilhiLTLTi
iiffPfCFCAPvCFPwvPSLTP
C),(),()|(),|()|()|(maxarg),,(maxarg
go_home-ta
Resources for parameter estimation
Probability Resourcewhat is generated
CS analysis resultscase framecase frameswordsparse resultspredicateKyoto Text Corpussurface case
CS analysis resultscase slot
Kyoto Text Corpuspredicate typeKyoto Text Corpustopic markerKyoto Text Corpuspunctuation mark)|( ij fpP
),|( jij pftP
)|( jj scP
),1)(,|( jjlj ssACFnP =
),|}1,0{)(( jlj sCFsAP =
)|(ihi wvP
)|( il vCFP
),,|,(iii hhhii oupupP
Supervised
Unsupervisd
Experiments• Resources for parameter estimation
– Case frames: constructed from 500M web sentences– Parse, CS analysis results: analysis results of 6M web
sentences• Experiment for syntactic structure
(675 web sentences)– Evaluate the head of each bunsetsu
(except the last and second last bunsetsu)• Experiment for case structure
(215 web sentences)– Evaluate case interpretation of TM phrases (~ wa) and
clausal modifiees
Experimental resultsOur methodMere parsing
0.920 (457/497)0.911 (453/497)VB→NB0.791 (601/760)0.780 (593/760)VB→VB0.946 (526/556)0.944 (525/556)NB→NB0.869 (1086/1249)0.853 (1066/1249)others0.812 (242/298)0.819 (244/298)TM(~wa)0.858 (1328/1547)0.847 (1310/1547)NB→VB0.874 (3477/3976)0.867 (3447/3976)all
Dep. structure
Case structure
0.781 (121/155)0.690 (107/155)Clausal modifiee0.781 (82/105)0.686 (72/105)TM phrase
Our methodSim-based baseline
Improved examples
水が 高い ところから 低い ところへ 流れる。
?
すぐに 標識用の エビを 同港に 停泊した 当港所属調査船
「おやしお丸」に 搬送し、…
?
(water) (high) (place) (low) (place) (flow)
(soon) (for sign) (shrimp) (same port) (cast anchor)
(transfer)(“Oyashiomaru”)
(investigation ship)
Synonymous expression acquisition and flexible matching
Flexible Matching
• There are a lot of expressions that convey almost the same meaning– great difficulty in many NLP tasks
• Automatic extraction of synonymous expressions from a dictionary and a Web corpus [Shibata et al. IJCNLP2008]
• Flexible matching using SYNGRAPH data structure [Shibata et al. IJCNLP2008]
Automatic Acquisition of Synonymous Expressions
Web
pattern distributional similarity
parenthetic expression
Dictionary
BSE=bovine spongiform
encephalitis
husband=shedog=spy buy=purchase
Synonym and HypernymExtraction from a Dictionary
• Using the definition sentence patterns– Hypernym
• dinner: yugata (evening) no (of) syokuji (meal)
– Synonym• ice: “ice cream” no (of) ryaku (abbreviation)• purchase: kau (buy) koto (matter) (one phrase)
• Wide coverage, but includes exceptional or idiosyncratic usages – dog:1/2 → animal– dog:2/2 = spy– tap water:2/2 = strait
Distributional Similarity
• “Two terms are similar if they appear in similar contexts”– If two terms have similar co-occurrence words, the
two terms are similar• Calculate the distributional similarity using a
Web corpus (500M sentences)– co-occurrence in the dependency relation– co-occurrence words if PMI (Pointwise Mutual
Information) is positive– similarity is defined as the overlap of co-occurrence
words: calculate using the Simpson coefficient
e.g.: co-occurrence word and similar word of “doctor”
……10.281be stopped by10.506turn white11.024want to consult11.277be examined11.589be pronounced12.173see
PMICo-occurrence word
……0.565DOCTOR0.573eye doctor0.613teacher0.664midwife0.742veterinary0.754ENT doctor
Simpson efficient
Similar word
Synonym and HypernymExtraction from a Dictionary
• Using the definition sentence patterns– Hypernym
• dinner: yugata (evening) no (of) syokuji (meal)
– Synonym• ice: “ice cream” no (of) ryaku (abbreviation)• purchase: kau (buy) koto (matter) (one phrase)
• Wide coverage, but includes exceptional or idiosyncratic usages – dog:1/2 → animal– dog:2/2 = spy– tap water:2/2 = strait
0.419
0.1190.338
Synonym Extraction from a Web corpus
• Extract from symmetry parenthesis expressions– ..A(B).., ..B(A).. → A=B
• Can extract synonyms between NEs/terminologies/neologisms, which cannot be extracted from a dictionary– 国際連合教育科学文化機関 = ユネスコ
(UNESCO)– 放射性同位元素 = RI (radioisotope)– 携帯電話 = ケータイ (cellular phone)
Acc.
Web (parenthetic expression)
94%
96%
Acc.
5,225
23,292
9,274
#
Dic Dic & Web (Similarity)
98%Synonym
#
98%17,207Hypernym
99%6,867Synonym
Acc.#
accuracy: for randomly-selected 100 terms
NLP based on Information Explosion on the Web
NLP for Information Explosion on the Web
• Compilation of a basic lexicon and robust morphological analysis
• Case frame acquisition and predicate-argument structure analysis
• Synonymous expression acquisition and flexible matching
• Open search engine infrastructure• Information organization system• Information credibility analysis system
• 情報爆発(Cyber Infrastructure for the Information-explosion Era)– 文部科学省 科学研究費補助金(Ministry of
Education, Culture, Sports, Science and Technology, Grants-in-Aid for Scientific Research)
• 情報分析(Information Analysis Project)– 総務省(Ministry of Internal Affairs and
Communications)/NICT
Two Governmental Projects
Open search engine infrastructure
Yahoo/Google
Next-Generation Search
Search Engine InfrastructureT S U B A K I
Grid computing environmentand huge storage servers
Next-Generation Search
Search Engine Infrastructure TSUBAKI
Search Engine InfrastructureT S U B A K I
Grid computing environmentand huge storage servers
Next-Generation Search
• Reproducible search results– Fix 100 million Japanese web
pages* (May - July 2007)• Web standard format for
advanced NLP– Available at TSUBAKI API
• Deep NLP indexing– Spelling variations,
dependency relations and synonymous expressions
• Open search algorithm• API without any restriction
Web Standard Format
Web standard format
• Problems of use of web pages in NLP– Unclear sentence boundaries– Several meta-data in several tag formats– Spam
• A simple XML-styled data format for annotating meta-data and text-data of a web page– Meta-data:
• URL, crawl date, character encoding, title and anchor text (in-links/out-links)
– Text-data:• Sentences in a web page, and their analysis results by NLP
tools
しかしさすがに電池の持ちが悪くなってきたのと、<br>たまたまキャンペーンをやっていて無料で機種変できるみたいだったから<br>
愛着のわいた携帯を手放すことにした。</p>
<p>で、折角変えるならまた長く使えるのがいいじゃない?<br>すごいいいデザインのがあって(しかもロゴがsoftbank!!)<br>これに決めた!!</p>
<p>と思ったら…</p>
<p>なんと品切れ。他の店舗に問い合わせてもどこも品切れ。<br>唯一あったのが電車で40分かかる所。</p>
<p>もちろん行きました。</p>
<p>そこまでして手に入れた携帯だから前以上に既に愛着がわいてますw<br>また5年間使い続けるぞい!</p>
</div></div><p class="entry-footer">
<span class="post-footers">投稿者: KN006 日時: 2006年10月16日 22:05
</span>
Sentence boundaryLayout adjustmentMeta data
<?xml version="1.0" encoding="utf-8"?><StandardFormat Url="http://nlp.kuee.kyoto-u.ac.jp/blog/KUNTT_blog/2006/10/"OriginalEncoding="utf8" Time="2007-06-18 18:13:38"><Text>…中略…<S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="10">
<RawString>しかしさすがに電池の持ちが悪くなってきたのと、たまたまキャンペー
ンをやっていて無料で機種変できるみたいだったから愛着のわいた携帯を手放すことにした。</RawString></S><S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="11">
<RawString>で、折角変えるならまた長く使えるのがいいじゃない?</RawString></S><S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="12"><RawString>すごいいいデザインのがあって(しかもロゴがsoftbank!!)これに決め
た!!と思ったら…</RawString></S>…中略…<S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="18">
<RawString>そこまでして手に入れた携帯だから前以上に既に愛着がわいてますwまた5年間使い続けるぞい!</RawString></S></Text></StandardFormat>
<?xml version="1.0" encoding="utf-8"?><StandardFormat Url="http://nlp.kuee.kyoto-u.ac.jp/blog/KUNTT_blog/2006/10/"OriginalEncoding="utf8" Time="2007-06-18 18:13:38"><Text>…中略…<S Offset="3490" Length="1070" is_Japanese_Sentence=“1" Id="10">
<RawString>しかしさすがに電池の持ちが悪くなってきたのと、たまたまキャンペーンをやっていて無料で機種変できるみたいだったから愛着のわいた携帯を手放すことにした。</RawString>
<Annotation Scheme="Knp"><![CDATA[* 14D <BGH:しかし/しかし><文頭><接続詞><係:連用>しかし しかし しかし 接続詞 10 * 0 * 0 * 0 "代表表記:しかし/しかし"<自立><文節始>* 4D <BGH:流石/さすが><助詞><体言><修飾><係:ニ格><格要素><連用要素>さすが さすが さすが 副詞 8 * 0 * 0 * 0 "代表表記:流石/さすが" <自立><文節始>に に に 助詞 9 格助詞 1 * 0 * 0 NIL <付属>* 3D <BGH:電池/でんち><助詞><連体修飾><体言><係:ノ格>電池 でんち 電池 名詞 6 普通名詞 1 * 0 * 0 "ドメイン:家庭・暮らし カテゴリ:人工物-その他 代表表記:電池/でんち“ <名詞相当語><自立><文節始>… 中略 …した した する 動詞 2 * 0 サ変動詞 16 タ形 10 NIL <連体修飾><活用語><付属>。 。 。 特殊 1 句点 1 * 0 * 0 NIL <文末><英記号><記号><付属>EOS]]></Annotation></S>
Part-of-speechPart-of-speechDomainDomain CategoryCategory
Representative formRepresentative form
Deep NLP Indexing
Inverted Index
Page1, Page2, Page3ofPage3andPage3information
Page1, Page2, Page3problemPage1, Page2computerPage1, Page3language
language, problem, information, of, andPage3
computer, problem, ofPage2
language, computer, problem, ofPage1
Items in index data
O
O
O
O
Doc. ids
O
O
O
O
Freq. in a doc.
O
O
O
O
String
X
X
O
O
DF
OXSynonymous expressions
OXDep. of
synonymous expressions
XXDep. of words
OOWord
PositionSent. IDsIndex type
DF: document frequency
Sentences in a document
せんたくしましたwashedselected
服(clothes)
こども(child)
こども(child)
一緒に(together)
を(wo)
。(.)
と(to)
。(.)
Word index
せんたくしましたwashedselected
服(clothes)
こども(child)
こども(child)
一緒に(together)
721.0TO
311.0WO410.5WASH821.0TOGETHER
2.0
0.51.02.0
Freq.
1,2
11
1,2SID
1,6CHILD2CLOTH4SELECT
5,9.
PositionWords
S1
S2
Sentence IDs
を(wo)
。(.)
と(to)
。(.)
1 32
6 8
Positions in a document
4 5
97
Dependency relation index
せんたくしましたwashedselected
服(clothes)
こども(child)
こども(child)
一緒に(together)
を(wo)
。(.)
と(to)
。(.)
1.0CHILD→TOGETHER
0.5
0.5
1.0
Freq.
CHILD→CLOTH
CLOTH→WASH
CLOTH→SELECT
Dependency relations
5
Synonymousexpression index
S17250:洗濯しました。S10184:選択
S55:服をS11412:こども
S11412:こどもと S15355:一緒に。
1 42
6 8
A LITTLE CHILD, CHILD
CLOTHES, CLOTHING
GARMENTS, WEAR, DRESS
WASH, CLEAN
SELECT,CHOOSE
A LITTLE CHILD, CHILD
TOGETHER
3
7 9
40.5S17250:洗濯
1.02.00.51.0
Freq.
2S55:服4S10184:選択
1,6S11412:こども
8S15355:一緒
PositionSynonymousexpression sets
5
Dep. index ofSynonymousexpressions
S17250:洗濯しました。S10184:選択
S55:服をS11412:こども
S11412:こどもと S15355:一緒に。
1 42
6 8
A LITTLE CHILD, CHILD
WASH, CLEAN
SELECT,CHOOSE
A LITTLE CHILD, CHILD
20.5S55:服→S17250:洗濯
1.0
1.0
0.5
Freq.
2S55:服→S10184:選択
1S11412:こども→S55:服
6S11412:こども→ S15355:一緒
PositionDependency relationsbetween synonymous
expression sets
3
7 9
TOGETHER
CLOTHES, CLOTHING
GARMENTS, WEAR, DRESS
Query syntax• Natural languagesentence:
• Phrase search:
• Proximity search(word):
• Proximity search(sentence):
“京都大学”
京都大学~5S京都 and 大学 are co-occurred within 5 sentencesin that order.
京都大学~5W京都 and 大学 are co-occurred within 5 words inthat order
• Combination of the above search notations京都大学への行き方 “市バス”
京都大学(Kyoto univ.)
行き方(access)
への
(city bus)
w1 w2 w3Query:
w1 w2 w3
d1
d2
Query:
w1 w2 w3
d1
d2
Query:
w1 w2 w3
d1
d2
Query:
w1 w2 w3
d1
d2
Query:
w1 w2 w3
d1
d2
Query:
w1 w2 w3
d1
d2
Query:
w1 w2 w3
d1
d2
Query:
within N words
w1 w2 w3
d1
d2
Query:
within N words
( ) ),(),( dQreldQrelQ,dscore ddww +=
Scoring method
Score calculatedfrom words in Q
Score calculated fromdependency relations in Q
• Score calculated from a query Q for a document d
– Q=
⇒ Qw={子供,体力,低下}, Qd={子供→体力,体力→低下}
子供(child)
の(no)
体力(strength)
低下(decrement)
Scoring method
• Score calculated from word indices Qw inQ for a document d– Qw={子供(child),体力(strength),
低下(decrement)}
( ) ∑∈
⎟⎟⎠
⎞⎜⎜⎝
⎛++−
×+×
×=wQq
ww nnN
fqKfqqfq,dQrel
5.05.0log3
OKAPI BM25
( )avellbK
2312 +−=
fq: the frequency of the expression q in dqfq: the frequency of q in Q
n: the document frequency of q in 100 million pages
N: 1 x 108
l: the document length of dlave: the average document length over all the pages
Scoring method
• Score calculated from dependency relationindices Qd in Q for a document d– Qd=
( ) ∑∈
=dQq
dd dqh,dQrel ),(
if d includes qotherwise⎩
⎨⎧
=),(),(
),(dqgdqf
dqh
子供(child)
体力,(strength)
→ 低下(decrement)
体力(strength)
→
Scoring method (d includes q)
• Score calculated from dependency relation indices Qd in Q for a document d
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛++−
×+×
×=5.0
5.0log3n
nNfqKfqqfqq,df
OKAPI BM25
( )avellbK
2312 +−=
fq: the frequency of the expression q in dqfq: the frequency of q in Q
n: the document frequency of q in 100 million pages
N: 1 x 108
l: the document length of dlave: the average document length over all the pages.
Scoring method (d does not include q)
• Score calculated from dependency relation indices Qd in Q for a document d
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛++−
×+×
×=5.0
5.0log)()(3,
nnN
qwKqwqfqdqg
⎪⎩
⎪⎨⎧ −
=0
))(),(min()( D
qrqlDqw
l(q): parent of dependency relation qr(q): child of dependency relation qmin(q1,q2) : minimum distance between q1
and q2 (# of words)D: threshold of distance (D = 30)n: DF value of dependency relation q1→q2
min(l(q),r(q)) < D
otherwise
Contribution of deep NLP indices
0.2530.170Dependency relation(only f (q,d))
0.2300.168Dependency relation(f (q,d) + g (q,d))
0.2570.162Wordrepresentative form
0.2320.155Baseline
P10R-precision
(NTCIR 10M web test set)
4 master servers
16 snippet creation servers
Index data generated from a million web pages
Load balance server
UserParse a queryMerge retrieved pagesCreate a search result
Retrieve pagesRank retrieved pages
Snippet CreationSnippet Creation
A million web standard format data27 search servers
4 master servers
16 snippet creation servers
index data generatedfrom a million web pages
Load balance server
User
A million web standardformat data
27 search servers
9.3 GBTitle DB
132 GBtotal115 GBDF DB
7.9 GBURL DB
Data sizes per 100 million pages
index data sizes per a million pages
48 GBDep. of synonymousexpressions
85.9 GBtotal
18 GBSynonymousexpressions
(word and phrase)
8.9 GBDep.11 GBWord
31 GBWeb standard form data
Gzipped file size pera million pages
Required time per a query
RetrievingScore
calculation& ranking
Snippetcreation
Get title& URL
Merge &ranking
Search servers Master server
Query paring
Load balanceserverSend
a query
Masterserver
Snippetcreationservers
7.9 secondsGet hitcount by API
Get document IDs, titles and URLs of top 100 pages by API
Ordinary search (50 pages are shown)
Get document IDs, titles and URLs of top 1000 pages by API
9.7 seconds
32.6 seconds
12.7 seconds
* Document IDs are necessary for obtaining cached web pages and web standard format data
TSUBAKI API
TSUBAKI APIhttp://tsubaki.ixnlp.nii.ac.jp/api.cgi
• No user registration• No limited number of API calls a day• Provide all pages in a search result
– cf. Yahoo! API: top 1000 pages in a searchresult,Google AJAX Search API: top 8 pages in asearch result, and(previous) Google API: top 1000 pages in asearch result
• Provide web standard format data
Request parameters
The document type to return.html/xmlformat
The document ID to obtaina cached web page or standardformat data correspondingto the ID.
integeridSet to 1 to obtain a query’s hitcount only.0/1only_hitcount
The number of results to return.integerresultsThe logical operation tosearch for.
AND/ORlogical_operator
The starting result position to return.integerstart
The query to search for (UTF-8 encoded). The query parameter is required for obtaining search results.
stringQueryDescriptionValueParameter
http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20
http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20
URI encoded string of the query “京都観光”(Kyoto sightseeing)
http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?query=%E4%BA%AC%E9%83%BD%E8%A6%B3%E5%85%89&start=1&results=20
http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=html&id=06832381http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=html&id=06832381
http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=xml&id=06832381http://tsubaki.ixnlp.nii.ac.jp/se/api.cgi?format=xml&id=06832381
Conclusion• Search engine infrastructure TSUBAKI
– reproducible search results,– Web standard format for sharing pre-
processed web pages,– indices generated by deep NLP,– open search algorithm, and– APIs without any restriction
• Available from http://tsubaki.ixnlp.nii.ac.jp/index.cgi
Information organization system
Search Engine InfrastructureT S U B A K I
Grid computing environmentand huge storage servers
Next-Generation Search
Search result clustering
• Advantage– Provide a bird’s-eye view of a search result– Provide efficient access for necessarily pages– Provide low-ranked pages in a search result
• Requirement– Quickness of cluster construction– Quality of cluster labels
• Affect access of necessarily pages
Characteristics of our system
• Cooperation with search engine infrastructure TSUBAKI– Full text data in a web page and their
analyzed data– High-performance computing environment
• Label acquisition based on deep NLP– Assimilate expressive divergence
• Spelling variations• Synonymous expressions
Distillation of labels
Discardedサイトマップサイトマップ
得点力アップ
新教育課程
DiscardedDiscarded教育基本法改正
教育基本法改正案
得点力アップ
新教育課程
教育基本法改正案
得点力アップ得点力アップ
得点力UP
新カリキュラム
教育基本教育基本
新教育課程新教育課程
法改正法改正
...
教育基本法改正教育基本法改正
教育基本法の改正案
教育基本法改正案
教育基本法改正案
Assimilate expressive divergence
Eliminate inappropriatecompound nouns
Merge substrings
Overview of our clustering systemStep 1. Label acquisition Step 2. Cluster generation
国際捕鯨委員会
調査捕鯨
Step 3. Cluster organization Step 4. Display
○○○○○○○○
○○○○○○○○○○○○○○○○○○○○ ○○○○○○○○○○○○○ ○○○
○○○○○○ ○○○○ ○○○○
○○○○○○ ○○○○○○○○○○○ ○○○
○○○○○○○○○○
国際捕鯨委員会(IWC)調査捕鯨(Scientific whaling)…
Architecture
Search EngineTSUBAKI
QueryQuery
Query
Search result
Compound NounExtraction
Compound NounExtraction
Label Selection &Clustering
Search &Page ID Gathering
Web Standard Format Collection
ClustersClusters
Clustering result- Whaling problem -
• IWC (357pages)– 科学委員会
(Scientific committee)– 年次総会
(Annual meeting)– IWC総会
(IWC meeting)– 原住民生存捕鯨
(Aboriginal subsistencewhaling scheme)
– 鯨種(Species of whales)
…• 調査捕鯨 (145pages)
(Scientific whaling)– 日本の調査捕鯨
(Scientific whaling in Japan)• 捕鯨船 (65pages)○
(Whaling port)
△ The explanation of IWC,criticism of IWC and others
○ The explanation and the positiveor negative opinions of thescientific whaling
○ Accidents and history of whalingports
●IWC (357pages)科学委員会(Scientific committee)年次総会(Annual meeting)IWC総会(IWC meeting)原住民生存捕鯨(Aboriginal Subsistence Whaling Scheme)鯨種(Species of whales)
…●調査捕鯨 (145pages)
(Scientific whaling)日本の調査捕鯨(Scientific whaling in Japan)
●捕鯨船 (65pages)(Whaling port)
●南極海 (51pages)(Antarctic ocean)
1 st72 nd
41 st
44 th
37 th
32 nd
1st
94 th
6 th
19 th
Rank of a web page including a label
The clustering result of the query ``whaling problem’’
Users can find web pages that are low-ranked in a search result
Users can find web pages that are low-ranked in a search result
Conclusion
• Label based search result clustering system• Cooperation with search engine infrastructure
TSUBAKI– Full text data in a web page and their analyzed data– High-performance computing environment
• Label acquisition based on deep NLP– Assimilate expressive divergence
• Spelling variations• Synonymous expressions
Information credibility analysis system
Information Credibility Analysis
1. Credibility of information contents2. Credibility of information sender3. Credibility estimated from document style
and superficial characteristics4. Credibility based on social evaluation of
information contents/sender
1.Credibility of information contents
• Sentences in the related documents are classified into opinions, events, and facts, and opinion sentences are classified into positive opinions and negative opinions.
• Documents in each cluster should be summarized, by using multi-document summarization techniques and their extensions.
• Several relations such as similarities, oppositions, causal relations, supporting relations are detected among inner- and inter-cluster statements, which leads to the detection of logical consistency and contradiction.
減っている
数 が
ミンク鯨 の
増えている 絶滅しかかっているミンク鯨 の 数 は 問題 は ミンク鯨 だ
Word segmentation and identification
増えている
ミンク鯨 の
数 は
Predicate-argument structure analysis
絶滅しかかっている
問題 は
ミンク鯨 だ
Conflict
ミンククジラの数は増えている 問題はミンク鯨だ.絶滅しかかっているminke whale number increasing problem is minke whale. face extiction
Anaphora resolution
Flexiblematching
Deep NLP ⇒ Information Credibility
2. Credibility of information sender
• Information sender:– individuals
• expert or not• identified individual by handle-name, and others
– organizations• public organization (administrative organ, academic
association, universities), • media, • commercial companies, and others.
• Distinguished by:– meta-information such as URLs, page titles, anchor
texts, and RSS– NE extraction
2. Credibility of information sender
• Check the quantity and quality of information the sender produced so far.
• Information quality evaluation can be given based on the other 3 criteria.
• Speciality of individual and organization is important, which can be detected by topic detection.
3. Credibility estimated from document style and superficial characteristics
• Guessed by integrating many criteria such as sentential style (formal or informal, written-language or spoken-language), page layout, and appropriateness of links in the page, and so on.
• cf. Persuasive technology at Stanford University, and Google News automatic assembling criteria.
4. Credibility based on social evaluation of information contents/sender
• How they are evaluated by others.
• One way is to perform opinion mining from the web based on NLP, and collect and count positive and negative evaluations for the information content/sender.
• Another way is to directly use rankings and comments of others, like social network framework.
Sender Clustering Ontology Opinion Q & A
Information Credibility Analysis SystemWISDOM (2006~)
Pageclustering
Agaricus
Sender
Opiniondistribution
Information Credibility Analysis SystemWISDOM (2006~)
Summary• Several linguistic and extra-linguistic knowledge
can be acquired from the web corpus, using high-performance computing environment.
• Deep NLP, especially accurate predicate-argument structure analysis and flexible matching are key technologies for next generation search.
• Automatic information credibility evaluation is not easy, but information organization system and multi-faceted information analysis system help users’ evaluation a lot.
References• D. Kawahara and S. Kurohashi. Case frame compilation from the
web using high-performance computing. In Proceedings of LREC2006, 2006.
• D. Kawahara and S. Kurohashi. A fully-lexicalized probabilistic model for Japanese syntactic and case structure analysis. In Proceedings of HLT-NAACL2006, pages 176--183, 2006.
• H. Miyamori, S. Akamine, Y. Kato, K. Kaneiwa, K. Sumi, K. Inui, and S. Kurohashi. Evaluation data and prototype system wisdom for information credibility analysis. In Proceedings of First International Symposium on Universal Communication, 2007.
• T. Shibata, M. Odani, J. Harashima, T. Oonishi, and S. Kurohashi. SYNGRAPH: A flexible matching method based on synonymous expression extraction from an ordinary dictionary and a web corpus. In Proceedings of IJCNLP2008, 2008.
• K. Shinzato, T. Shibata, D. Kawahara, C. Hashimoto, and S. Kurohashi. TSUBAKI: An open search engine infrastructure for developing new information access methodology. In Proceedings ofIJCNLP2008, 2008.
Top Related