Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia...
-
Upload
flora-kelly -
Category
Documents
-
view
246 -
download
0
Transcript of Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia...
![Page 1: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/1.jpg)
Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3
1Jožef Stefan Institute, Slovenia 2Lexical Computing Ltd. and University of Leeds, UK 3Tokyo Institute of Technology, Japan
A large public-access Japanese corpus and its query tool- JapWaC and Sketch Engine -
![Page 2: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/2.jpg)
2 COJAS, March 2007
Overview
1. The case for corpora
2. The case for web corpora
3. How JapWaC was created
4. Sketch Engine (SkE)
5. Demo-ing JapWaC & SkE
6. Future work
7. Access to JapWaC & SkE
![Page 3: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/3.jpg)
3 COJAS, March 2007
Corpora A sample of a language Useful for studying the language Language is diverse
Big samples needed, to catch everything Good tools needed, for large amounts of data
Last 15 years Big samples are easier to gather Tools are better Rapid growth in corpus methods
![Page 4: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/4.jpg)
4 COJAS, March 2007
Web corpora Web is huge, free, easily accessible (Non-)linguists use it for lang. check/research Skewed?
Keller and Lapata 03: • web results match human judgements well• the large amount of data outweighs the “noise” problem
Web importance as a resource is growing David Crystal “Language and the Internet” 06:
• “new linguistic medium that we cannot ignore” Web-corpus expertise is growing (WaCky etc.)
![Page 5: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/5.jpg)
5 COJAS, March 2007
Steps to compile web corpora(Sharoff, Baroni)
1. Get URL list for required language ~500 most frequent word forms
not function words; for general-purpose corpora, words that do not belong to a spec. domain
5000-6000 queries, 4 words, top 10 URLs2. Download HTML pages3. Normalize encoding (to UTF-8)4. HTML clean-up
boilerplate removal: HTML tags, Java code, navigation frames,…
5. Extract meta-data (URL, title, date,…)6. Linguistic annotation
![Page 6: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/6.jpg)
6 COJAS, March 2007
Steps for JapWaC
URL list of pages in Japanese provided by Serge Sharoff
• word, lemma and PoS frequency lists for Japanese, c.f. http://corpus.leeds.ac.uk/list.html
Files downloaded and cleaned with BootCat • by Marco Baroni and others from the WaCky
project, c.f. http://wacky.sslmit.unibo.it/ Segmented, tokenised, tagged with Chasen Translated Chasen tags to English Converted to Sketch Engine format and
loaded
![Page 7: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/7.jpg)
7 COJAS, March 2007
Example fileThe file size is 7038669 kB, showing first 1 kB.
<doc id="http://www.0start-hp.com/voice/index.php"><s>月々 月々 N.Adv2 2 N.Num6 6 N.Num3 3 N.Num円 円 N.Suff.msrで だ Aux、 、 Sym.cあなた あなた N.Pron.gも も P.bindブログデビュー ブログデビュー Unknownし する V.freeて て P.Conjみ みる V.bndませ ます Auxん ん Auxか か P.advcoordfin? ? Sym.g</s>
![Page 8: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/8.jpg)
8 COJAS, March 2007
Basic corpus statistics
49,554 URLs (i.e. HTML files) 16,072 sites (2 domains) 12,759,201 sentences (Chasen) 409,384,411 tokens (Chasen) 7.3 GB filesize tokens/file: 8,263 Average 5,001 Median 3 Min 170,693 Max
![Page 9: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/9.jpg)
9 COJAS, March 2007
URL statistics(top ranking domains, sites and keywords)
![Page 10: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/10.jpg)
10 COJAS, March 2007
Chasen POS statistics
![Page 11: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/11.jpg)
11 COJAS, March 2007
The Sketch Engine Leading corpus query system Any corpus, any language Web-based
No software to install
Concordance Word sketches
“one-page, corpus based account of a word’s grammatical and collocational behaviour”
Thesaurus Word Sketch Difference
![Page 12: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/12.jpg)
12 COJAS, March 2007
Use of Sketch Engine
Macmillan English DictionaryFor Advanced Learners
Ed: Rundell, 2002 Lexicography
Language learning
Linguistic research
.
.
.
![Page 13: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/13.jpg)
13 COJAS, March 2007
![Page 14: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/14.jpg)
14 COJAS, March 2007
Creating SkE for Japanese
1. Load JapWaC into SkE
2. Write gram relations for Japanese Chasen POS (as used for jaSlo)
3. Compile word sketches
4. Recompute scores in WS
5. Compile thesaurus
![Page 15: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/15.jpg)
15 COJAS, March 2007
![Page 16: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/16.jpg)
16 COJAS, March 2007
![Page 17: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/17.jpg)
17 COJAS, March 2007
![Page 18: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/18.jpg)
18 COJAS, March 2007
![Page 19: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/19.jpg)
19 COJAS, March 2007
![Page 20: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/20.jpg)
20 COJAS, March 2007
![Page 21: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/21.jpg)
21 COJAS, March 2007
![Page 22: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/22.jpg)
22 COJAS, March 2007
Word Sketch examples
WS for 女の子 (noun )
WS for 冷たい (adjective)
WS for 書く (verb)
![Page 23: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/23.jpg)
23 COJAS, March 2007
![Page 24: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/24.jpg)
24 COJAS, March 2007
Thesaurus, WS Diff example(1)
WS Diff for 女の子 and 男の子
![Page 25: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/25.jpg)
25 COJAS, March 2007
Thesaurus, WS Diff example (2)
WS Diff for 寒い and 冷たい
![Page 26: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/26.jpg)
26 COJAS, March 2007
「温泉」 example
WS for 温泉
![Page 27: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/27.jpg)
27 COJAS, March 2007
More metadata in the corpus:• date, title, author; text typology
More data cleaning Japanese corpus for HLT research:
• sampling only 10 consecutive sentences, 100M• would be available for download with Creative Commons
license For native speakers’ and learners’ use:
• original Chasen tags, Chasen kana• Ruby romaji, furigana in examples
Connecting to jaSlo, Natsume system More advanced relations (MWU etc.), Cabocha? Load other corpora into SkE (Kotonoha, AB)
Future work
![Page 28: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/28.jpg)
28 COJAS, March 2007
Access to JapWaC & SkE
http://www.sketchengine.co.uk Free 30-day trial Self-registration Japanese, Chinese, English, French,
German, Italian, Spanish, Portuguese, Slovene
Also gives access to WebBootCaT “instant web corpora”
![Page 29: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.](https://reader033.fdocuments.in/reader033/viewer/2022061614/56649de55503460f94add891/html5/thumbnails/29.jpg)
29 COJAS, March 2007
Thank you for your attention!