Hansaem Kim The National Institute of the Korean Language

18
2009. 6. 18. ISO/TC37/SC4/WG2 Word Segmentation Project Editorial Meeting Word Segmentation in Korean Hansaem Kim The National Institute of the Korean Language

description

2009. 6. 18. ISO/TC37/SC4/WG2 Word Segmentation Project Editorial Meeting Word Segmentation in Korean. Hansaem Kim The National Institute of the Korean Language. Contents for further work (09.4.24.). Part1 1. WU, WSU: check up 2. Figure1 -> change and check. 3. Figure4 - PowerPoint PPT Presentation

Transcript of Hansaem Kim The National Institute of the Korean Language

Page 1: Hansaem Kim  The National Institute of the Korean Language

2009. 6. 18.

ISO/TC37/SC4/WG2 Word Segmentation Project Editorial Meeting

Word Segmentation in Korean

Hansaem Kim The National Institute of the Korean Language

Page 2: Hansaem Kim  The National Institute of the Korean Language

-2- Word Segmentation Project Hansaem Kim

Contents for further work (09.4.24.)Part1

1. WU, WSU: check up2. Figure1 -> change and check. 3. Figure4 1) lemma: delete 2) other lexical items -> other character strings 3) word forms -> lexical items 4) bound morpheme: delete

Part21. terms & definition: added ex) bunsetz, eojeol, character, etc2. Properties of CJK: add to introductory part3. in "Scope": Chinese scripts -> Chinese characters4. Application of Chinese general rules for JK in combination of Chinese

characters5. Add examples of agglutinative unit in JK

Page 3: Hansaem Kim  The National Institute of the Korean Language

-3- Word Segmentation Project Hansaem Kim

Table of contents (Part 2)Foreword

1. Introduction: Kim 1) difference of CJK 2) interaction of CJK (nouns w/ Chinese characters)2. Scope: Choi Application oriented refer to MAF, SynAF, etc linguistic layer & processing(vertical)3. terms and definitions Bunsetsu: Kanzaki Eojeol: Kim4. Overview and motivation: Kanzaki(main), Sun, Kim Mapping table of CJK POS scheme( + examples and definition) 5. Chinese word segmentation 6. Japanese 7. Korean 5.1. General rules for identifying WUs in Chinese text 5.2 Typology of WUs in Chinese

Page 4: Hansaem Kim  The National Institute of the Korean Language

-4- Word Segmentation Project Hansaem Kim

Basic concepts and general Basic concepts and general principlesprinciples (Part1) (Part1)

Page 5: Hansaem Kim  The National Institute of the Korean Language

-5- Word Segmentation Project Hansaem Kim

Word unit(WU)

Distinction between ‘word unit’and‘word segmentation unit’– Y Terms and definition of WSU +– N Correcting the definition of WU

MWE(phrasal compound, fragment of sentence,…) ⊂ lexical item?

– Y No change or changing ‘lexical items’ into ‘lexical items including MWEs’

– N changing ‘lexical items’ into ‘lexical items, MWEs’

Terms and definitions

Page 6: Hansaem Kim  The National Institute of the Korean Language

-6- Word Segmentation Project Hansaem Kim

Essential concept systems (Figure 1)

Page 7: Hansaem Kim  The National Institute of the Korean Language

-7- Word Segmentation Project Hansaem Kim

Essential concept systems (Figure 4) changed

Miscellaneous character strings

Word segmentation unit

Word forms

Page 8: Hansaem Kim  The National Institute of the Korean Language

-8- Word Segmentation Project Hansaem Kim

Word segmentation for CJKWord segmentation for CJK (Part2) (Part2)

Page 9: Hansaem Kim  The National Institute of the Korean Language

-9- Word Segmentation Project Hansaem Kim

See the document. 1) difference of CJK 2) interaction of CJK (nouns w/ Chinese characters)

Introduction

Page 10: Hansaem Kim  The National Institute of the Korean Language

-10- Word Segmentation Project Hansaem Kim

Eojeol Linguistic unit separated by white space in Korean text,

consisting of a word followed by either particle(s) or ending(s), or just a word.

Example– Given a sentence “ 나는 점심을 먹었다 .”, “ 나 (I)” is a pronoun, “ 는”

is a particle, “ 점심 (lunch)” is a noun, “ 을” is a particle, “ 먹 (eat)” is a verbal stem followed by the endings “ 었” and “ 다” . And the sentence contains 3 Eojeols - “ 나는” , “ 점심을” , and “ 먹었다” .

Terms and definitions

Page 11: Hansaem Kim  The National Institute of the Korean Language

-11- Word Segmentation Project Hansaem Kim

Mapping table of CJK POS scheme Overview and motivation

Page 12: Hansaem Kim  The National Institute of the Korean Language

-12- Word Segmentation Project Hansaem Kim

7. 1. General rules 7. 1. General rules for identifying WUs in Korean for identifying WUs in Korean texttext

Page 13: Hansaem Kim  The National Institute of the Korean Language

-13- Word Segmentation Project Hansaem Kim

Space blank and punctuations are separation marks of word segmentation unit in computer processing. The punctuations used as separation marks include the full stop(.), question mark(?), exclamation mark(!), comma(,) middle dot( ), colon(:), slash(/), quotation mark(“”, ‘’), ․brackets(( ), { }, [ ]), dash(―), hyphen(-), swungdash(~), ellipsis dots(……), etc. Korean punctuation marks are listed up in the “Korean language regulations”.

7.1.1. Punctuation

Page 14: Hansaem Kim  The National Institute of the Korean Language

-14- Word Segmentation Project Hansaem Kim

7.1.2.1. Numeric character strings 1984, 2009 7.1.2.2. Foreign character strings GPS, EU, 同意 7.1.2.3. Hangeul(Korean Alphabet) characters (C & V) ㄱㄴㄷ , 가 7.1.2.4. Combination of character strings or other symbols [abc], { 라 }

7.1.2. Combination of characters

Page 15: Hansaem Kim  The National Institute of the Korean Language

-15- Word Segmentation Project Hansaem Kim

7.1.3.1. Simplex 사자 , 밥 7.1.3.2. Compound 농목장 , 검붉다 7.1.3.3. Derivation 풋사과 , 신사적 , 동의하다 7.1.3.4. Abbreviation 건교위 , 노찾사 7.1.3.5. idiomatic expression w/ Chinese characters 와신상담 (臥薪嘗膽 ), 오십보백보 (五十步百步 )

7.1.3. word

Page 16: Hansaem Kim  The National Institute of the Korean Language

-16- Word Segmentation Project Hansaem Kim

7.1.4.1. Phrasal compound 1) General phrasal compound 주민 번호 2) Terminology 민주 국가 , 계급 사회 3) Expressions related to proper nouns 예술의

전당

7.1.4.2. Idiom 1) Lexical idiom 무릎을 꿇다 2) Grammatical idiom ~ 로 인해 , ~ 을 위해

7.1.4.3. Fixed expression: proverb, motto, etc. 낫 놓고 기역 자도 모른다

7.1.4. Combination of words (MWEs)

Page 17: Hansaem Kim  The National Institute of the Korean Language

-17- Word Segmentation Project Hansaem Kim

Typology of WUs in KoreanTypology of WUs in Korean

Page 18: Hansaem Kim  The National Institute of the Korean Language

-18- Word Segmentation Project Hansaem Kim

1. Noun 1.1 Common noun 1.2 Proper noun 1.3 Bound noun2. Pronoun3. Numeral4. Verb5. Auxiliary verb6. Copula

Overall typology (See the document.)

7. Adjective8. Auxiliary adjective9. Adnoun10. Adverb11. Exclamation12. Particle 12.1 Case particle 12.2 Auxiliary

particle