Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English...

Extracting of Translation Unit s and Their Target Language Equival ents from a Chinese-English Pa rallel Corpus CHANG Baobao Institute of Computational Linguistics Peking University, China Pernilla DANIELSSON Wolfgang TEUBERT Centre for Corpus Linguistics Birmingham University

Transcript of Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English...

Page 1: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Extracting of Translation Units and Their Target Language Equivalents from a C

hinese-English Parallel Corpus

CHANG Baobao

Institute of Computational Linguistics

Peking University, China


Wolfgang TEUBERTCentre for Corpus Linguistics

Birmingham University

Page 2: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.


Major approaches to Machine Translation• Traditional Rule-Based Machine Translation(RBMT)

• Statistic-Based Machine Translation(SBMT)(Brown,1993)

• Example-Based Machine Translation(EBMT)(Nagao, 1984)

Corpus-Based Translation Platform• Corpus-Based, Parallel Corpus contain solutions to many translation problems

• To increase the productivity of professional translators and to make translation easier and more efficient

• Translators will have the control over the process of translation

• Management of the translation workflow and task oriented

• More than a translation memory, idiom and terminology support

• Customizable and self-improvement

Page 3: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Database of Translation Equivalent Pairs

• A Translation Equivalent Pair (TEP) is composed of a Translation Unit(TU) and one of its target language Equivalent

• Translation Units include not only single word but also multi-word units

• Every Translation Equivalent Pair will have two context profiles which describes the context under which such an equivalence holds

Page 4: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Steps Towards Construction of a Database of Chinese-English TEPs

(1) Chinese Text segmentation

(2) English Text lemmatization and POS-tagging

(3) Chinese Text POS-tagging

(4) Sentence Alignment

(5) Statistic-based Lexical Alignment (word type alignment )

(6) Lexicon-based Lexical Alignment (word token alignment)

(7) Chinese Multi-Word Unit(MWU) Identification

(8) English MWU Identification

(9) MWU Alignment

(10) Extraction of Translation Examples

(11) Context learning for every TEP in the database

Page 5: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

The Hong Kong Legal Documents Corpus(HKLDC)

• Size in words6,833,762 Chinese words and 6,391,919 English words (tokens)

• Text type:Laws and amendments issued by the Hong Kong Special Administration Region (published during 1997-1999, but some of regulations might date back to early times, f.g, 1921)

• The Corpus contains 430 ordinances in each language• Language Variety

Special Purpose LanguageRelative restricted vocabulary of about 19,400 words (types)Many legal termswhile a smaller Chinese corpus(one month of Chinese People Daily, 886,351 words )have a vocabulary of 56,829 wordsFrequency lists show the two corpus have different word distribution, f.g. the 37 th most frequent words are 法律 (law) in the HKLDC which is the 615th most frequent words in Peoples Daily Corpus.HKDLC also contain some words which are not used in mainland China.

Page 6: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.


…. The magistrate may, at any time, if any premises have been taken into the possession of the Commissioner of Police under section 2, order the release of the premises. The closure of an inquiry or the release of premises under this Ordinance shall be without prejudice to any proceedings against any person for an offence against the law. For the purposes of this Ordinance, and in relation to all investigations held thereunder, and for the summoning of witnesses, and for all proceedings in connection with such investigations, the magistrate shall have all the powers possessed by a magistrate in relation to cases of indictable offences, and the Commissioner of Police shall render him all proper and necessary assistance.….

Excerpt of a English Text of the Hong Kong Legal Documents Corpus

Page 7: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

… 裁判官可随时命令将警务处处长根据第 2 条接管的任何处所发还。 根据本条例结束研讯或发还遭接管的处所,并不影响就任何人的罪行而对其进行的任何法律程序。 为施行本条例,并在根据本条例进行的所有调查方面,以及为传召证人并处理与上述调查有关的一切法律程序,裁判官具有处理可公诉罪行的裁判官所具有的一切权力,而警务处处长须向该裁判官提供一切适当及需要的协助。 

Excerpt of a Chinese Text of the Hong Kong Legal

Documents Corpus

Page 8: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

English Text After Lemmatization and POS-tagging

The <DT the> magistrate<NN magistrate> may<MD may> ,<, ,> at<IN at> any<DT any> time<NN time> ,<, ,> if<IN if> any<DT any> premises<NNS premise> have<VBP have> been<VBN be> taken<VBN take> into<IN into> the<DT the> possession<NN possession> of<IN of> the< DT the> Commissioner<NP Commissioner> of<IN of> Police<NP Police> under<IN under> section<NP section> 2<CD 2> ,<, ,> order<NN order> the< DT the> release<NN release> of<IN of> the<DT the> premises<NNS premise> .< SENT .> The<DT the> closure<NN closure> of <IN of> an<DT an> inquiry< NN inquiry> or<CC or> the<DT the> release<NN release> of<IN of> premises<NNS premise> under<IN under> this<DT this> Ordinance<NN ordinance> shall<MD shall> be<VB be> without<IN without> prejudice<NN prejudice> to<TO to> any<DT any> proceedings<NNS proceeding> against<IN against> any<DT any> person<NN person> for<IN for> an<DT an> offence<NN offence> against<IN against> the< DT the> law<NN law> .<SENT .>

English POS tag

English lemma

Page 9: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Chinese Text After segmentation and POS-tagging

裁判官 /n 可 /d 随时 /d 命令 /v 将 /p 警务处 /n 处长 /n 根据 /p 第 /m 2/m 条 /q 接管 /v 的 /u 任何 /b 处所 /n 发还 /v 。 /w

根据 /p 本 /r 条例 /n 结束 /v 研讯 /v 或 /c 发还 /v 遭 /v 接管 /v 的 /u 处所 /n , /w 并不 /d 影响 /v 就 /d 任何 /b 人 /n 的 /u 罪行/n 而 /c 对 /p 其 /r 进行 /v 的 /u 任何 /b 法律 /n 程序 /n 。 /w


Page 10: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Comparison of Chinese and English tagset(1) The Penn Tree bank English tagset The ICL/PKU Chinese tagset

CC Coordinating conjunction c Conjunction

CD Cardinal number m Cardinal Number,

Ordinal number

FW Foreign word

SYM Symbol

x Foreign word,

Non-morpheme Character

IN Preposition,

subordinating conjunction

p Preposition

JJ Adjective

JJR Adjective, comparative

JJS Adjective, superlative

a Adjective

b Distinctive

z Status

NN Noun, singular or mass

NNS Noun, plural

NP Proper noun, singular

NPS Proper noun, plural

n Noun

s Location


PP Personal pronoun

PP$ Possessive pronoun

WP Wh-pronoun

WP$ Possessive wh-pronoun

WRB Wh-adverb

r Pronoun

Page 11: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

The Penn Tree bank English tagset The ICL/PKU Chinese tagset

RB Adverb

RBR Adverb, comparative

RBS Adverb, superlative

d Adverb

UH Interjection e Interjection

VB Verb, base form

VBD Verb, past tense

VBG Verb, gerund or present participle

VBN Verb, past participle

VBP Verb, non-3rd person singular present

VBZ Verb, 3rd person singular present

MD Modal

v Verb

Other tags for punctuation, include:, Comma . Sentence-final punctuation

: Colon, semi-colon

( Left bracket character ) Right bracket character

" Straight double quote

“ Left open double quote ” Right close double quote

` Left open single quote ' Right close single quote

w Punctuation

Comparison of Chinese and English tagset(2)

Page 12: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Comparison of Chinese and English tagset(3)The Penn Tree bank English tagset The ICL/PKU Chinese tagset

EX Existential there

LS List item marker

RP Particle

TO to

DT Determiner

PDT Predeterminer

WDT Wh-determiner

POS Possessive ending

f Direction

o Onomatopoeia

q Classifier

t Time

u Auxiliary

y Final particle

i Idiom

g Morpheme

h Prefix

j Abbreviation

k Suffix

l Frequently

used fixed expression

Page 13: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Chinese English Texts Aligned

English Chinese

<s id=10009> The magistrate may, at any time, if any premises have been taken into the possession of the Commissioner of Police under section 2, order the release of the premises.

<s id=10009> 裁判官可随时命令将警务处处长根据第 2 条接管的任何处所发还。

<s id=10010> The closure of an inquiry or the release of premises under this Ordinance shall be without prejudice to any proceedings against any person for

an offence against the law.

<s id=10010> 根据本条例结束研讯或发还遭接管的处所,并不影响就任何人的罪行而对其进行的任何法律程序。

Page 14: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Chinese English Texts Aligned

English Chinese

Characters Chinese Pin yin

<s id=10009> The magistrate may, at any time, if any premises have been taken into the possession of the Commissioner of Police under section 2, order the release of the


<s id=10009> 裁判官 可 随时 命令 将 警务处 处长 根据 第 2 条 接管 的 任何 处所 发还 。

<s id=10009> cai2pan4guan1 ke3 sui2shi2 ming4ling4 jiang1 jing3wu4chu3 chu4zhang3 gen1ju4 di4 2 tiao2 jie1guan3 de5 ren4he2 chu4suo3

fa1huan2 .

<s id=10010> The closure of an inquiry or the release of premises under this Ordinance shall be without prejudice to any proceedings against any person for an offence against the law.

<s id=10010> 根据 本 条例 结束 研讯 或 发还 遭 接管 的 处所 , 并不 影响 就 任何 人 的 罪行 而 对 其 进行 的 任何 法律 程序 。

<s id=10010> gen1ju4 ben3 tiao2li4 jie2shu4 yan2xun4 huo4 fa1huan2 zao1 jie1guan3 de5 chu4suo3 , bing4bu4 ying3xiang3 jiu4 ren4he2 ren2 de5 zui4xing2 er2 dui4 qi2 jin4xing2 de5 ren4he2 fa3lv4 cheng2xu4 .

Page 15: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Statistic-based Chinese-English Lexicon Extraction

• Based on the following observations: Words that are translation of each other are more likely to appear in corresponding bitext regions (f.g. aligned sentence pairs or bead) than other pairs.

• Hypothesis and testing approach:• Association measurement used

Pointwise Mutual InformationDice coefficientLog Likelihood

2 score or 2 score

Page 16: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Statistic-based Chinese-English Lexicon Extraction(A sample of the resulted Noun-Noun lexicon)

CHINESE ENGLISH 2 -SCORE强制令 injunction 9895.36空缺 vacancy 9883.18文告 proclamation 9820.43母亲 mother 9820.43父亲 father 9820.43速记员 writer 9802.76量度 measurement 9802.76准确性 accuracy 9802.76来源 source 9802.76附注 remark 9800.45职级 rank 9781.23女儿 daughter 9754.41监狱 prison 9754.41儿子 son 9754.41

Page 17: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Lexicon used for lexical alignment

爱称 [ai4 cheng1] /term of endearment/pet name/diminutive/爱戴 [ai4 dai4] /love and esteem/爱德华 [ai4 de2 hua2] /Edward/艾德蕾德 [ai4 de2 lei3 de2] /Adelaide/爱德玲 [ai4 de2 ling2] /Adeline/艾迪生 [ai4 di2 sheng1] /Addison/爱尔兰 [ai4 er3 lan2] /Ireland/爱抚 [ai4 fu3] /to show tender care for/to care for (affectionately)/爱国 [ai4 guo2] /patriotic/love of country/patriotism/爱国者 [ai4 guo2 zhe3] /patriot/爱好 [ai4 hao4] /to like/to be fond of/to be keen on/interest/hobby/爱护 [ai4 hu4] /cherish/treasure/take good care of/艾姬 [ai4 ji1] /Aggie/爱克斯光 [ai4 ke4 si1 guang1] /X-ray/Roentgen ray/隘口 [ai4 kou3] /(mountain) pass/爱理不理 [ai4 li3 bu4 li3] /look cold and indifferent/be standoffish/爱怜 [ai4 lian2] /show tender affection for/

Chinese Entries67135

English translation


Page 18: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Lexicon-based Chinese-English Lexical Alignment (The online Hong Kong English Chinese glossary)

ab initio // 一 开始 abandon // 放弃 abandonment // 放弃 申请 通知 abate // 中止 abatement // 减少 abbreviation // 缩写 abduction // 诱拐 abet // 教唆 abeyance // 搁置 abide by // 遵守 ability // 能力 abnormality of mind // 神志 失常 abode // 居留 abolish // 废去 abolition // 免除 abortion // 堕胎 ……

English legal terms(about 6000

entries )

Chinese translation(manually segmented)

Page 19: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Lexicon-based Chinese-English Lexical Alignment ( after automatic application of the lexicon and glossary )(1)

Chinese Sentence English Sentence

<s id=10009>

0: 裁判官 ( 1:magistrate )

1: 可 ( 2:may )

2: 随时 ( 5:any 6:time )

3: 命令 ( 26:order )

4: 将 ( )

5: 警务处 ( 21:Police )

6: 处长 ( 19:Commissioner )

7: 根据 ( )

8: 第 ( )

9:2 ( 24:2 )

10: 条 ( 23:section )

11: 接管 ( 16:possession )

12: 的 ( )

13: 任何 ( 9:any )

14: 处所 ( 31:premises )

15: 发还 ( 28:release )

16: 。 ( 32:. )

<s id=10009>

0:The ( )

1:magistrate ( 0: 裁判官 )

2:may ( 1: 可 )

3:, ( )

4:at ( )

5:any ( 2: 随时 )

6:time ( 2: 随时 )

7:, ( )

8:if ( )

9:any ( 13: 任何 )

10:premises ( )

11:have ( )

12:been ( )

13:taken ( )

14:into ( )

15:the ( )

16:possession ( 11: 接管 )

Page 20: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Lexicon-based Chinese-English Lexical Alignment( after automatic application of the lexicon and glossary)(2)

Chinese Sentence English Sentence

17:of ( )

18:the ( )

19:Commissioner ( 6: 处长 )

20:of ( )

21:Police ( 5: 警务处 )

22:under ( )

23:section ( 10: 条 )

24:2 ( 9:2 )

25:, ( )

26:order ( 3: 命令 )

27:the ( )

28:release ( 15: 发还 )

29:of ( )

30:the ( )

31:premises ( 14: 处所 )

32:. ( 16: 。 )

Page 21: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Multi-Word Unit Identification (1)

• MWU is (1) Word groups with strong coherence, they appear together more often than expectation (2) Word groups conform to some syntactic patterns

• Some Chinese MWUs extracted from Chinese part中华 _ 人民 _ 共和国 (3326.64) 太平 _ 绅士 (4173.51)香港 _ 特别 _ 行政区 (1895.82) 成文 _ 法则 (4173.51)永久性 _ 居民 (3977.13) 附属 _ 法例 (3641.09)

• Some English MWUs extracted from English partHong _Kong(4159.44) postal_address(1896.68)Special_Administrative_region (3166.2)subsidiary_legislation(4058.09)Foreign_Affairs(3785.75) chief_executive(1588.8)

Page 22: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Multi-Word Unit Identification(2)(English Noun+Noun Combination)

N1 N2 2 score

shorthand writer 17339.3lo mei 15433.5cast iron 15054.2hydrogen sulphide 14849.4plot ratio 14207.4pendente lite 14174.9zinc oxide 14174.9remoter descendants 14174.9mechanics theory 14174.9credit union 13996.3balance sheet 12457.9mail bag 12382.4herb tea 12371.7methyl alcohol 12347.5immersion cooler 11879.1ear protector 11359.1

Page 23: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Multi-Word Unit Identification(3)(Chinese Noun+Noun Combination)

N1 N2 2 score

羁 留者 113972国务 大臣 112980芽子 碱 107326庄重 气氛 103945胸膜 肺炎 103945前线 伤兵 103945太平 绅士 103497电信 联盟 98861.7厂商 联合会 94710.6鸡眼 涂料 94710.6盲人 读物 94710.6罗马 教廷 94710.6文法 变体 94710.6面粉 改良剂 94710.6

Page 24: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Multi-Word TEP Extraction

Chinese English 2-score

成人 _图书馆 adult_library 68620.5影子 _董事 shadow_director 68469.8幕 _墙 curtain_wall 68469.8卤味 _店 lo_mei 68282.1橡胶 _手套 rubber_glove 68041.9橡胶 _围裙 rubber_apron 67723.5疾病 _津贴 sickness_allowance 67433.1计算机 _软件 computer_software 67281.6软 _雪糕 ice_cream 67281.6污水 _隧道 sewage_tunnel 66626.8工程 _原理 engineering_principle 66626.8

Page 25: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

Further work to be done and other applications

• Further worksChinese Multi-Word Unit(MWU) IdentificationEnglish MWU IdentificationMWU AlignmentExtraction of Translation Examples Context learning for every TEP in the database

• Other Applications of the database of TEPs Target word selection in Machine Translation(MT) and Machine Aided Translation(MAT)Cross Language Information Retrieval(CLIR)Bilingual LexicographyOnline Bilingual Intelligent Dictionary

Automatic Acquisition of bilingual Legal Terms

Page 26: Extracting of Translation Units and Their Target Language Equivalents from a Chinese-English Parallel Corpus CHANG Baobao Institute of Computational Linguistics.

The EndThank you very much!