New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century...

25
New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute of Education [email protected] PNC2013 Kyoto University December 10-11 2013

Transcript of New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century...

Page 1: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

New Language Resources for Cantonese Linguistics Research:

A Linguistic Corpus of Mid-20th Century Hong Kong Cantonese

Andy C. ChinThe Hong Kong Institute of Education

[email protected]

PNC2013Kyoto UniversityDecember 10-11 2013

Page 2: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Outline

• Why “Cantonese”?• Research on early Cantonese (19th - mid-

20th C) – Diachronic development• The corpus

– Source of data– Demonstration of search engine

2

Page 3: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Cantonese in Hong Kong

3

Page 4: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Cantonese

• One of the dialects of the Chinese language family

• In spite of being a dialect, Cantonese serves as a lingua franca in Hong Kong, Macau and most part of Guangdong Province of China

4

Page 5: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

5

Use of Cantonese

Page 6: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

“Cantonese” in early Hong Kong

• A fishing village• Population: 1851: ~33,000

– Four major ethnic groups:• Guangfu 廣府 ( 本地 )• Danjia 蛋家 (seafaring people)• Hakka 客家• Min 閩語 ( 鶴佬 / 潮州 )• Their languages are mutually unintelligible

6

Page 7: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Given the long history of Cantonese in HK

• We are interested in understanding its development in the past 200 years

• Are there any differences between early Cantonese and modern Cantonese?

• How can we capture these differences?

7

Page 8: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Diachronic studies of Cantonese

• Two approaches

– Apparent time approach– Real time approach

8

Page 9: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Apparent time approach

• age-stratified variation in a linguistic form is often indicative of a change in progress

– 75 vs. 50 vs. 25 y/o changes over 50 years– language of 200 years ago?– language change: Can we assume a speaker still

speak the language of his time?• if two speakers show no difference with

respect to a linguistic feature, does it mean that there has been no change?

9

Page 10: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Real time approach

• samples the population over an extended period of time – longitudinal study

• To collect data produced in the period concerned

10

Page 11: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Limitations on Research in Cantonese

• Cantonese is a vernacular language

• Spoken data is needed

• Any records of Cantonese of early 19th-C?

- spoken data vs. written records

11

Page 12: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

12

With these early materials,

• We are able to reconstruct the early stage of the Cantonese language (about 200 years ago)

• Some of the linguistic features are very different from those in modern Cantonese

Page 13: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Previous research on Cantonese

Neutral Qs

Directional complements

Aspect markers

demonstratives

phonology

Verb complement

Comparative construction

Lexicon (sociolinguistics)

Dative verb GIVE

Sentence final particles

Grammar of the late Qing period

13

Page 14: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Furthermore,

• Some linguistic changes took place/completed around the mid-20th century– Dative marker: 過 畀 ( 送本書過 /畀佢 )– Neutral Q :你去睇戲唔呀 你去唔去睇戲呀– …

• New and old features might co-exist in mid-20th C

14

Page 15: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

15

Morrison (1828)

2013Chao (1947)

120 years

~66 years

Page 16: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Existing Cantonese corpora

1. The Hong Kong Cantonese Child Language Corpus

2. The Hong Kong Bilingual Child Language Corpus

3. Hong Kong Cantonese Corpus

4. The Hong Kong Cantonese Adult Language Corpus

5. 19th Century Cantonese Corpus

16

Page 17: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Source of corpus data

– Real time vs.   Apparent time

– Naturally occurring data

– HK Cantonese movies ( 粵語長片)

17

Page 18: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

http://corpus.ied.edu.hk/hkcc/

Page 19: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

HK Movie Industry in mid-20th C.

Year   No. of Cantonese movies No. of PTH movies1952 - 1955 627    2221956 - 1960 963    3141961 - 1965 928    2061966 - 1970 361    286Total 2879 1028

Source of data : Chung (2004:177)19

Page 20: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

About the corpus

• 21 movies have been transcribed with Chinese characters: ~200k characters

• Word segmentation

• search engine (14 movies, since Apr 2012)– http://corpus.ied.edu.hk/hkcc/– 350+ registered users

20

Page 21: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Search criteria

• Characters or words (segmented units)• Cantonese pronunciation

• Movie names• Names of speakers• Gender of speakers• …

21

Page 22: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

契爺艷史 (1952)

• Yes-No question• VP-Neg: 你位千金有讀書冇呀?• V-Neg-VO: 呢道係咪有位黃小姐?

• Dative marker• 重要畀錢過人?• 咪可以快啲還清啲債畀人?

22

Page 23: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Some challenges

• Quality of speech• Overlap of speech• Representations of colloquial vocabulary• Parts-of-speech: How many types?• Discourse features• …

23

Page 24: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Acknowledgments

• ECS research grants, RGC:– Linguistic Analysis of Mid-20th Century Hong Kong

Cantonese by Constructing an Annotated Spoken Corpus (2013/2015)

• HKIEd Internal Research Grants:– RG41/2010-2011: Spoken Corpus Construction and

Linguistic Analysis of Mid-20th Century Cantonese– RG62/12-13R: A Preliminary Linguistic Analysis of

Mid-20th Century Cantonese from a Corpus-based Approach

24

Page 25: New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute.

Demonstration

• http://corpus.ied.edu.hk/hkcc/

25