Compiling a Spoken Chinese Corpus of Situated Discourse
Gu Yueguo
The Institute of Linguistics
The Chinese Academy of Social Sciences
Corpora Overview
Spoken Chinese CorporaA corpus of situated discourse
A corpus of major dialects
A corpus of speech
Written Chinese CorporaA corpus of contemporary written Chinese
A corpus of Pre-Qing written Chinese
Main headings
Components of the compiling process1. Real world discourse –what is it?
2. Recording
3. Encoding1. Transcription (a)
2. Transcription (b)
3. Mark-up
4. Tagging
4. Application
(0) ‘real world’ spoken discourse
Recording (1)
(2a) Character transcription
(3) Mark-up
(5) Application
(4) Coding
(2b) Transcription for a special purpose
0
Discourse in the Real World
No prepara-tion
Topics pre-set with no
preparation
Topics pre-set with no
written preparation
Talking based on a written script
Reading a written script
Single speaker
e.g. talk to oneself
e.g. narrate a personal story
e.g. oral exam
e.g. soliloquy, 1-person cross talk
e.g. news reading, reading practice
Two or more speakers
*e.g. everyday talks
* e.g. sports saloon
*e.g. press interview
e.g. acting, cross talk
e.g. collective reciting
Spoken Chinese
Real world situated discourse (1) It is situated to an actual social situation; (2) It is situated to actual users; (3) It is situated to an inter-subjective world of disc
ourse; (4) It is situated to actual goals; (5) It is situated to spatial and temporal setting; (6) It is situated to the cognitive capacity of actual user
s; (7) It is situated to performance contingencies of actu
al users who are engaged in spontaneous talking with little pre-planning.
clerks
colleagues
Staff meeting
Phone calls
F-staff
visitors
ZWFAcademic
Building X
studentsOther colleagues
visitors
Phone calls
Colleague 1
Colleague 2
Academic
Thurs Mon
Building Y
Academic Prjct team 1
Prjct team 2
Prjct team 3
Tues Wedn Fri
Building Z
Academicwife
sonkindergarten
markets
Neighbours
Mon-Fri Swimming pool
Residential Building
Senior managers
Research center staff
Academic
Conference organizers
Hotel staff
Sports playmates
Sat
Sun
Summer Resort
1. Talking is the task, e.g., meeting, seminar, (it is task-oriented, task-goal-directed, segmented on the basis of the goal-attaining process. Note that turn-taking rules are based on such a type of talking-task relation)
2. Talking is the main constitutive part of the task, some classroom discourse, doctor patient discourse (it is task-oriented, task-goal-directed, segmented on the basis of the goal-attaining process)
3. Talking is a constitutive part of the task, e.g. giving instructions from time to time (task performance is dominant, talking tends to be fragmented)
4. Tasking and task run in conflicting parallel, the achievement of the latter serves as a means to the goal of the former, e.g. business dinner (business table talk) (Note that segmenting this kind of talk can be based on the task)
5. Talking is an embedded social part of the task, e.g. talking over the meal (talking has no specific goal to reach)
6. Talking is a decorative part of the task, e.g., talking accompanying tea-making
7. Talking is a hindrance to the task, e.g. talking over a written exam 8. Talking and task are independent to each other
Talking and Doing Interwoven in the Real World
Micro performance analysis of five minute activities
Spatial- temporal
Relations between acts
doing relation btwn doing & talking
talking
00 : 00-1 : 15
Parallel and independent
X helps himself with noodles
conflictive
X and Y gossipY sorts out the things on the table
Parallel and independent
1 : 27-2 : 6
Parallel and independent
X sorts out the bowl and the chopsticks Parallel and
relevant
X and Y talk about the journal editingY switches on the
computer
2 : 11-3 : 06
Parallel and independent
X sorts out the things on the table
Parallel and independent
Y talks to X about a politician
Y continues to sort out the things
3 : 19-4 : 25
Parallel and independent
X starts to reinstall his computer
Parallel and relevant
X talks to Y about the Journal layoutY starts to do the layout
on computer
4 : 34-4 : 40
Parallel and independent
X continues reinstallingParallel and
relevant
X continues talking to Y about the Journal editing
Y continues doing the layout
Sampling: Whose job?Sinclair (1991:13) writes:
The specification of a corpus --- the types and proportions of material in it --- is hardly a job for linguists at all, but more appropriate to the sociology of culture. The stance of the linguist should be a readiness to describe and analyse any instances of language placed before him or her. In the infancy of the discipline of corpus linguistics, the linguists have to do the text selection as well; when the impact of the work is felt more widely, it may be possible to hand over this responsibility to language-oriented social scientists.
The standard variety approach
it is arguable that Putonghua should be chosen as the target language to rule out other dialects from the picture. There are at least two major reasons for doing so. First, Putonghua serves as the standard language used by the media and education. Second, other spoken corpora have also adopted the standard variety.
Criticisms of the standard variety approach
Subject to serious criticisms relating to the reservation of the naturalness of language use. The standard variety is given its identity before the corpus is compiled. The corpus cannot be used to represent its naturalness, nor be used to establish or demonstrate its identity. … what the compilers believe what Putonghua looks like. Subjective judgment is also involved in sampling Putonghua speakers by filtering non-standard speakers out. … Unless they are ‘commissioned’ to talk among themselves, the activities the standard and non-standard interactants are engaged in have to be properly filtered as well.
The sampling: The workplace approach
It is true that situated discourses are unlimited in number. However, the types of social situations to which they are situated can be in theory exhaustively enlisted. According to the Beijing Yellow Book 1999, there are 67783 social work units which we divide them into 6 major categories and 31 sub-categories,
01 Government, Parties and Other Social Bodies
4823 7.12%
02 Economical organizations 53838 79.43%
03 education, research and arts 6840 10.09%
04 health, sports, and social welfare 1365 2.01%
05 public welfare 890 1.46%
06 military 27 0.04%
6 major categories of social work units
descriptive title no of mp3 files the total size
1 accident mediation 1 5 23,369,3262 accident mediation 2 8 30,944,1143 Administrative meetings 107 561,000,0004 assessment meeting 6 68,500,0005 auction 30 158,000,0006 bfsu meeting 14 66,200,0007 Birthday celebration 10 43,100,0008 btvu seminar 26 138,000,0009 bus talk 60 294,392,29810 business negotiation 1 27 143,285,17811 business negotiation 2 26 140,260,74412 business negotiation 3 54 284,761,45813 business negotiation 4 9 44,767,134
14 child discourse 163 1,115,063,560
15 Chinese and Korean first contact 7 34,708,71616 Chinese New Year celebration 11 126,323,48417 Classmates get-together 14 73,063,72818 Classroom discourse-teach Chinese to Koreans 125 574,000,00019 commercial house key-handling procedure 16 84,512,806
20 community talks 322 1,734,865,326
21 end year celebration 17 78,310,71622 fortune telling 33 390,741,36223 Gu yueguo a week record 248 1,235,679,18624 house allocation meeting 44 239,388,83825 house decoration team talks 36 181,660,95226 Jiangsu TVU review meeting 11 49,675,91827 kindergarten meeting 28 146,741,69028 Lan Baochun family talks 22 285,975,640
29 lawsuit 93 508,628,42230 lovers conversation 11 59,845,16031 medical discourse 156 764,274,19832 ministry education meeting 99 522,992,40433 office talk ministry of communication 114 577,889,24234 peasant family 73 373,917,09435 Peking Univ ceremony 7 46,894,31236 play mah-jong 28 145,754,88437 private conversation 77 401,858,42438 Radio Communication interviews 24 919,456,51239 sell and buy 296 1,150,000,00040 seventy-eighty yrs old peasant talks 22 125,624,13841 street market shopping 37 190,887,97242 student dormitory talks 66 345,920,58243 table talks 89 529,995,69844 visit blood doners 14 71,655,10445 Zhu Rongji press conference 20 97,984,672
total (1second=15.6503KB) 2705 15,180,870,992=970005.11 sds/269.44 hrs
1
Recording
Recording 1. Who does the recording? 2. In what role does the person assume while
recording? 3. What is the quality of the recording? 4. In what manner is the recording to be made? 5. How is the ethics of recording to be properly
taken care? 6. What details are to be noted while recording?7. How are the recordings to be kept safe?
In what role does the person assume while recording
The recording person as a legitimate observer: s/he is allowed by the authority to take non-active part in the activity and record the talk. S/he is an outsider. The party is aware of her or his presence and of her or his purpose of being there.
The recording person as a genuine participant: s/he is an insider.
The recording person as a surreptitious observer: s/he is one of the public members, and her or his presence draws no particular attention from anyone else.
In what manner is the recording to be made? With the approval of all the participants With the approval of the key participant With the approval of the unit authority Open recording which can be noticed by
anyone Surreptitiously
录 音 记 录 卡 录音人姓名 : ________________ 性别 : ______________ 职业 : _______________________开 始 录 音 日 期 _____ 年 ____ 月 ____ 日 结 束 录 音 日 期 _____ 年 ____ 月 ____ 日开 始 录 音 时 间 : 上 午 _____ 点 下 午 _____ 点 晚 上 _____ 点结 束 录 音 时 间 : 上 午 _____ 点 下 午 _____ 点 晚 上 _____ 点谈 话 地 点 _____ 省 _____ 市 ____ 县 ____ 乡 _____ 村 单位 : ______________________________________________ 谈 话 场 所 : 如 办 公 室、 朋 友 家、 餐 馆、 会议室、 超 市、 火 车 上、 车 间、 家 中 、 商 场、 医 院、 法 庭、 宾 馆、 街 上、 晚 会 上、 ___________ 在 录 本 面 磁 带 时 您 在 何 处?1. ________________ 2. __________________ 3. ___________________ 录音方式 : 公开 秘密 先秘密后公开 有些人知道并同意 都知道并同意 请 把 本 面 磁 带 的 谈 话 人 员 的 有 关 情 况 填 在 下 面 的 表 里 ( 越详细越好 ) :
姓 名
职 业、职称、职务
年 龄
性 别
文化程度
口 音
与 您 以 及 和 别 的 谈 话 人 的 关 系
谈话目的和事由: _____________________________________________________________________________________________________________________________________________________________________________________________________________________________ 提 醒 您 本 面 录 完 后 要 检 查 一 下 磁 带 是 否 要 翻 面! ( 以下由语料库工作人员填写 )------------------------------------------------------------------------------------------------------------------------------原始声波文件名 :_____________________ 汉字转写文件名 : ____________________________原始声波文件光盘编号 : ______________ 切分后声波文件名 : __________________________归类文件夹名 : ______________________ 其他 : ______________________________________
How are the recordings to be kept safe?
The recordings on the 74 minute mini disks are all converted into wav files by using the recording function of the sound card. The format is 16 bits, stereo, 44100 Hz. The wav files are then stored on 640 mb recordable compact discs. They are further backed up by being converted into MP3 format (to economize on storage space) and saved again on separate 640 mb recordable compact discs. Furthermore, all the MP3 files are stalled on a USB movable 20G hard disk.
2
Transcription
The encoding process
1. Transcription in Chinese characters
2. Transcription in Pinying/IPA symbols
3. Transcription by using Praat
4. Mark-up by XML
5. Tagging
Issues in segmentationSegmenting sound streams into orthographic and phonetic linear units is the first major concern of the present project. It proves to be theoretically significant and practically difficult. The only natural unit boundaries are speaker-turns (turn defined in terms of the speaker’s presence of phonation). The other units either larger or smaller than turns tend to be more like theoretical constructs than otherwise.
Basic unit ---?Acoustically speaking, a spontaneous talk is a sequence
of strings of sounds uttered by two or more speakers. Prosodic or intonational units seem to be natural segments of the sequence. They are treated as basic units of talk and seem to have the same status as sentence does in written text. The weaknesses of such segmentation are (1) segments larger than intonational units are assumed to be the mere stacking of these basic units, which are untrue, hence misleading; and (2) talk is treated as a self-contained product waiting to be sliced into intonational units, thus ignoring the dynamic aspect of talk and its intrinsic relation with the social activities at large.
Multiple level segmentation 1 The first-level segment: The activity boundary
(segmenting talk from other social activities) Schedule boundary, e.g. a two-hour meeting,
classroom discourse Visit boundary, e.g. a patient’s visit to a doctor Case boundary, an accident settlement Appointment boundary, e.g. Business boundary, e.g. buy something
Multiple level segmentation 2The second-level segment: goal-oriented
segmentation
(segmenting talk into goal-attaining chunks) The segmentation is made on the basis of goal-
attaining process – goal-attainment structure E.g., Opening, negotiating, closing of a meeting E.g., examine-diagnose-prescribe-recommending The presentation of a speaker
Multiple level segmentation 3
The third-level segment: turn-oriented segmentation
(segmenting goal-attaining chunks into turn-taking chunks)
The segmentation is made on the basis of turn-boundary
Multiple level segmentation 4
The fourth-level segment: functional units(segmenting turn-taking chunks into functional units)The segmentation is made on the basis of functional markers or clues. • A meaningful cluster with a clear forward function• A meaningful cluster with a clear backward function• A meaningful cluster with a clear downward function• A meaningful cluster having a clear cognitive function: planning or searching for words
Multiple level segmentation 5
The fifth level segment: linear character and phonetic units
Trajectories of life path
Trajectories of life path
Internalized language out of life path trajectories
Trajectories of life path
Trajectories of life path
Internalized language out of life path trajectories
Trajectories of life path
Trajectories of life path
Internalized language out of life path trajectories
Natural growth and development of language
Trajectories of life pathT
rajectories of life path
Internalized language out of life path trajectories
Trajectories of life path
Trajectories of
life path
Internalized language out of life path trajectories
Trajectories of life path
Trajectories of
life pathInternalized language out of life path trajectories
Linguistic theory as reconstruction as modeling as description as standardization
Top Related