Corpus applications for the African languages, with ... · erary surveys, sociolinguistic...

21
Copyright © 2001 NISC Pty Ltd SOUTHERN AFRICAN LINGUISTICS AND APPLIED LANGUAGE STUDIES ISSN 1607–3614 Southern African Linguistics and Applied Language Studies 2001 19: 111–131 Printed in South Africa — All rights reserved Corpus applications for the African languages, with special reference to research, teaching, learning and software 1 DJ Prinsloo* and Gilles-Maurice de Schryver Department of African Languages, University of Pretoria, Pretoria 0002, South Africa. *Corresponding author, e-mail: [email protected] Abstract: The point of departure of the present article is the realisation that more and more seri- ous contemporary linguistic applications are based on electronic corpora. If African linguistics is to take its rightful place in the new millennium, the active compilation, querying and application of cor- pora should therefore become an absolute priority. In order to illustrate the feasibility of corpus appli- cations for the African languages at present, the article first considers ‘fundamental linguistic research’ in the fields of phonetics and question particles. It is shown how that research was boost- ed as a result of the utilisation of corpora. In a second section ‘language teaching and learning’ is given due attention by means of the corpus-aided compilation of pronunciation guides and text- books, and the teaching of morpho-syntactic and contrastive structures. Finally, in the field of ‘lan- guage software’, a series of first-generation spellcheckers based on corpora is reviewed. All appli- cations are exemplified with reference to one or more of the following African languages: Cilubà, Sepedi, isiXhosa, isiZulu, and Setswana. The present article focuses on various corpus applications in the broad field of linguistics, with special reference to the African languages. In a previous article we argued that ‘[c]ompiling and querying electronic corpora has become a sine qua non as an empirical basis for contemporary linguistic research. As a result, around the world, corpus applications now abound in all fields of linguistics’ (De Schryver & Prinsloo, 2000:89). The compilation of African-language corpora and corpus query tools are the subjects of that previous article. Present-day scholars are unanimous when it comes to the crucial role corpora play in the broad field of modern linguistic applications: ‘As a consequence of the growing global interest in large electronic text corpora in the past few years, th[e pre- sent-day Dutch] corpus will be a com- ponent of a multifunctional collection of electronic texts, rather than used for lexicographical purposes only’ (Kruyt, 1995:19) ‘Carefully constructed, large written and spoken corpora are essential sources of linguistic knowledge if we hope to provide extensive and ade- quate descriptions of the concrete use of the language in real text. These types of descriptions certainly remain impossible if we only rely on introspec- tion and native speaker intuition’ (Calzolari, 1996:4) ‘It is now almost inconceivable that worthwhile and comprehensive lexical descriptions can be undertaken with- out a corpus’ (Kennedy, 1998:91) According to Kruyt, firstly: ‘At the level of word form, [...] analysis of corpus data may be supported by statistical tools’, secondly: ‘The analysis at other language levels than word form requires a corpus encoded for linguistic features’, and thirdly: ‘Statistic devices can be applied on encoded linguistic features as well’ (1995:126–127). As indicated in De Schryver and Prinsloo (2000:95–96), African-language corpus linguistics goes a long way with corpora clear of any codes (cf. also Hurskainen, 1998, § 2). The present article will therefore deal with corpus applications for the African languages on Kruyt’s ‘first level’, which means that ‘raw’ corpora are used which have not been supple- mented by a series of so-called ‘standard cor- pus pre-processing’ annotations. In contrast to the different levels suggested by Kruyt, Calzolari stresses the complexity of the Introduction

Transcript of Corpus applications for the African languages, with ... · erary surveys, sociolinguistic...

Page 1: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Copyright copy 2001 NISC Pty LtdSOUTHERN AFRICAN LINGUISTICSAND APPLIED LANGUAGE STUDIES

ISSN 1607ndash3614

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131Printed in South Africa mdash All rights reserved

Corpus applications for the African languages with specialreference to research teaching learning and software1

DJ Prinsloo and Gilles-Maurice de SchryverDepartment of African Languages University of Pretoria Pretoria 0002 South Africa

Corresponding author e-mail prinsloopostinoupacza

Abstract The point of departure of the present article is the realisation that more and more seri-ous contemporary linguistic applications are based on electronic corpora If African linguistics is totake its rightful place in the new millennium the active compilation querying and application of cor-pora should therefore become an absolute priority In order to illustrate the feasibility of corpus appli-cations for the African languages at present the article first considers lsquofundamental linguisticresearchrsquo in the fields of phonetics and question particles It is shown how that research was boost-ed as a result of the utilisation of corpora In a second section lsquolanguage teaching and learningrsquo isgiven due attention by means of the corpus-aided compilation of pronunciation guides and text-books and the teaching of morpho-syntactic and contrastive structures Finally in the field of lsquolan-guage softwarersquo a series of first-generation spellcheckers based on corpora is reviewed All appli-cations are exemplified with reference to one or more of the following African languages CilubagraveSepedi isiXhosa isiZulu and Setswana

The present article focuses on various corpusapplications in the broad field of linguistics withspecial reference to the African languages In aprevious article we argued that lsquo[c]ompiling andquerying electronic corpora has become a sinequa non as an empirical basis for contemporarylinguistic research As a result around theworld corpus applications now abound in allfields of linguisticsrsquo (De Schryver amp Prinsloo200089) The compilation of African-languagecorpora and corpus query tools are the subjectsof that previous article

Present-day scholars are unanimous whenit comes to the crucial role corpora play in thebroad field of modern linguistic applications

lsquoAs a consequence of the growingglobal interest in large electronic textcorpora in the past few years th[e pre-sent-day Dutch] corpus will be a com-ponent of a multifunctional collection ofelectronic texts rather than used forlexicographical purposes onlyrsquo (Kruyt199519)lsquoCarefully constructed large writtenand spoken corpora are essentialsources of linguistic knowledge if wehope to provide extensive and ade-quate descriptions of the concrete use

of the language in real text Thesetypes of descriptions certainly remainimpossible if we only rely on introspec-tion and native speaker intuitionrsquo(Calzolari 19964)lsquoIt is now almost inconceivable thatworthwhile and comprehensive lexicaldescriptions can be undertaken with-out a corpusrsquo (Kennedy 199891)

According to Kruyt firstly lsquoAt the level ofword form [] analysis of corpus data may besupported by statistical toolsrsquo secondly lsquoTheanalysis at other language levels than wordform requires a corpus encoded for linguisticfeaturesrsquo and thirdly lsquoStatistic devices can beapplied on encoded linguistic features as wellrsquo(1995126ndash127) As indicated in De Schryverand Prinsloo (200095ndash96) African-languagecorpus linguistics goes a long way with corporaclear of any codes (cf also Hurskainen 1998sect 2) The present article will therefore deal withcorpus applications for the African languageson Kruytrsquos lsquofirst levelrsquo which means that lsquorawrsquocorpora are used which have not been supple-mented by a series of so-called lsquostandard cor-pus pre-processingrsquo annotationsIn contrast to the different levels suggested byKruyt Calzolari stresses the complexity of the

Introduction

Prinsloo and de Schryver Corpus applications for the African languages112

mutual interactions between lexicon and cor-pus

lsquoWe can summarise without claimingto be exhaustive the lexicon (L) mdashcorpus (C) interactions in the followinglist where an arrow from L to Cmeans in general the projectionmap-ping of some lexical data on the cor-pus while an arrow from C to L meansacquisition of lexical information fromcorporaL rarr C taggingC rarr L frequencies (of different lin-

guistic ldquoobjectsrdquo)C rarr L proper nounsL rarr C parsingC rarr L updatingC rarr L ldquocollocationalrdquo data (MW

idioms gram patterns)C rarr L ldquonuancesrdquo of meanings amp

semantic clusteringC rarr L lexical (syntacticsemantic)

knowledge acquisitionL rarr C semantic tagging

darrC rarr L more semantic information

on the lexical entryL rarr C semantic disambiguationC rarr L corpus based computational

lexicographyC rarr L validation of lexical modelsrsquo

(Calzolari 19967ndash8)As we are operating on Kruytrsquos first level

we see that within Calzolarirsquos framework weare essentially dealing with lsquoacquisition of lexi-cal information from corporarsquo (C rarr L) In futurewhen corpora for African languages will also bepre-processed linguistically lsquoprojectionmap-ping of some lexical data on the corpusrsquo (L rarr C)will also become possible At that point we willbe able to implement Calzolarirsquos lsquobootstrappingmethodologyrsquo (199614) which implies that acontinuous projectionmapping of C on L andvice versa results in successive analyses ofthe corpus which increase in richness

Corpora uses in the broad field of linguisticsare virtually unlimited and are even found out-side linguistics (such as in anthropology histo-ry sociology etc cf Hurskainen 1998 sect2)Within linguistics Calzolari (19964ndash5) men-tions NLP (Natural Language Processing) andSpeech systems the evaluation of syntactictheories a variety of phenomena occurring in

lsquorealrsquo texts (such as underestimatedunderdis-cussed structures linguistically uninterestingphenomena and deviations from linguisticmodels) the construction of stochastic modelsthe identification and characterisation of sub-languages language teaching and learning lit-erary surveys sociolinguistic considerationslexicographic compilations stylistic studiesetc

Due to space restrictions we can obviouslyonly discuss a fraction of all potential linguisticapplications of corpora for African languagesAs such the present article will touch uponthree different linguistic facets (a) fundamentallinguistic research (b) language teaching andlearning and (c) language software The firstfacet (fundamental linguistic research) is exem-plified by means of a thorough discussion ofCilubagrave phonetics on the one hand and anoverview of question particles in Sepedi on theother hand The second facet (language teach-ing and learning) is illustrated with compilationsof a pronunciation guide for Cilubagrave and a text-book for Sepedi on the one hand and theteaching of morpho-syntactic and contrastivestructures in Sepedi on the other hand Finallyfor the third facet (language software) the first-generation spellcheckers developed forisiXhosa isiZulu Sepedi and Setswana arereviewed

Corpus applications in the field of fun-damental linguistic research Part 1phonetics2

Formulation of the basic aim lsquocor-pus-based phonetics from belowrsquoTraditionally phonetic research has beenundertaken only in order to proceed to phonol-ogy More recently phonetic research has beencarried out in order to frame the results in aglobal perspective by cross-comparing occur-rence frequencies of different phone categoriesin various languages The utilisation of corporain the field of phonetics however opens newand even more exciting doors We will illustratethis with reference to some phonetic aspects ofCilubagrave

If we look at phonetic studies that havebeen undertaken throughout the world we seethat the great majority of them are based on alsquotranslationrsquo of a story very often an English oneat that or worse a lsquotranslationrsquo of randomlysampled (English) words In order to avoid an

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 113

ethnocentric approach different so-calledlsquoevery-day non-cultural lists of wordsrsquo havebeen assembled over the years (cf Swadesh1952 1953349 1955) Even though the labelfor such (English) lists was later changed tolsquobasic vocabularyrsquo (cf Bastin Coupez amp DeHalleux 1983174) one will always recognise alsquoforeign biasrsquo for as long as one does not takethe language being studied as onersquos point ofdeparture The minority of scholars who didtake the language itself as their point of depar-ture would randomly sample a small selectionof recordings andor texts from that languagefrom which to work Here of course onestrongly doubts the representativeness of suchrandom samples

In order to pursue truly modern phoneticsone should therefore do away with the ethno-centric approach on the one hand and elimi-nate the random factor on the other handFormulated differently one needs to arrive at aphonetic description which emanates solelyfrom the language itself mdash hence a lsquophoneticsfrom belowrsquo-approach and this must be adescription with well-founded claims To complywith both these points of departure at the sametime we argue that one can simply turn to top-

frequency counts derived from a corpus of thelanguage under study The amazing thing isthat such a corpus does not even need to belarge while the actual words one works on canmoreover be ridiculously small The methodol-ogy we suggest is therefore a lsquocorpus-basedphonetics from belowrsquo-approach

Previous lsquotraditionalrsquo scholarship inthe field of Cilubagrave phoneticsAs far as previous lsquotraditionalrsquo scholarship isconcerned only three publications explicitlydiscuss phonetic aspects of Cilubagrave The firstattempt is the one by Gabrieumll and dates fromthe 1920s His phone inventory (originally arunning text) has been summarised in Table 1

As can be seen from Table 1 Gabrieumll doesnot use any phonetic symbols He ratherdescribes the ldquosoundsrdquo andor uses the romanalphabet patterned on French and Flemish Inaddition one huge omission in his descriptionconcerns the tonal dimension of Cilubagrave

The first scholar to stress the crucial role oftone in Cilubagrave is Burssens in his Tonologischeschets van het Tshiluba (1939) His phoneinventory (originally also a running text) hasbeen summarised in Table 2

ingressive sounds

$ monosyllabic affirmation en $ dental click

egressive sounds

vowels

consonants

combinations vowels

consonants

voiced

voiceless

V+V

nasal+C

C+semivowel

soft b d j v z semivowel w y trill l

a e i o u $ they can

be short medium or long

$ they can be nasalised

nasal n m velar-n

hard p f t k s sh tsh $ p =

voiceless bilabial affricate

$ mp = explosive

ai ei au eu io iu

$ mb mp mf

mv mm $ nd ng nk

ns nz nj nsh ntsh nt nn nw ny

$ preceding vowels are slightly nasalised

$ bw fw kw

lw nw pw sw tw vw

$ by dy ky my ny py

Table 1 Cilubagrave phone inventory according to Gabrieumll (sd4 [(19213)]7ndash11)

Prinsloo and de Schryver Corpus applications for the African languages114

In contrast to Gabrieumll Burssens does use aphonetic alphabet albeit the one popularamong Africanists As far as Cilubagrave is con-cerned this implies that the symbol [ ] is usedinstead of [ ] to represent the voiceless bil-abial fricative and the symbol [ y ] instead of[ j ] for the voiced palatal consonantal reso-nant3

The phonetically most explicit description isfound in Muyunga (197947ndash64) This explicit-ness is not so much a result of an improvedclassification of the Lubagrave phones but ratherdue to the fact that Muyunga is the only one totake quantitative aspects into consideration inhis phonetic description Muyunga summarises(all) the information found in the literature asshown in Table 34

Compared to Burssensrsquo inventory it can beobserved that Gabrieumllrsquos diphthongs and pre-nasalised consonants have reappeared This

procedure can only be considered as a movebackwards Phonetically the diphthongs sug-gested by Muyunga (and Gabrieumll) are moreaccurately analysed as vocalic resonants pre-ceded by the voiced labial-velar consonantalresonant [ w ] or the voiced palatal consonantalresonant [ j ] While Muyunga includes thesediphthongs in his quantitative study in thatsame study he fortunately dissociates the pre-nasalised consonants for this is without ques-tion a more accurate phonetic analysis

Despite such inaccuracies in his analysisMuyungarsquos prime contribution to the phoneticdescription of Cilubagrave is his discussion of thefrequencies of occurrence of the Lubagrave phonesHe calculated these frequencies on the basisof 11 texts four familiar-style letters four pas-toral letters and three radio broadcasts In thisway he obtained lsquoa total of 2 333 words con-taining 10 726 soundsrsquo (197960) Although he

front

central

back

close

half-close

half-open

V O W E L S

open

$ there are no true diphthongs $ all vowels can be short long or extra-long

bilabial

labio-dental

dental

alveolar

palato-alveolar

(pre-) palatal

palatal

velar

explosive

affricate

fricative

nasal

bilateral

semivocal

C O N S O N A N T S

$ only in the combination mp

$ labial-velar semivocal

Table 2 Cilubagrave phone inventory according to Burssens (19396ndash12)

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 115

is not explicit in this regard the 2 333 wordsare not 2 333 different words Anyhow by ran-domly choosing 11 short texts he hopes toarrive at a representative sample of the Lubagravelanguage

Corpus-based fieldwork and theCilubagrave Phonetic Database (CPD)While Muyungarsquos method can be regarded as alsquophonetics from belowrsquo-approach one stillneeds to eliminate the random factor It is pre-cisely here that our suggestion to utilise a cor-pus comes in To that end Recallrsquos Cilubagrave

Corpus (RCC) a small-size structured corpusof just 300 000 running words (tokens) wasqueried (cf De Schryver amp Prinsloo200098ndash102) and a corpus of that size turnedout to contain approximately 35 000 differentwords (types) Now the top ONE PERCENT ofthe types with an even distribution across thedifferent sub-corpora (cf De Schryver ampPrinsloo forthcoming) or thus just 350 wordsnot only turned out to provide enough data fora thorough phonetic analysis but also to com-plement all existing phonetic descriptions Inother words although this study only deals with

Simple Consonants

bilabial

labio-dental

dental-alveolar

palato-alveolar

palatal

velar

glottal

plosive

nasal

lateral

fricative

affricate

semivowel

Prenasalised Consonants

bilabial

labio-dental

dental-alveolar

palato-alveolar

palatal

velar

glottal

plosive

fricative

affricate

Simple Vowels

front back

Diphthongs(We are puzzled by Muyungarsquos notation of diphthongd)

front back

close

u

close

half-close

close to half-close

open

close to open

Remarks

$ vowels can be short environmentally lengthened or inherently long

$ is always prenasalised

Table 3 Cilubagrave phone inventory according to Muyunga (197948ndash49 52)

Prinsloo and de Schryver Corpus applications for the African languages116

the top one percent of the types in RCC theresults are far-reaching Indeed all claimsabout the frequencies of occurrence of certainphones imply that these claims are valid forthose words that are most frequent in Cilubagrave

Fieldwork was carried out with a malenative speaker of standard Cilubagrave For each ofthe 350 words he was asked to pronounce ashort sentence chosen from the concordancelines extracted from RCC After repeating thissentence a second time the word was pro-nounced two more times in isolation With thisprocedure we hoped to obtain a pronunciationas close to natural spoken language as possi-ble During the recordings an initial transcrip-tion was made In order to complete the purelyauditory and visual cues the informant wasoften asked to describe mdash in his own words mdash

the articulation of this or that phone In additionwe read out our own transcriptions time andagain

Following the fieldwork our initial transcrip-tions were verified with the recordingsSamples of the resulting (detailed) transcrip-tions are shown in Table 4

The phonetic transcriptions of the 350 mostfrequent Lubagrave words constitute the backbone ofthe statistical database which was subsequent-ly set up mdash the Cilubagrave Phonetic Database(CPD) In total CPD contains 1 709 phonesEach phonersquos phonetic description was codedin various ways to enable a thorough distribu-tional analysis An overview of the differentphones attested in CPD is shown in Table 5

Compared to the inventories presented inTables 1 2 and 3 the CPD inventory reveals a

phonetic transcription

phonetic transcription

phonetic transcription

153

180

207

154

181

208

155

182

209

156

183

210

157

184

211

158

185

212

159

186

213

160

187

214

161

188

215

162

189

216

163

190

217

164

191

218

165

192

219

166

193

220

167

194

221

Table 4 Samples of the phonetic transcriptions of the 350 most frequent words in Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 117

number of striking differences such as thepresence of the voiced alveolar trilled stop [ r ]and the high number of vocalic resonants Thevoiced alveolar trilled stop [ r ] for instance isa phone not to be found in genuine lsquostandardCilubagraversquo so phoneticians have tended to over-look its importance From a frequency point ofview however it is clear that this phone rightful-ly deserves its place on phonetic charts ofCilubagrave Yet in order for such and similar claimsto be valid one must be sure that there is agood correlation between the overall distribu-tion of the phones mentioned in the literatureand those in CPD

Comparison between phone frequen-cies in the literature and those inCPDOn a first level a comparison can be made

between the phone frequencies found inMuyunga and those derived from CPD Theresults of this comparison are summarised inTable 6

From Table 6 it is clear that there is excel-lent agreement between the two frequencystudies There are only a few discrepanciesand even these can be explained As far as theconsonants are concerned there is just onephone for which there is a big differencebetween the studies namely the voiced labial-velar consonantal resonant [ w ] for whichMuyunga shows 104 while CPD has 521To a much smaller extent an analogous differ-ence can be observed for the voiced palatalconsonantal resonant [ j ] for which Muyungashows 136 while CPD has 252 The rea-son for this is obvious once one realises thatMuyunga includes diphthongs into his frequen-

CONSONANTS

bilabial labiodental

alveolar

palato-alveolar

palatal

velar

oral stop

nasal stop

trilled stop

( )

fricative

resonant

lateral resonant

VOCALIC RESONANTS OTHER SYMBOLS

voiced labial-velar consonantal resonant

voiceless palato-alveolar affricate

front

central back

e ( )

e-mid

n-mid ( )

n

Table 5 Cilubagrave phone inventory derived from the Cilubagrave Phonetic Database (CPD)

Prinsloo and de Schryver Corpus applications for the African languages118

Table 6 Cilubagrave phone frequencies in Muyunga (197958 62ndash63) compared to those in CPD

CONSONANTS VOCALIC RESONANTS

Muyunga- CPD- symbol Muyunga- CPD-

031 035 847

646 661 ( ) 081

928 720

377 222

039

117

529 515

403

525 568

( ) 080

483 386

728 725

088

380

751 655 1154

BB 063 059 ( ) 136

1290 1205

234 129

048

480

( ) C 012 1062

BB 124 088 ( ) 086

1148 913

043 023

022

111

081 123 146

138 140 ( ) 046

192 205

048 070

009

082

098 082 ( )

C

006

073 053 ( )

C

006

136 252 DIPHTHONGS

448 363 length

Muyunga-

CPD-

104 521 short

241

C

076 094 long

259

C

Total 5253

5390 Total

4747

4611

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 119

cy counts As a result CPDrsquos [ wa ] [ we ] and[ ja ] for instance are considered [ ua ] [ ue ]and [ ia ] respectively by Muyunga The 500(= 241 + 259) diphthongs Muyunga countsroughly correspond to the 533 (= (521 -104) + (252 - 136)) more [ w ] and [ j ] in CPDTo be able to compare the vocalic resonantsCPDrsquos [ ] was added to [ e ] and CPDrsquos [ ]was added to [ o ] Also as Muyunga distin-guishes between lsquoshort environmentallylengthened and inherently long vowelsrsquo (cfTable 3) while CPD is based on lsquowords in isola-tionrsquo (thus excluding environmentally length-ened vocalic resonants) Muyungarsquos short andenvironmentally lengthened vocalic resonantshad to be counted together in order to comparethe two studies As far as the short vocalic res-onants are concerned they agree rather wellFor the long vocalic resonants however [ a ](048 versus 480) and [ e ] (088 versus380) seem too incongruous Upon consultingour transcriptions we noted that the majority of[ a ] and [ e ] come from demonstratives Yetfor this part of speech Muyunga (1979150ndash152) consistently (and wrongly) writesshort vocalic resonants

As far as the phones in brackets in Table 6are concerned we can note that besides beingattested solely in CPD they are extremelyinfrequent They therefore do not distort theinventory

In order to calculate the correlation coeffi-cient r between the two frequency studies it isclear from the foregoing that counts for vocalicresonants and for [ w ] [ j ] and diphthongscannot be included For the remaining phonesone obtains a near-perfect correlation as r =098 On the whole we must conclude that theproportional distribution of the phones in thesmall-scale CPD (1 709 phones) correspondsto the distribution found in Muyunga which isas much as six times larger (10 726 phones)Doubtless this clearly supports a corpus-basedphonetics from below approach

On a second level the proportional occur-rence of the different tones in vocalic resonantscan also be considered As far as number ofwords is concerned the largest study wasundertaken by Kabuta as he transcribed oneand a half hour of unscripted conversation andconcluded that lsquo[c]ounts carried out on a 90-minute ordinary conversation recorded on cas-

sette revealed [] that there are 62 of H [hightones] vs 38 L [low tones]rsquo (1998b57) Thedetailed analysis stored in CPD attests 6104high and 3528 low tones (together with330 falling 013 rising 013 middle and013 voiceless) The fact that the tonal dimen-sion in just 350 top-frequency words corre-sponds extremely well with the tonal dimensionin a one-and-a-half-hour-long natural conversa-tion once more clearly supports a corpus-basedphonetics from below approach

Complementing existing phoneinventories for CilubagraveAccepting the validity of a corpus-basedapproach instantly implies that one must alsoseriously consider the peripheral phenomenaattested by means of such an approach Thusphones like the voiced alveolar trilled stop [ r ]and the vocalic resonant schwa [ ] hencephones that do not belong to genuine lsquostandardCilubagraversquo should nonetheless be mentioned onfuture phonetic charts of Cilubagrave mdash preciselybecause they too presently belong to the fre-quent phones of the language

Surprisingly enough one word [ si ] (a par-ticle used to confirm a statement and for whichlsquoisnrsquot itrsquo might be a close equivalent) containeda phone never mentioned in the literature sofar The fact that the vocalic resonant [ i ]showed up as voiceless in the particle [ si ] wasreally surprising to both the researchers as wellas to the informant This very particle wasrecorded very often and in many different con-texts At times the informant even forced him-self to make it voiced mdash as for one reason oranother it was thought that this was the way ithad to be pronounced mdash but in the end theinformant was bound to conclude about thevoiced attempt ldquoNo People do not speak likethatrdquo The voiceless vocalic resonant [ i ] shouldtherefore also be mentioned on future phoneticcharts of Cilubagrave

Framing Cilubagrave phonetics in a globalperspectiveOnce one realises that a minimum number ofwords representing the most frequent sectionof a languagersquos lexicon are sufficient as a basisfor a phonetic description one can easily takeexisting research one step further and framethe results in a global perspective The largestdatabase for which systematic data is readily

Prinsloo and de Schryver Corpus applications for the African languages120

available is UPSID (an acronym for UCLAPhonological Segment Inventory Database)This database was compiled underMaddiesonrsquos supervision at the University ofCalifornia Los Angeles and contains thephonemic inventories of 317 languages(Maddieson 1984) By way of example we canconsider the distribution of the different placesof articulation in stops for Cilubagrave as shown inFigure 1

From Figure 1 it can be seen that the mostfrequent places of articulation in stops forCilubagrave are located forward in the oral cavity vizbilabial and alveolar which together account forroughly four fifths of the places The velar placeof articulation roughly accounts for the remain-ing fifth The UPSID database has 2350labial 013 labiodental 3348 dental-alveo-lar 700 postalveolar 509 retroflex 570palatal 1963 velar 201 uvular and 346glottal Hence one must conclude that here theCilubagrave distribution broadly follows the generalpattern seen in the worldrsquos languages

On the other hand exactly three quarters ofthe Cilubagrave stops are voiced the remainingquarter being voiceless The UPSID databasehas 5249 voiced versus 4751 voicelessHence Cilubagrave here does not follow the generalpattern seen in the worldrsquos languages

Towards a sound treatment of theCilubagrave vocalic resonantsThe study of CPD also reveals the lackadaisi-cal approach of any phonetic description ofCilubagrave thus far when it comes to the vocalicresonants Firstly through a purely auditivecomparison with the taped pronunciation of theCardinal Vowels (CVs) by Daniel Jones him-self we have come to the conclusion that atotal of nine vocalic-resonant values are attest-ed in CPD cf Table 7

Nine vocalic-resonant values is a high num-ber for a language traditionally considered ashaving only five vocalic resonants In the entireliterature in our possession only three authorsmention the existence of more than five vocalicresonants Stappers (1949xi) devotes just onesentence to the observation that there is nophonological opposition between o and and eand in Cilubagrave Kabuta (1998a14) devotesonly one short obscure rule in which he arguesthat e is pronounced [ ] whenever the pre-ceding syllable contains e or o He gives onlytwo examples [ kupna ] and [ kukma ]which are not really helpful5 In addition one isat a total loss when it comes to the phones[ o ] versus [ ] for nothing is mentionedabout them The only serious attempt to clarifythe matter is found in Muyunga (197949ndash51)

1895

16

255

3822

3869

velar

palatal

palato-alveolar

alveolar

bilabial

0 10 20 30 40 50

Places of articulation in stops

Figure 1 Proportional occurrence of each place of articulation in stops for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 121

His study brings him to the conclusion thatlsquo[t]he degree of openness of these vowels [eand o] is conditioned by the final vowel of thewordrsquo (197949) a phenomenon he calls lsquoa kindof retrogressive vowel harmonyrsquo Unfortunatelyupon scrutinising CPD this suggested harmo-ny cannot be supported This has an importantconsequence Even though Stappersrsquo observa-tion still holds the occurrence of a particularvocalic-resonant value not being predictable ina specific environment one should seriouslyreconsider the many different orthographiesused for Cilubagrave for they are all restricted to justfive lsquovowel symbolsrsquo

Secondly even more lackadaisical through-out the literature is the treatment of the tonaldimension of vocalic resonants and thisdespite the fact that tones are used to makeboth semantic and grammatical distinctionsWe are convinced that if one is to expound onthe real nature of the vocalic resonants inCilubagrave one needs a three-dimensionalapproach with a quantity level a tonality leveland a frequency level mdash and this for eachvocalic resonant6 As an illustration two suchthree-dimensional approaches are shown inFigures 2 and 3

Rare phenomena and the corpus-based phonetics from belowapproachWe must note that a method based on top-fre-quencies of occurrence will not mdash by definitionmdash show the rather rare phenomena of a lan-guage In this respect Gabrieumll is the onlyauthor to mention the presence of two ingres-sive phones namely the lsquomonosyllabic affirma-tion enrsquo and the lsquodental clickrsquo (cf Table 1)These facets are certainly crucial if one pur-sues an exhaustive phonetic description ofCilubagrave

Rather accidentally what Gabrieumll calls the

lsquomonosyllabic affirmation enrsquo was recordedduring the sessions with the informant Indeedin one of the utterances to illustrate [ si ] (theparticle used to confirm a statement) the infor-mant starts off with a phone we could tenta-tively pinpoint as [ K ] a breathy voiced glottalfricative pronounced on an indrawn breath Asit stands there the lsquoconfirmation particlersquo [ si ] ispreceded by the lsquoaffirmation particlersquo [ K ] It ishowever not simply a pleonasm to strengthenthe ensuing statement even more Rather [ K ]seems to be a paralinguistic use of the pul-monic ingressive airstream mechanism in orderto express sympathy

The Balubagrave rarely swear but whenever theydo they use [ | ] the voiceless dental click Justas [ K ] (made on a pulmonic ingressiveairstream) [ | ] (made on a velaric ingressiveairstream) is only used in a paralinguistic func-tion

The corpus-based phonetics frombelow approach as a powerful toolTo summarise this section on corpus applica-tions in the field of fundamental phoneticresearch one can safely claim that a lsquocorpus-based phonetics from belowrsquo-approach is apowerful tool Specifically for Cilubagrave it hasrevealed previously underestimated phonesled to the discovery of one new phone enabledframing the phonetic inventory in a global per-spective and pointed out some serious lacu-nae in the literature For any language one canclaim that this approach entails a new method-ology in terms of which the phonetic descriptionof a language is obtained in which one startsfrom the language itself and eliminates the ran-dom factor In addition this methodologymakes it possible to make a maximum numberof distributional claims based on a minimumnumber of words about the most frequent sec-tion of a languagersquos lexicon

[ ] CV1 [ ] CV3 [ ] CV8

[ ] CV2 [ ] CV4 somewhat retracted [ ] CV7 somewhat lowered

[ ] CV2 somewhat lowered ([ ]) (IPA symbol for schwa) [ ] CV6 somewhat raised

Table 7 Vocalic resonants attested in CPD

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 2: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Prinsloo and de Schryver Corpus applications for the African languages112

mutual interactions between lexicon and cor-pus

lsquoWe can summarise without claimingto be exhaustive the lexicon (L) mdashcorpus (C) interactions in the followinglist where an arrow from L to Cmeans in general the projectionmap-ping of some lexical data on the cor-pus while an arrow from C to L meansacquisition of lexical information fromcorporaL rarr C taggingC rarr L frequencies (of different lin-

guistic ldquoobjectsrdquo)C rarr L proper nounsL rarr C parsingC rarr L updatingC rarr L ldquocollocationalrdquo data (MW

idioms gram patterns)C rarr L ldquonuancesrdquo of meanings amp

semantic clusteringC rarr L lexical (syntacticsemantic)

knowledge acquisitionL rarr C semantic tagging

darrC rarr L more semantic information

on the lexical entryL rarr C semantic disambiguationC rarr L corpus based computational

lexicographyC rarr L validation of lexical modelsrsquo

(Calzolari 19967ndash8)As we are operating on Kruytrsquos first level

we see that within Calzolarirsquos framework weare essentially dealing with lsquoacquisition of lexi-cal information from corporarsquo (C rarr L) In futurewhen corpora for African languages will also bepre-processed linguistically lsquoprojectionmap-ping of some lexical data on the corpusrsquo (L rarr C)will also become possible At that point we willbe able to implement Calzolarirsquos lsquobootstrappingmethodologyrsquo (199614) which implies that acontinuous projectionmapping of C on L andvice versa results in successive analyses ofthe corpus which increase in richness

Corpora uses in the broad field of linguisticsare virtually unlimited and are even found out-side linguistics (such as in anthropology histo-ry sociology etc cf Hurskainen 1998 sect2)Within linguistics Calzolari (19964ndash5) men-tions NLP (Natural Language Processing) andSpeech systems the evaluation of syntactictheories a variety of phenomena occurring in

lsquorealrsquo texts (such as underestimatedunderdis-cussed structures linguistically uninterestingphenomena and deviations from linguisticmodels) the construction of stochastic modelsthe identification and characterisation of sub-languages language teaching and learning lit-erary surveys sociolinguistic considerationslexicographic compilations stylistic studiesetc

Due to space restrictions we can obviouslyonly discuss a fraction of all potential linguisticapplications of corpora for African languagesAs such the present article will touch uponthree different linguistic facets (a) fundamentallinguistic research (b) language teaching andlearning and (c) language software The firstfacet (fundamental linguistic research) is exem-plified by means of a thorough discussion ofCilubagrave phonetics on the one hand and anoverview of question particles in Sepedi on theother hand The second facet (language teach-ing and learning) is illustrated with compilationsof a pronunciation guide for Cilubagrave and a text-book for Sepedi on the one hand and theteaching of morpho-syntactic and contrastivestructures in Sepedi on the other hand Finallyfor the third facet (language software) the first-generation spellcheckers developed forisiXhosa isiZulu Sepedi and Setswana arereviewed

Corpus applications in the field of fun-damental linguistic research Part 1phonetics2

Formulation of the basic aim lsquocor-pus-based phonetics from belowrsquoTraditionally phonetic research has beenundertaken only in order to proceed to phonol-ogy More recently phonetic research has beencarried out in order to frame the results in aglobal perspective by cross-comparing occur-rence frequencies of different phone categoriesin various languages The utilisation of corporain the field of phonetics however opens newand even more exciting doors We will illustratethis with reference to some phonetic aspects ofCilubagrave

If we look at phonetic studies that havebeen undertaken throughout the world we seethat the great majority of them are based on alsquotranslationrsquo of a story very often an English oneat that or worse a lsquotranslationrsquo of randomlysampled (English) words In order to avoid an

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 113

ethnocentric approach different so-calledlsquoevery-day non-cultural lists of wordsrsquo havebeen assembled over the years (cf Swadesh1952 1953349 1955) Even though the labelfor such (English) lists was later changed tolsquobasic vocabularyrsquo (cf Bastin Coupez amp DeHalleux 1983174) one will always recognise alsquoforeign biasrsquo for as long as one does not takethe language being studied as onersquos point ofdeparture The minority of scholars who didtake the language itself as their point of depar-ture would randomly sample a small selectionof recordings andor texts from that languagefrom which to work Here of course onestrongly doubts the representativeness of suchrandom samples

In order to pursue truly modern phoneticsone should therefore do away with the ethno-centric approach on the one hand and elimi-nate the random factor on the other handFormulated differently one needs to arrive at aphonetic description which emanates solelyfrom the language itself mdash hence a lsquophoneticsfrom belowrsquo-approach and this must be adescription with well-founded claims To complywith both these points of departure at the sametime we argue that one can simply turn to top-

frequency counts derived from a corpus of thelanguage under study The amazing thing isthat such a corpus does not even need to belarge while the actual words one works on canmoreover be ridiculously small The methodol-ogy we suggest is therefore a lsquocorpus-basedphonetics from belowrsquo-approach

Previous lsquotraditionalrsquo scholarship inthe field of Cilubagrave phoneticsAs far as previous lsquotraditionalrsquo scholarship isconcerned only three publications explicitlydiscuss phonetic aspects of Cilubagrave The firstattempt is the one by Gabrieumll and dates fromthe 1920s His phone inventory (originally arunning text) has been summarised in Table 1

As can be seen from Table 1 Gabrieumll doesnot use any phonetic symbols He ratherdescribes the ldquosoundsrdquo andor uses the romanalphabet patterned on French and Flemish Inaddition one huge omission in his descriptionconcerns the tonal dimension of Cilubagrave

The first scholar to stress the crucial role oftone in Cilubagrave is Burssens in his Tonologischeschets van het Tshiluba (1939) His phoneinventory (originally also a running text) hasbeen summarised in Table 2

ingressive sounds

$ monosyllabic affirmation en $ dental click

egressive sounds

vowels

consonants

combinations vowels

consonants

voiced

voiceless

V+V

nasal+C

C+semivowel

soft b d j v z semivowel w y trill l

a e i o u $ they can

be short medium or long

$ they can be nasalised

nasal n m velar-n

hard p f t k s sh tsh $ p =

voiceless bilabial affricate

$ mp = explosive

ai ei au eu io iu

$ mb mp mf

mv mm $ nd ng nk

ns nz nj nsh ntsh nt nn nw ny

$ preceding vowels are slightly nasalised

$ bw fw kw

lw nw pw sw tw vw

$ by dy ky my ny py

Table 1 Cilubagrave phone inventory according to Gabrieumll (sd4 [(19213)]7ndash11)

Prinsloo and de Schryver Corpus applications for the African languages114

In contrast to Gabrieumll Burssens does use aphonetic alphabet albeit the one popularamong Africanists As far as Cilubagrave is con-cerned this implies that the symbol [ ] is usedinstead of [ ] to represent the voiceless bil-abial fricative and the symbol [ y ] instead of[ j ] for the voiced palatal consonantal reso-nant3

The phonetically most explicit description isfound in Muyunga (197947ndash64) This explicit-ness is not so much a result of an improvedclassification of the Lubagrave phones but ratherdue to the fact that Muyunga is the only one totake quantitative aspects into consideration inhis phonetic description Muyunga summarises(all) the information found in the literature asshown in Table 34

Compared to Burssensrsquo inventory it can beobserved that Gabrieumllrsquos diphthongs and pre-nasalised consonants have reappeared This

procedure can only be considered as a movebackwards Phonetically the diphthongs sug-gested by Muyunga (and Gabrieumll) are moreaccurately analysed as vocalic resonants pre-ceded by the voiced labial-velar consonantalresonant [ w ] or the voiced palatal consonantalresonant [ j ] While Muyunga includes thesediphthongs in his quantitative study in thatsame study he fortunately dissociates the pre-nasalised consonants for this is without ques-tion a more accurate phonetic analysis

Despite such inaccuracies in his analysisMuyungarsquos prime contribution to the phoneticdescription of Cilubagrave is his discussion of thefrequencies of occurrence of the Lubagrave phonesHe calculated these frequencies on the basisof 11 texts four familiar-style letters four pas-toral letters and three radio broadcasts In thisway he obtained lsquoa total of 2 333 words con-taining 10 726 soundsrsquo (197960) Although he

front

central

back

close

half-close

half-open

V O W E L S

open

$ there are no true diphthongs $ all vowels can be short long or extra-long

bilabial

labio-dental

dental

alveolar

palato-alveolar

(pre-) palatal

palatal

velar

explosive

affricate

fricative

nasal

bilateral

semivocal

C O N S O N A N T S

$ only in the combination mp

$ labial-velar semivocal

Table 2 Cilubagrave phone inventory according to Burssens (19396ndash12)

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 115

is not explicit in this regard the 2 333 wordsare not 2 333 different words Anyhow by ran-domly choosing 11 short texts he hopes toarrive at a representative sample of the Lubagravelanguage

Corpus-based fieldwork and theCilubagrave Phonetic Database (CPD)While Muyungarsquos method can be regarded as alsquophonetics from belowrsquo-approach one stillneeds to eliminate the random factor It is pre-cisely here that our suggestion to utilise a cor-pus comes in To that end Recallrsquos Cilubagrave

Corpus (RCC) a small-size structured corpusof just 300 000 running words (tokens) wasqueried (cf De Schryver amp Prinsloo200098ndash102) and a corpus of that size turnedout to contain approximately 35 000 differentwords (types) Now the top ONE PERCENT ofthe types with an even distribution across thedifferent sub-corpora (cf De Schryver ampPrinsloo forthcoming) or thus just 350 wordsnot only turned out to provide enough data fora thorough phonetic analysis but also to com-plement all existing phonetic descriptions Inother words although this study only deals with

Simple Consonants

bilabial

labio-dental

dental-alveolar

palato-alveolar

palatal

velar

glottal

plosive

nasal

lateral

fricative

affricate

semivowel

Prenasalised Consonants

bilabial

labio-dental

dental-alveolar

palato-alveolar

palatal

velar

glottal

plosive

fricative

affricate

Simple Vowels

front back

Diphthongs(We are puzzled by Muyungarsquos notation of diphthongd)

front back

close

u

close

half-close

close to half-close

open

close to open

Remarks

$ vowels can be short environmentally lengthened or inherently long

$ is always prenasalised

Table 3 Cilubagrave phone inventory according to Muyunga (197948ndash49 52)

Prinsloo and de Schryver Corpus applications for the African languages116

the top one percent of the types in RCC theresults are far-reaching Indeed all claimsabout the frequencies of occurrence of certainphones imply that these claims are valid forthose words that are most frequent in Cilubagrave

Fieldwork was carried out with a malenative speaker of standard Cilubagrave For each ofthe 350 words he was asked to pronounce ashort sentence chosen from the concordancelines extracted from RCC After repeating thissentence a second time the word was pro-nounced two more times in isolation With thisprocedure we hoped to obtain a pronunciationas close to natural spoken language as possi-ble During the recordings an initial transcrip-tion was made In order to complete the purelyauditory and visual cues the informant wasoften asked to describe mdash in his own words mdash

the articulation of this or that phone In additionwe read out our own transcriptions time andagain

Following the fieldwork our initial transcrip-tions were verified with the recordingsSamples of the resulting (detailed) transcrip-tions are shown in Table 4

The phonetic transcriptions of the 350 mostfrequent Lubagrave words constitute the backbone ofthe statistical database which was subsequent-ly set up mdash the Cilubagrave Phonetic Database(CPD) In total CPD contains 1 709 phonesEach phonersquos phonetic description was codedin various ways to enable a thorough distribu-tional analysis An overview of the differentphones attested in CPD is shown in Table 5

Compared to the inventories presented inTables 1 2 and 3 the CPD inventory reveals a

phonetic transcription

phonetic transcription

phonetic transcription

153

180

207

154

181

208

155

182

209

156

183

210

157

184

211

158

185

212

159

186

213

160

187

214

161

188

215

162

189

216

163

190

217

164

191

218

165

192

219

166

193

220

167

194

221

Table 4 Samples of the phonetic transcriptions of the 350 most frequent words in Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 117

number of striking differences such as thepresence of the voiced alveolar trilled stop [ r ]and the high number of vocalic resonants Thevoiced alveolar trilled stop [ r ] for instance isa phone not to be found in genuine lsquostandardCilubagraversquo so phoneticians have tended to over-look its importance From a frequency point ofview however it is clear that this phone rightful-ly deserves its place on phonetic charts ofCilubagrave Yet in order for such and similar claimsto be valid one must be sure that there is agood correlation between the overall distribu-tion of the phones mentioned in the literatureand those in CPD

Comparison between phone frequen-cies in the literature and those inCPDOn a first level a comparison can be made

between the phone frequencies found inMuyunga and those derived from CPD Theresults of this comparison are summarised inTable 6

From Table 6 it is clear that there is excel-lent agreement between the two frequencystudies There are only a few discrepanciesand even these can be explained As far as theconsonants are concerned there is just onephone for which there is a big differencebetween the studies namely the voiced labial-velar consonantal resonant [ w ] for whichMuyunga shows 104 while CPD has 521To a much smaller extent an analogous differ-ence can be observed for the voiced palatalconsonantal resonant [ j ] for which Muyungashows 136 while CPD has 252 The rea-son for this is obvious once one realises thatMuyunga includes diphthongs into his frequen-

CONSONANTS

bilabial labiodental

alveolar

palato-alveolar

palatal

velar

oral stop

nasal stop

trilled stop

( )

fricative

resonant

lateral resonant

VOCALIC RESONANTS OTHER SYMBOLS

voiced labial-velar consonantal resonant

voiceless palato-alveolar affricate

front

central back

e ( )

e-mid

n-mid ( )

n

Table 5 Cilubagrave phone inventory derived from the Cilubagrave Phonetic Database (CPD)

Prinsloo and de Schryver Corpus applications for the African languages118

Table 6 Cilubagrave phone frequencies in Muyunga (197958 62ndash63) compared to those in CPD

CONSONANTS VOCALIC RESONANTS

Muyunga- CPD- symbol Muyunga- CPD-

031 035 847

646 661 ( ) 081

928 720

377 222

039

117

529 515

403

525 568

( ) 080

483 386

728 725

088

380

751 655 1154

BB 063 059 ( ) 136

1290 1205

234 129

048

480

( ) C 012 1062

BB 124 088 ( ) 086

1148 913

043 023

022

111

081 123 146

138 140 ( ) 046

192 205

048 070

009

082

098 082 ( )

C

006

073 053 ( )

C

006

136 252 DIPHTHONGS

448 363 length

Muyunga-

CPD-

104 521 short

241

C

076 094 long

259

C

Total 5253

5390 Total

4747

4611

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 119

cy counts As a result CPDrsquos [ wa ] [ we ] and[ ja ] for instance are considered [ ua ] [ ue ]and [ ia ] respectively by Muyunga The 500(= 241 + 259) diphthongs Muyunga countsroughly correspond to the 533 (= (521 -104) + (252 - 136)) more [ w ] and [ j ] in CPDTo be able to compare the vocalic resonantsCPDrsquos [ ] was added to [ e ] and CPDrsquos [ ]was added to [ o ] Also as Muyunga distin-guishes between lsquoshort environmentallylengthened and inherently long vowelsrsquo (cfTable 3) while CPD is based on lsquowords in isola-tionrsquo (thus excluding environmentally length-ened vocalic resonants) Muyungarsquos short andenvironmentally lengthened vocalic resonantshad to be counted together in order to comparethe two studies As far as the short vocalic res-onants are concerned they agree rather wellFor the long vocalic resonants however [ a ](048 versus 480) and [ e ] (088 versus380) seem too incongruous Upon consultingour transcriptions we noted that the majority of[ a ] and [ e ] come from demonstratives Yetfor this part of speech Muyunga (1979150ndash152) consistently (and wrongly) writesshort vocalic resonants

As far as the phones in brackets in Table 6are concerned we can note that besides beingattested solely in CPD they are extremelyinfrequent They therefore do not distort theinventory

In order to calculate the correlation coeffi-cient r between the two frequency studies it isclear from the foregoing that counts for vocalicresonants and for [ w ] [ j ] and diphthongscannot be included For the remaining phonesone obtains a near-perfect correlation as r =098 On the whole we must conclude that theproportional distribution of the phones in thesmall-scale CPD (1 709 phones) correspondsto the distribution found in Muyunga which isas much as six times larger (10 726 phones)Doubtless this clearly supports a corpus-basedphonetics from below approach

On a second level the proportional occur-rence of the different tones in vocalic resonantscan also be considered As far as number ofwords is concerned the largest study wasundertaken by Kabuta as he transcribed oneand a half hour of unscripted conversation andconcluded that lsquo[c]ounts carried out on a 90-minute ordinary conversation recorded on cas-

sette revealed [] that there are 62 of H [hightones] vs 38 L [low tones]rsquo (1998b57) Thedetailed analysis stored in CPD attests 6104high and 3528 low tones (together with330 falling 013 rising 013 middle and013 voiceless) The fact that the tonal dimen-sion in just 350 top-frequency words corre-sponds extremely well with the tonal dimensionin a one-and-a-half-hour-long natural conversa-tion once more clearly supports a corpus-basedphonetics from below approach

Complementing existing phoneinventories for CilubagraveAccepting the validity of a corpus-basedapproach instantly implies that one must alsoseriously consider the peripheral phenomenaattested by means of such an approach Thusphones like the voiced alveolar trilled stop [ r ]and the vocalic resonant schwa [ ] hencephones that do not belong to genuine lsquostandardCilubagraversquo should nonetheless be mentioned onfuture phonetic charts of Cilubagrave mdash preciselybecause they too presently belong to the fre-quent phones of the language

Surprisingly enough one word [ si ] (a par-ticle used to confirm a statement and for whichlsquoisnrsquot itrsquo might be a close equivalent) containeda phone never mentioned in the literature sofar The fact that the vocalic resonant [ i ]showed up as voiceless in the particle [ si ] wasreally surprising to both the researchers as wellas to the informant This very particle wasrecorded very often and in many different con-texts At times the informant even forced him-self to make it voiced mdash as for one reason oranother it was thought that this was the way ithad to be pronounced mdash but in the end theinformant was bound to conclude about thevoiced attempt ldquoNo People do not speak likethatrdquo The voiceless vocalic resonant [ i ] shouldtherefore also be mentioned on future phoneticcharts of Cilubagrave

Framing Cilubagrave phonetics in a globalperspectiveOnce one realises that a minimum number ofwords representing the most frequent sectionof a languagersquos lexicon are sufficient as a basisfor a phonetic description one can easily takeexisting research one step further and framethe results in a global perspective The largestdatabase for which systematic data is readily

Prinsloo and de Schryver Corpus applications for the African languages120

available is UPSID (an acronym for UCLAPhonological Segment Inventory Database)This database was compiled underMaddiesonrsquos supervision at the University ofCalifornia Los Angeles and contains thephonemic inventories of 317 languages(Maddieson 1984) By way of example we canconsider the distribution of the different placesof articulation in stops for Cilubagrave as shown inFigure 1

From Figure 1 it can be seen that the mostfrequent places of articulation in stops forCilubagrave are located forward in the oral cavity vizbilabial and alveolar which together account forroughly four fifths of the places The velar placeof articulation roughly accounts for the remain-ing fifth The UPSID database has 2350labial 013 labiodental 3348 dental-alveo-lar 700 postalveolar 509 retroflex 570palatal 1963 velar 201 uvular and 346glottal Hence one must conclude that here theCilubagrave distribution broadly follows the generalpattern seen in the worldrsquos languages

On the other hand exactly three quarters ofthe Cilubagrave stops are voiced the remainingquarter being voiceless The UPSID databasehas 5249 voiced versus 4751 voicelessHence Cilubagrave here does not follow the generalpattern seen in the worldrsquos languages

Towards a sound treatment of theCilubagrave vocalic resonantsThe study of CPD also reveals the lackadaisi-cal approach of any phonetic description ofCilubagrave thus far when it comes to the vocalicresonants Firstly through a purely auditivecomparison with the taped pronunciation of theCardinal Vowels (CVs) by Daniel Jones him-self we have come to the conclusion that atotal of nine vocalic-resonant values are attest-ed in CPD cf Table 7

Nine vocalic-resonant values is a high num-ber for a language traditionally considered ashaving only five vocalic resonants In the entireliterature in our possession only three authorsmention the existence of more than five vocalicresonants Stappers (1949xi) devotes just onesentence to the observation that there is nophonological opposition between o and and eand in Cilubagrave Kabuta (1998a14) devotesonly one short obscure rule in which he arguesthat e is pronounced [ ] whenever the pre-ceding syllable contains e or o He gives onlytwo examples [ kupna ] and [ kukma ]which are not really helpful5 In addition one isat a total loss when it comes to the phones[ o ] versus [ ] for nothing is mentionedabout them The only serious attempt to clarifythe matter is found in Muyunga (197949ndash51)

1895

16

255

3822

3869

velar

palatal

palato-alveolar

alveolar

bilabial

0 10 20 30 40 50

Places of articulation in stops

Figure 1 Proportional occurrence of each place of articulation in stops for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 121

His study brings him to the conclusion thatlsquo[t]he degree of openness of these vowels [eand o] is conditioned by the final vowel of thewordrsquo (197949) a phenomenon he calls lsquoa kindof retrogressive vowel harmonyrsquo Unfortunatelyupon scrutinising CPD this suggested harmo-ny cannot be supported This has an importantconsequence Even though Stappersrsquo observa-tion still holds the occurrence of a particularvocalic-resonant value not being predictable ina specific environment one should seriouslyreconsider the many different orthographiesused for Cilubagrave for they are all restricted to justfive lsquovowel symbolsrsquo

Secondly even more lackadaisical through-out the literature is the treatment of the tonaldimension of vocalic resonants and thisdespite the fact that tones are used to makeboth semantic and grammatical distinctionsWe are convinced that if one is to expound onthe real nature of the vocalic resonants inCilubagrave one needs a three-dimensionalapproach with a quantity level a tonality leveland a frequency level mdash and this for eachvocalic resonant6 As an illustration two suchthree-dimensional approaches are shown inFigures 2 and 3

Rare phenomena and the corpus-based phonetics from belowapproachWe must note that a method based on top-fre-quencies of occurrence will not mdash by definitionmdash show the rather rare phenomena of a lan-guage In this respect Gabrieumll is the onlyauthor to mention the presence of two ingres-sive phones namely the lsquomonosyllabic affirma-tion enrsquo and the lsquodental clickrsquo (cf Table 1)These facets are certainly crucial if one pur-sues an exhaustive phonetic description ofCilubagrave

Rather accidentally what Gabrieumll calls the

lsquomonosyllabic affirmation enrsquo was recordedduring the sessions with the informant Indeedin one of the utterances to illustrate [ si ] (theparticle used to confirm a statement) the infor-mant starts off with a phone we could tenta-tively pinpoint as [ K ] a breathy voiced glottalfricative pronounced on an indrawn breath Asit stands there the lsquoconfirmation particlersquo [ si ] ispreceded by the lsquoaffirmation particlersquo [ K ] It ishowever not simply a pleonasm to strengthenthe ensuing statement even more Rather [ K ]seems to be a paralinguistic use of the pul-monic ingressive airstream mechanism in orderto express sympathy

The Balubagrave rarely swear but whenever theydo they use [ | ] the voiceless dental click Justas [ K ] (made on a pulmonic ingressiveairstream) [ | ] (made on a velaric ingressiveairstream) is only used in a paralinguistic func-tion

The corpus-based phonetics frombelow approach as a powerful toolTo summarise this section on corpus applica-tions in the field of fundamental phoneticresearch one can safely claim that a lsquocorpus-based phonetics from belowrsquo-approach is apowerful tool Specifically for Cilubagrave it hasrevealed previously underestimated phonesled to the discovery of one new phone enabledframing the phonetic inventory in a global per-spective and pointed out some serious lacu-nae in the literature For any language one canclaim that this approach entails a new method-ology in terms of which the phonetic descriptionof a language is obtained in which one startsfrom the language itself and eliminates the ran-dom factor In addition this methodologymakes it possible to make a maximum numberof distributional claims based on a minimumnumber of words about the most frequent sec-tion of a languagersquos lexicon

[ ] CV1 [ ] CV3 [ ] CV8

[ ] CV2 [ ] CV4 somewhat retracted [ ] CV7 somewhat lowered

[ ] CV2 somewhat lowered ([ ]) (IPA symbol for schwa) [ ] CV6 somewhat raised

Table 7 Vocalic resonants attested in CPD

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 3: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 113

ethnocentric approach different so-calledlsquoevery-day non-cultural lists of wordsrsquo havebeen assembled over the years (cf Swadesh1952 1953349 1955) Even though the labelfor such (English) lists was later changed tolsquobasic vocabularyrsquo (cf Bastin Coupez amp DeHalleux 1983174) one will always recognise alsquoforeign biasrsquo for as long as one does not takethe language being studied as onersquos point ofdeparture The minority of scholars who didtake the language itself as their point of depar-ture would randomly sample a small selectionof recordings andor texts from that languagefrom which to work Here of course onestrongly doubts the representativeness of suchrandom samples

In order to pursue truly modern phoneticsone should therefore do away with the ethno-centric approach on the one hand and elimi-nate the random factor on the other handFormulated differently one needs to arrive at aphonetic description which emanates solelyfrom the language itself mdash hence a lsquophoneticsfrom belowrsquo-approach and this must be adescription with well-founded claims To complywith both these points of departure at the sametime we argue that one can simply turn to top-

frequency counts derived from a corpus of thelanguage under study The amazing thing isthat such a corpus does not even need to belarge while the actual words one works on canmoreover be ridiculously small The methodol-ogy we suggest is therefore a lsquocorpus-basedphonetics from belowrsquo-approach

Previous lsquotraditionalrsquo scholarship inthe field of Cilubagrave phoneticsAs far as previous lsquotraditionalrsquo scholarship isconcerned only three publications explicitlydiscuss phonetic aspects of Cilubagrave The firstattempt is the one by Gabrieumll and dates fromthe 1920s His phone inventory (originally arunning text) has been summarised in Table 1

As can be seen from Table 1 Gabrieumll doesnot use any phonetic symbols He ratherdescribes the ldquosoundsrdquo andor uses the romanalphabet patterned on French and Flemish Inaddition one huge omission in his descriptionconcerns the tonal dimension of Cilubagrave

The first scholar to stress the crucial role oftone in Cilubagrave is Burssens in his Tonologischeschets van het Tshiluba (1939) His phoneinventory (originally also a running text) hasbeen summarised in Table 2

ingressive sounds

$ monosyllabic affirmation en $ dental click

egressive sounds

vowels

consonants

combinations vowels

consonants

voiced

voiceless

V+V

nasal+C

C+semivowel

soft b d j v z semivowel w y trill l

a e i o u $ they can

be short medium or long

$ they can be nasalised

nasal n m velar-n

hard p f t k s sh tsh $ p =

voiceless bilabial affricate

$ mp = explosive

ai ei au eu io iu

$ mb mp mf

mv mm $ nd ng nk

ns nz nj nsh ntsh nt nn nw ny

$ preceding vowels are slightly nasalised

$ bw fw kw

lw nw pw sw tw vw

$ by dy ky my ny py

Table 1 Cilubagrave phone inventory according to Gabrieumll (sd4 [(19213)]7ndash11)

Prinsloo and de Schryver Corpus applications for the African languages114

In contrast to Gabrieumll Burssens does use aphonetic alphabet albeit the one popularamong Africanists As far as Cilubagrave is con-cerned this implies that the symbol [ ] is usedinstead of [ ] to represent the voiceless bil-abial fricative and the symbol [ y ] instead of[ j ] for the voiced palatal consonantal reso-nant3

The phonetically most explicit description isfound in Muyunga (197947ndash64) This explicit-ness is not so much a result of an improvedclassification of the Lubagrave phones but ratherdue to the fact that Muyunga is the only one totake quantitative aspects into consideration inhis phonetic description Muyunga summarises(all) the information found in the literature asshown in Table 34

Compared to Burssensrsquo inventory it can beobserved that Gabrieumllrsquos diphthongs and pre-nasalised consonants have reappeared This

procedure can only be considered as a movebackwards Phonetically the diphthongs sug-gested by Muyunga (and Gabrieumll) are moreaccurately analysed as vocalic resonants pre-ceded by the voiced labial-velar consonantalresonant [ w ] or the voiced palatal consonantalresonant [ j ] While Muyunga includes thesediphthongs in his quantitative study in thatsame study he fortunately dissociates the pre-nasalised consonants for this is without ques-tion a more accurate phonetic analysis

Despite such inaccuracies in his analysisMuyungarsquos prime contribution to the phoneticdescription of Cilubagrave is his discussion of thefrequencies of occurrence of the Lubagrave phonesHe calculated these frequencies on the basisof 11 texts four familiar-style letters four pas-toral letters and three radio broadcasts In thisway he obtained lsquoa total of 2 333 words con-taining 10 726 soundsrsquo (197960) Although he

front

central

back

close

half-close

half-open

V O W E L S

open

$ there are no true diphthongs $ all vowels can be short long or extra-long

bilabial

labio-dental

dental

alveolar

palato-alveolar

(pre-) palatal

palatal

velar

explosive

affricate

fricative

nasal

bilateral

semivocal

C O N S O N A N T S

$ only in the combination mp

$ labial-velar semivocal

Table 2 Cilubagrave phone inventory according to Burssens (19396ndash12)

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 115

is not explicit in this regard the 2 333 wordsare not 2 333 different words Anyhow by ran-domly choosing 11 short texts he hopes toarrive at a representative sample of the Lubagravelanguage

Corpus-based fieldwork and theCilubagrave Phonetic Database (CPD)While Muyungarsquos method can be regarded as alsquophonetics from belowrsquo-approach one stillneeds to eliminate the random factor It is pre-cisely here that our suggestion to utilise a cor-pus comes in To that end Recallrsquos Cilubagrave

Corpus (RCC) a small-size structured corpusof just 300 000 running words (tokens) wasqueried (cf De Schryver amp Prinsloo200098ndash102) and a corpus of that size turnedout to contain approximately 35 000 differentwords (types) Now the top ONE PERCENT ofthe types with an even distribution across thedifferent sub-corpora (cf De Schryver ampPrinsloo forthcoming) or thus just 350 wordsnot only turned out to provide enough data fora thorough phonetic analysis but also to com-plement all existing phonetic descriptions Inother words although this study only deals with

Simple Consonants

bilabial

labio-dental

dental-alveolar

palato-alveolar

palatal

velar

glottal

plosive

nasal

lateral

fricative

affricate

semivowel

Prenasalised Consonants

bilabial

labio-dental

dental-alveolar

palato-alveolar

palatal

velar

glottal

plosive

fricative

affricate

Simple Vowels

front back

Diphthongs(We are puzzled by Muyungarsquos notation of diphthongd)

front back

close

u

close

half-close

close to half-close

open

close to open

Remarks

$ vowels can be short environmentally lengthened or inherently long

$ is always prenasalised

Table 3 Cilubagrave phone inventory according to Muyunga (197948ndash49 52)

Prinsloo and de Schryver Corpus applications for the African languages116

the top one percent of the types in RCC theresults are far-reaching Indeed all claimsabout the frequencies of occurrence of certainphones imply that these claims are valid forthose words that are most frequent in Cilubagrave

Fieldwork was carried out with a malenative speaker of standard Cilubagrave For each ofthe 350 words he was asked to pronounce ashort sentence chosen from the concordancelines extracted from RCC After repeating thissentence a second time the word was pro-nounced two more times in isolation With thisprocedure we hoped to obtain a pronunciationas close to natural spoken language as possi-ble During the recordings an initial transcrip-tion was made In order to complete the purelyauditory and visual cues the informant wasoften asked to describe mdash in his own words mdash

the articulation of this or that phone In additionwe read out our own transcriptions time andagain

Following the fieldwork our initial transcrip-tions were verified with the recordingsSamples of the resulting (detailed) transcrip-tions are shown in Table 4

The phonetic transcriptions of the 350 mostfrequent Lubagrave words constitute the backbone ofthe statistical database which was subsequent-ly set up mdash the Cilubagrave Phonetic Database(CPD) In total CPD contains 1 709 phonesEach phonersquos phonetic description was codedin various ways to enable a thorough distribu-tional analysis An overview of the differentphones attested in CPD is shown in Table 5

Compared to the inventories presented inTables 1 2 and 3 the CPD inventory reveals a

phonetic transcription

phonetic transcription

phonetic transcription

153

180

207

154

181

208

155

182

209

156

183

210

157

184

211

158

185

212

159

186

213

160

187

214

161

188

215

162

189

216

163

190

217

164

191

218

165

192

219

166

193

220

167

194

221

Table 4 Samples of the phonetic transcriptions of the 350 most frequent words in Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 117

number of striking differences such as thepresence of the voiced alveolar trilled stop [ r ]and the high number of vocalic resonants Thevoiced alveolar trilled stop [ r ] for instance isa phone not to be found in genuine lsquostandardCilubagraversquo so phoneticians have tended to over-look its importance From a frequency point ofview however it is clear that this phone rightful-ly deserves its place on phonetic charts ofCilubagrave Yet in order for such and similar claimsto be valid one must be sure that there is agood correlation between the overall distribu-tion of the phones mentioned in the literatureand those in CPD

Comparison between phone frequen-cies in the literature and those inCPDOn a first level a comparison can be made

between the phone frequencies found inMuyunga and those derived from CPD Theresults of this comparison are summarised inTable 6

From Table 6 it is clear that there is excel-lent agreement between the two frequencystudies There are only a few discrepanciesand even these can be explained As far as theconsonants are concerned there is just onephone for which there is a big differencebetween the studies namely the voiced labial-velar consonantal resonant [ w ] for whichMuyunga shows 104 while CPD has 521To a much smaller extent an analogous differ-ence can be observed for the voiced palatalconsonantal resonant [ j ] for which Muyungashows 136 while CPD has 252 The rea-son for this is obvious once one realises thatMuyunga includes diphthongs into his frequen-

CONSONANTS

bilabial labiodental

alveolar

palato-alveolar

palatal

velar

oral stop

nasal stop

trilled stop

( )

fricative

resonant

lateral resonant

VOCALIC RESONANTS OTHER SYMBOLS

voiced labial-velar consonantal resonant

voiceless palato-alveolar affricate

front

central back

e ( )

e-mid

n-mid ( )

n

Table 5 Cilubagrave phone inventory derived from the Cilubagrave Phonetic Database (CPD)

Prinsloo and de Schryver Corpus applications for the African languages118

Table 6 Cilubagrave phone frequencies in Muyunga (197958 62ndash63) compared to those in CPD

CONSONANTS VOCALIC RESONANTS

Muyunga- CPD- symbol Muyunga- CPD-

031 035 847

646 661 ( ) 081

928 720

377 222

039

117

529 515

403

525 568

( ) 080

483 386

728 725

088

380

751 655 1154

BB 063 059 ( ) 136

1290 1205

234 129

048

480

( ) C 012 1062

BB 124 088 ( ) 086

1148 913

043 023

022

111

081 123 146

138 140 ( ) 046

192 205

048 070

009

082

098 082 ( )

C

006

073 053 ( )

C

006

136 252 DIPHTHONGS

448 363 length

Muyunga-

CPD-

104 521 short

241

C

076 094 long

259

C

Total 5253

5390 Total

4747

4611

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 119

cy counts As a result CPDrsquos [ wa ] [ we ] and[ ja ] for instance are considered [ ua ] [ ue ]and [ ia ] respectively by Muyunga The 500(= 241 + 259) diphthongs Muyunga countsroughly correspond to the 533 (= (521 -104) + (252 - 136)) more [ w ] and [ j ] in CPDTo be able to compare the vocalic resonantsCPDrsquos [ ] was added to [ e ] and CPDrsquos [ ]was added to [ o ] Also as Muyunga distin-guishes between lsquoshort environmentallylengthened and inherently long vowelsrsquo (cfTable 3) while CPD is based on lsquowords in isola-tionrsquo (thus excluding environmentally length-ened vocalic resonants) Muyungarsquos short andenvironmentally lengthened vocalic resonantshad to be counted together in order to comparethe two studies As far as the short vocalic res-onants are concerned they agree rather wellFor the long vocalic resonants however [ a ](048 versus 480) and [ e ] (088 versus380) seem too incongruous Upon consultingour transcriptions we noted that the majority of[ a ] and [ e ] come from demonstratives Yetfor this part of speech Muyunga (1979150ndash152) consistently (and wrongly) writesshort vocalic resonants

As far as the phones in brackets in Table 6are concerned we can note that besides beingattested solely in CPD they are extremelyinfrequent They therefore do not distort theinventory

In order to calculate the correlation coeffi-cient r between the two frequency studies it isclear from the foregoing that counts for vocalicresonants and for [ w ] [ j ] and diphthongscannot be included For the remaining phonesone obtains a near-perfect correlation as r =098 On the whole we must conclude that theproportional distribution of the phones in thesmall-scale CPD (1 709 phones) correspondsto the distribution found in Muyunga which isas much as six times larger (10 726 phones)Doubtless this clearly supports a corpus-basedphonetics from below approach

On a second level the proportional occur-rence of the different tones in vocalic resonantscan also be considered As far as number ofwords is concerned the largest study wasundertaken by Kabuta as he transcribed oneand a half hour of unscripted conversation andconcluded that lsquo[c]ounts carried out on a 90-minute ordinary conversation recorded on cas-

sette revealed [] that there are 62 of H [hightones] vs 38 L [low tones]rsquo (1998b57) Thedetailed analysis stored in CPD attests 6104high and 3528 low tones (together with330 falling 013 rising 013 middle and013 voiceless) The fact that the tonal dimen-sion in just 350 top-frequency words corre-sponds extremely well with the tonal dimensionin a one-and-a-half-hour-long natural conversa-tion once more clearly supports a corpus-basedphonetics from below approach

Complementing existing phoneinventories for CilubagraveAccepting the validity of a corpus-basedapproach instantly implies that one must alsoseriously consider the peripheral phenomenaattested by means of such an approach Thusphones like the voiced alveolar trilled stop [ r ]and the vocalic resonant schwa [ ] hencephones that do not belong to genuine lsquostandardCilubagraversquo should nonetheless be mentioned onfuture phonetic charts of Cilubagrave mdash preciselybecause they too presently belong to the fre-quent phones of the language

Surprisingly enough one word [ si ] (a par-ticle used to confirm a statement and for whichlsquoisnrsquot itrsquo might be a close equivalent) containeda phone never mentioned in the literature sofar The fact that the vocalic resonant [ i ]showed up as voiceless in the particle [ si ] wasreally surprising to both the researchers as wellas to the informant This very particle wasrecorded very often and in many different con-texts At times the informant even forced him-self to make it voiced mdash as for one reason oranother it was thought that this was the way ithad to be pronounced mdash but in the end theinformant was bound to conclude about thevoiced attempt ldquoNo People do not speak likethatrdquo The voiceless vocalic resonant [ i ] shouldtherefore also be mentioned on future phoneticcharts of Cilubagrave

Framing Cilubagrave phonetics in a globalperspectiveOnce one realises that a minimum number ofwords representing the most frequent sectionof a languagersquos lexicon are sufficient as a basisfor a phonetic description one can easily takeexisting research one step further and framethe results in a global perspective The largestdatabase for which systematic data is readily

Prinsloo and de Schryver Corpus applications for the African languages120

available is UPSID (an acronym for UCLAPhonological Segment Inventory Database)This database was compiled underMaddiesonrsquos supervision at the University ofCalifornia Los Angeles and contains thephonemic inventories of 317 languages(Maddieson 1984) By way of example we canconsider the distribution of the different placesof articulation in stops for Cilubagrave as shown inFigure 1

From Figure 1 it can be seen that the mostfrequent places of articulation in stops forCilubagrave are located forward in the oral cavity vizbilabial and alveolar which together account forroughly four fifths of the places The velar placeof articulation roughly accounts for the remain-ing fifth The UPSID database has 2350labial 013 labiodental 3348 dental-alveo-lar 700 postalveolar 509 retroflex 570palatal 1963 velar 201 uvular and 346glottal Hence one must conclude that here theCilubagrave distribution broadly follows the generalpattern seen in the worldrsquos languages

On the other hand exactly three quarters ofthe Cilubagrave stops are voiced the remainingquarter being voiceless The UPSID databasehas 5249 voiced versus 4751 voicelessHence Cilubagrave here does not follow the generalpattern seen in the worldrsquos languages

Towards a sound treatment of theCilubagrave vocalic resonantsThe study of CPD also reveals the lackadaisi-cal approach of any phonetic description ofCilubagrave thus far when it comes to the vocalicresonants Firstly through a purely auditivecomparison with the taped pronunciation of theCardinal Vowels (CVs) by Daniel Jones him-self we have come to the conclusion that atotal of nine vocalic-resonant values are attest-ed in CPD cf Table 7

Nine vocalic-resonant values is a high num-ber for a language traditionally considered ashaving only five vocalic resonants In the entireliterature in our possession only three authorsmention the existence of more than five vocalicresonants Stappers (1949xi) devotes just onesentence to the observation that there is nophonological opposition between o and and eand in Cilubagrave Kabuta (1998a14) devotesonly one short obscure rule in which he arguesthat e is pronounced [ ] whenever the pre-ceding syllable contains e or o He gives onlytwo examples [ kupna ] and [ kukma ]which are not really helpful5 In addition one isat a total loss when it comes to the phones[ o ] versus [ ] for nothing is mentionedabout them The only serious attempt to clarifythe matter is found in Muyunga (197949ndash51)

1895

16

255

3822

3869

velar

palatal

palato-alveolar

alveolar

bilabial

0 10 20 30 40 50

Places of articulation in stops

Figure 1 Proportional occurrence of each place of articulation in stops for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 121

His study brings him to the conclusion thatlsquo[t]he degree of openness of these vowels [eand o] is conditioned by the final vowel of thewordrsquo (197949) a phenomenon he calls lsquoa kindof retrogressive vowel harmonyrsquo Unfortunatelyupon scrutinising CPD this suggested harmo-ny cannot be supported This has an importantconsequence Even though Stappersrsquo observa-tion still holds the occurrence of a particularvocalic-resonant value not being predictable ina specific environment one should seriouslyreconsider the many different orthographiesused for Cilubagrave for they are all restricted to justfive lsquovowel symbolsrsquo

Secondly even more lackadaisical through-out the literature is the treatment of the tonaldimension of vocalic resonants and thisdespite the fact that tones are used to makeboth semantic and grammatical distinctionsWe are convinced that if one is to expound onthe real nature of the vocalic resonants inCilubagrave one needs a three-dimensionalapproach with a quantity level a tonality leveland a frequency level mdash and this for eachvocalic resonant6 As an illustration two suchthree-dimensional approaches are shown inFigures 2 and 3

Rare phenomena and the corpus-based phonetics from belowapproachWe must note that a method based on top-fre-quencies of occurrence will not mdash by definitionmdash show the rather rare phenomena of a lan-guage In this respect Gabrieumll is the onlyauthor to mention the presence of two ingres-sive phones namely the lsquomonosyllabic affirma-tion enrsquo and the lsquodental clickrsquo (cf Table 1)These facets are certainly crucial if one pur-sues an exhaustive phonetic description ofCilubagrave

Rather accidentally what Gabrieumll calls the

lsquomonosyllabic affirmation enrsquo was recordedduring the sessions with the informant Indeedin one of the utterances to illustrate [ si ] (theparticle used to confirm a statement) the infor-mant starts off with a phone we could tenta-tively pinpoint as [ K ] a breathy voiced glottalfricative pronounced on an indrawn breath Asit stands there the lsquoconfirmation particlersquo [ si ] ispreceded by the lsquoaffirmation particlersquo [ K ] It ishowever not simply a pleonasm to strengthenthe ensuing statement even more Rather [ K ]seems to be a paralinguistic use of the pul-monic ingressive airstream mechanism in orderto express sympathy

The Balubagrave rarely swear but whenever theydo they use [ | ] the voiceless dental click Justas [ K ] (made on a pulmonic ingressiveairstream) [ | ] (made on a velaric ingressiveairstream) is only used in a paralinguistic func-tion

The corpus-based phonetics frombelow approach as a powerful toolTo summarise this section on corpus applica-tions in the field of fundamental phoneticresearch one can safely claim that a lsquocorpus-based phonetics from belowrsquo-approach is apowerful tool Specifically for Cilubagrave it hasrevealed previously underestimated phonesled to the discovery of one new phone enabledframing the phonetic inventory in a global per-spective and pointed out some serious lacu-nae in the literature For any language one canclaim that this approach entails a new method-ology in terms of which the phonetic descriptionof a language is obtained in which one startsfrom the language itself and eliminates the ran-dom factor In addition this methodologymakes it possible to make a maximum numberof distributional claims based on a minimumnumber of words about the most frequent sec-tion of a languagersquos lexicon

[ ] CV1 [ ] CV3 [ ] CV8

[ ] CV2 [ ] CV4 somewhat retracted [ ] CV7 somewhat lowered

[ ] CV2 somewhat lowered ([ ]) (IPA symbol for schwa) [ ] CV6 somewhat raised

Table 7 Vocalic resonants attested in CPD

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 4: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Prinsloo and de Schryver Corpus applications for the African languages114

In contrast to Gabrieumll Burssens does use aphonetic alphabet albeit the one popularamong Africanists As far as Cilubagrave is con-cerned this implies that the symbol [ ] is usedinstead of [ ] to represent the voiceless bil-abial fricative and the symbol [ y ] instead of[ j ] for the voiced palatal consonantal reso-nant3

The phonetically most explicit description isfound in Muyunga (197947ndash64) This explicit-ness is not so much a result of an improvedclassification of the Lubagrave phones but ratherdue to the fact that Muyunga is the only one totake quantitative aspects into consideration inhis phonetic description Muyunga summarises(all) the information found in the literature asshown in Table 34

Compared to Burssensrsquo inventory it can beobserved that Gabrieumllrsquos diphthongs and pre-nasalised consonants have reappeared This

procedure can only be considered as a movebackwards Phonetically the diphthongs sug-gested by Muyunga (and Gabrieumll) are moreaccurately analysed as vocalic resonants pre-ceded by the voiced labial-velar consonantalresonant [ w ] or the voiced palatal consonantalresonant [ j ] While Muyunga includes thesediphthongs in his quantitative study in thatsame study he fortunately dissociates the pre-nasalised consonants for this is without ques-tion a more accurate phonetic analysis

Despite such inaccuracies in his analysisMuyungarsquos prime contribution to the phoneticdescription of Cilubagrave is his discussion of thefrequencies of occurrence of the Lubagrave phonesHe calculated these frequencies on the basisof 11 texts four familiar-style letters four pas-toral letters and three radio broadcasts In thisway he obtained lsquoa total of 2 333 words con-taining 10 726 soundsrsquo (197960) Although he

front

central

back

close

half-close

half-open

V O W E L S

open

$ there are no true diphthongs $ all vowels can be short long or extra-long

bilabial

labio-dental

dental

alveolar

palato-alveolar

(pre-) palatal

palatal

velar

explosive

affricate

fricative

nasal

bilateral

semivocal

C O N S O N A N T S

$ only in the combination mp

$ labial-velar semivocal

Table 2 Cilubagrave phone inventory according to Burssens (19396ndash12)

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 115

is not explicit in this regard the 2 333 wordsare not 2 333 different words Anyhow by ran-domly choosing 11 short texts he hopes toarrive at a representative sample of the Lubagravelanguage

Corpus-based fieldwork and theCilubagrave Phonetic Database (CPD)While Muyungarsquos method can be regarded as alsquophonetics from belowrsquo-approach one stillneeds to eliminate the random factor It is pre-cisely here that our suggestion to utilise a cor-pus comes in To that end Recallrsquos Cilubagrave

Corpus (RCC) a small-size structured corpusof just 300 000 running words (tokens) wasqueried (cf De Schryver amp Prinsloo200098ndash102) and a corpus of that size turnedout to contain approximately 35 000 differentwords (types) Now the top ONE PERCENT ofthe types with an even distribution across thedifferent sub-corpora (cf De Schryver ampPrinsloo forthcoming) or thus just 350 wordsnot only turned out to provide enough data fora thorough phonetic analysis but also to com-plement all existing phonetic descriptions Inother words although this study only deals with

Simple Consonants

bilabial

labio-dental

dental-alveolar

palato-alveolar

palatal

velar

glottal

plosive

nasal

lateral

fricative

affricate

semivowel

Prenasalised Consonants

bilabial

labio-dental

dental-alveolar

palato-alveolar

palatal

velar

glottal

plosive

fricative

affricate

Simple Vowels

front back

Diphthongs(We are puzzled by Muyungarsquos notation of diphthongd)

front back

close

u

close

half-close

close to half-close

open

close to open

Remarks

$ vowels can be short environmentally lengthened or inherently long

$ is always prenasalised

Table 3 Cilubagrave phone inventory according to Muyunga (197948ndash49 52)

Prinsloo and de Schryver Corpus applications for the African languages116

the top one percent of the types in RCC theresults are far-reaching Indeed all claimsabout the frequencies of occurrence of certainphones imply that these claims are valid forthose words that are most frequent in Cilubagrave

Fieldwork was carried out with a malenative speaker of standard Cilubagrave For each ofthe 350 words he was asked to pronounce ashort sentence chosen from the concordancelines extracted from RCC After repeating thissentence a second time the word was pro-nounced two more times in isolation With thisprocedure we hoped to obtain a pronunciationas close to natural spoken language as possi-ble During the recordings an initial transcrip-tion was made In order to complete the purelyauditory and visual cues the informant wasoften asked to describe mdash in his own words mdash

the articulation of this or that phone In additionwe read out our own transcriptions time andagain

Following the fieldwork our initial transcrip-tions were verified with the recordingsSamples of the resulting (detailed) transcrip-tions are shown in Table 4

The phonetic transcriptions of the 350 mostfrequent Lubagrave words constitute the backbone ofthe statistical database which was subsequent-ly set up mdash the Cilubagrave Phonetic Database(CPD) In total CPD contains 1 709 phonesEach phonersquos phonetic description was codedin various ways to enable a thorough distribu-tional analysis An overview of the differentphones attested in CPD is shown in Table 5

Compared to the inventories presented inTables 1 2 and 3 the CPD inventory reveals a

phonetic transcription

phonetic transcription

phonetic transcription

153

180

207

154

181

208

155

182

209

156

183

210

157

184

211

158

185

212

159

186

213

160

187

214

161

188

215

162

189

216

163

190

217

164

191

218

165

192

219

166

193

220

167

194

221

Table 4 Samples of the phonetic transcriptions of the 350 most frequent words in Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 117

number of striking differences such as thepresence of the voiced alveolar trilled stop [ r ]and the high number of vocalic resonants Thevoiced alveolar trilled stop [ r ] for instance isa phone not to be found in genuine lsquostandardCilubagraversquo so phoneticians have tended to over-look its importance From a frequency point ofview however it is clear that this phone rightful-ly deserves its place on phonetic charts ofCilubagrave Yet in order for such and similar claimsto be valid one must be sure that there is agood correlation between the overall distribu-tion of the phones mentioned in the literatureand those in CPD

Comparison between phone frequen-cies in the literature and those inCPDOn a first level a comparison can be made

between the phone frequencies found inMuyunga and those derived from CPD Theresults of this comparison are summarised inTable 6

From Table 6 it is clear that there is excel-lent agreement between the two frequencystudies There are only a few discrepanciesand even these can be explained As far as theconsonants are concerned there is just onephone for which there is a big differencebetween the studies namely the voiced labial-velar consonantal resonant [ w ] for whichMuyunga shows 104 while CPD has 521To a much smaller extent an analogous differ-ence can be observed for the voiced palatalconsonantal resonant [ j ] for which Muyungashows 136 while CPD has 252 The rea-son for this is obvious once one realises thatMuyunga includes diphthongs into his frequen-

CONSONANTS

bilabial labiodental

alveolar

palato-alveolar

palatal

velar

oral stop

nasal stop

trilled stop

( )

fricative

resonant

lateral resonant

VOCALIC RESONANTS OTHER SYMBOLS

voiced labial-velar consonantal resonant

voiceless palato-alveolar affricate

front

central back

e ( )

e-mid

n-mid ( )

n

Table 5 Cilubagrave phone inventory derived from the Cilubagrave Phonetic Database (CPD)

Prinsloo and de Schryver Corpus applications for the African languages118

Table 6 Cilubagrave phone frequencies in Muyunga (197958 62ndash63) compared to those in CPD

CONSONANTS VOCALIC RESONANTS

Muyunga- CPD- symbol Muyunga- CPD-

031 035 847

646 661 ( ) 081

928 720

377 222

039

117

529 515

403

525 568

( ) 080

483 386

728 725

088

380

751 655 1154

BB 063 059 ( ) 136

1290 1205

234 129

048

480

( ) C 012 1062

BB 124 088 ( ) 086

1148 913

043 023

022

111

081 123 146

138 140 ( ) 046

192 205

048 070

009

082

098 082 ( )

C

006

073 053 ( )

C

006

136 252 DIPHTHONGS

448 363 length

Muyunga-

CPD-

104 521 short

241

C

076 094 long

259

C

Total 5253

5390 Total

4747

4611

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 119

cy counts As a result CPDrsquos [ wa ] [ we ] and[ ja ] for instance are considered [ ua ] [ ue ]and [ ia ] respectively by Muyunga The 500(= 241 + 259) diphthongs Muyunga countsroughly correspond to the 533 (= (521 -104) + (252 - 136)) more [ w ] and [ j ] in CPDTo be able to compare the vocalic resonantsCPDrsquos [ ] was added to [ e ] and CPDrsquos [ ]was added to [ o ] Also as Muyunga distin-guishes between lsquoshort environmentallylengthened and inherently long vowelsrsquo (cfTable 3) while CPD is based on lsquowords in isola-tionrsquo (thus excluding environmentally length-ened vocalic resonants) Muyungarsquos short andenvironmentally lengthened vocalic resonantshad to be counted together in order to comparethe two studies As far as the short vocalic res-onants are concerned they agree rather wellFor the long vocalic resonants however [ a ](048 versus 480) and [ e ] (088 versus380) seem too incongruous Upon consultingour transcriptions we noted that the majority of[ a ] and [ e ] come from demonstratives Yetfor this part of speech Muyunga (1979150ndash152) consistently (and wrongly) writesshort vocalic resonants

As far as the phones in brackets in Table 6are concerned we can note that besides beingattested solely in CPD they are extremelyinfrequent They therefore do not distort theinventory

In order to calculate the correlation coeffi-cient r between the two frequency studies it isclear from the foregoing that counts for vocalicresonants and for [ w ] [ j ] and diphthongscannot be included For the remaining phonesone obtains a near-perfect correlation as r =098 On the whole we must conclude that theproportional distribution of the phones in thesmall-scale CPD (1 709 phones) correspondsto the distribution found in Muyunga which isas much as six times larger (10 726 phones)Doubtless this clearly supports a corpus-basedphonetics from below approach

On a second level the proportional occur-rence of the different tones in vocalic resonantscan also be considered As far as number ofwords is concerned the largest study wasundertaken by Kabuta as he transcribed oneand a half hour of unscripted conversation andconcluded that lsquo[c]ounts carried out on a 90-minute ordinary conversation recorded on cas-

sette revealed [] that there are 62 of H [hightones] vs 38 L [low tones]rsquo (1998b57) Thedetailed analysis stored in CPD attests 6104high and 3528 low tones (together with330 falling 013 rising 013 middle and013 voiceless) The fact that the tonal dimen-sion in just 350 top-frequency words corre-sponds extremely well with the tonal dimensionin a one-and-a-half-hour-long natural conversa-tion once more clearly supports a corpus-basedphonetics from below approach

Complementing existing phoneinventories for CilubagraveAccepting the validity of a corpus-basedapproach instantly implies that one must alsoseriously consider the peripheral phenomenaattested by means of such an approach Thusphones like the voiced alveolar trilled stop [ r ]and the vocalic resonant schwa [ ] hencephones that do not belong to genuine lsquostandardCilubagraversquo should nonetheless be mentioned onfuture phonetic charts of Cilubagrave mdash preciselybecause they too presently belong to the fre-quent phones of the language

Surprisingly enough one word [ si ] (a par-ticle used to confirm a statement and for whichlsquoisnrsquot itrsquo might be a close equivalent) containeda phone never mentioned in the literature sofar The fact that the vocalic resonant [ i ]showed up as voiceless in the particle [ si ] wasreally surprising to both the researchers as wellas to the informant This very particle wasrecorded very often and in many different con-texts At times the informant even forced him-self to make it voiced mdash as for one reason oranother it was thought that this was the way ithad to be pronounced mdash but in the end theinformant was bound to conclude about thevoiced attempt ldquoNo People do not speak likethatrdquo The voiceless vocalic resonant [ i ] shouldtherefore also be mentioned on future phoneticcharts of Cilubagrave

Framing Cilubagrave phonetics in a globalperspectiveOnce one realises that a minimum number ofwords representing the most frequent sectionof a languagersquos lexicon are sufficient as a basisfor a phonetic description one can easily takeexisting research one step further and framethe results in a global perspective The largestdatabase for which systematic data is readily

Prinsloo and de Schryver Corpus applications for the African languages120

available is UPSID (an acronym for UCLAPhonological Segment Inventory Database)This database was compiled underMaddiesonrsquos supervision at the University ofCalifornia Los Angeles and contains thephonemic inventories of 317 languages(Maddieson 1984) By way of example we canconsider the distribution of the different placesof articulation in stops for Cilubagrave as shown inFigure 1

From Figure 1 it can be seen that the mostfrequent places of articulation in stops forCilubagrave are located forward in the oral cavity vizbilabial and alveolar which together account forroughly four fifths of the places The velar placeof articulation roughly accounts for the remain-ing fifth The UPSID database has 2350labial 013 labiodental 3348 dental-alveo-lar 700 postalveolar 509 retroflex 570palatal 1963 velar 201 uvular and 346glottal Hence one must conclude that here theCilubagrave distribution broadly follows the generalpattern seen in the worldrsquos languages

On the other hand exactly three quarters ofthe Cilubagrave stops are voiced the remainingquarter being voiceless The UPSID databasehas 5249 voiced versus 4751 voicelessHence Cilubagrave here does not follow the generalpattern seen in the worldrsquos languages

Towards a sound treatment of theCilubagrave vocalic resonantsThe study of CPD also reveals the lackadaisi-cal approach of any phonetic description ofCilubagrave thus far when it comes to the vocalicresonants Firstly through a purely auditivecomparison with the taped pronunciation of theCardinal Vowels (CVs) by Daniel Jones him-self we have come to the conclusion that atotal of nine vocalic-resonant values are attest-ed in CPD cf Table 7

Nine vocalic-resonant values is a high num-ber for a language traditionally considered ashaving only five vocalic resonants In the entireliterature in our possession only three authorsmention the existence of more than five vocalicresonants Stappers (1949xi) devotes just onesentence to the observation that there is nophonological opposition between o and and eand in Cilubagrave Kabuta (1998a14) devotesonly one short obscure rule in which he arguesthat e is pronounced [ ] whenever the pre-ceding syllable contains e or o He gives onlytwo examples [ kupna ] and [ kukma ]which are not really helpful5 In addition one isat a total loss when it comes to the phones[ o ] versus [ ] for nothing is mentionedabout them The only serious attempt to clarifythe matter is found in Muyunga (197949ndash51)

1895

16

255

3822

3869

velar

palatal

palato-alveolar

alveolar

bilabial

0 10 20 30 40 50

Places of articulation in stops

Figure 1 Proportional occurrence of each place of articulation in stops for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 121

His study brings him to the conclusion thatlsquo[t]he degree of openness of these vowels [eand o] is conditioned by the final vowel of thewordrsquo (197949) a phenomenon he calls lsquoa kindof retrogressive vowel harmonyrsquo Unfortunatelyupon scrutinising CPD this suggested harmo-ny cannot be supported This has an importantconsequence Even though Stappersrsquo observa-tion still holds the occurrence of a particularvocalic-resonant value not being predictable ina specific environment one should seriouslyreconsider the many different orthographiesused for Cilubagrave for they are all restricted to justfive lsquovowel symbolsrsquo

Secondly even more lackadaisical through-out the literature is the treatment of the tonaldimension of vocalic resonants and thisdespite the fact that tones are used to makeboth semantic and grammatical distinctionsWe are convinced that if one is to expound onthe real nature of the vocalic resonants inCilubagrave one needs a three-dimensionalapproach with a quantity level a tonality leveland a frequency level mdash and this for eachvocalic resonant6 As an illustration two suchthree-dimensional approaches are shown inFigures 2 and 3

Rare phenomena and the corpus-based phonetics from belowapproachWe must note that a method based on top-fre-quencies of occurrence will not mdash by definitionmdash show the rather rare phenomena of a lan-guage In this respect Gabrieumll is the onlyauthor to mention the presence of two ingres-sive phones namely the lsquomonosyllabic affirma-tion enrsquo and the lsquodental clickrsquo (cf Table 1)These facets are certainly crucial if one pur-sues an exhaustive phonetic description ofCilubagrave

Rather accidentally what Gabrieumll calls the

lsquomonosyllabic affirmation enrsquo was recordedduring the sessions with the informant Indeedin one of the utterances to illustrate [ si ] (theparticle used to confirm a statement) the infor-mant starts off with a phone we could tenta-tively pinpoint as [ K ] a breathy voiced glottalfricative pronounced on an indrawn breath Asit stands there the lsquoconfirmation particlersquo [ si ] ispreceded by the lsquoaffirmation particlersquo [ K ] It ishowever not simply a pleonasm to strengthenthe ensuing statement even more Rather [ K ]seems to be a paralinguistic use of the pul-monic ingressive airstream mechanism in orderto express sympathy

The Balubagrave rarely swear but whenever theydo they use [ | ] the voiceless dental click Justas [ K ] (made on a pulmonic ingressiveairstream) [ | ] (made on a velaric ingressiveairstream) is only used in a paralinguistic func-tion

The corpus-based phonetics frombelow approach as a powerful toolTo summarise this section on corpus applica-tions in the field of fundamental phoneticresearch one can safely claim that a lsquocorpus-based phonetics from belowrsquo-approach is apowerful tool Specifically for Cilubagrave it hasrevealed previously underestimated phonesled to the discovery of one new phone enabledframing the phonetic inventory in a global per-spective and pointed out some serious lacu-nae in the literature For any language one canclaim that this approach entails a new method-ology in terms of which the phonetic descriptionof a language is obtained in which one startsfrom the language itself and eliminates the ran-dom factor In addition this methodologymakes it possible to make a maximum numberof distributional claims based on a minimumnumber of words about the most frequent sec-tion of a languagersquos lexicon

[ ] CV1 [ ] CV3 [ ] CV8

[ ] CV2 [ ] CV4 somewhat retracted [ ] CV7 somewhat lowered

[ ] CV2 somewhat lowered ([ ]) (IPA symbol for schwa) [ ] CV6 somewhat raised

Table 7 Vocalic resonants attested in CPD

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 5: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 115

is not explicit in this regard the 2 333 wordsare not 2 333 different words Anyhow by ran-domly choosing 11 short texts he hopes toarrive at a representative sample of the Lubagravelanguage

Corpus-based fieldwork and theCilubagrave Phonetic Database (CPD)While Muyungarsquos method can be regarded as alsquophonetics from belowrsquo-approach one stillneeds to eliminate the random factor It is pre-cisely here that our suggestion to utilise a cor-pus comes in To that end Recallrsquos Cilubagrave

Corpus (RCC) a small-size structured corpusof just 300 000 running words (tokens) wasqueried (cf De Schryver amp Prinsloo200098ndash102) and a corpus of that size turnedout to contain approximately 35 000 differentwords (types) Now the top ONE PERCENT ofthe types with an even distribution across thedifferent sub-corpora (cf De Schryver ampPrinsloo forthcoming) or thus just 350 wordsnot only turned out to provide enough data fora thorough phonetic analysis but also to com-plement all existing phonetic descriptions Inother words although this study only deals with

Simple Consonants

bilabial

labio-dental

dental-alveolar

palato-alveolar

palatal

velar

glottal

plosive

nasal

lateral

fricative

affricate

semivowel

Prenasalised Consonants

bilabial

labio-dental

dental-alveolar

palato-alveolar

palatal

velar

glottal

plosive

fricative

affricate

Simple Vowels

front back

Diphthongs(We are puzzled by Muyungarsquos notation of diphthongd)

front back

close

u

close

half-close

close to half-close

open

close to open

Remarks

$ vowels can be short environmentally lengthened or inherently long

$ is always prenasalised

Table 3 Cilubagrave phone inventory according to Muyunga (197948ndash49 52)

Prinsloo and de Schryver Corpus applications for the African languages116

the top one percent of the types in RCC theresults are far-reaching Indeed all claimsabout the frequencies of occurrence of certainphones imply that these claims are valid forthose words that are most frequent in Cilubagrave

Fieldwork was carried out with a malenative speaker of standard Cilubagrave For each ofthe 350 words he was asked to pronounce ashort sentence chosen from the concordancelines extracted from RCC After repeating thissentence a second time the word was pro-nounced two more times in isolation With thisprocedure we hoped to obtain a pronunciationas close to natural spoken language as possi-ble During the recordings an initial transcrip-tion was made In order to complete the purelyauditory and visual cues the informant wasoften asked to describe mdash in his own words mdash

the articulation of this or that phone In additionwe read out our own transcriptions time andagain

Following the fieldwork our initial transcrip-tions were verified with the recordingsSamples of the resulting (detailed) transcrip-tions are shown in Table 4

The phonetic transcriptions of the 350 mostfrequent Lubagrave words constitute the backbone ofthe statistical database which was subsequent-ly set up mdash the Cilubagrave Phonetic Database(CPD) In total CPD contains 1 709 phonesEach phonersquos phonetic description was codedin various ways to enable a thorough distribu-tional analysis An overview of the differentphones attested in CPD is shown in Table 5

Compared to the inventories presented inTables 1 2 and 3 the CPD inventory reveals a

phonetic transcription

phonetic transcription

phonetic transcription

153

180

207

154

181

208

155

182

209

156

183

210

157

184

211

158

185

212

159

186

213

160

187

214

161

188

215

162

189

216

163

190

217

164

191

218

165

192

219

166

193

220

167

194

221

Table 4 Samples of the phonetic transcriptions of the 350 most frequent words in Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 117

number of striking differences such as thepresence of the voiced alveolar trilled stop [ r ]and the high number of vocalic resonants Thevoiced alveolar trilled stop [ r ] for instance isa phone not to be found in genuine lsquostandardCilubagraversquo so phoneticians have tended to over-look its importance From a frequency point ofview however it is clear that this phone rightful-ly deserves its place on phonetic charts ofCilubagrave Yet in order for such and similar claimsto be valid one must be sure that there is agood correlation between the overall distribu-tion of the phones mentioned in the literatureand those in CPD

Comparison between phone frequen-cies in the literature and those inCPDOn a first level a comparison can be made

between the phone frequencies found inMuyunga and those derived from CPD Theresults of this comparison are summarised inTable 6

From Table 6 it is clear that there is excel-lent agreement between the two frequencystudies There are only a few discrepanciesand even these can be explained As far as theconsonants are concerned there is just onephone for which there is a big differencebetween the studies namely the voiced labial-velar consonantal resonant [ w ] for whichMuyunga shows 104 while CPD has 521To a much smaller extent an analogous differ-ence can be observed for the voiced palatalconsonantal resonant [ j ] for which Muyungashows 136 while CPD has 252 The rea-son for this is obvious once one realises thatMuyunga includes diphthongs into his frequen-

CONSONANTS

bilabial labiodental

alveolar

palato-alveolar

palatal

velar

oral stop

nasal stop

trilled stop

( )

fricative

resonant

lateral resonant

VOCALIC RESONANTS OTHER SYMBOLS

voiced labial-velar consonantal resonant

voiceless palato-alveolar affricate

front

central back

e ( )

e-mid

n-mid ( )

n

Table 5 Cilubagrave phone inventory derived from the Cilubagrave Phonetic Database (CPD)

Prinsloo and de Schryver Corpus applications for the African languages118

Table 6 Cilubagrave phone frequencies in Muyunga (197958 62ndash63) compared to those in CPD

CONSONANTS VOCALIC RESONANTS

Muyunga- CPD- symbol Muyunga- CPD-

031 035 847

646 661 ( ) 081

928 720

377 222

039

117

529 515

403

525 568

( ) 080

483 386

728 725

088

380

751 655 1154

BB 063 059 ( ) 136

1290 1205

234 129

048

480

( ) C 012 1062

BB 124 088 ( ) 086

1148 913

043 023

022

111

081 123 146

138 140 ( ) 046

192 205

048 070

009

082

098 082 ( )

C

006

073 053 ( )

C

006

136 252 DIPHTHONGS

448 363 length

Muyunga-

CPD-

104 521 short

241

C

076 094 long

259

C

Total 5253

5390 Total

4747

4611

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 119

cy counts As a result CPDrsquos [ wa ] [ we ] and[ ja ] for instance are considered [ ua ] [ ue ]and [ ia ] respectively by Muyunga The 500(= 241 + 259) diphthongs Muyunga countsroughly correspond to the 533 (= (521 -104) + (252 - 136)) more [ w ] and [ j ] in CPDTo be able to compare the vocalic resonantsCPDrsquos [ ] was added to [ e ] and CPDrsquos [ ]was added to [ o ] Also as Muyunga distin-guishes between lsquoshort environmentallylengthened and inherently long vowelsrsquo (cfTable 3) while CPD is based on lsquowords in isola-tionrsquo (thus excluding environmentally length-ened vocalic resonants) Muyungarsquos short andenvironmentally lengthened vocalic resonantshad to be counted together in order to comparethe two studies As far as the short vocalic res-onants are concerned they agree rather wellFor the long vocalic resonants however [ a ](048 versus 480) and [ e ] (088 versus380) seem too incongruous Upon consultingour transcriptions we noted that the majority of[ a ] and [ e ] come from demonstratives Yetfor this part of speech Muyunga (1979150ndash152) consistently (and wrongly) writesshort vocalic resonants

As far as the phones in brackets in Table 6are concerned we can note that besides beingattested solely in CPD they are extremelyinfrequent They therefore do not distort theinventory

In order to calculate the correlation coeffi-cient r between the two frequency studies it isclear from the foregoing that counts for vocalicresonants and for [ w ] [ j ] and diphthongscannot be included For the remaining phonesone obtains a near-perfect correlation as r =098 On the whole we must conclude that theproportional distribution of the phones in thesmall-scale CPD (1 709 phones) correspondsto the distribution found in Muyunga which isas much as six times larger (10 726 phones)Doubtless this clearly supports a corpus-basedphonetics from below approach

On a second level the proportional occur-rence of the different tones in vocalic resonantscan also be considered As far as number ofwords is concerned the largest study wasundertaken by Kabuta as he transcribed oneand a half hour of unscripted conversation andconcluded that lsquo[c]ounts carried out on a 90-minute ordinary conversation recorded on cas-

sette revealed [] that there are 62 of H [hightones] vs 38 L [low tones]rsquo (1998b57) Thedetailed analysis stored in CPD attests 6104high and 3528 low tones (together with330 falling 013 rising 013 middle and013 voiceless) The fact that the tonal dimen-sion in just 350 top-frequency words corre-sponds extremely well with the tonal dimensionin a one-and-a-half-hour-long natural conversa-tion once more clearly supports a corpus-basedphonetics from below approach

Complementing existing phoneinventories for CilubagraveAccepting the validity of a corpus-basedapproach instantly implies that one must alsoseriously consider the peripheral phenomenaattested by means of such an approach Thusphones like the voiced alveolar trilled stop [ r ]and the vocalic resonant schwa [ ] hencephones that do not belong to genuine lsquostandardCilubagraversquo should nonetheless be mentioned onfuture phonetic charts of Cilubagrave mdash preciselybecause they too presently belong to the fre-quent phones of the language

Surprisingly enough one word [ si ] (a par-ticle used to confirm a statement and for whichlsquoisnrsquot itrsquo might be a close equivalent) containeda phone never mentioned in the literature sofar The fact that the vocalic resonant [ i ]showed up as voiceless in the particle [ si ] wasreally surprising to both the researchers as wellas to the informant This very particle wasrecorded very often and in many different con-texts At times the informant even forced him-self to make it voiced mdash as for one reason oranother it was thought that this was the way ithad to be pronounced mdash but in the end theinformant was bound to conclude about thevoiced attempt ldquoNo People do not speak likethatrdquo The voiceless vocalic resonant [ i ] shouldtherefore also be mentioned on future phoneticcharts of Cilubagrave

Framing Cilubagrave phonetics in a globalperspectiveOnce one realises that a minimum number ofwords representing the most frequent sectionof a languagersquos lexicon are sufficient as a basisfor a phonetic description one can easily takeexisting research one step further and framethe results in a global perspective The largestdatabase for which systematic data is readily

Prinsloo and de Schryver Corpus applications for the African languages120

available is UPSID (an acronym for UCLAPhonological Segment Inventory Database)This database was compiled underMaddiesonrsquos supervision at the University ofCalifornia Los Angeles and contains thephonemic inventories of 317 languages(Maddieson 1984) By way of example we canconsider the distribution of the different placesof articulation in stops for Cilubagrave as shown inFigure 1

From Figure 1 it can be seen that the mostfrequent places of articulation in stops forCilubagrave are located forward in the oral cavity vizbilabial and alveolar which together account forroughly four fifths of the places The velar placeof articulation roughly accounts for the remain-ing fifth The UPSID database has 2350labial 013 labiodental 3348 dental-alveo-lar 700 postalveolar 509 retroflex 570palatal 1963 velar 201 uvular and 346glottal Hence one must conclude that here theCilubagrave distribution broadly follows the generalpattern seen in the worldrsquos languages

On the other hand exactly three quarters ofthe Cilubagrave stops are voiced the remainingquarter being voiceless The UPSID databasehas 5249 voiced versus 4751 voicelessHence Cilubagrave here does not follow the generalpattern seen in the worldrsquos languages

Towards a sound treatment of theCilubagrave vocalic resonantsThe study of CPD also reveals the lackadaisi-cal approach of any phonetic description ofCilubagrave thus far when it comes to the vocalicresonants Firstly through a purely auditivecomparison with the taped pronunciation of theCardinal Vowels (CVs) by Daniel Jones him-self we have come to the conclusion that atotal of nine vocalic-resonant values are attest-ed in CPD cf Table 7

Nine vocalic-resonant values is a high num-ber for a language traditionally considered ashaving only five vocalic resonants In the entireliterature in our possession only three authorsmention the existence of more than five vocalicresonants Stappers (1949xi) devotes just onesentence to the observation that there is nophonological opposition between o and and eand in Cilubagrave Kabuta (1998a14) devotesonly one short obscure rule in which he arguesthat e is pronounced [ ] whenever the pre-ceding syllable contains e or o He gives onlytwo examples [ kupna ] and [ kukma ]which are not really helpful5 In addition one isat a total loss when it comes to the phones[ o ] versus [ ] for nothing is mentionedabout them The only serious attempt to clarifythe matter is found in Muyunga (197949ndash51)

1895

16

255

3822

3869

velar

palatal

palato-alveolar

alveolar

bilabial

0 10 20 30 40 50

Places of articulation in stops

Figure 1 Proportional occurrence of each place of articulation in stops for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 121

His study brings him to the conclusion thatlsquo[t]he degree of openness of these vowels [eand o] is conditioned by the final vowel of thewordrsquo (197949) a phenomenon he calls lsquoa kindof retrogressive vowel harmonyrsquo Unfortunatelyupon scrutinising CPD this suggested harmo-ny cannot be supported This has an importantconsequence Even though Stappersrsquo observa-tion still holds the occurrence of a particularvocalic-resonant value not being predictable ina specific environment one should seriouslyreconsider the many different orthographiesused for Cilubagrave for they are all restricted to justfive lsquovowel symbolsrsquo

Secondly even more lackadaisical through-out the literature is the treatment of the tonaldimension of vocalic resonants and thisdespite the fact that tones are used to makeboth semantic and grammatical distinctionsWe are convinced that if one is to expound onthe real nature of the vocalic resonants inCilubagrave one needs a three-dimensionalapproach with a quantity level a tonality leveland a frequency level mdash and this for eachvocalic resonant6 As an illustration two suchthree-dimensional approaches are shown inFigures 2 and 3

Rare phenomena and the corpus-based phonetics from belowapproachWe must note that a method based on top-fre-quencies of occurrence will not mdash by definitionmdash show the rather rare phenomena of a lan-guage In this respect Gabrieumll is the onlyauthor to mention the presence of two ingres-sive phones namely the lsquomonosyllabic affirma-tion enrsquo and the lsquodental clickrsquo (cf Table 1)These facets are certainly crucial if one pur-sues an exhaustive phonetic description ofCilubagrave

Rather accidentally what Gabrieumll calls the

lsquomonosyllabic affirmation enrsquo was recordedduring the sessions with the informant Indeedin one of the utterances to illustrate [ si ] (theparticle used to confirm a statement) the infor-mant starts off with a phone we could tenta-tively pinpoint as [ K ] a breathy voiced glottalfricative pronounced on an indrawn breath Asit stands there the lsquoconfirmation particlersquo [ si ] ispreceded by the lsquoaffirmation particlersquo [ K ] It ishowever not simply a pleonasm to strengthenthe ensuing statement even more Rather [ K ]seems to be a paralinguistic use of the pul-monic ingressive airstream mechanism in orderto express sympathy

The Balubagrave rarely swear but whenever theydo they use [ | ] the voiceless dental click Justas [ K ] (made on a pulmonic ingressiveairstream) [ | ] (made on a velaric ingressiveairstream) is only used in a paralinguistic func-tion

The corpus-based phonetics frombelow approach as a powerful toolTo summarise this section on corpus applica-tions in the field of fundamental phoneticresearch one can safely claim that a lsquocorpus-based phonetics from belowrsquo-approach is apowerful tool Specifically for Cilubagrave it hasrevealed previously underestimated phonesled to the discovery of one new phone enabledframing the phonetic inventory in a global per-spective and pointed out some serious lacu-nae in the literature For any language one canclaim that this approach entails a new method-ology in terms of which the phonetic descriptionof a language is obtained in which one startsfrom the language itself and eliminates the ran-dom factor In addition this methodologymakes it possible to make a maximum numberof distributional claims based on a minimumnumber of words about the most frequent sec-tion of a languagersquos lexicon

[ ] CV1 [ ] CV3 [ ] CV8

[ ] CV2 [ ] CV4 somewhat retracted [ ] CV7 somewhat lowered

[ ] CV2 somewhat lowered ([ ]) (IPA symbol for schwa) [ ] CV6 somewhat raised

Table 7 Vocalic resonants attested in CPD

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 6: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Prinsloo and de Schryver Corpus applications for the African languages116

the top one percent of the types in RCC theresults are far-reaching Indeed all claimsabout the frequencies of occurrence of certainphones imply that these claims are valid forthose words that are most frequent in Cilubagrave

Fieldwork was carried out with a malenative speaker of standard Cilubagrave For each ofthe 350 words he was asked to pronounce ashort sentence chosen from the concordancelines extracted from RCC After repeating thissentence a second time the word was pro-nounced two more times in isolation With thisprocedure we hoped to obtain a pronunciationas close to natural spoken language as possi-ble During the recordings an initial transcrip-tion was made In order to complete the purelyauditory and visual cues the informant wasoften asked to describe mdash in his own words mdash

the articulation of this or that phone In additionwe read out our own transcriptions time andagain

Following the fieldwork our initial transcrip-tions were verified with the recordingsSamples of the resulting (detailed) transcrip-tions are shown in Table 4

The phonetic transcriptions of the 350 mostfrequent Lubagrave words constitute the backbone ofthe statistical database which was subsequent-ly set up mdash the Cilubagrave Phonetic Database(CPD) In total CPD contains 1 709 phonesEach phonersquos phonetic description was codedin various ways to enable a thorough distribu-tional analysis An overview of the differentphones attested in CPD is shown in Table 5

Compared to the inventories presented inTables 1 2 and 3 the CPD inventory reveals a

phonetic transcription

phonetic transcription

phonetic transcription

153

180

207

154

181

208

155

182

209

156

183

210

157

184

211

158

185

212

159

186

213

160

187

214

161

188

215

162

189

216

163

190

217

164

191

218

165

192

219

166

193

220

167

194

221

Table 4 Samples of the phonetic transcriptions of the 350 most frequent words in Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 117

number of striking differences such as thepresence of the voiced alveolar trilled stop [ r ]and the high number of vocalic resonants Thevoiced alveolar trilled stop [ r ] for instance isa phone not to be found in genuine lsquostandardCilubagraversquo so phoneticians have tended to over-look its importance From a frequency point ofview however it is clear that this phone rightful-ly deserves its place on phonetic charts ofCilubagrave Yet in order for such and similar claimsto be valid one must be sure that there is agood correlation between the overall distribu-tion of the phones mentioned in the literatureand those in CPD

Comparison between phone frequen-cies in the literature and those inCPDOn a first level a comparison can be made

between the phone frequencies found inMuyunga and those derived from CPD Theresults of this comparison are summarised inTable 6

From Table 6 it is clear that there is excel-lent agreement between the two frequencystudies There are only a few discrepanciesand even these can be explained As far as theconsonants are concerned there is just onephone for which there is a big differencebetween the studies namely the voiced labial-velar consonantal resonant [ w ] for whichMuyunga shows 104 while CPD has 521To a much smaller extent an analogous differ-ence can be observed for the voiced palatalconsonantal resonant [ j ] for which Muyungashows 136 while CPD has 252 The rea-son for this is obvious once one realises thatMuyunga includes diphthongs into his frequen-

CONSONANTS

bilabial labiodental

alveolar

palato-alveolar

palatal

velar

oral stop

nasal stop

trilled stop

( )

fricative

resonant

lateral resonant

VOCALIC RESONANTS OTHER SYMBOLS

voiced labial-velar consonantal resonant

voiceless palato-alveolar affricate

front

central back

e ( )

e-mid

n-mid ( )

n

Table 5 Cilubagrave phone inventory derived from the Cilubagrave Phonetic Database (CPD)

Prinsloo and de Schryver Corpus applications for the African languages118

Table 6 Cilubagrave phone frequencies in Muyunga (197958 62ndash63) compared to those in CPD

CONSONANTS VOCALIC RESONANTS

Muyunga- CPD- symbol Muyunga- CPD-

031 035 847

646 661 ( ) 081

928 720

377 222

039

117

529 515

403

525 568

( ) 080

483 386

728 725

088

380

751 655 1154

BB 063 059 ( ) 136

1290 1205

234 129

048

480

( ) C 012 1062

BB 124 088 ( ) 086

1148 913

043 023

022

111

081 123 146

138 140 ( ) 046

192 205

048 070

009

082

098 082 ( )

C

006

073 053 ( )

C

006

136 252 DIPHTHONGS

448 363 length

Muyunga-

CPD-

104 521 short

241

C

076 094 long

259

C

Total 5253

5390 Total

4747

4611

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 119

cy counts As a result CPDrsquos [ wa ] [ we ] and[ ja ] for instance are considered [ ua ] [ ue ]and [ ia ] respectively by Muyunga The 500(= 241 + 259) diphthongs Muyunga countsroughly correspond to the 533 (= (521 -104) + (252 - 136)) more [ w ] and [ j ] in CPDTo be able to compare the vocalic resonantsCPDrsquos [ ] was added to [ e ] and CPDrsquos [ ]was added to [ o ] Also as Muyunga distin-guishes between lsquoshort environmentallylengthened and inherently long vowelsrsquo (cfTable 3) while CPD is based on lsquowords in isola-tionrsquo (thus excluding environmentally length-ened vocalic resonants) Muyungarsquos short andenvironmentally lengthened vocalic resonantshad to be counted together in order to comparethe two studies As far as the short vocalic res-onants are concerned they agree rather wellFor the long vocalic resonants however [ a ](048 versus 480) and [ e ] (088 versus380) seem too incongruous Upon consultingour transcriptions we noted that the majority of[ a ] and [ e ] come from demonstratives Yetfor this part of speech Muyunga (1979150ndash152) consistently (and wrongly) writesshort vocalic resonants

As far as the phones in brackets in Table 6are concerned we can note that besides beingattested solely in CPD they are extremelyinfrequent They therefore do not distort theinventory

In order to calculate the correlation coeffi-cient r between the two frequency studies it isclear from the foregoing that counts for vocalicresonants and for [ w ] [ j ] and diphthongscannot be included For the remaining phonesone obtains a near-perfect correlation as r =098 On the whole we must conclude that theproportional distribution of the phones in thesmall-scale CPD (1 709 phones) correspondsto the distribution found in Muyunga which isas much as six times larger (10 726 phones)Doubtless this clearly supports a corpus-basedphonetics from below approach

On a second level the proportional occur-rence of the different tones in vocalic resonantscan also be considered As far as number ofwords is concerned the largest study wasundertaken by Kabuta as he transcribed oneand a half hour of unscripted conversation andconcluded that lsquo[c]ounts carried out on a 90-minute ordinary conversation recorded on cas-

sette revealed [] that there are 62 of H [hightones] vs 38 L [low tones]rsquo (1998b57) Thedetailed analysis stored in CPD attests 6104high and 3528 low tones (together with330 falling 013 rising 013 middle and013 voiceless) The fact that the tonal dimen-sion in just 350 top-frequency words corre-sponds extremely well with the tonal dimensionin a one-and-a-half-hour-long natural conversa-tion once more clearly supports a corpus-basedphonetics from below approach

Complementing existing phoneinventories for CilubagraveAccepting the validity of a corpus-basedapproach instantly implies that one must alsoseriously consider the peripheral phenomenaattested by means of such an approach Thusphones like the voiced alveolar trilled stop [ r ]and the vocalic resonant schwa [ ] hencephones that do not belong to genuine lsquostandardCilubagraversquo should nonetheless be mentioned onfuture phonetic charts of Cilubagrave mdash preciselybecause they too presently belong to the fre-quent phones of the language

Surprisingly enough one word [ si ] (a par-ticle used to confirm a statement and for whichlsquoisnrsquot itrsquo might be a close equivalent) containeda phone never mentioned in the literature sofar The fact that the vocalic resonant [ i ]showed up as voiceless in the particle [ si ] wasreally surprising to both the researchers as wellas to the informant This very particle wasrecorded very often and in many different con-texts At times the informant even forced him-self to make it voiced mdash as for one reason oranother it was thought that this was the way ithad to be pronounced mdash but in the end theinformant was bound to conclude about thevoiced attempt ldquoNo People do not speak likethatrdquo The voiceless vocalic resonant [ i ] shouldtherefore also be mentioned on future phoneticcharts of Cilubagrave

Framing Cilubagrave phonetics in a globalperspectiveOnce one realises that a minimum number ofwords representing the most frequent sectionof a languagersquos lexicon are sufficient as a basisfor a phonetic description one can easily takeexisting research one step further and framethe results in a global perspective The largestdatabase for which systematic data is readily

Prinsloo and de Schryver Corpus applications for the African languages120

available is UPSID (an acronym for UCLAPhonological Segment Inventory Database)This database was compiled underMaddiesonrsquos supervision at the University ofCalifornia Los Angeles and contains thephonemic inventories of 317 languages(Maddieson 1984) By way of example we canconsider the distribution of the different placesof articulation in stops for Cilubagrave as shown inFigure 1

From Figure 1 it can be seen that the mostfrequent places of articulation in stops forCilubagrave are located forward in the oral cavity vizbilabial and alveolar which together account forroughly four fifths of the places The velar placeof articulation roughly accounts for the remain-ing fifth The UPSID database has 2350labial 013 labiodental 3348 dental-alveo-lar 700 postalveolar 509 retroflex 570palatal 1963 velar 201 uvular and 346glottal Hence one must conclude that here theCilubagrave distribution broadly follows the generalpattern seen in the worldrsquos languages

On the other hand exactly three quarters ofthe Cilubagrave stops are voiced the remainingquarter being voiceless The UPSID databasehas 5249 voiced versus 4751 voicelessHence Cilubagrave here does not follow the generalpattern seen in the worldrsquos languages

Towards a sound treatment of theCilubagrave vocalic resonantsThe study of CPD also reveals the lackadaisi-cal approach of any phonetic description ofCilubagrave thus far when it comes to the vocalicresonants Firstly through a purely auditivecomparison with the taped pronunciation of theCardinal Vowels (CVs) by Daniel Jones him-self we have come to the conclusion that atotal of nine vocalic-resonant values are attest-ed in CPD cf Table 7

Nine vocalic-resonant values is a high num-ber for a language traditionally considered ashaving only five vocalic resonants In the entireliterature in our possession only three authorsmention the existence of more than five vocalicresonants Stappers (1949xi) devotes just onesentence to the observation that there is nophonological opposition between o and and eand in Cilubagrave Kabuta (1998a14) devotesonly one short obscure rule in which he arguesthat e is pronounced [ ] whenever the pre-ceding syllable contains e or o He gives onlytwo examples [ kupna ] and [ kukma ]which are not really helpful5 In addition one isat a total loss when it comes to the phones[ o ] versus [ ] for nothing is mentionedabout them The only serious attempt to clarifythe matter is found in Muyunga (197949ndash51)

1895

16

255

3822

3869

velar

palatal

palato-alveolar

alveolar

bilabial

0 10 20 30 40 50

Places of articulation in stops

Figure 1 Proportional occurrence of each place of articulation in stops for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 121

His study brings him to the conclusion thatlsquo[t]he degree of openness of these vowels [eand o] is conditioned by the final vowel of thewordrsquo (197949) a phenomenon he calls lsquoa kindof retrogressive vowel harmonyrsquo Unfortunatelyupon scrutinising CPD this suggested harmo-ny cannot be supported This has an importantconsequence Even though Stappersrsquo observa-tion still holds the occurrence of a particularvocalic-resonant value not being predictable ina specific environment one should seriouslyreconsider the many different orthographiesused for Cilubagrave for they are all restricted to justfive lsquovowel symbolsrsquo

Secondly even more lackadaisical through-out the literature is the treatment of the tonaldimension of vocalic resonants and thisdespite the fact that tones are used to makeboth semantic and grammatical distinctionsWe are convinced that if one is to expound onthe real nature of the vocalic resonants inCilubagrave one needs a three-dimensionalapproach with a quantity level a tonality leveland a frequency level mdash and this for eachvocalic resonant6 As an illustration two suchthree-dimensional approaches are shown inFigures 2 and 3

Rare phenomena and the corpus-based phonetics from belowapproachWe must note that a method based on top-fre-quencies of occurrence will not mdash by definitionmdash show the rather rare phenomena of a lan-guage In this respect Gabrieumll is the onlyauthor to mention the presence of two ingres-sive phones namely the lsquomonosyllabic affirma-tion enrsquo and the lsquodental clickrsquo (cf Table 1)These facets are certainly crucial if one pur-sues an exhaustive phonetic description ofCilubagrave

Rather accidentally what Gabrieumll calls the

lsquomonosyllabic affirmation enrsquo was recordedduring the sessions with the informant Indeedin one of the utterances to illustrate [ si ] (theparticle used to confirm a statement) the infor-mant starts off with a phone we could tenta-tively pinpoint as [ K ] a breathy voiced glottalfricative pronounced on an indrawn breath Asit stands there the lsquoconfirmation particlersquo [ si ] ispreceded by the lsquoaffirmation particlersquo [ K ] It ishowever not simply a pleonasm to strengthenthe ensuing statement even more Rather [ K ]seems to be a paralinguistic use of the pul-monic ingressive airstream mechanism in orderto express sympathy

The Balubagrave rarely swear but whenever theydo they use [ | ] the voiceless dental click Justas [ K ] (made on a pulmonic ingressiveairstream) [ | ] (made on a velaric ingressiveairstream) is only used in a paralinguistic func-tion

The corpus-based phonetics frombelow approach as a powerful toolTo summarise this section on corpus applica-tions in the field of fundamental phoneticresearch one can safely claim that a lsquocorpus-based phonetics from belowrsquo-approach is apowerful tool Specifically for Cilubagrave it hasrevealed previously underestimated phonesled to the discovery of one new phone enabledframing the phonetic inventory in a global per-spective and pointed out some serious lacu-nae in the literature For any language one canclaim that this approach entails a new method-ology in terms of which the phonetic descriptionof a language is obtained in which one startsfrom the language itself and eliminates the ran-dom factor In addition this methodologymakes it possible to make a maximum numberof distributional claims based on a minimumnumber of words about the most frequent sec-tion of a languagersquos lexicon

[ ] CV1 [ ] CV3 [ ] CV8

[ ] CV2 [ ] CV4 somewhat retracted [ ] CV7 somewhat lowered

[ ] CV2 somewhat lowered ([ ]) (IPA symbol for schwa) [ ] CV6 somewhat raised

Table 7 Vocalic resonants attested in CPD

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 7: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 117

number of striking differences such as thepresence of the voiced alveolar trilled stop [ r ]and the high number of vocalic resonants Thevoiced alveolar trilled stop [ r ] for instance isa phone not to be found in genuine lsquostandardCilubagraversquo so phoneticians have tended to over-look its importance From a frequency point ofview however it is clear that this phone rightful-ly deserves its place on phonetic charts ofCilubagrave Yet in order for such and similar claimsto be valid one must be sure that there is agood correlation between the overall distribu-tion of the phones mentioned in the literatureand those in CPD

Comparison between phone frequen-cies in the literature and those inCPDOn a first level a comparison can be made

between the phone frequencies found inMuyunga and those derived from CPD Theresults of this comparison are summarised inTable 6

From Table 6 it is clear that there is excel-lent agreement between the two frequencystudies There are only a few discrepanciesand even these can be explained As far as theconsonants are concerned there is just onephone for which there is a big differencebetween the studies namely the voiced labial-velar consonantal resonant [ w ] for whichMuyunga shows 104 while CPD has 521To a much smaller extent an analogous differ-ence can be observed for the voiced palatalconsonantal resonant [ j ] for which Muyungashows 136 while CPD has 252 The rea-son for this is obvious once one realises thatMuyunga includes diphthongs into his frequen-

CONSONANTS

bilabial labiodental

alveolar

palato-alveolar

palatal

velar

oral stop

nasal stop

trilled stop

( )

fricative

resonant

lateral resonant

VOCALIC RESONANTS OTHER SYMBOLS

voiced labial-velar consonantal resonant

voiceless palato-alveolar affricate

front

central back

e ( )

e-mid

n-mid ( )

n

Table 5 Cilubagrave phone inventory derived from the Cilubagrave Phonetic Database (CPD)

Prinsloo and de Schryver Corpus applications for the African languages118

Table 6 Cilubagrave phone frequencies in Muyunga (197958 62ndash63) compared to those in CPD

CONSONANTS VOCALIC RESONANTS

Muyunga- CPD- symbol Muyunga- CPD-

031 035 847

646 661 ( ) 081

928 720

377 222

039

117

529 515

403

525 568

( ) 080

483 386

728 725

088

380

751 655 1154

BB 063 059 ( ) 136

1290 1205

234 129

048

480

( ) C 012 1062

BB 124 088 ( ) 086

1148 913

043 023

022

111

081 123 146

138 140 ( ) 046

192 205

048 070

009

082

098 082 ( )

C

006

073 053 ( )

C

006

136 252 DIPHTHONGS

448 363 length

Muyunga-

CPD-

104 521 short

241

C

076 094 long

259

C

Total 5253

5390 Total

4747

4611

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 119

cy counts As a result CPDrsquos [ wa ] [ we ] and[ ja ] for instance are considered [ ua ] [ ue ]and [ ia ] respectively by Muyunga The 500(= 241 + 259) diphthongs Muyunga countsroughly correspond to the 533 (= (521 -104) + (252 - 136)) more [ w ] and [ j ] in CPDTo be able to compare the vocalic resonantsCPDrsquos [ ] was added to [ e ] and CPDrsquos [ ]was added to [ o ] Also as Muyunga distin-guishes between lsquoshort environmentallylengthened and inherently long vowelsrsquo (cfTable 3) while CPD is based on lsquowords in isola-tionrsquo (thus excluding environmentally length-ened vocalic resonants) Muyungarsquos short andenvironmentally lengthened vocalic resonantshad to be counted together in order to comparethe two studies As far as the short vocalic res-onants are concerned they agree rather wellFor the long vocalic resonants however [ a ](048 versus 480) and [ e ] (088 versus380) seem too incongruous Upon consultingour transcriptions we noted that the majority of[ a ] and [ e ] come from demonstratives Yetfor this part of speech Muyunga (1979150ndash152) consistently (and wrongly) writesshort vocalic resonants

As far as the phones in brackets in Table 6are concerned we can note that besides beingattested solely in CPD they are extremelyinfrequent They therefore do not distort theinventory

In order to calculate the correlation coeffi-cient r between the two frequency studies it isclear from the foregoing that counts for vocalicresonants and for [ w ] [ j ] and diphthongscannot be included For the remaining phonesone obtains a near-perfect correlation as r =098 On the whole we must conclude that theproportional distribution of the phones in thesmall-scale CPD (1 709 phones) correspondsto the distribution found in Muyunga which isas much as six times larger (10 726 phones)Doubtless this clearly supports a corpus-basedphonetics from below approach

On a second level the proportional occur-rence of the different tones in vocalic resonantscan also be considered As far as number ofwords is concerned the largest study wasundertaken by Kabuta as he transcribed oneand a half hour of unscripted conversation andconcluded that lsquo[c]ounts carried out on a 90-minute ordinary conversation recorded on cas-

sette revealed [] that there are 62 of H [hightones] vs 38 L [low tones]rsquo (1998b57) Thedetailed analysis stored in CPD attests 6104high and 3528 low tones (together with330 falling 013 rising 013 middle and013 voiceless) The fact that the tonal dimen-sion in just 350 top-frequency words corre-sponds extremely well with the tonal dimensionin a one-and-a-half-hour-long natural conversa-tion once more clearly supports a corpus-basedphonetics from below approach

Complementing existing phoneinventories for CilubagraveAccepting the validity of a corpus-basedapproach instantly implies that one must alsoseriously consider the peripheral phenomenaattested by means of such an approach Thusphones like the voiced alveolar trilled stop [ r ]and the vocalic resonant schwa [ ] hencephones that do not belong to genuine lsquostandardCilubagraversquo should nonetheless be mentioned onfuture phonetic charts of Cilubagrave mdash preciselybecause they too presently belong to the fre-quent phones of the language

Surprisingly enough one word [ si ] (a par-ticle used to confirm a statement and for whichlsquoisnrsquot itrsquo might be a close equivalent) containeda phone never mentioned in the literature sofar The fact that the vocalic resonant [ i ]showed up as voiceless in the particle [ si ] wasreally surprising to both the researchers as wellas to the informant This very particle wasrecorded very often and in many different con-texts At times the informant even forced him-self to make it voiced mdash as for one reason oranother it was thought that this was the way ithad to be pronounced mdash but in the end theinformant was bound to conclude about thevoiced attempt ldquoNo People do not speak likethatrdquo The voiceless vocalic resonant [ i ] shouldtherefore also be mentioned on future phoneticcharts of Cilubagrave

Framing Cilubagrave phonetics in a globalperspectiveOnce one realises that a minimum number ofwords representing the most frequent sectionof a languagersquos lexicon are sufficient as a basisfor a phonetic description one can easily takeexisting research one step further and framethe results in a global perspective The largestdatabase for which systematic data is readily

Prinsloo and de Schryver Corpus applications for the African languages120

available is UPSID (an acronym for UCLAPhonological Segment Inventory Database)This database was compiled underMaddiesonrsquos supervision at the University ofCalifornia Los Angeles and contains thephonemic inventories of 317 languages(Maddieson 1984) By way of example we canconsider the distribution of the different placesof articulation in stops for Cilubagrave as shown inFigure 1

From Figure 1 it can be seen that the mostfrequent places of articulation in stops forCilubagrave are located forward in the oral cavity vizbilabial and alveolar which together account forroughly four fifths of the places The velar placeof articulation roughly accounts for the remain-ing fifth The UPSID database has 2350labial 013 labiodental 3348 dental-alveo-lar 700 postalveolar 509 retroflex 570palatal 1963 velar 201 uvular and 346glottal Hence one must conclude that here theCilubagrave distribution broadly follows the generalpattern seen in the worldrsquos languages

On the other hand exactly three quarters ofthe Cilubagrave stops are voiced the remainingquarter being voiceless The UPSID databasehas 5249 voiced versus 4751 voicelessHence Cilubagrave here does not follow the generalpattern seen in the worldrsquos languages

Towards a sound treatment of theCilubagrave vocalic resonantsThe study of CPD also reveals the lackadaisi-cal approach of any phonetic description ofCilubagrave thus far when it comes to the vocalicresonants Firstly through a purely auditivecomparison with the taped pronunciation of theCardinal Vowels (CVs) by Daniel Jones him-self we have come to the conclusion that atotal of nine vocalic-resonant values are attest-ed in CPD cf Table 7

Nine vocalic-resonant values is a high num-ber for a language traditionally considered ashaving only five vocalic resonants In the entireliterature in our possession only three authorsmention the existence of more than five vocalicresonants Stappers (1949xi) devotes just onesentence to the observation that there is nophonological opposition between o and and eand in Cilubagrave Kabuta (1998a14) devotesonly one short obscure rule in which he arguesthat e is pronounced [ ] whenever the pre-ceding syllable contains e or o He gives onlytwo examples [ kupna ] and [ kukma ]which are not really helpful5 In addition one isat a total loss when it comes to the phones[ o ] versus [ ] for nothing is mentionedabout them The only serious attempt to clarifythe matter is found in Muyunga (197949ndash51)

1895

16

255

3822

3869

velar

palatal

palato-alveolar

alveolar

bilabial

0 10 20 30 40 50

Places of articulation in stops

Figure 1 Proportional occurrence of each place of articulation in stops for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 121

His study brings him to the conclusion thatlsquo[t]he degree of openness of these vowels [eand o] is conditioned by the final vowel of thewordrsquo (197949) a phenomenon he calls lsquoa kindof retrogressive vowel harmonyrsquo Unfortunatelyupon scrutinising CPD this suggested harmo-ny cannot be supported This has an importantconsequence Even though Stappersrsquo observa-tion still holds the occurrence of a particularvocalic-resonant value not being predictable ina specific environment one should seriouslyreconsider the many different orthographiesused for Cilubagrave for they are all restricted to justfive lsquovowel symbolsrsquo

Secondly even more lackadaisical through-out the literature is the treatment of the tonaldimension of vocalic resonants and thisdespite the fact that tones are used to makeboth semantic and grammatical distinctionsWe are convinced that if one is to expound onthe real nature of the vocalic resonants inCilubagrave one needs a three-dimensionalapproach with a quantity level a tonality leveland a frequency level mdash and this for eachvocalic resonant6 As an illustration two suchthree-dimensional approaches are shown inFigures 2 and 3

Rare phenomena and the corpus-based phonetics from belowapproachWe must note that a method based on top-fre-quencies of occurrence will not mdash by definitionmdash show the rather rare phenomena of a lan-guage In this respect Gabrieumll is the onlyauthor to mention the presence of two ingres-sive phones namely the lsquomonosyllabic affirma-tion enrsquo and the lsquodental clickrsquo (cf Table 1)These facets are certainly crucial if one pur-sues an exhaustive phonetic description ofCilubagrave

Rather accidentally what Gabrieumll calls the

lsquomonosyllabic affirmation enrsquo was recordedduring the sessions with the informant Indeedin one of the utterances to illustrate [ si ] (theparticle used to confirm a statement) the infor-mant starts off with a phone we could tenta-tively pinpoint as [ K ] a breathy voiced glottalfricative pronounced on an indrawn breath Asit stands there the lsquoconfirmation particlersquo [ si ] ispreceded by the lsquoaffirmation particlersquo [ K ] It ishowever not simply a pleonasm to strengthenthe ensuing statement even more Rather [ K ]seems to be a paralinguistic use of the pul-monic ingressive airstream mechanism in orderto express sympathy

The Balubagrave rarely swear but whenever theydo they use [ | ] the voiceless dental click Justas [ K ] (made on a pulmonic ingressiveairstream) [ | ] (made on a velaric ingressiveairstream) is only used in a paralinguistic func-tion

The corpus-based phonetics frombelow approach as a powerful toolTo summarise this section on corpus applica-tions in the field of fundamental phoneticresearch one can safely claim that a lsquocorpus-based phonetics from belowrsquo-approach is apowerful tool Specifically for Cilubagrave it hasrevealed previously underestimated phonesled to the discovery of one new phone enabledframing the phonetic inventory in a global per-spective and pointed out some serious lacu-nae in the literature For any language one canclaim that this approach entails a new method-ology in terms of which the phonetic descriptionof a language is obtained in which one startsfrom the language itself and eliminates the ran-dom factor In addition this methodologymakes it possible to make a maximum numberof distributional claims based on a minimumnumber of words about the most frequent sec-tion of a languagersquos lexicon

[ ] CV1 [ ] CV3 [ ] CV8

[ ] CV2 [ ] CV4 somewhat retracted [ ] CV7 somewhat lowered

[ ] CV2 somewhat lowered ([ ]) (IPA symbol for schwa) [ ] CV6 somewhat raised

Table 7 Vocalic resonants attested in CPD

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 8: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Prinsloo and de Schryver Corpus applications for the African languages118

Table 6 Cilubagrave phone frequencies in Muyunga (197958 62ndash63) compared to those in CPD

CONSONANTS VOCALIC RESONANTS

Muyunga- CPD- symbol Muyunga- CPD-

031 035 847

646 661 ( ) 081

928 720

377 222

039

117

529 515

403

525 568

( ) 080

483 386

728 725

088

380

751 655 1154

BB 063 059 ( ) 136

1290 1205

234 129

048

480

( ) C 012 1062

BB 124 088 ( ) 086

1148 913

043 023

022

111

081 123 146

138 140 ( ) 046

192 205

048 070

009

082

098 082 ( )

C

006

073 053 ( )

C

006

136 252 DIPHTHONGS

448 363 length

Muyunga-

CPD-

104 521 short

241

C

076 094 long

259

C

Total 5253

5390 Total

4747

4611

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 119

cy counts As a result CPDrsquos [ wa ] [ we ] and[ ja ] for instance are considered [ ua ] [ ue ]and [ ia ] respectively by Muyunga The 500(= 241 + 259) diphthongs Muyunga countsroughly correspond to the 533 (= (521 -104) + (252 - 136)) more [ w ] and [ j ] in CPDTo be able to compare the vocalic resonantsCPDrsquos [ ] was added to [ e ] and CPDrsquos [ ]was added to [ o ] Also as Muyunga distin-guishes between lsquoshort environmentallylengthened and inherently long vowelsrsquo (cfTable 3) while CPD is based on lsquowords in isola-tionrsquo (thus excluding environmentally length-ened vocalic resonants) Muyungarsquos short andenvironmentally lengthened vocalic resonantshad to be counted together in order to comparethe two studies As far as the short vocalic res-onants are concerned they agree rather wellFor the long vocalic resonants however [ a ](048 versus 480) and [ e ] (088 versus380) seem too incongruous Upon consultingour transcriptions we noted that the majority of[ a ] and [ e ] come from demonstratives Yetfor this part of speech Muyunga (1979150ndash152) consistently (and wrongly) writesshort vocalic resonants

As far as the phones in brackets in Table 6are concerned we can note that besides beingattested solely in CPD they are extremelyinfrequent They therefore do not distort theinventory

In order to calculate the correlation coeffi-cient r between the two frequency studies it isclear from the foregoing that counts for vocalicresonants and for [ w ] [ j ] and diphthongscannot be included For the remaining phonesone obtains a near-perfect correlation as r =098 On the whole we must conclude that theproportional distribution of the phones in thesmall-scale CPD (1 709 phones) correspondsto the distribution found in Muyunga which isas much as six times larger (10 726 phones)Doubtless this clearly supports a corpus-basedphonetics from below approach

On a second level the proportional occur-rence of the different tones in vocalic resonantscan also be considered As far as number ofwords is concerned the largest study wasundertaken by Kabuta as he transcribed oneand a half hour of unscripted conversation andconcluded that lsquo[c]ounts carried out on a 90-minute ordinary conversation recorded on cas-

sette revealed [] that there are 62 of H [hightones] vs 38 L [low tones]rsquo (1998b57) Thedetailed analysis stored in CPD attests 6104high and 3528 low tones (together with330 falling 013 rising 013 middle and013 voiceless) The fact that the tonal dimen-sion in just 350 top-frequency words corre-sponds extremely well with the tonal dimensionin a one-and-a-half-hour-long natural conversa-tion once more clearly supports a corpus-basedphonetics from below approach

Complementing existing phoneinventories for CilubagraveAccepting the validity of a corpus-basedapproach instantly implies that one must alsoseriously consider the peripheral phenomenaattested by means of such an approach Thusphones like the voiced alveolar trilled stop [ r ]and the vocalic resonant schwa [ ] hencephones that do not belong to genuine lsquostandardCilubagraversquo should nonetheless be mentioned onfuture phonetic charts of Cilubagrave mdash preciselybecause they too presently belong to the fre-quent phones of the language

Surprisingly enough one word [ si ] (a par-ticle used to confirm a statement and for whichlsquoisnrsquot itrsquo might be a close equivalent) containeda phone never mentioned in the literature sofar The fact that the vocalic resonant [ i ]showed up as voiceless in the particle [ si ] wasreally surprising to both the researchers as wellas to the informant This very particle wasrecorded very often and in many different con-texts At times the informant even forced him-self to make it voiced mdash as for one reason oranother it was thought that this was the way ithad to be pronounced mdash but in the end theinformant was bound to conclude about thevoiced attempt ldquoNo People do not speak likethatrdquo The voiceless vocalic resonant [ i ] shouldtherefore also be mentioned on future phoneticcharts of Cilubagrave

Framing Cilubagrave phonetics in a globalperspectiveOnce one realises that a minimum number ofwords representing the most frequent sectionof a languagersquos lexicon are sufficient as a basisfor a phonetic description one can easily takeexisting research one step further and framethe results in a global perspective The largestdatabase for which systematic data is readily

Prinsloo and de Schryver Corpus applications for the African languages120

available is UPSID (an acronym for UCLAPhonological Segment Inventory Database)This database was compiled underMaddiesonrsquos supervision at the University ofCalifornia Los Angeles and contains thephonemic inventories of 317 languages(Maddieson 1984) By way of example we canconsider the distribution of the different placesof articulation in stops for Cilubagrave as shown inFigure 1

From Figure 1 it can be seen that the mostfrequent places of articulation in stops forCilubagrave are located forward in the oral cavity vizbilabial and alveolar which together account forroughly four fifths of the places The velar placeof articulation roughly accounts for the remain-ing fifth The UPSID database has 2350labial 013 labiodental 3348 dental-alveo-lar 700 postalveolar 509 retroflex 570palatal 1963 velar 201 uvular and 346glottal Hence one must conclude that here theCilubagrave distribution broadly follows the generalpattern seen in the worldrsquos languages

On the other hand exactly three quarters ofthe Cilubagrave stops are voiced the remainingquarter being voiceless The UPSID databasehas 5249 voiced versus 4751 voicelessHence Cilubagrave here does not follow the generalpattern seen in the worldrsquos languages

Towards a sound treatment of theCilubagrave vocalic resonantsThe study of CPD also reveals the lackadaisi-cal approach of any phonetic description ofCilubagrave thus far when it comes to the vocalicresonants Firstly through a purely auditivecomparison with the taped pronunciation of theCardinal Vowels (CVs) by Daniel Jones him-self we have come to the conclusion that atotal of nine vocalic-resonant values are attest-ed in CPD cf Table 7

Nine vocalic-resonant values is a high num-ber for a language traditionally considered ashaving only five vocalic resonants In the entireliterature in our possession only three authorsmention the existence of more than five vocalicresonants Stappers (1949xi) devotes just onesentence to the observation that there is nophonological opposition between o and and eand in Cilubagrave Kabuta (1998a14) devotesonly one short obscure rule in which he arguesthat e is pronounced [ ] whenever the pre-ceding syllable contains e or o He gives onlytwo examples [ kupna ] and [ kukma ]which are not really helpful5 In addition one isat a total loss when it comes to the phones[ o ] versus [ ] for nothing is mentionedabout them The only serious attempt to clarifythe matter is found in Muyunga (197949ndash51)

1895

16

255

3822

3869

velar

palatal

palato-alveolar

alveolar

bilabial

0 10 20 30 40 50

Places of articulation in stops

Figure 1 Proportional occurrence of each place of articulation in stops for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 121

His study brings him to the conclusion thatlsquo[t]he degree of openness of these vowels [eand o] is conditioned by the final vowel of thewordrsquo (197949) a phenomenon he calls lsquoa kindof retrogressive vowel harmonyrsquo Unfortunatelyupon scrutinising CPD this suggested harmo-ny cannot be supported This has an importantconsequence Even though Stappersrsquo observa-tion still holds the occurrence of a particularvocalic-resonant value not being predictable ina specific environment one should seriouslyreconsider the many different orthographiesused for Cilubagrave for they are all restricted to justfive lsquovowel symbolsrsquo

Secondly even more lackadaisical through-out the literature is the treatment of the tonaldimension of vocalic resonants and thisdespite the fact that tones are used to makeboth semantic and grammatical distinctionsWe are convinced that if one is to expound onthe real nature of the vocalic resonants inCilubagrave one needs a three-dimensionalapproach with a quantity level a tonality leveland a frequency level mdash and this for eachvocalic resonant6 As an illustration two suchthree-dimensional approaches are shown inFigures 2 and 3

Rare phenomena and the corpus-based phonetics from belowapproachWe must note that a method based on top-fre-quencies of occurrence will not mdash by definitionmdash show the rather rare phenomena of a lan-guage In this respect Gabrieumll is the onlyauthor to mention the presence of two ingres-sive phones namely the lsquomonosyllabic affirma-tion enrsquo and the lsquodental clickrsquo (cf Table 1)These facets are certainly crucial if one pur-sues an exhaustive phonetic description ofCilubagrave

Rather accidentally what Gabrieumll calls the

lsquomonosyllabic affirmation enrsquo was recordedduring the sessions with the informant Indeedin one of the utterances to illustrate [ si ] (theparticle used to confirm a statement) the infor-mant starts off with a phone we could tenta-tively pinpoint as [ K ] a breathy voiced glottalfricative pronounced on an indrawn breath Asit stands there the lsquoconfirmation particlersquo [ si ] ispreceded by the lsquoaffirmation particlersquo [ K ] It ishowever not simply a pleonasm to strengthenthe ensuing statement even more Rather [ K ]seems to be a paralinguistic use of the pul-monic ingressive airstream mechanism in orderto express sympathy

The Balubagrave rarely swear but whenever theydo they use [ | ] the voiceless dental click Justas [ K ] (made on a pulmonic ingressiveairstream) [ | ] (made on a velaric ingressiveairstream) is only used in a paralinguistic func-tion

The corpus-based phonetics frombelow approach as a powerful toolTo summarise this section on corpus applica-tions in the field of fundamental phoneticresearch one can safely claim that a lsquocorpus-based phonetics from belowrsquo-approach is apowerful tool Specifically for Cilubagrave it hasrevealed previously underestimated phonesled to the discovery of one new phone enabledframing the phonetic inventory in a global per-spective and pointed out some serious lacu-nae in the literature For any language one canclaim that this approach entails a new method-ology in terms of which the phonetic descriptionof a language is obtained in which one startsfrom the language itself and eliminates the ran-dom factor In addition this methodologymakes it possible to make a maximum numberof distributional claims based on a minimumnumber of words about the most frequent sec-tion of a languagersquos lexicon

[ ] CV1 [ ] CV3 [ ] CV8

[ ] CV2 [ ] CV4 somewhat retracted [ ] CV7 somewhat lowered

[ ] CV2 somewhat lowered ([ ]) (IPA symbol for schwa) [ ] CV6 somewhat raised

Table 7 Vocalic resonants attested in CPD

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 9: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 119

cy counts As a result CPDrsquos [ wa ] [ we ] and[ ja ] for instance are considered [ ua ] [ ue ]and [ ia ] respectively by Muyunga The 500(= 241 + 259) diphthongs Muyunga countsroughly correspond to the 533 (= (521 -104) + (252 - 136)) more [ w ] and [ j ] in CPDTo be able to compare the vocalic resonantsCPDrsquos [ ] was added to [ e ] and CPDrsquos [ ]was added to [ o ] Also as Muyunga distin-guishes between lsquoshort environmentallylengthened and inherently long vowelsrsquo (cfTable 3) while CPD is based on lsquowords in isola-tionrsquo (thus excluding environmentally length-ened vocalic resonants) Muyungarsquos short andenvironmentally lengthened vocalic resonantshad to be counted together in order to comparethe two studies As far as the short vocalic res-onants are concerned they agree rather wellFor the long vocalic resonants however [ a ](048 versus 480) and [ e ] (088 versus380) seem too incongruous Upon consultingour transcriptions we noted that the majority of[ a ] and [ e ] come from demonstratives Yetfor this part of speech Muyunga (1979150ndash152) consistently (and wrongly) writesshort vocalic resonants

As far as the phones in brackets in Table 6are concerned we can note that besides beingattested solely in CPD they are extremelyinfrequent They therefore do not distort theinventory

In order to calculate the correlation coeffi-cient r between the two frequency studies it isclear from the foregoing that counts for vocalicresonants and for [ w ] [ j ] and diphthongscannot be included For the remaining phonesone obtains a near-perfect correlation as r =098 On the whole we must conclude that theproportional distribution of the phones in thesmall-scale CPD (1 709 phones) correspondsto the distribution found in Muyunga which isas much as six times larger (10 726 phones)Doubtless this clearly supports a corpus-basedphonetics from below approach

On a second level the proportional occur-rence of the different tones in vocalic resonantscan also be considered As far as number ofwords is concerned the largest study wasundertaken by Kabuta as he transcribed oneand a half hour of unscripted conversation andconcluded that lsquo[c]ounts carried out on a 90-minute ordinary conversation recorded on cas-

sette revealed [] that there are 62 of H [hightones] vs 38 L [low tones]rsquo (1998b57) Thedetailed analysis stored in CPD attests 6104high and 3528 low tones (together with330 falling 013 rising 013 middle and013 voiceless) The fact that the tonal dimen-sion in just 350 top-frequency words corre-sponds extremely well with the tonal dimensionin a one-and-a-half-hour-long natural conversa-tion once more clearly supports a corpus-basedphonetics from below approach

Complementing existing phoneinventories for CilubagraveAccepting the validity of a corpus-basedapproach instantly implies that one must alsoseriously consider the peripheral phenomenaattested by means of such an approach Thusphones like the voiced alveolar trilled stop [ r ]and the vocalic resonant schwa [ ] hencephones that do not belong to genuine lsquostandardCilubagraversquo should nonetheless be mentioned onfuture phonetic charts of Cilubagrave mdash preciselybecause they too presently belong to the fre-quent phones of the language

Surprisingly enough one word [ si ] (a par-ticle used to confirm a statement and for whichlsquoisnrsquot itrsquo might be a close equivalent) containeda phone never mentioned in the literature sofar The fact that the vocalic resonant [ i ]showed up as voiceless in the particle [ si ] wasreally surprising to both the researchers as wellas to the informant This very particle wasrecorded very often and in many different con-texts At times the informant even forced him-self to make it voiced mdash as for one reason oranother it was thought that this was the way ithad to be pronounced mdash but in the end theinformant was bound to conclude about thevoiced attempt ldquoNo People do not speak likethatrdquo The voiceless vocalic resonant [ i ] shouldtherefore also be mentioned on future phoneticcharts of Cilubagrave

Framing Cilubagrave phonetics in a globalperspectiveOnce one realises that a minimum number ofwords representing the most frequent sectionof a languagersquos lexicon are sufficient as a basisfor a phonetic description one can easily takeexisting research one step further and framethe results in a global perspective The largestdatabase for which systematic data is readily

Prinsloo and de Schryver Corpus applications for the African languages120

available is UPSID (an acronym for UCLAPhonological Segment Inventory Database)This database was compiled underMaddiesonrsquos supervision at the University ofCalifornia Los Angeles and contains thephonemic inventories of 317 languages(Maddieson 1984) By way of example we canconsider the distribution of the different placesof articulation in stops for Cilubagrave as shown inFigure 1

From Figure 1 it can be seen that the mostfrequent places of articulation in stops forCilubagrave are located forward in the oral cavity vizbilabial and alveolar which together account forroughly four fifths of the places The velar placeof articulation roughly accounts for the remain-ing fifth The UPSID database has 2350labial 013 labiodental 3348 dental-alveo-lar 700 postalveolar 509 retroflex 570palatal 1963 velar 201 uvular and 346glottal Hence one must conclude that here theCilubagrave distribution broadly follows the generalpattern seen in the worldrsquos languages

On the other hand exactly three quarters ofthe Cilubagrave stops are voiced the remainingquarter being voiceless The UPSID databasehas 5249 voiced versus 4751 voicelessHence Cilubagrave here does not follow the generalpattern seen in the worldrsquos languages

Towards a sound treatment of theCilubagrave vocalic resonantsThe study of CPD also reveals the lackadaisi-cal approach of any phonetic description ofCilubagrave thus far when it comes to the vocalicresonants Firstly through a purely auditivecomparison with the taped pronunciation of theCardinal Vowels (CVs) by Daniel Jones him-self we have come to the conclusion that atotal of nine vocalic-resonant values are attest-ed in CPD cf Table 7

Nine vocalic-resonant values is a high num-ber for a language traditionally considered ashaving only five vocalic resonants In the entireliterature in our possession only three authorsmention the existence of more than five vocalicresonants Stappers (1949xi) devotes just onesentence to the observation that there is nophonological opposition between o and and eand in Cilubagrave Kabuta (1998a14) devotesonly one short obscure rule in which he arguesthat e is pronounced [ ] whenever the pre-ceding syllable contains e or o He gives onlytwo examples [ kupna ] and [ kukma ]which are not really helpful5 In addition one isat a total loss when it comes to the phones[ o ] versus [ ] for nothing is mentionedabout them The only serious attempt to clarifythe matter is found in Muyunga (197949ndash51)

1895

16

255

3822

3869

velar

palatal

palato-alveolar

alveolar

bilabial

0 10 20 30 40 50

Places of articulation in stops

Figure 1 Proportional occurrence of each place of articulation in stops for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 121

His study brings him to the conclusion thatlsquo[t]he degree of openness of these vowels [eand o] is conditioned by the final vowel of thewordrsquo (197949) a phenomenon he calls lsquoa kindof retrogressive vowel harmonyrsquo Unfortunatelyupon scrutinising CPD this suggested harmo-ny cannot be supported This has an importantconsequence Even though Stappersrsquo observa-tion still holds the occurrence of a particularvocalic-resonant value not being predictable ina specific environment one should seriouslyreconsider the many different orthographiesused for Cilubagrave for they are all restricted to justfive lsquovowel symbolsrsquo

Secondly even more lackadaisical through-out the literature is the treatment of the tonaldimension of vocalic resonants and thisdespite the fact that tones are used to makeboth semantic and grammatical distinctionsWe are convinced that if one is to expound onthe real nature of the vocalic resonants inCilubagrave one needs a three-dimensionalapproach with a quantity level a tonality leveland a frequency level mdash and this for eachvocalic resonant6 As an illustration two suchthree-dimensional approaches are shown inFigures 2 and 3

Rare phenomena and the corpus-based phonetics from belowapproachWe must note that a method based on top-fre-quencies of occurrence will not mdash by definitionmdash show the rather rare phenomena of a lan-guage In this respect Gabrieumll is the onlyauthor to mention the presence of two ingres-sive phones namely the lsquomonosyllabic affirma-tion enrsquo and the lsquodental clickrsquo (cf Table 1)These facets are certainly crucial if one pur-sues an exhaustive phonetic description ofCilubagrave

Rather accidentally what Gabrieumll calls the

lsquomonosyllabic affirmation enrsquo was recordedduring the sessions with the informant Indeedin one of the utterances to illustrate [ si ] (theparticle used to confirm a statement) the infor-mant starts off with a phone we could tenta-tively pinpoint as [ K ] a breathy voiced glottalfricative pronounced on an indrawn breath Asit stands there the lsquoconfirmation particlersquo [ si ] ispreceded by the lsquoaffirmation particlersquo [ K ] It ishowever not simply a pleonasm to strengthenthe ensuing statement even more Rather [ K ]seems to be a paralinguistic use of the pul-monic ingressive airstream mechanism in orderto express sympathy

The Balubagrave rarely swear but whenever theydo they use [ | ] the voiceless dental click Justas [ K ] (made on a pulmonic ingressiveairstream) [ | ] (made on a velaric ingressiveairstream) is only used in a paralinguistic func-tion

The corpus-based phonetics frombelow approach as a powerful toolTo summarise this section on corpus applica-tions in the field of fundamental phoneticresearch one can safely claim that a lsquocorpus-based phonetics from belowrsquo-approach is apowerful tool Specifically for Cilubagrave it hasrevealed previously underestimated phonesled to the discovery of one new phone enabledframing the phonetic inventory in a global per-spective and pointed out some serious lacu-nae in the literature For any language one canclaim that this approach entails a new method-ology in terms of which the phonetic descriptionof a language is obtained in which one startsfrom the language itself and eliminates the ran-dom factor In addition this methodologymakes it possible to make a maximum numberof distributional claims based on a minimumnumber of words about the most frequent sec-tion of a languagersquos lexicon

[ ] CV1 [ ] CV3 [ ] CV8

[ ] CV2 [ ] CV4 somewhat retracted [ ] CV7 somewhat lowered

[ ] CV2 somewhat lowered ([ ]) (IPA symbol for schwa) [ ] CV6 somewhat raised

Table 7 Vocalic resonants attested in CPD

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 10: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Prinsloo and de Schryver Corpus applications for the African languages120

available is UPSID (an acronym for UCLAPhonological Segment Inventory Database)This database was compiled underMaddiesonrsquos supervision at the University ofCalifornia Los Angeles and contains thephonemic inventories of 317 languages(Maddieson 1984) By way of example we canconsider the distribution of the different placesof articulation in stops for Cilubagrave as shown inFigure 1

From Figure 1 it can be seen that the mostfrequent places of articulation in stops forCilubagrave are located forward in the oral cavity vizbilabial and alveolar which together account forroughly four fifths of the places The velar placeof articulation roughly accounts for the remain-ing fifth The UPSID database has 2350labial 013 labiodental 3348 dental-alveo-lar 700 postalveolar 509 retroflex 570palatal 1963 velar 201 uvular and 346glottal Hence one must conclude that here theCilubagrave distribution broadly follows the generalpattern seen in the worldrsquos languages

On the other hand exactly three quarters ofthe Cilubagrave stops are voiced the remainingquarter being voiceless The UPSID databasehas 5249 voiced versus 4751 voicelessHence Cilubagrave here does not follow the generalpattern seen in the worldrsquos languages

Towards a sound treatment of theCilubagrave vocalic resonantsThe study of CPD also reveals the lackadaisi-cal approach of any phonetic description ofCilubagrave thus far when it comes to the vocalicresonants Firstly through a purely auditivecomparison with the taped pronunciation of theCardinal Vowels (CVs) by Daniel Jones him-self we have come to the conclusion that atotal of nine vocalic-resonant values are attest-ed in CPD cf Table 7

Nine vocalic-resonant values is a high num-ber for a language traditionally considered ashaving only five vocalic resonants In the entireliterature in our possession only three authorsmention the existence of more than five vocalicresonants Stappers (1949xi) devotes just onesentence to the observation that there is nophonological opposition between o and and eand in Cilubagrave Kabuta (1998a14) devotesonly one short obscure rule in which he arguesthat e is pronounced [ ] whenever the pre-ceding syllable contains e or o He gives onlytwo examples [ kupna ] and [ kukma ]which are not really helpful5 In addition one isat a total loss when it comes to the phones[ o ] versus [ ] for nothing is mentionedabout them The only serious attempt to clarifythe matter is found in Muyunga (197949ndash51)

1895

16

255

3822

3869

velar

palatal

palato-alveolar

alveolar

bilabial

0 10 20 30 40 50

Places of articulation in stops

Figure 1 Proportional occurrence of each place of articulation in stops for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 121

His study brings him to the conclusion thatlsquo[t]he degree of openness of these vowels [eand o] is conditioned by the final vowel of thewordrsquo (197949) a phenomenon he calls lsquoa kindof retrogressive vowel harmonyrsquo Unfortunatelyupon scrutinising CPD this suggested harmo-ny cannot be supported This has an importantconsequence Even though Stappersrsquo observa-tion still holds the occurrence of a particularvocalic-resonant value not being predictable ina specific environment one should seriouslyreconsider the many different orthographiesused for Cilubagrave for they are all restricted to justfive lsquovowel symbolsrsquo

Secondly even more lackadaisical through-out the literature is the treatment of the tonaldimension of vocalic resonants and thisdespite the fact that tones are used to makeboth semantic and grammatical distinctionsWe are convinced that if one is to expound onthe real nature of the vocalic resonants inCilubagrave one needs a three-dimensionalapproach with a quantity level a tonality leveland a frequency level mdash and this for eachvocalic resonant6 As an illustration two suchthree-dimensional approaches are shown inFigures 2 and 3

Rare phenomena and the corpus-based phonetics from belowapproachWe must note that a method based on top-fre-quencies of occurrence will not mdash by definitionmdash show the rather rare phenomena of a lan-guage In this respect Gabrieumll is the onlyauthor to mention the presence of two ingres-sive phones namely the lsquomonosyllabic affirma-tion enrsquo and the lsquodental clickrsquo (cf Table 1)These facets are certainly crucial if one pur-sues an exhaustive phonetic description ofCilubagrave

Rather accidentally what Gabrieumll calls the

lsquomonosyllabic affirmation enrsquo was recordedduring the sessions with the informant Indeedin one of the utterances to illustrate [ si ] (theparticle used to confirm a statement) the infor-mant starts off with a phone we could tenta-tively pinpoint as [ K ] a breathy voiced glottalfricative pronounced on an indrawn breath Asit stands there the lsquoconfirmation particlersquo [ si ] ispreceded by the lsquoaffirmation particlersquo [ K ] It ishowever not simply a pleonasm to strengthenthe ensuing statement even more Rather [ K ]seems to be a paralinguistic use of the pul-monic ingressive airstream mechanism in orderto express sympathy

The Balubagrave rarely swear but whenever theydo they use [ | ] the voiceless dental click Justas [ K ] (made on a pulmonic ingressiveairstream) [ | ] (made on a velaric ingressiveairstream) is only used in a paralinguistic func-tion

The corpus-based phonetics frombelow approach as a powerful toolTo summarise this section on corpus applica-tions in the field of fundamental phoneticresearch one can safely claim that a lsquocorpus-based phonetics from belowrsquo-approach is apowerful tool Specifically for Cilubagrave it hasrevealed previously underestimated phonesled to the discovery of one new phone enabledframing the phonetic inventory in a global per-spective and pointed out some serious lacu-nae in the literature For any language one canclaim that this approach entails a new method-ology in terms of which the phonetic descriptionof a language is obtained in which one startsfrom the language itself and eliminates the ran-dom factor In addition this methodologymakes it possible to make a maximum numberof distributional claims based on a minimumnumber of words about the most frequent sec-tion of a languagersquos lexicon

[ ] CV1 [ ] CV3 [ ] CV8

[ ] CV2 [ ] CV4 somewhat retracted [ ] CV7 somewhat lowered

[ ] CV2 somewhat lowered ([ ]) (IPA symbol for schwa) [ ] CV6 somewhat raised

Table 7 Vocalic resonants attested in CPD

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 11: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 121

His study brings him to the conclusion thatlsquo[t]he degree of openness of these vowels [eand o] is conditioned by the final vowel of thewordrsquo (197949) a phenomenon he calls lsquoa kindof retrogressive vowel harmonyrsquo Unfortunatelyupon scrutinising CPD this suggested harmo-ny cannot be supported This has an importantconsequence Even though Stappersrsquo observa-tion still holds the occurrence of a particularvocalic-resonant value not being predictable ina specific environment one should seriouslyreconsider the many different orthographiesused for Cilubagrave for they are all restricted to justfive lsquovowel symbolsrsquo

Secondly even more lackadaisical through-out the literature is the treatment of the tonaldimension of vocalic resonants and thisdespite the fact that tones are used to makeboth semantic and grammatical distinctionsWe are convinced that if one is to expound onthe real nature of the vocalic resonants inCilubagrave one needs a three-dimensionalapproach with a quantity level a tonality leveland a frequency level mdash and this for eachvocalic resonant6 As an illustration two suchthree-dimensional approaches are shown inFigures 2 and 3

Rare phenomena and the corpus-based phonetics from belowapproachWe must note that a method based on top-fre-quencies of occurrence will not mdash by definitionmdash show the rather rare phenomena of a lan-guage In this respect Gabrieumll is the onlyauthor to mention the presence of two ingres-sive phones namely the lsquomonosyllabic affirma-tion enrsquo and the lsquodental clickrsquo (cf Table 1)These facets are certainly crucial if one pur-sues an exhaustive phonetic description ofCilubagrave

Rather accidentally what Gabrieumll calls the

lsquomonosyllabic affirmation enrsquo was recordedduring the sessions with the informant Indeedin one of the utterances to illustrate [ si ] (theparticle used to confirm a statement) the infor-mant starts off with a phone we could tenta-tively pinpoint as [ K ] a breathy voiced glottalfricative pronounced on an indrawn breath Asit stands there the lsquoconfirmation particlersquo [ si ] ispreceded by the lsquoaffirmation particlersquo [ K ] It ishowever not simply a pleonasm to strengthenthe ensuing statement even more Rather [ K ]seems to be a paralinguistic use of the pul-monic ingressive airstream mechanism in orderto express sympathy

The Balubagrave rarely swear but whenever theydo they use [ | ] the voiceless dental click Justas [ K ] (made on a pulmonic ingressiveairstream) [ | ] (made on a velaric ingressiveairstream) is only used in a paralinguistic func-tion

The corpus-based phonetics frombelow approach as a powerful toolTo summarise this section on corpus applica-tions in the field of fundamental phoneticresearch one can safely claim that a lsquocorpus-based phonetics from belowrsquo-approach is apowerful tool Specifically for Cilubagrave it hasrevealed previously underestimated phonesled to the discovery of one new phone enabledframing the phonetic inventory in a global per-spective and pointed out some serious lacu-nae in the literature For any language one canclaim that this approach entails a new method-ology in terms of which the phonetic descriptionof a language is obtained in which one startsfrom the language itself and eliminates the ran-dom factor In addition this methodologymakes it possible to make a maximum numberof distributional claims based on a minimumnumber of words about the most frequent sec-tion of a languagersquos lexicon

[ ] CV1 [ ] CV3 [ ] CV8

[ ] CV2 [ ] CV4 somewhat retracted [ ] CV7 somewhat lowered

[ ] CV2 somewhat lowered ([ ]) (IPA symbol for schwa) [ ] CV6 somewhat raised

Table 7 Vocalic resonants attested in CPD

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 12: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Prinsloo and de Schryver Corpus applications for the African languages122

0

741

0

2593

4444

2222

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ e ]

Figure 2 Three-dimensional approach to the vocalic resonant [ ] for Cilubagrave

2639

4514

0

903

1806

139

short long

0

10

20

30

40

50

low high falling

Tones for the vocalic resonant [ a ]

Figure 3 Three-dimensional approach to the vocalic resonant [ a ] for Cilubagrave

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 13: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 123

Corpus applications in the field offundamental linguistic research Part2 question particlesQuestion particles in Sepedi intro-spection-based and informant-basedapproachesAs a second example of how the corpus canrevolutionise fundamental linguistic researchinto African languages more specifically for theinterpretation and description of problematiclinguistic issues we can look at how the corpusadds a new dimension to the traditional intro-spection-based and informant-basedapproaches In these approaches a researcherhad to rely on hisher own native speaker intu-ition or as a non-mother tongue speaker onthe opinions of one (or more) mother tonguespeaker(s) of the language If conclusionswhich were made by means of introspection orin utilising informants are reviewed against cor-pus-query results quite a number of these con-clusions can be confirmed whilst others how-ever are proven incorrect

Prinsloo (1985) for example made an in-depth study of the interrogative particles na andafa in Sepedi in which he analysed the differ-ent types of questions marked by these parti-cles He concluded that na is used to ask ques-tions of which the speaker does not know theanswer while afa is used if the speaker is of theopinion that the addressee knows the answerCompare (1) and (2) respectively (adaptedfrom Prinsloo 198593)(1) Na o tseba go beša nama lsquoDo you know

how to roast meatrsquo(2) Afa o tseba go beša nama lsquoDo you know

how to roast meatrsquoIn terms of Prinsloo (1985) the first questionwill be asked if the speaker does not knowwhether or not the addressee is capable ofroasting meat and the second if the speaker isunder the impression that the addressee iscapable of roasting meat but observes thatheshe is not performing well Louwrens(1991140) in turn states that the use of nademands an answer but that the use of afaindicates a rhetorical question

Both Prinsloo (198593) and Louwrens(1991143) emphasise that afa cannot be usedwith question words and give the examplesshown in (3) ndash (4) and (5) ndash (6) respectively(3) Afa go hwile mang(4) Afa ke mang

(5) Afa o ya kae(6) Afa ke ngwana ofe yocirc a llagoFrom (3) ndash (6) it is clear that according toPrinsloo and Louwrens the occurrence of afawith question words such as mang kae -feetc is not possible in Sepedi

Furthermore they agree that afa cannot beused in sentence-final position

lsquoSekere vraagpartikels tree [ s]legs indie inisieumlle sinsposisie [op](3ii) O tšwa ka gae ge o etla fa kagore o šetše o fela pelo afarsquo(Prinsloo 198591)lsquothe particle na may appear in eitherthe initial or the final sentence positionor in both these positions simultane-ously whereas afa may appear in theinitial sentence position onlyrsquo(Louwrens 1991140)

Thus Prinsloorsquos and Louwrensrsquo presentationof the data suggests that (a) na and afa markdifferent types of questions (b) afa will notoccur with question words such as mang kae-fe etc and (c) afa cannot be used in the sen-tence-final position

Question particles in Sepedi corpus-based approachQuerying the large structured Pretoria SepediCorpus (PSC) when it stood at 4 million runningwords confirms the semantic analysis ofPrinsloo and Louwrens in respect of (b) Thefact that not a single example is found whereafa occurs with question words such as mangkae -fe etc validates their finding regardingthe interrogative character of afa

As for (c) however compare the examplefound in PSC and shown in (7) where afa iscontrary to Prinsloorsquos and Louwrensrsquo claimused in sentence-final position(7) Mokgalabje wa mereba ge Naa e ka ba

kgomolekokoto ye e mo hlotšeng afa E kaba lsquoThe cheeky old man Can it be some-thing big immense and strong that createdhim It can be rsquo

Here we must conclude that the corpus indi-cates that the analysis of both Prinsloo andLouwrens was too rigid

Finally contrary to claim (a) numerousexamples are found in the corpus of na in com-bination with afa but only in the order na afaand not vice versa At least Louwrens in princi-pal suggests that lsquo[n]a and afa may in certain

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 14: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Prinsloo and de Schryver Corpus applications for the African languages124

instances co-occur in the same questionrsquo(1991144) yet the only example he givesshows the co-occurrence of a and naaLouwrens gives no actual examples of naoccurring with afa especially not when theseparticles follow one another directly Table 8lists the concordance lines culled from PSC forthe sequence of question particles na afa

The lines listed in Table 8 provide the empir-ical basis for a challenging semanticpragmaticanalysis in terms of the theoretical assumptions(and rigid distinction between na and afa espe-cially) made in Prinsloo (1985) and Louwrens(1991) As far as the relation between thisempirical basis and the theoretical assumptionsis concerned one would be well-advised totake heed of Calzolarirsquos suggestion

lsquoIn fact corpus data cannot be used ina simplistic way In order to becomeusable they must be analysed accord-ing to some theoretical hypothesisthat would model and structure whatwould be otherwise an unstructured

set of data The best mixture of theempirical and theoretical approachesis the one in which the theoreticalhypothesis is itself emerging from andis guided by successive analyses ofthe data and is cyclically refined andadjusted to textual evidencersquo(Calzolari 19969)

The corpus is indispensable in highlightingthe co-occurrences of na and afa Noresearcher would have persevered in readingthe equivalent of 90 Sepedi literary works andmagazines to find such empirical examples Infact heshe would probably have missed themanyway being lsquohiddenrsquo in 4 million words of run-ning text

To summarise this section we see that thecorpus comes in handy when pursuing funda-mental linguistic research into African lan-guages When a corpus-based approach iscontrasted with the so-called lsquotraditionalapproachesrsquo of introspection and informant elic-

1

tša Dikgoneng Ruri re paletšwe (Letl 47) Na afa Kgoteledi o tla be a gomela gae a hweditše

2 16) dedio Bjale gona bothata bo agetše Na afa Kgoteledi mohla a di kwa o tla di thabela

3 ba go forolle MOLOGADI Sešane sa basadi Na afa o tloga o sa re tswe Peba Ke eng tše o di

4 molamo Sa monkgwana se gona ke a go botša Na afa o kwele gore mmotong wa Lekokoto ga go mpša

5 ya tšewa ke badimo ya ba gona ge e felela Na afa o sa gopola ka lepokisana la gagwe ka nako

6 mpušeletša matšatšing ale a bjana bja gago Na afa o a tseba ka mo o kilego wa re hlomola pelo

7 a tla ba a gopola Bohlapamonwana gaMashilo Na afa bosola bjo bja poso Yeo ya ba potšišo

8 hwetša ba re ga a gona MODUPI Aowa ge Na afa rena re tla o gotša wa tuka MOLOGADI Se

9 ka mabaka a mabedi La pele e be e le gore na afa yola monna wa gagwe o be a sa fo ithomela

10 lebeletše gomme ka moka ba gagabiša mahlo) Na afa le a di kwa Ruri re tlo inama sa re

11 ba a hlamula) MELADI (o a hwenahwena) Na afa o di kwele NAPŠADI (o a mmatamela) Ke

12 boa mokatong (Go kwala khwaere ya Kekwele) Na afa le kwa bošaedi bjo bo dirwago ke khwaere ya

13 Aowa fela ga re kgole le kgole KEKWELE Na afa matšatši a le ke le bone Thellenyane

14 le baki yela ya go aparwa mohla wa kgoro Na afa dikobo tšeo e be e se sutu Sutu Ee sutu

15 iša pelo kgole e fo ba metlae KOTENTSHO Na afa le ke le hlole mogwera wa rena bookelong

16 a go loba ga morwa Letanka e be e le Na afa ruri ke therešo Tša ditsotsi tšona ga di

17 le boNadinadi le boMatonya MODUPI Na afa o a bona gore o a itahlela O re sentše

18 yoo wa gago NTLABILE Kehwile mogatšaka na afa o na le tlhologelo le lerato bjalo ka

19 se nnete Gape go lebala ga go elwe mošate Na afa baisana bale ba ile ba bonagala lehono MDI

20 tlogele tšeo tša go hlaletšana (Setunyana) Na afa o ile wa šogašoga taba yela le Mmakoma MDI

21 mo ke lego gona ge o ka mpona o ka sola Na afa e ka be e le Dio Goba ke Lata Aowa monna

22 ba oretše wo o se nago muši Hei thaka na afa yola morwedi wa Lenkwang o ile wa mo

23 le mmagwe ba ka mo feleletša THOMO Na afa o a lemoga gore motho yo ga se wa rena O

24 be re hudua dijanaga tša rena kua tseleng Na afa o lemoga gore mathaka a thala a feta

Table 8 Concordance lines for the sequence of question particles na afa in Sepedi

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 15: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 125

itation corpora reveal both correct and incor-rect traditional findings

Corpus applications in the field oflanguage teaching and learningCompiling pronunciation guidesThe corpus-based approach to the phoneticdescription of a languagersquos lexicon that wasdescribed above has in addition a first impor-tant application in the field of language acquisi-tion For Cilubagrave the described study instantlylead to the compilation of two concise corpus-based pronunciation guides a Phonetic fre-quency-lexicon Cilubagrave-English consisting of350 entries and a converted Phonetic frequen-cy-lexicon English-Cilubagrave (De Schryver199955ndash68 69ndash87) Provided that the targetusers know the conventions of the InternationalPhonetic Alphabet (IPA) these two pronuncia-tion guides give them the possibility tolsquoretrieversquo lsquopronouncersquo and lsquolearnrsquo mdash and henceto lsquoacquaint themselves withrsquo mdash the 350 mostfrequent words from the Lubagrave language

Compiling modern textbooks syllabiworkbooks manuals etcPronunciation guides are but one instance ofthe manifold contributions corpora can make tothe field of language teaching and learning Ingeneral one can say that learners are able tomaster a target language faster if they are pre-sented with the most frequently used wordscollocations grammatical structures andidioms in the target language mdash especially ifthe quoted material represents authentic (alsocalled lsquonaturally-occurringrsquo or lsquorealrsquo) languageuse In this respect Renouf reporting on thecompilation of a lsquolexical syllabusrsquo writes

lsquoWith the resources and expertisewhich were available to us at Cobuild[ a]n approach which immediatelysuggested itself was to identify thewords and uses of words which weremost central to the language by virtueof their high rate of occurrence in ourCorpusrsquo (Renouf 1987169)

The consultation of corpora is therefore crucialin compiling modern textbooks syllabi work-books manuals etc

The compiler of a specific language coursefor scholars or students may decide for exam-ple that a basic or core vocabulary of say 1 000words should be mastered In the past the com-

piler had to select these 1 000 words on thebasis of hisher intuition or through informantelicitation which was not really satisfactorysince on the one hand many highly usedwords were accidentally left out and on theother hand such a selection often includedwords of which the frequency of use was ques-tionable According to Renouf

lsquoThere has also long been a need inlanguage-teaching for a reliable set ofcriteria for the selection of lexis forteaching purposes Generations of lin-guists have attempted to provide listsof lsquousefulrsquo or lsquoimportantrsquo words to thisend but these have fallen short in oneway or another largely because empir-ical evidence has not been sufficientlytaken into accountrsquo (Renouf 198768)

With frequency counts derived from a corpus athisher disposal the basic or core vocabularycan easily and accurately be isolated by thecourse compiler and presented in various use-ful ways to the scholar or student mdash for exam-ple by means of full sentences in language lab-oratory exercises Compare Table 9 which isan extract from the first lesson in the FirstYearrsquos Sepedi Laboratory Textbook used at theUniversity of Pretoria and the TechnikonPretoria reflecting the five most frequentlyused words in Sepedi in context

Here the corpus allows learners of Sepedifrom the first lesson onwards to be providedwith naturally-occurring text revolving aroundthe languagersquos basic or core vocabulary

Teaching morpho-syntactic struc-turesIt is well-known that African-language teachershave a hard time teaching morpho-syntacticstructures and getting learners to master therequired analysis and description This task ismuch easier when authentic examples takenfrom a variety of written and oral sources areused rather than artificial ones made up by theteacher This is especially applicable to caseswhere the teacher has to explain moreadvanced or complicated structures and willhave difficulty in thinking up suitable examplesAccording to Kruyt such structures were large-ly ignored in the past

lsquoVery large electronic text corpora []contain sentence and word usageinformation that was difficult to collect

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 16: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Prinsloo and de Schryver Corpus applications for the African languages126

until recently and consequently waslargely ignored by linguistsrsquo (Kruyt1995126)

As an illustration we can look at the rather com-plex and intricate situation in Sepedi where upto five lersquos or up to four barsquos are used in a rowIn Tables 10 and 11 a selection of concordancelines extracted from PSC is listed for both theseinstances

The relation between grammatical functionand meaning of the different lersquos in Table 10can for example be pointed out In corpus line1 the first le is a conjunctive particle followedby the class 5 relative pronoun and the class 5subject concord The sequence in corpus line8 is copulative verb stem class 5 relative pro-noun and class 5 prefix while in corpus line29 it is class 5 relative pronoun 2nd personplural subject concord and class 5 object con-cord etc

As the concordance lines listed in Tables 10and 11 are taken from the living language theyrepresent excellent material for morpho-syntac-tic analysis in the classroom situation as wellas workbook exercises homework etc Inretrieving such examples in abundance fromthe corpus the teacher can focus on the daunt-ing task of guiding the learner in distinguishingbetween the different lersquos and barsquos instead oftrying to come up with such examples on thebasis of intuition andor through informant elici-tation In addition in an educational systemwhere it is expected from the learner to perform

a variety of exercisestasks on hisher ownbasing such exercisestasks on lsquorealrsquo languagecan only be welcomed

Teaching contrasting structuresSingling out top-frequency words and top-fre-quency grammatical structures from a corpusshould obviously receive most attention for lan-guage teaching and learning purposesConversely rather infrequent and rare struc-tures are often needed in order to be contrast-ed with the more common ones For both theseextremes where one needs to be selectivewhen it comes to frequent instances andexhaustive when it comes to infrequent onesthe corpus can successfully be queried Renoufargues

lsquowe could seek help from the comput-er which would accelerate the searchfor relevant data on each word allowus to be selective or exhaustive in ourinvestigation and supplement ourhuman observations with a variety ofautomatically retrieved informationrsquo(Renouf 1987169)

Formulated differently in using a corpus certainrelated grammatical structures can easily becontrasted and studied especially in thosecases where the structures in question are rareand hard if not impossible to find by readingand marking Following exhaustive corpusqueries these structures can be instantly

Table 9 Extract from the First Yearrsquos Sepedi Laboratory Textbook

M gore gtthat so that=

M Ke nyaka gore o nthuše gtI want you to help me= S

M bona gtsee=

M Re bona tau gtWe see a lion= S

M bona gttheythem=

M Re thuša bona gtWe help them= S

M bego gtwhich was busy=

M Batho ba ba bego ba reka gtThe people who were busy buying= S

M tla gtcome shallwill=

M Tla mo gtCome here= S

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 17: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 127

Table 10 Morpho-syntactic analysis of up to five lersquos in a row in Sepedi

Legend

relative pronoun class 5

copulative verb stem

conjunctive particle

prefix class 5

subject concord 2nd person plural

object concord 2nd person plural

subject concord class 5

object concord class 5

1

Go ile gwa direga mola malokeišene

a mantši a thewa gwa agiwa

le

le

le

bitšwago Donsa Lona le ile la

thongwa ke ba municipality ka go aga

8 Taba ye e tšwa go Morena rena ga re

kgone go go botša le lebe le ge e

le

le le

botse Rebeka šo o a mmona Mo

tšee o sepele e be mosadi wa morwa wa

13 go tšwa ka sefero a ngaya sethokgwa se

se bego se le mokgahlo ga lapa

le

le

le le

le latelago O be a tseba gabotse gore

barwa ba Rre Hau o tlo ba hwetša ba

16 go tlošana bodutu ga rena go tlamegile

go fela ga ešita le lona leeto

le

le le

swereng ge nako ya lona ya go fela

e fihla le swanetše go fela Bjale

18 yeo e lego gona ke ya gore a ka ba a

bolailwe ke motho Ga se fela lehu

le

le le

golomago dimpa tša ba motse e

šetše e le a mmalwa Mabakeng ohle ge

29 ke yena monna yola wa mohumi le bego

le le ka gagwe maabane Letsogo

le

le le

bonago le golofetše le e sa le le

gobala mohlang woo Banna ba

32 seo re ka se dirago Ga se ka ka ka le

bona letšatši la madi go swana

le

le le

hlabago le Bona mahlasedi a

mahubedu a lona a tsotsometša dithaba

Table 11 Morpho-syntactic analysis of up to four barsquos in a row in Sepedi

Legend

relative pronoun class 2

auxiliary verb stem

copulative verb stem

subject concord class 2

object concord class 2

22

meloko ya bona ba tlišitše dineo e be

e le bagolo bala ba meloko balaodi

ba

ba

ba

badilwego Ba tlišitše dineo tša

bona pele ga Morena e be e le dikoloi

127 mediro ya bobona yeo ba bego ba sešo ba

e phetha malapeng a bobona Ba ile

ba ba ba

tlogela le tšona dijo tšeo ba bego

ba dutše ba dija A ešita le bao ba beg

185 ya Modimo 6 Le le ba go dirišana le

Modimo re Le eletša gore le se ke la

ba ba ba

amogetšego kgaugelo ya Modimo mme e

se ke ya le hola selo Gobane o re Ke

259 ba be ba topa tša fase baeng bao bona

ba ile ba ba amogela ka tše pedi

ba ba ba ba bea fase ka a mabedi ka gore lešago

la moeng le bewa ke mongwotse gae Baen

272 tle go ya go hwetša tšela di bego di di

kokotela BoPoromane le bona ga se

ba ba ba

hlwa ba laela motho Sa bona e ile

ya fo ba go tšwa ba tlemolla makaba a bo

312 itlela go itiša le koma le legogwa

fela ka tsebo ya gagwe ya go tsoma

ba ba ba

mmea yo mongwe wa baditi ba go laola

lesolo O be a fela a re ka a mangwe ge

317 etšega go mpolediša Mola go bago bjalo

le ditaba di emago ka mokgwa woo

ba ba Ba

bea marumo fase Ga se ba no a

lahlela fase sesolo Ke be ke thathankg

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 18: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Prinsloo and de Schryver Corpus applications for the African languages128

indexed and studied in context and contrastedwith their more frequently used counterparts

As an example we can consider two differ-ent locative strategies used in Sepedi lsquoprefix-ing of gorsquo versus lsquosuffixing of -ngrsquo Teachersoften err in regarding these two strategies asmutually exclusive especially in the case ofhuman beings Hence they regard go monnalsquoat the manrsquo and go mosadi lsquoat the womanrsquo asthe accepted forms while not giving any atten-tion to or even rejecting forms such as mon-neng and mosading This is despite the factthat Louwrens attempts to point out the differ-ence between them

lsquoThere exists a clear semantic differ-ence between the members of suchpairs of examples kgocircšing has thegeneral meaning lsquothe neighbourhoodwhere the chief livesrsquo whereas go

kgocircši clearly implies lsquoto the particularchief in personrsquorsquo (Louwrens 1991121)

Although it is clear that prefixing go is by far themost frequently used strategy some examplesare found in PSC substantiating the use of thesuffixal strategy Even more important is thefact that these authentic examples clearly indi-cate that there is indeed a semantic differencebetween the two strategies Compare the gen-eral meaning of go as lsquoatrsquo with the specificmeanings which can be retrieved from the cor-pus lines shown in Tables 12 and 13

Louwrensrsquo semantic distinction betweenthese strategies goes a long way in pointing outthe difference However once again carefulanalysis of corpus data reveals semantic con-notations other than those described byresearchers who solely rely on introspectionand informants So for example the meaning

1

e eja mabele tšhemong ya mosadi wa bobedi

monneng

wa gagwe Yoo a rego ge a lla senku a

2 a napa a ineela a tseba gore o fihlile monneng wa banna yo a tlago mo khutšiša maima a

3 gore ka nnete le nyaka thušo nka go iša monneng e mongwe wa gešo yo ke tsebago gore yena

4 bose bja nama Gantši kgomo ya mogoga monneng e ba kgomo ye a bego a e rata kudu gare

5 gagwe a ikgafa go sepela le nna go nkiša monneng yoo wa gabo Ga se ba bantši ba ba ka

6 ke mosadi gobane ke yena e a ntšhitswego monneng Ka baka leo monna o tlo tlogela tatagwe

7 botšobana bja lekgarebe Thupa ya tefo monneng e be e le bohloko go bona kgomo e etšwa

8 ka mosadi Gobane boka mosadi ge a tšwile monneng le monna o tšwelela ka mosadi Mme

9 a tšwa mosading ke mosadi e a tšwilego monneng Le gona monna ga a bopelwa mosadi ke

Table 12 Corpus lines for monneng (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

Table 13 Corpus lines for mosading (suffixing the locative -ng to lsquoa human beingrsquo in Sepedi)

1

seleka (Setu) Bjale ge a ka re o boela mosading wa gagwe ke reng PEBETSE Se tshwenyege

2 thuše selo ka gobane di swanetše go fihla mosading A tirišano ye botse le go jabetšana

3 a yo apewa a jewe Le rile go fihla mosading la re mosadi a thuše ka go gotša mollo le

4 o tlogetše mphufutšo wa letheka la gagwe mosading yoo e sego wa gagwe etšwe a boditšwe

5 a nnoši lenyalong la rena IKGETHELE Mosading wa bobedi INAMA Ke mo hweditše a na le

6 a lahla setala a sekamela kudu ka mosading yo monyenyane mererong gona o tla

7 lona leo Ke maikutlo a bona a kgatelelo mosading Taamane a sega Ke be ke sa tsebe seo

8 ka namane re hwetša se na le kgononelo mosading gore ge a ka nyalwa e sa le lekgarebe go

9 tšhelete le botse bja gagwe mme a kgosela mosading o tee O dula gona Meadowlands Soweto

10 dirwa ke gore ke bogale Ke bogale kudu mosading wa go swana le Maria Nna ke na le

11 ke wa ntira mošemanyana O boletše maaka mosading wa gago gore ke mo hweditše a itia bola

12 Gobane monna ge a bopša ga a tšwa mosading ke mosadi e a tšwilego monneng Le gona

13 lethabo le tlhompho ya maleba go tšwa mosading wa gagwe 0 tla be a intšhitše seriti ka

14 le ba bogweng bja gagwe gore o sa ya mosading kua Ditsobotla a bone polane yeo a ka e

15 ka dieta O ile a ntlogela a ya mosading yo mongwe Ke yena mosadi yoo yo a

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 19: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 129

of phrases such as Gobane monna ge a bopšaga a tšwa mosading ke mosadi e a tšwilegomonneng lsquoBecause when man was created hedid not come out of a woman it is the womanwho came out of the manrsquo (cf corpus line 9 inTable 12 and corpus line 12 in Table 13) in theBiblical sense is not catered for

Corpus applications in the field oflanguage software spellcheckersAccording to the Longman Dictionary ofContemporary English a spellchecker is lsquoacomputer PROGRAM that checks what you havewritten and makes your spelling correctrsquo(Summers 19953) Today such language soft-ware is abundantly available for Indo-European languages Yet corpus-based fre-quency studies may enable African languagesto be provided with such tools as well

Basically there are two main approaches tospellcheckers On the one hand one can pro-gram software with a proper description of alanguage including detailed morpho-phonolog-ical and syntactic rules together with a storedlist of word-roots and on the other hand onecan simply compare the spelling of typed wordswith a stored list of word-forms The latterindeed forms the core of the Concise OxfordDictionaryrsquos definition of a spellchecker lsquoa com-puter program which checks the spelling ofwords in files of text usually by comparisonwith a stored list of wordsrsquo (1996)

While such a lsquostored list of wordsrsquo is oftenassembled in a random manner we argue thatmuch better results are obtained when thecompilation of such a list is based on high fre-quencies of occurrence Formulated differentlya first-generation spellchecker for African lan-guages can simply compare typed words with astored list of the top few thousand word-formsActually this approach is already a reality forisiXhosa isiZulu Sepedi and Setswana asfirst-generation spellcheckers compiled by DJPrinsloo are commercially available inWordPerfect 9 within the WordPerfect Office2000 suite Due to the conjunctive orthographyof isiXhosa and isiZulu the software is obvious-ly less effective for these languages than for thedisjunctively written Sepedi and Setswana

To illustrate this latter point tests were con-ducted on two randomly selected paragraphs

In (8) the isiZulu paragraph is shown where theword-forms in bold are not recognised by theWordPerfect 9 spellchecker software(8) Spellchecking a randomly selected para-

graph from Bona Zulu (June 2000114)Izingane ezizichamelayo zivame ukuhlala ngokuh-lukumezeka kanti akufanele ziphathwe nga-leyondlela Uma ushaya ingane ngoba izichamelileusuke uyihlukumeza ngoba lokho ayikwenzi ngam-abomu njengoba iningi labazali licabanga kanjaloUma nawe mzali usubuyisa ingqondo usho ukuthiikhona ingane engajatshuliswa wukuvuka embhe-deni obandayo omanzi njalo ekuseni

The stored isiZulu list consists of the 33 526most frequently used word-forms As 12 out of41 word-forms were not recognised in (8) thisimplies a success rate of lsquoonlyrsquo 71

When we test the WordPerfect 9spellchecker software on a randomly selectedSepedi paragraph however the results are asshown in (9)(9) Spellchecking a randomly selected para-

graph from the telephone directory PretoriaWhite Pages (November 1999ndash200024)

Dikarata tša mogala di a hwetšagala ka go fapafa-pana goba R15 R20 (R2 ke mahala) R50 R100goba R200 Gomme di ka šomišwa go megala yaTelkom ka moka (ye metala) Ge tšhelete ka moka efedile karateng o ka tsentšha karata ye nngwe ntle lego šitiša poledišano ya gago mogaleng

Even though the stored Sepedi list is small-er than the isiZulu one as it only consists of the27 020 most frequently used word-forms with 2unrecognised words out of 46 the success rateis as high as 96

The four available first-generationspellcheckers were tested by Corelrsquos BetaPartners and the current success rates wereapproved Yet it is our intention to substantiallyenlarge the sizes of all our corpora for SouthAfrican languages so as to feed thespellcheckers with say the top 100 000 word-forms The actual success rates for the con-junctively written languages (isiNdebeleisiXhosa isiZulu and siSwati) remains to beseen while it is expected that the performancefor the disjunctively written languages (SepediSesotho Setswana Tshivenda and Xitsonga)will be more than acceptable with such a cor-pus-based approach

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 20: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Prinsloo and de Schryver Corpus applications for the African languages130

ConclusionWe have shown clearly that applications ofelectronic corpora in various linguistic fieldshave at present become a reality for theAfrican languages As such African-languagescholars can take their rightful place in the newmillennium and mirror the great contemporaryendeavours in corpus linguistics achieved byscholars of say Indo-European languages

In this article together with a previous one(De Schryver amp Prinsloo 2000) the compila-tion querying and possible applications ofAfrican-language corpora have been reviewedIn a way these two articles should be consid-ered as foundational to a discipline of corpuslinguistics for the African languages mdash a disci-pline which will be explored more extensively infuture publications

From the different corpus-project applica-tions that have been used as illustrations of thetheoretical premises in the present article wecan draw the following conclusionsbull In the field of fundamental linguistic research

we have seen that in order to pursue trulymodern phonetics one can simply turn totop-frequency counts derived from a corpusof the language under study mdash hence a lsquocor-pus-based phonetics from belowrsquo-approachSuch an approach makes it possible to makea maximum number of distributional claimsbased on a minimum number of words aboutthe most frequent section of a languagersquos lex-icon

bull Also in the field of fundamental linguisticresearch the discussion of question particlesbrought to light that when a corpus-basedapproach is contrasted with the so-called lsquotra-ditional approachesrsquo of introspection andinformant elicitation corpora reveal both cor-rect and incorrect traditional findings

bull When it comes to corpus applications in thefield of language teaching and learning wehave stressed the power of corpus-basedpronunciation guides and corpus-based text-books syllabi workbooks manuals etc Inaddition we have illustrated how the teachercan retrieve a wealth of morpho-syntacticand contrasting structures from the corpus mdashstructures heshe can then put to good use inthe classroom situation

bull Finally we have pointed out that at least oneset of corpus-based language tools is already

commercially available With the knowledgewe have acquired in compiling the softwarefor first-generation spellcheckers for fourAfrican languages we are now ready toundertake the compilation of spellcheckersfor all African languages spoken in SouthAfrica

Notes1This article is based on a paper read by theauthors at the First International Conferenceon Linguistics in Southern Africa held at theUniversity of Cape Town 12ndash14 January2000 G-M de Schryver is currently ResearchAssistant of the Fund for Scientific Researchmdash Flanders (Belgium)

2A different approach to the research presentedin this section can be found in De Schryver(1999)

3Laverrsquos phonetic taxonomy (1994) is used as atheoretical framework throughout this section

4Strangely enough Muyunga seems to feel theneed to combine the different phone invento-ries into one new inventory In this respect hedistinguishes the voiceless bilabial fricativeand the voiceless glottal fricative claiming thatlsquoEach simple consonant represents aphoneme except and h which belong to asame phonemersquo (1979 47) Here howeverMuyunga is mixing different dialects While[ ] is used by for instance the BakwagraveDiishograve mdash their dialect giving rise to what ispresently known as lsquostandard Cilubagraversquo (DeClercq amp Willems 196037) mdash the glottal vari-ant [ h ] is used by for instance some BakwagraveKalonji (Stappers 1949xi) The glottal variantnot being the standard is seldom found in theliterature A rare example is the dictionary byMorrison Anderson McElroy amp McKee(1939)

5High tones being more frequent than low onesKabuta restricts the tonal diacritics to low tonefalling tone and rising tone The first exampleshould have been [ kuna ]

6Considering tone (and quantity) as an integralpart of vocalic-resonant identity does notseem far-fetched as long as lsquowords in isola-tionrsquo are concerned The implications of suchan approach for lsquowords in contextrsquo howeverdefinitely need further research

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References

Page 21: Corpus applications for the African languages, with ... · erary surveys, sociolinguistic considerations, lexicographic compilations, stylistic studies, etc. Due to space restrictions,

Southern African Linguistics and Applied Language Studies 2001 19 111ndash131 131

(URLs last accessed on 16 April 2001)

Bastin Y Coupez A amp De Halleux B 1983Classification lexicostatistique des languesbantoues (214 releveacutes) Bulletin desSeacuteances de lrsquoAcadeacutemie Royale desSciences drsquoOutre-Mer 27(2) 173ndash199

Bona Zulu Imagazini Yesizwe Durban June2000

Burssens A 1939 Tonologische schets vanhet Tshiluba (Kasayi Belgisch Kongo)Antwerp De Sikkel

Calzolari N 1996 Lexicon and Corpus a Multi-faceted Interaction In Gellerstam M et al(eds) Euralex rsquo96 Proceedings I GothenburgGothenburg University pp 3ndash16

Concise Oxford Dictionary Ninth Edition OnCD-ROM 1996 Oxford Oxford UniversityPress

De Clercq A amp Willems E 19603 DictionnaireTshilubagrave-Franccedilais Leacuteopoldville Imprimeriede la Socieacuteteacute Missionnaire de St Paul

De Schryver G-M 1999 Cilubagrave PhoneticsProposals for a lsquocorpus-based phoneticsfrom belowrsquo-approach Ghent Recall

De Schryver G-M amp Prinsloo DJ 2000 Thecompilation of electronic corpora with spe-cial reference to the African languagesSouthern African Linguistics andApplied Language Studies 18 89ndash106

De Schryver G-M amp Prinsloo DJ forthcomingElectronic corpora as a basis for the compi-lation of African-language dictionaries Part1 The macrostructure South AfricanJournal of African Languages 21

Gabrieumll [Vermeersch] sd4 [(19213)] Etude deslangues congolaises bantoues avec applica-tions au tshiluba Turnhout Imprimerie delrsquoEacutecole Professionnelle St Victor

Hurskainen A 1998 Maximizing the(re)usability of language data Available atlthttpwwwhduibnoAcoHumabshurskhtmgt

Kabuta NS 1998a Inleiding tot de structuurvan het Cilubagrave Ghent Recall

Kabuta NS 1998b Loanwords in CilubagraveLexikos 8 37ndash64

Kennedy G 1998 An Introduction to CorpusLinguistics London Longman

Kruyt JG 1995 Technologies in ComputerizedLexicography Lexikos 5 117ndash137

Laver J 1994 Principles of PhoneticsCambridge Cambridge University Press

Louwrens LJ 1991 Aspects of NorthernSotho Grammar Pretoria Via AfrikaLimited

Maddieson I 1984 Patterns of SoundsCambridge Cambridge University PressSee alsolthttpwwwlinguisticsrdgacukstaffRonBrasingtonUPSIDinterfaceInterfacehtmlgt

Morrison WM Anderson VA McElroy WF amp

McKee GT 1939 Dictionary of the TshilubaLanguage (Sometimes known as theBuluba-Lulua or Luba-Lulua) Luebo JLeighton Wilson Press

Muyunga YK 1979 Lingala and CilubagraveSpeech Audiometry Kinshasa PressesUniversitaires du Zaiumlre

Pretoria White Pages North Sotho EnglishAfrikaans Information PagesJohannesburg November 1999ndash2000

Prinsloo DJ 1985 Semantiese analise vandie vraagpartikels na en afa in Noord-Sotho South African Journal of AfricanLanguages 5(3) 91ndash95

Renouf A 1987 Moving On In Sinclair JM(ed) Looking Up An account of the COBUILDProject in lexical computing and the devel-opment of the Collins COBUILD EnglishLanguage Dictionary London Collins ELTpp 167ndash178

Stappers L 1949 Tonologische bijdrage tot destudie van het werkwoord in het TshilubaBrussels Koninklijk Belgisch KoloniaalInstituut

Summers D (director) 19953 LongmanDictionary of Contemporary English ThirdEdition Harlow Longman Dictionaries

Swadesh M 1952 Lexicostatistic Dating ofPrehistoric Ethnic Contacts Proceedingsof the American Philosophical Society96 452ndash463

Swadesh M 1953 Archeological andLinguistic Chronology of Indo-EuropeanGroups American Anthropologist 55349ndash352

Swadesh M 1955 Towards Greater Accuracyin Lexicostatistic Dating InternationalJournal of American Linguistics 21121ndash137

References