PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies...

17
PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

Transcript of PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies...

Page 1: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

PowerConc: An R-gram Based Corpus Analysis Tool

Jiajin Xu & Yunlong JiaBeijing Foreign Studies University

Page 2: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

2

PowerConc• National Research Centre for Foreign Language E

ducation, Beijing Foreign Studies University• A general purpose tool for corpus analysis• Developed in Delphi• can deal with any ANSI encoded texts

– E.g. on a Simplified Chinese OS– works well with Simplified/Trad. Chinese texts,

(un)tokenised or raw/POS-tagged, as well as raw/POS-tagged English texts

Page 3: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

3

• Size: 1.5MB, compressed package less than 1MB

• Installation: Doesn’t require any installation.

• OS: Works only on Windows now.

PowerConc

Page 4: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

Design principles for PowerConc

Page 5: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

5

Ideally• Most powerful, can do anything that a concor

dancer can do and cannot do.• involves least effort in learning to use it

• Doing MORE with less• Reductionism in software design

Page 6: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

6

Less buttons and/or tabs

Frequencycount

SearchList

Page 7: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

7

Page 8: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

8

Page 9: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

9

Freq. Count

Concordance N-gram list

Collocation &Colligation Key n-gram list

Page 10: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

10

More possibilities in tool develop’t

• Corpus-informed/related ‘grammars’– Pattern grammar (local grammar)– Collostruction– Lexical grammar (natural grammar, real grammar)– Lexical priming (textual colligation)– Longman grammar: Biber et al. grammar register

variation• Tool development lags behind

Page 11: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

11

From phraseology to R-gram

• Many of the ‘grammars’ as some sort of phraseology

• We coined a technical term ‘R-gram’.– An operational parallel to phraseology– The unit of language can be words, lemmata,

phrases, POS, POS sequence, and combination of all these.

– Can be linguistic structures with uncertain words or categories (e.g. be passive/get passive).

Page 12: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

12

• a * of: collocational framework• It be ADJ that: evaluative construction• Noun noun compounds• Bi-nominal constructions• Passive constructions: be/get ADV. V-EN• All these could be matched with Regular

Expressions.• But Regex is too difficult for lay users.

Page 13: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

13

Easy search with enhanced hits

• Smart Input• Three meta-characters in Smart Input syntax,

the simplest grammar ever.

• @be returns all inflectional forms of ‘be’

• #n returns all nouns

• * refers to any single word

Page 14: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

14

• a * of => a * of• It be ADJ that => It @be #adj that• Noun noun compound => #n #n• Bi-nominal => #n and #n• Passive => \S+_VB\S+\s(\S+_[RXPJDN]\S+\s)*\

S+_V\S*N

Page 15: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

15

Limitation

• speed• A concordancer without applying indexing• can't process texts larger than a few million

words anyway.

Page 16: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

16

Download PowerConc

•www.fleric.org.cn/powerconc/• http://www.bfsu-corpus.org/channels/tools

Page 17: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.

Thank you!