PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies...
-
Upload
adam-benson -
Category
Documents
-
view
216 -
download
0
Transcript of PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies...
PowerConc: An R-gram Based Corpus Analysis Tool
Jiajin Xu & Yunlong JiaBeijing Foreign Studies University
2
PowerConc• National Research Centre for Foreign Language E
ducation, Beijing Foreign Studies University• A general purpose tool for corpus analysis• Developed in Delphi• can deal with any ANSI encoded texts
– E.g. on a Simplified Chinese OS– works well with Simplified/Trad. Chinese texts,
(un)tokenised or raw/POS-tagged, as well as raw/POS-tagged English texts
3
• Size: 1.5MB, compressed package less than 1MB
• Installation: Doesn’t require any installation.
• OS: Works only on Windows now.
PowerConc
Design principles for PowerConc
5
Ideally• Most powerful, can do anything that a concor
dancer can do and cannot do.• involves least effort in learning to use it
• Doing MORE with less• Reductionism in software design
6
Less buttons and/or tabs
Frequencycount
SearchList
7
8
9
Freq. Count
Concordance N-gram list
Collocation &Colligation Key n-gram list
10
More possibilities in tool develop’t
• Corpus-informed/related ‘grammars’– Pattern grammar (local grammar)– Collostruction– Lexical grammar (natural grammar, real grammar)– Lexical priming (textual colligation)– Longman grammar: Biber et al. grammar register
variation• Tool development lags behind
11
From phraseology to R-gram
• Many of the ‘grammars’ as some sort of phraseology
• We coined a technical term ‘R-gram’.– An operational parallel to phraseology– The unit of language can be words, lemmata,
phrases, POS, POS sequence, and combination of all these.
– Can be linguistic structures with uncertain words or categories (e.g. be passive/get passive).
12
• a * of: collocational framework• It be ADJ that: evaluative construction• Noun noun compounds• Bi-nominal constructions• Passive constructions: be/get ADV. V-EN• All these could be matched with Regular
Expressions.• But Regex is too difficult for lay users.
13
Easy search with enhanced hits
• Smart Input• Three meta-characters in Smart Input syntax,
the simplest grammar ever.
• @be returns all inflectional forms of ‘be’
• #n returns all nouns
• * refers to any single word
14
• a * of => a * of• It be ADJ that => It @be #adj that• Noun noun compound => #n #n• Bi-nominal => #n and #n• Passive => \S+_VB\S+\s(\S+_[RXPJDN]\S+\s)*\
S+_V\S*N
15
Limitation
• speed• A concordancer without applying indexing• can't process texts larger than a few million
words anyway.
16
Download PowerConc
•www.fleric.org.cn/powerconc/• http://www.bfsu-corpus.org/channels/tools
Thank you!