What is a national corpus. Primary objective of a national corpus is to provide linguists with a...
-
Upload
martin-small -
Category
Documents
-
view
216 -
download
0
Transcript of What is a national corpus. Primary objective of a national corpus is to provide linguists with a...
![Page 1: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/1.jpg)
What is a national corpus
![Page 2: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/2.jpg)
Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types of texts through making complex lexical grammatical queries.
The corpus allows to investigate various linguistic phenomena by observing the possible range of contexts in which they occur.
![Page 3: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/3.jpg)
Examples of searchable corpora online
British National Corpus
Russian National Corpus
Eastern Armenian National Corpus
Czech National Corpus
![Page 4: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/4.jpg)
To show just one example:Eastern Armenian National Corpus
• about 90 million tokens • powerful search engine for making complex lexical morphological queries • a diachronic corpus covering SEA texts from the mid-19th century to the present • both written discourse and oral discourse • open access
![Page 5: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/5.jpg)
A national corpus is a large-scale, linguistically diversified and balanced collection of texts provided with a flexible search engine.
![Page 6: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/6.jpg)
How large?
RNC 150 mlnBNC 100 mlnEANC 90 mln
Essentially, depends on the type of research envisaged
![Page 7: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/7.jpg)
How diversified?
As diversified as practicable
EANC – extension of the press subcorpus to cover early Armenian press, soon to cover internet forums
RNC – effort to cover snail mail and electronic communication
![Page 8: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/8.jpg)
EANC: subcorpus form
![Page 9: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/9.jpg)
How balanced?
Balance is a vague notion…
At least not disproportionate – less poetry than prose etc. Even a disbalanced corpus can be balanced by creating predefined subcorpora.
![Page 10: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/10.jpg)
As an example: EANC
Written discourse # tokens % EANC # of docs
Fiction
prose: novel 23 487 427 32,0% 287
prose: story 5 203 507 7,1% 104
prose: play 1 407 344 1,9% 46
prose subtotal 30 098 278 41,0% 437
poetry 2 392 710 3,3% 106
Press 22 471 921 30,6% 3895
Nonfiction
science 13 354 755 18,2% 109
essays, memoirs, official, religious 3 894 015 5,3% 320
Written discourse total 72 211 679 98,5% 4 867
![Page 11: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/11.jpg)
Multicomponent corpora
Oral subcorpus (RNC, BNC, EANC)Dialectal subcropus (RNC)Poetic subcropus (RNC)Educational subcorpus (RNC)…
![Page 12: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/12.jpg)
Library or corpus?
• electronic library is intended for readers
• corpus is intended for researchers
Difference in target audience and intended usage
Implied differences:
corpus must be able to respond to queries
library have major problems related to copyright
![Page 13: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/13.jpg)
Technical requirement: reasonable expectation time
Functional requirement: complex queries
• you can not parse texts as you go (on flight)
texts need to contain mark up
• in large corpora, you can not simply search the markup
you have to index files, create datafiles and use special search algorythms
![Page 14: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/14.jpg)
Parsing
Сlassification of inflectional types needs to be as exhaustive and formal as a logical calculus.
Parser creates a list of endings and a list of stems; when parsing a wordform, it tries to match the ending of the word with an ending in the list, then tries to match the rest with the stem, and checks whether this ending is allowed to be added to this stem.
• wordlist
• inflection type attributed to its each item
![Page 15: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/15.jpg)
Parsing
•recent loanwords •neologisms•elements of code-
switching•abbreviations•proper names •technical terms
•distorted spellings•cases of inflectional variance
not included into the wordlist•scanning errors•typos and misspellings in the
original texts
Some tokens are not recognized at all; these tokens can not be searched by means of lexical or grammatical queries.
![Page 16: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/16.jpg)
Parsing
Some tokens receive several analyses.
The actual applicability of these analyses depend on the context and may not be evaluated by the parser.
![Page 17: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/17.jpg)
# of analyses Comment Fiction Science PressOther
WrittenOral
DiscourseEANC Total
1 unambiguous 73,9% 65,9% 70,4% 68,0% 63,0% 70,9%
2 ambiguous (homonimous) 15,4% 9,8% 12,4% 12,3% 14,1% 13,2%
3 ambiguous (homonimous) 2,7% 2,0% 1,9% 3,8% 2,4% 2,3%
4 - 7 ambiguous (homonimous) 1,4% 1,8% 1,8% 1,6% 1,5% 1,6%
Subtotal ambiguous 19,5% 13,7% 16,0% 17,7% 18,0% 17,1%
1? hypothetic (not in dictionary) 0,0% 1,3% 0,6% 0,7% 0,2% 0,5%
0 not recognized 6,2% 12,8% 9,9% 8,0% 13,9% 8,9%
Special tokens: Cyrillic, Latin, digits 0,3% 6,3% 3,1% 5,6% 4,9% 2,6%
Total 100% 100% 100% 100% 100% 100%
![Page 18: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/18.jpg)
Search Functionality
Once again: the Corpus allows to investigate various linguistic phenomena by observing the range of contexts in which they occur.
• token queries
• context queries
• subcorpus queries
![Page 19: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/19.jpg)
Search Functionality
Simple token queries:
• lexeme search
• wordform search
• gram search
Combined token queries:
• lexeme + gram search
![Page 20: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/20.jpg)
Search Functionality
Additional and advanced options for token queries:
• case-sensitivity
• punctuation marks
• position in the sentence
• wildcard queries
• logical functions
• negated features
![Page 21: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/21.jpg)
Search Functionality
Context queries: a combination of several token queries
• search for tokens at a specified distance
• search for tokens within one sentence
• search for tokens in adjacent sentences
• increasing the number of tokens ad infinitum
![Page 22: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/22.jpg)
Search Functionality
Subcorpus selection: searching in a specified type of texts only
• search within a specific period of time
• search in texts of specified authors
• search in specified genres/types of texts
![Page 23: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/23.jpg)
Search Functionality
Working with the results
• expanding the context
• pop-up grammar
• sort by…
![Page 24: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/24.jpg)
Extras
• Translations (EANC)• Disambiguation (RNC)• Electronic library (EANC)• Syntactic markup• Statistics (RNC?)
![Page 25: What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.](https://reader033.fdocuments.in/reader033/viewer/2022051401/56649d0c5503460f949e09c0/html5/thumbnails/25.jpg)
Possible applications
Linguistics(corpus-based grammars projects under way) Education (www.studiorum.ruscorpora.ru to appear) Normative linguistics Literature and culture studies etc.