Compiling and Analyzing Your Own Learner Corpus

Post on 31-Dec-2015

18 views 0 download

description

Compiling and Analyzing Your Own Learner Corpus. Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012. Workshop outline. Opening discussion and corpora overview Graphic Online Language Diagnostic (GOLD) overview Sample GOLD (and related) projects GOLD (or related tool) project lab - PowerPoint PPT Presentation

Transcript of Compiling and Analyzing Your Own Learner Corpus

Compiling and Analyzing Your Own Learner Corpus

Xiaofei LuCALPER 2012 Summer Workshop

July 16, 2012

2

Workshop outlineOpening discussion and corpora overviewGraphic Online Language Diagnostic (GOLD)

overviewSample GOLD (and related) projectsGOLD (or related tool) project labGOLD (or related tool) project discussionsConcluding discussion

3

Opening discussionBrief introduction of your professional/language

background and teaching/research interestsPrior experience with corpus linguisticsPrimary challenges you are dealing withPrimary purposes and goals for taking this

workshop and for learning about corpus linguistics in general

Any other relevant information

4

Corpora overviewWhat is a corpusTypes of corporaCorpus design and compilationCorpus annotationCorpus querying and analysisLearner corpora and L2 developmentResources

5

What is a corpus? Leech (1992):

an unexciting phenomenon, a helluva lot of text, stored on a computer

Sinclair (1991):a collection of naturally-occurring language text, chosen

to characterize a state or a variety of languageSinclair (2004):

a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research

6

Types of corporaGeneral-purpose vs. specialized corpora

British National Corpus & Russian National CorpusMichigan Corpus of Academic Spoken English

Native vs. learner corpora International Corpus of Learner EnglishSpanish Learner Language Oral Corpora

Monolingual vs. parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer

7

Types of corpora (cont.)Corpora representing one or diverse varieties

International Corpus of English Synchronic vs. diachronic corpora

The Corpus of Historical American EnglishSpoken vs. written corpora

Michigan Corpus of Upper-Level Student Papers

8

Corpus designPurpose and type of corpus

Spoken/written; cross-sectional/longitudinal

External criteria for content selectionCommunicative function of a textMode, medium, interaction, domain, topic

Representativeness, balance, size, samplingDesign of the BNC

9

Corpus design (cont.)Encoding meaningful metadata information

Learner: L1, gender, program level, discipline … Sample: date, mode, task, genre, rating …Facilitates contrastive and longitudinal studies

MICASE speaker and transcript attributes Corpus markup: The ICE example

10

Corpus annotationWhy annotateLevels of corpus annotationDifficulties for corpus annotationStandards and encoding

11

Why annotateRaw text vs. annotated text: How do you…

Count the number of words in a Chinese text?Calculate the lexical density of an English text?Count the frequency of can as a modal verb?Know how many T-units in a text are complex?Extract all imperative sentences from a text?Know whether a syntactic structure is used in a text?

12

Levels of corpus annotationSentence and word segmentationPart-of-speech (POS) tagging and lemmatizationSyntactic parsingSemantic, pragmatic, and discourse annotation Learner corpora: error annotationProject-specific annotation

Sentence and word segmentationWhy is this non-trivial?

I went to the shops in Jones St. Saturday afternoon with Mr. Smith.I can’t remember whether it’s a second- or third-grade book.

克林顿在讲话中指出 Clinton pointed out in his speech (that…) 克林顿 在 讲话 中 指出

Clinton at speech middle point-out 克林顿  在 讲话  中指  出

Clinton at speech middle-finger out

POS taggingThe what and whyWhat are the difficulties?

Ambiguity: 48% tokens in the Brown CorpusUnknown words: neologism

Tagsets: overspecificatin vs. underspecificationPenn Treebank Tagset vs. CLAWS7 Tagset

LemmatizationCounting linguistic items

Types – number of different wordsTokens – number of words

What constitutes a different word type?go, went, gone, goes, going?differ, difference, different, differently?can as a noun, verb, and modal verb?

16

Demos and tools: Part 1Xerox morphological analyzer (demo only)ICTCLAS for Chinese segmentation and POS taggingQuerying POS-tagged corpora and Stanford POS tagger for EnglishTree Tagger for multiple languages

Chunking and parsingPartial/full structural analysis of each sentence

My dog likes eating sausage.(ROOT (S (NP (PRP$ My) (NN dog))

(VP (VBZ likes) (S

(VP (VBG eating) (NP (NN sausage)))))

(. .)))

Chunking and parsing (cont’d)What is it useful for?

Retrieving examples of grammatical patternsGrammar checking, syntactic complexity analysisNLP applications that require syntactic analysis

DifficultiesUngrammatical sentencesAmbiguities, e.g., PP attachmentErrors from preprocessing steps

19

Semantic and discourse analysisSemantic and discourse featuresWord sense disambiguationPropositional idea densityCoherence and cohesion

20

Annotation standards and encodingUseful standards

Separable, linguistically consensualDocumentation, compatibility with existing standards

Encoding Simple encoding: present_JJ XML-style: <w type=“JJ">present</w>Format varies, depending on level of annotation

Manual, computer-aided, and automatic annotationEfficiency, scale, reliabilityUAM CorpusTool

21

Demos and tools: Part 2Stanford parser for Arabic, Chinese and EnglishWord sense disambiguation demoComputerized Propositional Idea Density RaterCoh-Metrix for text coherence analysisCHILDES and CLANComputerized ProfilingWMatrix

22

Corpus querying and analysisManual analysis?Corpus-specific online interfaces

Raw: MICASE and MICUSPPOS-tagged: Corpora @ BYUGrammatically and semantically tagged: RNC

General-purpose online interfaces: GOLDWindows-based querying/concordancing tools

WordSmith Tools & AntConc

23

Corpus querying and analysisNatural language processing tools

Good for processing annotated corporaExtracting occurrences of grammatical patterns Examples: Stanford parser and Tregex

24

ResourcesBooks and journals

Hunston (2002): Corpora in Applied LinguisticsMcEnery (2006): Corpus-Based Language Studies International Journal of Corpus LinguisticsCorpus Linguistics and Linguistic TheoryCorpora

Websites and mailing listsBookmarks for corpus-based linguistsLinguistic data consortiumThe corpora list; corpus in deliciousStanford Natural Language Processing Group

25

DiscussionWhat kind of corpus do you intend to compile

and/or use? For what purpose?What are the design issues?How do you intend to format, organize and store

your files?Do you intend to annotate your corpus in some

way? How?How do you intend to search/query your corpus?

26

Learner corpora and L2 developmentSamples from same students at different times

Did (targeted) language development take place?Was a particular pedagogical intervention effective?

Samples from different studentsWhat areas do students show different levels of

development?What factors affect students’ language development?

27

Graphic Online Language DiagnosticA free online tool for teachers to assess their

students’ language developmentDeveloped at CALPER, Penn State, funded by DOEProject co-directors: Xiaofei Lu and Michael McCarthy

Teachers can use GOLD toCompile, upload, and manage their own corporaShare corpora with each otherSearch and analyze corpora

Demonstration

28

Corpus compilationA user can compile a corpus by

Directly compiling and uploading an XML fileUsing the easy-to-use guided XML creation interface

An uploaded corpus can be easily managedDocuments can be added or deletedThe whole corpus can be deletedContent and metadata of individual documents can be

easily accessed

29

Corpus sharingGOLD facilitates easy data sharingA corpus may be set to be

Private, shared, or public

Corpus owner may give other users right to View, add, edit, or delete corpora

Demonstration

30

Basic corpus informationWord count

Alphabetic or numeric orderCan be downloaded as a text file

Corpus and document statisticsMean sentence lengthMean word lengthType-token ratio

Demonstration

31

Corpus searchSelect one or more corpora to searchSpecify key words or phrases

May use the wildcard character, e.g. book*

Specify contextsSize of context windowContext words and their positions

Specify metadata conditions

32

Corpus search resultsDisplay of search results

Sortable KWIC display of search resultsSortable graphic display of search results

Demonstration

33

Lexical bundle/collocation searchProcedure

Select one or more corpora to searchSpecify search wordSpecify contextsSpecify metadata conditions

Search resultsSortable list of n-grams found in selected corpora

Demonstration

34

Summary of featuresDifference from other online tools

Can create, share, and search multiple corporaCan easily search subsets of dataCan work with any language

Summary of corpus analysis functionsWord listCorpus and document statistics: mean sentence length,

mean word length, type-token ratioCorpus search and collocation search

35

Sample questions to askWith data from an individual student, one can

either describe or track development in Patterns of usages of words and phrases – frequency,

underuse, overuse, etc.Lexical and syntactic complexityAppropriate usage of words and phrases in contextPatterns of usages of lexical bundles

36

Sample questions to ask (cont.)With data from different (groups of) students,

one can compare similarities or differences among different (groups of) students in terms of Patterns of usages of words and phrases – frequency,

underuse, overuse, etc.Lexical and syntactic complexityAppropriate usage of words and phrases in contextPatterns of usages of lexical bundles

37

Future enhancementsCorpora for benchmarkingMultilingual natural language processingSuggestions on desirable functions welcome