Compiling and Analyzing Your Own Learner Corpus

37
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012

description

Compiling and Analyzing Your Own Learner Corpus. Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012. Workshop outline. Opening discussion and corpora overview Graphic Online Language Diagnostic (GOLD) overview Sample GOLD (and related) projects GOLD (or related tool) project lab - PowerPoint PPT Presentation

Transcript of Compiling and Analyzing Your Own Learner Corpus

Page 1: Compiling and Analyzing  Your Own Learner Corpus

Compiling and Analyzing Your Own Learner Corpus

Xiaofei LuCALPER 2012 Summer Workshop

July 16, 2012

Page 2: Compiling and Analyzing  Your Own Learner Corpus

2

Workshop outlineOpening discussion and corpora overviewGraphic Online Language Diagnostic (GOLD)

overviewSample GOLD (and related) projectsGOLD (or related tool) project labGOLD (or related tool) project discussionsConcluding discussion

Page 3: Compiling and Analyzing  Your Own Learner Corpus

3

Opening discussionBrief introduction of your professional/language

background and teaching/research interestsPrior experience with corpus linguisticsPrimary challenges you are dealing withPrimary purposes and goals for taking this

workshop and for learning about corpus linguistics in general

Any other relevant information

Page 4: Compiling and Analyzing  Your Own Learner Corpus

4

Corpora overviewWhat is a corpusTypes of corporaCorpus design and compilationCorpus annotationCorpus querying and analysisLearner corpora and L2 developmentResources

Page 5: Compiling and Analyzing  Your Own Learner Corpus

5

What is a corpus? Leech (1992):

an unexciting phenomenon, a helluva lot of text, stored on a computer

Sinclair (1991):a collection of naturally-occurring language text, chosen

to characterize a state or a variety of languageSinclair (2004):

a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research

Page 6: Compiling and Analyzing  Your Own Learner Corpus

6

Types of corporaGeneral-purpose vs. specialized corpora

British National Corpus & Russian National CorpusMichigan Corpus of Academic Spoken English

Native vs. learner corpora International Corpus of Learner EnglishSpanish Learner Language Oral Corpora

Monolingual vs. parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer

Page 7: Compiling and Analyzing  Your Own Learner Corpus

7

Types of corpora (cont.)Corpora representing one or diverse varieties

International Corpus of English Synchronic vs. diachronic corpora

The Corpus of Historical American EnglishSpoken vs. written corpora

Michigan Corpus of Upper-Level Student Papers

Page 8: Compiling and Analyzing  Your Own Learner Corpus

8

Corpus designPurpose and type of corpus

Spoken/written; cross-sectional/longitudinal

External criteria for content selectionCommunicative function of a textMode, medium, interaction, domain, topic

Representativeness, balance, size, samplingDesign of the BNC

Page 9: Compiling and Analyzing  Your Own Learner Corpus

9

Corpus design (cont.)Encoding meaningful metadata information

Learner: L1, gender, program level, discipline … Sample: date, mode, task, genre, rating …Facilitates contrastive and longitudinal studies

MICASE speaker and transcript attributes Corpus markup: The ICE example

Page 10: Compiling and Analyzing  Your Own Learner Corpus

10

Corpus annotationWhy annotateLevels of corpus annotationDifficulties for corpus annotationStandards and encoding

Page 11: Compiling and Analyzing  Your Own Learner Corpus

11

Why annotateRaw text vs. annotated text: How do you…

Count the number of words in a Chinese text?Calculate the lexical density of an English text?Count the frequency of can as a modal verb?Know how many T-units in a text are complex?Extract all imperative sentences from a text?Know whether a syntactic structure is used in a text?

Page 12: Compiling and Analyzing  Your Own Learner Corpus

12

Levels of corpus annotationSentence and word segmentationPart-of-speech (POS) tagging and lemmatizationSyntactic parsingSemantic, pragmatic, and discourse annotation Learner corpora: error annotationProject-specific annotation

Page 13: Compiling and Analyzing  Your Own Learner Corpus

Sentence and word segmentationWhy is this non-trivial?

I went to the shops in Jones St. Saturday afternoon with Mr. Smith.I can’t remember whether it’s a second- or third-grade book.

克林顿在讲话中指出 Clinton pointed out in his speech (that…) 克林顿 在 讲话 中 指出

Clinton at speech middle point-out 克林顿  在 讲话  中指  出

Clinton at speech middle-finger out

Page 14: Compiling and Analyzing  Your Own Learner Corpus

POS taggingThe what and whyWhat are the difficulties?

Ambiguity: 48% tokens in the Brown CorpusUnknown words: neologism

Tagsets: overspecificatin vs. underspecificationPenn Treebank Tagset vs. CLAWS7 Tagset

Page 15: Compiling and Analyzing  Your Own Learner Corpus

LemmatizationCounting linguistic items

Types – number of different wordsTokens – number of words

What constitutes a different word type?go, went, gone, goes, going?differ, difference, different, differently?can as a noun, verb, and modal verb?

Page 16: Compiling and Analyzing  Your Own Learner Corpus

16

Demos and tools: Part 1Xerox morphological analyzer (demo only)ICTCLAS for Chinese segmentation and POS taggingQuerying POS-tagged corpora and Stanford POS tagger for EnglishTree Tagger for multiple languages

Page 17: Compiling and Analyzing  Your Own Learner Corpus

Chunking and parsingPartial/full structural analysis of each sentence

My dog likes eating sausage.(ROOT (S (NP (PRP$ My) (NN dog))

(VP (VBZ likes) (S

(VP (VBG eating) (NP (NN sausage)))))

(. .)))

Page 18: Compiling and Analyzing  Your Own Learner Corpus

Chunking and parsing (cont’d)What is it useful for?

Retrieving examples of grammatical patternsGrammar checking, syntactic complexity analysisNLP applications that require syntactic analysis

DifficultiesUngrammatical sentencesAmbiguities, e.g., PP attachmentErrors from preprocessing steps

Page 19: Compiling and Analyzing  Your Own Learner Corpus

19

Semantic and discourse analysisSemantic and discourse featuresWord sense disambiguationPropositional idea densityCoherence and cohesion

Page 20: Compiling and Analyzing  Your Own Learner Corpus

20

Annotation standards and encodingUseful standards

Separable, linguistically consensualDocumentation, compatibility with existing standards

Encoding Simple encoding: present_JJ XML-style: <w type=“JJ">present</w>Format varies, depending on level of annotation

Manual, computer-aided, and automatic annotationEfficiency, scale, reliabilityUAM CorpusTool

Page 21: Compiling and Analyzing  Your Own Learner Corpus

21

Demos and tools: Part 2Stanford parser for Arabic, Chinese and EnglishWord sense disambiguation demoComputerized Propositional Idea Density RaterCoh-Metrix for text coherence analysisCHILDES and CLANComputerized ProfilingWMatrix

Page 22: Compiling and Analyzing  Your Own Learner Corpus

22

Corpus querying and analysisManual analysis?Corpus-specific online interfaces

Raw: MICASE and MICUSPPOS-tagged: Corpora @ BYUGrammatically and semantically tagged: RNC

General-purpose online interfaces: GOLDWindows-based querying/concordancing tools

WordSmith Tools & AntConc

Page 23: Compiling and Analyzing  Your Own Learner Corpus

23

Corpus querying and analysisNatural language processing tools

Good for processing annotated corporaExtracting occurrences of grammatical patterns Examples: Stanford parser and Tregex

Page 24: Compiling and Analyzing  Your Own Learner Corpus

24

ResourcesBooks and journals

Hunston (2002): Corpora in Applied LinguisticsMcEnery (2006): Corpus-Based Language Studies International Journal of Corpus LinguisticsCorpus Linguistics and Linguistic TheoryCorpora

Websites and mailing listsBookmarks for corpus-based linguistsLinguistic data consortiumThe corpora list; corpus in deliciousStanford Natural Language Processing Group

Page 25: Compiling and Analyzing  Your Own Learner Corpus

25

DiscussionWhat kind of corpus do you intend to compile

and/or use? For what purpose?What are the design issues?How do you intend to format, organize and store

your files?Do you intend to annotate your corpus in some

way? How?How do you intend to search/query your corpus?

Page 26: Compiling and Analyzing  Your Own Learner Corpus

26

Learner corpora and L2 developmentSamples from same students at different times

Did (targeted) language development take place?Was a particular pedagogical intervention effective?

Samples from different studentsWhat areas do students show different levels of

development?What factors affect students’ language development?

Page 27: Compiling and Analyzing  Your Own Learner Corpus

27

Graphic Online Language DiagnosticA free online tool for teachers to assess their

students’ language developmentDeveloped at CALPER, Penn State, funded by DOEProject co-directors: Xiaofei Lu and Michael McCarthy

Teachers can use GOLD toCompile, upload, and manage their own corporaShare corpora with each otherSearch and analyze corpora

Demonstration

Page 28: Compiling and Analyzing  Your Own Learner Corpus

28

Corpus compilationA user can compile a corpus by

Directly compiling and uploading an XML fileUsing the easy-to-use guided XML creation interface

An uploaded corpus can be easily managedDocuments can be added or deletedThe whole corpus can be deletedContent and metadata of individual documents can be

easily accessed

Page 29: Compiling and Analyzing  Your Own Learner Corpus

29

Corpus sharingGOLD facilitates easy data sharingA corpus may be set to be

Private, shared, or public

Corpus owner may give other users right to View, add, edit, or delete corpora

Demonstration

Page 30: Compiling and Analyzing  Your Own Learner Corpus

30

Basic corpus informationWord count

Alphabetic or numeric orderCan be downloaded as a text file

Corpus and document statisticsMean sentence lengthMean word lengthType-token ratio

Demonstration

Page 31: Compiling and Analyzing  Your Own Learner Corpus

31

Corpus searchSelect one or more corpora to searchSpecify key words or phrases

May use the wildcard character, e.g. book*

Specify contextsSize of context windowContext words and their positions

Specify metadata conditions

Page 32: Compiling and Analyzing  Your Own Learner Corpus

32

Corpus search resultsDisplay of search results

Sortable KWIC display of search resultsSortable graphic display of search results

Demonstration

Page 33: Compiling and Analyzing  Your Own Learner Corpus

33

Lexical bundle/collocation searchProcedure

Select one or more corpora to searchSpecify search wordSpecify contextsSpecify metadata conditions

Search resultsSortable list of n-grams found in selected corpora

Demonstration

Page 34: Compiling and Analyzing  Your Own Learner Corpus

34

Summary of featuresDifference from other online tools

Can create, share, and search multiple corporaCan easily search subsets of dataCan work with any language

Summary of corpus analysis functionsWord listCorpus and document statistics: mean sentence length,

mean word length, type-token ratioCorpus search and collocation search

Page 35: Compiling and Analyzing  Your Own Learner Corpus

35

Sample questions to askWith data from an individual student, one can

either describe or track development in Patterns of usages of words and phrases – frequency,

underuse, overuse, etc.Lexical and syntactic complexityAppropriate usage of words and phrases in contextPatterns of usages of lexical bundles

Page 36: Compiling and Analyzing  Your Own Learner Corpus

36

Sample questions to ask (cont.)With data from different (groups of) students,

one can compare similarities or differences among different (groups of) students in terms of Patterns of usages of words and phrases – frequency,

underuse, overuse, etc.Lexical and syntactic complexityAppropriate usage of words and phrases in contextPatterns of usages of lexical bundles

Page 37: Compiling and Analyzing  Your Own Learner Corpus

37

Future enhancementsCorpora for benchmarkingMultilingual natural language processingSuggestions on desirable functions welcome