Developing annotation solutions for online data-driven learning

45
EUROCALL 2007 - University of Ulster, 5 - 8 September Developing annotation solutions for online data-driven learning Pascual Pérez-Paredes and Jose María Alcaraz SACODEYL Universidad de Murcia, Spain

description

Developing annotation solutions for online data-driven learning. Pascual Pérez-Paredes and Jose María Alcaraz SACODEYL Universidad de Murcia, Spain. System Aided Compilation and Open Distribution of European Youth Language. 225836-CP-1-2005-1-ES-MINERVA-M. - PowerPoint PPT Presentation

Transcript of Developing annotation solutions for online data-driven learning

EUROCALL 2007 - University of Ulster, 5 - 8 September

Developing annotation solutions for online data-driven learning

Pascual Pérez-Paredes and Jose María Alcaraz

SACODEYL

Universidad de Murcia, Spain

EUROCALL 2007 - University of Ulster, 5 - 8 September

System Aided Compilation and Open Distribution of European

Youth Language

225836-CP-1-2005-1-ES-MINERVA-M

EUROCALL 2007 - University of Ulster, 5 - 8 September

Developing annotation solutions for online data-driven learning

1. Annotation in CL

2. Annotating corpora for the FL classroom

3. Challenges of pedagogical annotation

4. Developing annotation solutions

5. SACODEYL annotator

Domainanalysis

Requirements and

software specification

EUROCALL 2007 - University of Ulster, 5 - 8 September

1. Annotation in Corpus Linguistics

EUROCALL 2007 - University of Ulster, 5 - 8 September

• Add-on

• Needs of the research community

• Annotation = analysis

• Annotation = processing

Annotation in Corpus Linguistics

EUROCALL 2007 - University of Ulster, 5 - 8 September

Why annotate?

Annotation allows corpus users for both refined information retrieval capabilities and the

subsequent treatment of the data

EUROCALL 2007 - University of Ulster, 5 - 8 September

Annotation

• Can be automatic, semi-automatic or manual

• Can be performed by one or different annotators or software operators

• Does reflect the different nature of the ultimate aim of the meta-information being added to the corpus

EUROCALL 2007 - University of Ulster, 5 - 8 September

Non polysemic ambiguity: Poesio and Artstein (2005) -----------Interest in L2 speakers’ errors: Abe and Tono (2005)

EUROCALL 2007 - University of Ulster, 5 - 8 September

Strong research paradigm rooted on

grammatical tagging, including morphological and syntactical information

(Garside, R., Leech, G., and McEnery 1997).

EUROCALL 2007 - University of Ulster, 5 - 8 September

2 Annotating corpora for the FL classroom

2.1 Corpora in the FL classroom

EUROCALL 2007 - University of Ulster, 5 - 8 September

Interest in corpora and FLT:• Volumes: Sinclair 2004, Braun, Kohn and

Mukherkee 2006, Hidalgo, Quereda and Santana 2007

• SIG EUROCALL

• 1st International Conference on Corpus-Based Approaches to ELT , November 2007

EUROCALL 2007 - University of Ulster, 5 - 8 September

Normalisation is still an issue:• Mauranen (2004:99) points out that for a

teaching method to become an important innovation, it has to “make its way to the normal classroom where teachers and students can use it as part of their everyday routine, with not too much extra hassle”.

• Chambers 2007: major obstacles• Braun 2007: secondary education

EUROCALL 2007 - University of Ulster, 5 - 8 September

2 Annotating corpora for the FL classroom

2.2 Annotating with a view on learning

EUROCALL 2007 - University of Ulster, 5 - 8 September

• Braun (2007): pedagogically motivated corpora

(a) provide a more systematic range of material than individual texts or scattered collections of activities and, if well-designed, (b) offer a wider range of idiolects than the average material.

EUROCALL 2007 - University of Ulster, 5 - 8 September

• Braun (2006) states that thematic annotation, including topic keys and section titles, are particularly useful in the implementation of pedagogically motivated corpora.

EUROCALL 2007 - University of Ulster, 5 - 8 September

EUROCALL 2007 - University of Ulster, 5 - 8 September

<event start="0m0" end="1m24" video="horse_caravanning_ie" duration="1m24" wordcount="223">

<topic><topic_title>What we do</topic_title><topic_key>02 What we do</topic_key><content_key/>

</topic><speaker name="Dieter">In the 60s, in the late 60s, I had worked in Germany for a

while and I decided that I wanted to have my children reared in Ireland. So we came back from Germany, working for the Irish Tourist Board and started this enterprise <break/>. It's lovely now with the sunshine, we don't always have it like this, but very often. We started with 12 and then 20 caravans, and now we have about 35. And it's been a basis of what which we can live as a family, raise our children in a nice environment. We work very hard for three months and then have a very relaxed time of it, nine months. And in that time then I took on as a hobby computers, and Mary took on tour-guiding. So we have various different aspects to what we do.The horse caravans is a very intensive work just for those three months, but it's very enjoyable because we mix in the family a quiet nine months where we are very much en famille with the children, you can concentrate on them much more than if we were nine-to-five workers. And then the intensity of the three months means that we can also have our children employed, and learning how to work, learning how to deal with people. So, good mixture, isn't it.<cut/></speaker></event>

EUROCALL 2007 - University of Ulster, 5 - 8 September

• The annotators have a pedagogical use of the text in mind when approaching the annotation stage.

• The tags <topic_title>, <topic_key> and <content_key> highlight the relevance of the communicative purpose of texts, that is, the topics and the contents that characterize them.

EUROCALL 2007 - University of Ulster, 5 - 8 September

Corpus

LanguageData

Annotation

Language

Metadata

Pedagogy

EUROCALL 2007 - University of Ulster, 5 - 8 September

3 Annotation challenges

EUROCALL 2007 - University of Ulster, 5 - 8 September

Rememberthe why annotate? slide

Annotation allows corpus users for both refined information retrieval capabilities and the

subsequent treatment of the data

PEDAGOGY

EUROCALL 2007 - University of Ulster, 5 - 8 September

Linguistic analysis of interest in FLT

Tsui (2004)

Corpus-based studies focus on 4 areas of description:

1. Lexical collocation

2. Syntactic patterning

3. Genre analysis

4. Discourse structure and cohesion

Word based and relying

on co-occurrence of grammatical word-class tags

EUROCALL 2007 - University of Ulster, 5 - 8 September

Researcher/LinguistEnd user

Linguistic analysis of interest in FLT------>

Linguistics comes first------->

DDL materialsConcordances and corpus

EUROCALL 2007 - University of Ulster, 5 - 8 September

Pedagogical analysis (and annotation)

of language corpora------>

Pedagogy comes first------->

Pedagogy-driven

DDL

Material developer/Teacher/ LearnerEnd user

EUROCALL 2007 - University of Ulster, 5 - 8 September

• Problem-oriented tagging

• Corpus applications in FLT still need to gain a status on their own

CHALLENGES

EUROCALL 2007 - University of Ulster, 5 - 8 September

CHALLENGES

TECHNOLOGYDESIGN

EPISTEMOLOGY

EUROCALL 2007 - University of Ulster, 5 - 8 September

Leech (1993) maxims– remove the annotation from the text; – if desired, the annotation could be extracted – based on guidelines everyone could reach; – it should be made clear how and by

whom the annotation was carried out,– it should be based on widely agreed

and theory-neutral principles

DESIGN

EUROCALL 2007 - University of Ulster, 5 - 8 September

• Presuppositions and foundations: antecedent implications in the literature

• Annotation oriented towards pedagogical uses

EPISTEMOLOGY

EUROCALL 2007 - University of Ulster, 5 - 8 September

• Mukherjee (2006): copora in language pegagogy for (a) dictionaries and material, (b) database and (c) representative samples of learner language.

EPISTEMOLOGY

EUROCALL 2007 - University of Ulster, 5 - 8 September

• Meunier (2002): methodological influence ---- use of classroom concordancing and inductive approach to learning leading to “rehabilitation” of grammar (p. 135)

EPISTEMOLOGY

EUROCALL 2007 - University of Ulster, 5 - 8 September

• Bernardini (2000): inductive and deductive learning, probabilistic notion of language and learning pedagogy that resolves the attention to form /meaning dichotomy

EPISTEMOLOGY

EUROCALL 2007 - University of Ulster, 5 - 8 September

• Bernardini (2000):

learners as either researchers or travellers

EPISTEMOLOGY

EUROCALL 2007 - University of Ulster, 5 - 8 September

• Bernardini (2004): potential of corpora as a linguistic aid: favour descriptive insights and discovery learning

EPISTEMOLOGY

EUROCALL 2007 - University of Ulster, 5 - 8 September

• Pérez-Paredes (2003,2004): integrative paradigm of CL in FLT

EPISTEMOLOGY

EUROCALL 2007 - University of Ulster, 5 - 8 September

TECHNOLOGY

•User-friendly: non-computational linguists

•Multilingual support

•Standard-compliant: reusability and valorisation

EUROCALL 2007 - University of Ulster, 5 - 8 September

4. Developing Annotation Solutions

EUROCALL 2007 - University of Ulster, 5 - 8 September

Developing Annotation Solutions

From Challenges To Requirements

From software engineering perspective, development can be considered as the following process:

InputSoftware

EngineeringOutput

InputSoftware

EngineeringOutput

From Requirements To Solutions

EUROCALL 2007 - University of Ulster, 5 - 8 September

Input Requirements• Input = User Requirement• Changing Approach = Changing Requirements• Identifying New Requirement

– Five Perspectives

InputSoftware

EngineeringOutput

Input Details

Analysis Process

Context

DataActors

Epistemology

Analysis Process

Input Details

Empirical

EUROCALL 2007 - University of Ulster, 5 - 8 September

Actors & Context. Linguistic Engineering vs Pedagogical Engineering

Researching

• Powerful Tool• Research Oriented• Extensible & Modular• Specific Domain• Efficient• Complexity• Ad-Hoc Solutions• Mandatory

Teaching

• Pedagogic Tool• Learning Oriented• Friendly• General Domain• Practical• Simplicity• Organizational• Optional

InputSoftware

EngineeringOutput

EUROCALL 2007 - University of Ulster, 5 - 8 September

Data. Grammatical vs Pedagogical

Linguistic Engineering

• Large amount of data (representative Corpora)

• Grammatical Annotation

• Oriented to retrieve statistical Information

Learning

• Reduced set of data

• Pedagogy Annotation

• Oriented to retrieve learning information(Hierarchical Structures & Selective Information)

InputSoftware

EngineeringOutput

EUROCALL 2007 - University of Ulster, 5 - 8 September

Epistemological & Empirical• Multi-Disciplinarily support

• Multi-Lingual support

• Multi-Corpus Management

• Multi-Purpose Support

• Based on StandardsInput

Software Engineering

Output

EUROCALL 2007 - University of Ulster, 5 - 8 September

Choosing Software Life Cycle

Analysis

DesignImplementing

Testing

Maturity

time

Spiral Approach

Why?

InputSoftware

EngineeringOutput

EUROCALL 2007 - University of Ulster, 5 - 8 September

5 SACODEYL Annotator

EUROCALL 2007 - University of Ulster, 5 - 8 September

Output. SACODEYL Annotator

SACODEYL Annotator characteristics:

• Pedagogical Motivation• Teaching Oriented• Friendly Interface• Multi-Language (UTF)• Standardization (TEI)• Multi-Purpose

InputsSoftware

EngineeringOutputs

EUROCALL 2007 - University of Ulster, 5 - 8 September

Developing annotation solutions for online data-driven learning

Contact information

Pascual Pérez-Paredes [email protected]

Jose María Alcaraz [email protected]

Universidad de Murcia, Spain