Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität...

42
Challenges and Solutions in Automatically Annotating CMC Data Torsten Zesch Senior Researcher UKP Lab, Technische Universität Darmstadt Associated Researcher German Institute for International Educational Research, Frankfurt International workshop: Building Corpora of Computer-Mediated Communication: Issues, Challenges, and Perspectives Feb 1415, 2013 TU Dortmund

Transcript of Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität...

Page 1: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

Challenges and Solutions in Automatically Annotating CMC Data

Torsten Zesch Senior Researcher

UKP Lab, Technische Universität Darmstadt

Associated Researcher

German Institute for International Educational Research, Frankfurt

International workshop:

Building Corpora of Computer-Mediated Communication: Issues, Challenges, and Perspectives

Feb 14–15, 2013

TU Dortmund

Page 2: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

2

Challenging Properties of CMC Data

Very heterogeneous (Twitter, Wikipedia, Chat, Discussion forum, …)

No tool fits all

Even access might be challenging

One’s junk is the other’s research interest :)

Many common assumptions do not hold

Non-standard language

“lmao s/o to the cool ass asian officer 4 #1 not runnin my license and #2 not takin dru boo

to jail . Thank u God . #amen”

Standard tools might not work & need to deal with spelling errors, abbreviations, and

variants

Often (relatively) small samples

Hard to build automatic models

Page 3: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

3

Working with CMC Data

Mix of manual and automatic tasks

Data

Automatic Analysis

Manual Analysis

Page 4: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

4

Wrong tool for the job?

Challenge

Page 5: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

5

Availability of Tools & Models

Challenge

Page 6: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

6

Comprehensive NLP Framework

Proposed Solution

Page 7: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

7

Apache UIMA Unstructured Information Management Architecture

Component-based architecture

Designed to scale

Flexible type system

High-density in-memory data representation

2004 – IBM alphaWorks project

still used e.g. in IBM LanguageWare

2006 – Apache Incubator project

2009 – OASIS Standard

2010 – Full Apache project

2010 – Used in IBM’s Watson

Jeopardy Challenge

http://uima.apache.org

Page 8: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

8

Pipeline Architecture

Parser

Writer

Semantic analysis

Syntactic analysis

Morphological analysis

Linguistic preprocessing

Reader

Named entity

tagger

PoS tagger

Stemmer

Lemmatizer

Compound

splitter

Tokenizer

Sentence

splitter

Stopword

tagger

a

bc

Datastore / Results

a

bc

a

bc

a

bc

a

bc

Page 9: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

9

Information Flow S

ou

rce D

ocu

men

t C

oll

ecti

on

An

no

tate

d D

ocu

men

t C

oll

ecti

on

Tokeniz

er

Sente

nce S

plit

ter

PO

S T

agger

Documents

Tokens Tokens Tokens

Sentences Sentences

POS

Documents Documents Documents

Module1 Module2 Module3

Page 10: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

10

So far, so good … but the framework is empty

Page 12: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

12

DKPro

Core

DKPro

Semantics

DKPro

Information

Retrieval

Tokenizer

Part

of Speech

Tagger

Lemmatizer

Named

Entity

Recognition

Parser I/O

Keyphrase

Extraction

Text

Segmenting

Link

Discovery

Text

Summary

Apache

Lucene

Indexing

API Query API Terrier 2

Latent

Semantic

Indexing

Terrier 3

Darmstadt Knowledge Processing

Software Repository

Relatedness Spelling

Page 13: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

13

Tokenizer

Part

of Speech

Tagger

Lemmatizer

Named

Entity

Recognition

Parser I/O

Keyphrase

Extraction

Text

Segmenting

Link

Discovery

Text

Summary

Apache

Lucene

Indexing

API Query API Terrier 2

Latent

Semantic

Indexing

Terrier 3

Darmstadt Knowledge Processing

Software Repository

Relatedness Spelling

DKPro

Core

DKPro

Semantics

DKPro

Information

Retrieval

Page 14: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

14

Multilinguality

Automatically select the correct language-

specific components

Database Writer

Named Entity Recognition

POS Tagging

Sentence splitting

Tokenization

Text Document Reader

de en

TreeTagger German

TreeTagger English

de

en

Page 15: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

15

Other Dimensions

Flexibility

Change I/O or replace tagger without breaking the pipeline

Scalability

Process large amounts of data on Hadoop cluster

Efficiency

Integrated access to large resources, e.g. web-scale language models

Robustness & Accuracy

Not influenced by the choice of the framework

But: UIMA encourages to improve existing tools instead of developing new ones

Page 16: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

16

Pro

Large collection of interoperable components

Contra

Programming knowledge needed (only very little)

Not (mainly) targeted towards CMC

Proposed Solution

Comprehensive NLP Framework

Page 17: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

17

Cool Tool – Wrong Environment

Challenge

Page 18: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

18

Domain Adaptation / Retraining

Challenge

Page 19: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

19

Examples

Different smiley styles :) :-) (^_^) ^o #smiley

Different smiley styles : ) : - ) ( ^ _ ^ ) ^ o # smiley

JJ NN NNS : ) : : ) ( SYM SYM SYM ) SYM NN # NN

Page 20: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

20

Annotate & Retrain

Proposed Solution

Page 21: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

21

Learning Curves

[Steedman, Mark, et al. "Bootstrapping statistical parsers from small datasets." EACL 2003]

Collins-CFG parser LTAG Parser

Page 22: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

22

Case Study Retraining Dependency Parsers for German

All the work done by Irina Alles

Joint supervision with Chris Biemann

4 Parsers

MaltParser

MateTools Parser

Stanford Parser

MDParser (DKFI inhouse)

Page 23: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

23

Case Study Retraining Dependency Parsers for German

All the work done by Irina Alles

Joint supervision with Chris Biemann

4 Parsers

MaltParser

MateTools Parser

Stanford Parser

MDParser (DKFI inhouse)

Page 24: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

24

Case Study Retraining Dependency Parsers for German

All the work done by Irina Alles

Joint supervision with Chris Biemann

4 Parsers

MaltParser

MateTools Parser

Stanford Parser

MDParser (DKFI inhouse)

2 annotated corpora

TIGER Treebank (50,000 sentences of German newspaper text)

TueBa-D/Z treebank (65.524 sentences of German newspaper text)

MaltEval using CoNLL evaluation format (LA – labelled accuracy)

Page 25: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

25

Training Time

Page 26: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

26

Testing Time

Page 27: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

27

Parsing Accuracy

Page 28: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

28

Parsing Accuracy – Ensemble Parser

Page 29: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

29

Annotate & Retrain

Proposed Solution

Pro

Often remarkably few training instances are necessary

Contra

Re-trainability of tools needs to be improved

Some tools behave weird when trained on data that is too different

Page 30: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

30

I am in a special situation …

Challenge

Page 31: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

31

Proposed Solution

Build special tools

Page 32: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

32

Retraining

Tool stays exactly the same – new behavior due to new training data

Customization

Customize tool when retraining is not enough

Build special tools

Proposed Solution

Enable researchers to customize tools

Page 33: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

33

Use Case – Twitter Tagger

Mixture of customized tool and retraining

Customized tokenizer

Retrained tagger

Page 34: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

34

Use Case – Twitter Tagger

Different smiley styles :) :-) (^_^) ^o #smiley

Different smiley styles :) :-) (^ ^) ^o #smiley

A A N E E E EMO EMO HASH

ADJ ADJ NN EMO EMO EMO E E #

Different smiley styles : ) : - ) ( ^ _ ^ ) ^ o # smiley

JJ NN NNS : ) : : ) ( SYM SYM SYM ) SYM NN # NN

ADJ NN NN O O O O O O O O O O O NN O NN

Page 35: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

35

Different smiley styles :) :-) (^ ^) ^o #smiley

A A N E E E EMO EMO HASH

ADJ ADJ NN EMO EMO EMO E E #

Different smiley styles : ) : - ) ( ^ _ ^ ) ^ o # smiley

JJ NN NNS : ) : : ) ( SYM SYM SYM ) SYM NN # NN

ADJ NN NN O O O O O O O O O O O NN O NN

Use Case – Twitter Tagger

Different smiley styles :) :-) (^_^) ^o #smiley

Different Tokenization

Page 36: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

36

Use Case – Twitter Tagger

Different smiley styles :) :-) (^_^) ^o #smiley

Different smiley styles :) :-) (^ ^) ^o #smiley

A A N E E E EMO EMO HASH

ADJ ADJ NN EMO EMO EMO E E #

Different smiley styles : ) : - ) ( ^ _ ^ ) ^ o # smiley

JJ NN NNS : ) : : ) ( SYM SYM SYM ) SYM NN # NN

ADJ NN NN O O O O O O O O O O O NN O NN

Specialized Tagset.

Page 37: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

37

Use Case – Twitter Tagger Retrained Model

Spending the day withhh mommma !

Spending the day withhh mommma !

V D N P N ,

Spending the day withhh mommma !

VVG DT NN NN NNS SENT

Twitter Tokenizer +

Twitter Tagger

Standard Tokenizer +

TreeTagger

Page 38: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

38

Spending the day withhh mommma !

Spending the day withhh mommma !

V D N P N ,

Spending the day withhh mommma !

VVG DT NN NN NNS SENT

Twitter Tokenizer +

Twitter Tagger

Standard Tokenizer +

TreeTagger

Different Tagsets

Use Case – Twitter Tagger Retrained Model

Page 39: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

39

Spending the day withhh mommma !

Spending the day withhh mommma !

V D N P N ,

Spending the day withhh mommma !

VVG DT NN NN NNS SENT

Twitter Tokenizer +

Twitter Tagger

Standard Tokenizer +

TreeTagger

Better model Better Tagging

Use Case – Twitter Tagger Retrained Model

Page 40: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

40

Spending the day withhh mommma !

Spending the day withhh mommma !

V D N P N ,

V ART NN PP NN PUNC

Spending the day withhh mommma !

VVG DT NN NN NNS SENT

V ART NN NN NN PUNC

Twitter Tokenizer +

Twitter Tagger

Standard Tokenizer +

TreeTagger

Mapping to generalized POS models

Use Case – Twitter Tagger Retrained Model

Page 41: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

41

Pro

Very flexible

Contra

Still too complicated for non programmers

Build special tools

Proposed Solution

Enable researchers to customize tools

Page 42: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,

42

Availability of Tools & Models

DKPro framework

Domain Adaptation / Retraining

a little annotation can get you a long way

Customization

active field of research in our Lab

interesting research project opportunities

Summary