Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität...
Transcript of Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität...
![Page 1: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/1.jpg)
Challenges and Solutions in Automatically Annotating CMC Data
Torsten Zesch Senior Researcher
UKP Lab, Technische Universität Darmstadt
Associated Researcher
German Institute for International Educational Research, Frankfurt
International workshop:
Building Corpora of Computer-Mediated Communication: Issues, Challenges, and Perspectives
Feb 14–15, 2013
TU Dortmund
![Page 2: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/2.jpg)
2
Challenging Properties of CMC Data
Very heterogeneous (Twitter, Wikipedia, Chat, Discussion forum, …)
No tool fits all
Even access might be challenging
One’s junk is the other’s research interest :)
Many common assumptions do not hold
Non-standard language
“lmao s/o to the cool ass asian officer 4 #1 not runnin my license and #2 not takin dru boo
to jail . Thank u God . #amen”
Standard tools might not work & need to deal with spelling errors, abbreviations, and
variants
Often (relatively) small samples
Hard to build automatic models
![Page 3: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/3.jpg)
3
Working with CMC Data
Mix of manual and automatic tasks
Data
Automatic Analysis
Manual Analysis
![Page 4: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/4.jpg)
4
Wrong tool for the job?
Challenge
![Page 5: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/5.jpg)
5
Availability of Tools & Models
Challenge
![Page 6: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/6.jpg)
6
Comprehensive NLP Framework
Proposed Solution
![Page 7: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/7.jpg)
7
Apache UIMA Unstructured Information Management Architecture
Component-based architecture
Designed to scale
Flexible type system
High-density in-memory data representation
2004 – IBM alphaWorks project
still used e.g. in IBM LanguageWare
2006 – Apache Incubator project
2009 – OASIS Standard
2010 – Full Apache project
2010 – Used in IBM’s Watson
Jeopardy Challenge
http://uima.apache.org
![Page 8: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/8.jpg)
8
Pipeline Architecture
Parser
Writer
Semantic analysis
Syntactic analysis
Morphological analysis
Linguistic preprocessing
Reader
Named entity
tagger
PoS tagger
Stemmer
Lemmatizer
Compound
splitter
Tokenizer
Sentence
splitter
Stopword
tagger
a
bc
Datastore / Results
a
bc
a
bc
a
bc
a
bc
![Page 9: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/9.jpg)
9
Information Flow S
ou
rce D
ocu
men
t C
oll
ecti
on
An
no
tate
d D
ocu
men
t C
oll
ecti
on
Tokeniz
er
Sente
nce S
plit
ter
PO
S T
agger
Documents
Tokens Tokens Tokens
Sentences Sentences
POS
Documents Documents Documents
Module1 Module2 Module3
![Page 10: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/10.jpg)
10
So far, so good … but the framework is empty
![Page 11: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/11.jpg)
11
DKPro Core
Linguistic preprocessing
Tokenization, tagging, parsing, co-reference, n-gram models
Support for various file formats
XML, PDF, WSDL, Wikipedia, TEI, IMS-CWB, ANNIS, …
Already available open source
http://code.google.com/p/dkpro-core-asl/
DKPro Similarity (http://code.google.com/p/dkpro-similarity-asl/)
DKPro Spelling (http://code.google.com/p/dkpro-spelling-asl/)
Coming soon
Information Retrieval, Keyphrase Extraction, Summarization, Lexical Cohesion
DKPro Framework
![Page 12: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/12.jpg)
12
DKPro
Core
DKPro
Semantics
DKPro
Information
Retrieval
Tokenizer
Part
of Speech
Tagger
Lemmatizer
Named
Entity
Recognition
Parser I/O
Keyphrase
Extraction
Text
Segmenting
Link
Discovery
Text
Summary
Apache
Lucene
Indexing
API Query API Terrier 2
Latent
Semantic
Indexing
Terrier 3
Darmstadt Knowledge Processing
Software Repository
Relatedness Spelling
![Page 13: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/13.jpg)
13
Tokenizer
Part
of Speech
Tagger
Lemmatizer
Named
Entity
Recognition
Parser I/O
Keyphrase
Extraction
Text
Segmenting
Link
Discovery
Text
Summary
Apache
Lucene
Indexing
API Query API Terrier 2
Latent
Semantic
Indexing
Terrier 3
Darmstadt Knowledge Processing
Software Repository
Relatedness Spelling
DKPro
Core
DKPro
Semantics
DKPro
Information
Retrieval
![Page 14: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/14.jpg)
14
Multilinguality
Automatically select the correct language-
specific components
Database Writer
Named Entity Recognition
POS Tagging
Sentence splitting
Tokenization
Text Document Reader
de en
TreeTagger German
TreeTagger English
de
en
![Page 15: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/15.jpg)
15
Other Dimensions
Flexibility
Change I/O or replace tagger without breaking the pipeline
Scalability
Process large amounts of data on Hadoop cluster
Efficiency
Integrated access to large resources, e.g. web-scale language models
Robustness & Accuracy
Not influenced by the choice of the framework
But: UIMA encourages to improve existing tools instead of developing new ones
![Page 16: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/16.jpg)
16
Pro
Large collection of interoperable components
Contra
Programming knowledge needed (only very little)
Not (mainly) targeted towards CMC
Proposed Solution
Comprehensive NLP Framework
![Page 17: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/17.jpg)
17
Cool Tool – Wrong Environment
Challenge
![Page 18: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/18.jpg)
18
Domain Adaptation / Retraining
Challenge
![Page 19: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/19.jpg)
19
Examples
Different smiley styles :) :-) (^_^) ^o #smiley
Different smiley styles : ) : - ) ( ^ _ ^ ) ^ o # smiley
JJ NN NNS : ) : : ) ( SYM SYM SYM ) SYM NN # NN
![Page 20: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/20.jpg)
20
Annotate & Retrain
Proposed Solution
![Page 21: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/21.jpg)
21
Learning Curves
[Steedman, Mark, et al. "Bootstrapping statistical parsers from small datasets." EACL 2003]
Collins-CFG parser LTAG Parser
![Page 22: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/22.jpg)
22
Case Study Retraining Dependency Parsers for German
All the work done by Irina Alles
Joint supervision with Chris Biemann
4 Parsers
MaltParser
MateTools Parser
Stanford Parser
MDParser (DKFI inhouse)
![Page 23: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/23.jpg)
23
Case Study Retraining Dependency Parsers for German
All the work done by Irina Alles
Joint supervision with Chris Biemann
4 Parsers
MaltParser
MateTools Parser
Stanford Parser
MDParser (DKFI inhouse)
![Page 24: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/24.jpg)
24
Case Study Retraining Dependency Parsers for German
All the work done by Irina Alles
Joint supervision with Chris Biemann
4 Parsers
MaltParser
MateTools Parser
Stanford Parser
MDParser (DKFI inhouse)
2 annotated corpora
TIGER Treebank (50,000 sentences of German newspaper text)
TueBa-D/Z treebank (65.524 sentences of German newspaper text)
MaltEval using CoNLL evaluation format (LA – labelled accuracy)
![Page 25: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/25.jpg)
25
Training Time
![Page 26: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/26.jpg)
26
Testing Time
![Page 27: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/27.jpg)
27
Parsing Accuracy
![Page 28: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/28.jpg)
28
Parsing Accuracy – Ensemble Parser
![Page 29: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/29.jpg)
29
Annotate & Retrain
Proposed Solution
Pro
Often remarkably few training instances are necessary
Contra
Re-trainability of tools needs to be improved
Some tools behave weird when trained on data that is too different
![Page 30: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/30.jpg)
30
I am in a special situation …
Challenge
![Page 31: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/31.jpg)
31
Proposed Solution
Build special tools
![Page 32: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/32.jpg)
32
Retraining
Tool stays exactly the same – new behavior due to new training data
Customization
Customize tool when retraining is not enough
Build special tools
Proposed Solution
Enable researchers to customize tools
![Page 33: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/33.jpg)
33
Use Case – Twitter Tagger
Mixture of customized tool and retraining
Customized tokenizer
Retrained tagger
![Page 34: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/34.jpg)
34
Use Case – Twitter Tagger
Different smiley styles :) :-) (^_^) ^o #smiley
Different smiley styles :) :-) (^ ^) ^o #smiley
A A N E E E EMO EMO HASH
ADJ ADJ NN EMO EMO EMO E E #
Different smiley styles : ) : - ) ( ^ _ ^ ) ^ o # smiley
JJ NN NNS : ) : : ) ( SYM SYM SYM ) SYM NN # NN
ADJ NN NN O O O O O O O O O O O NN O NN
![Page 35: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/35.jpg)
35
Different smiley styles :) :-) (^ ^) ^o #smiley
A A N E E E EMO EMO HASH
ADJ ADJ NN EMO EMO EMO E E #
Different smiley styles : ) : - ) ( ^ _ ^ ) ^ o # smiley
JJ NN NNS : ) : : ) ( SYM SYM SYM ) SYM NN # NN
ADJ NN NN O O O O O O O O O O O NN O NN
Use Case – Twitter Tagger
Different smiley styles :) :-) (^_^) ^o #smiley
Different Tokenization
![Page 36: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/36.jpg)
36
Use Case – Twitter Tagger
Different smiley styles :) :-) (^_^) ^o #smiley
Different smiley styles :) :-) (^ ^) ^o #smiley
A A N E E E EMO EMO HASH
ADJ ADJ NN EMO EMO EMO E E #
Different smiley styles : ) : - ) ( ^ _ ^ ) ^ o # smiley
JJ NN NNS : ) : : ) ( SYM SYM SYM ) SYM NN # NN
ADJ NN NN O O O O O O O O O O O NN O NN
Specialized Tagset.
![Page 37: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/37.jpg)
37
Use Case – Twitter Tagger Retrained Model
Spending the day withhh mommma !
Spending the day withhh mommma !
V D N P N ,
Spending the day withhh mommma !
VVG DT NN NN NNS SENT
Twitter Tokenizer +
Twitter Tagger
Standard Tokenizer +
TreeTagger
![Page 38: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/38.jpg)
38
Spending the day withhh mommma !
Spending the day withhh mommma !
V D N P N ,
Spending the day withhh mommma !
VVG DT NN NN NNS SENT
Twitter Tokenizer +
Twitter Tagger
Standard Tokenizer +
TreeTagger
Different Tagsets
Use Case – Twitter Tagger Retrained Model
![Page 39: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/39.jpg)
39
Spending the day withhh mommma !
Spending the day withhh mommma !
V D N P N ,
Spending the day withhh mommma !
VVG DT NN NN NNS SENT
Twitter Tokenizer +
Twitter Tagger
Standard Tokenizer +
TreeTagger
Better model Better Tagging
Use Case – Twitter Tagger Retrained Model
![Page 40: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/40.jpg)
40
Spending the day withhh mommma !
Spending the day withhh mommma !
V D N P N ,
V ART NN PP NN PUNC
Spending the day withhh mommma !
VVG DT NN NN NNS SENT
V ART NN NN NN PUNC
Twitter Tokenizer +
Twitter Tagger
Standard Tokenizer +
TreeTagger
Mapping to generalized POS models
Use Case – Twitter Tagger Retrained Model
![Page 41: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/41.jpg)
41
Pro
Very flexible
Contra
Still too complicated for non programmers
Build special tools
Proposed Solution
Enable researchers to customize tools
![Page 42: Challenges and Solutions in Automatically Annotating CMC Data · UKP Lab, Technische Universität Darmstadt ... Standard tools might not work & need to deal with spelling errors,](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fc27693e5ab9b3756181a9d/html5/thumbnails/42.jpg)
42
Availability of Tools & Models
DKPro framework
Domain Adaptation / Retraining
a little annotation can get you a long way
Customization
active field of research in our Lab
interesting research project opportunities
Summary