Post on 26-Jan-2015
description
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
Leon DerczynskiAlan RitterSam Clark
Kalina Bontcheva
Streaming social media is powerful
● It's Big Data!– Velocity: 500M tweets / day– Volume: 20M users / month– Variety: earthquakes, stocks, this guy
● Sample of all human discourse - unprecedented● Not only where people are & when, but also
what they are doing● Interesting stuff - just ask the NSA!
Tweets are dirty
● You all know what Twitter is, so let's just look at some difficult tweets
● Orthography: Kk its 22:48 friday nyt :D really tired so imma go to sleep :) good nyt x god bles xxxxx
● Fragments: Bonfire tonite. All are welcome, joe included
● Capitalisation: Don't Have Time To Stop In??? Then, Check Out Our Quick Full Service Drive Thru Window :)
● Nonverbal acts: RT @Huddy85: @Mz_Twilightxxx *kisses your ass**sneezes after* Lol
Tough tweets: Do we even care?
● Most tweets are linguistically fairly well-formed● RT @DesignerDepot: Minimalist Web Design: When Less is More - http://ow.ly/2FwyX
● just went on an unfollowing spree... there's no point of following you if you haven't tweeted in 10+ days. #justsaying ..
● The tweets we find most difficult, are those that seem to say the least
● So im in tha chi whts popping tonight?
● i just gave my momma some money 4 a bill.... she smiled when i put it n her hand __AND__ said "i wanna go out to eat"... -______- HELLA SCAN
We do care
● However, there is utility in trivia:– Sadilek: Predict if you will get flu, using spatial co-location and friend network
– Sugumaran, U. Northern Iowa. Crow corpse reports precede West Nile Virus
– Emerging events: tendency to describe briefly
''There's a dead crow in my garden''
@mari: i think im sick ugh..
Problem representation
● Tweets into finite tokens (PTB + URLs, Smileys)● Put tokens in categories, depending on linguistic function
● Discriminative – cases one by one
– e.g. unigram tagger
● Sequence labelling– order matters!
– consider neighbouring labels
● Goal: label the whole sequence correctly
Word order still matters.. just
● Hard for tweets: exclamations and fragments● Whole sequences a bit rare● @FeeninforPretty making something to eat, aint ate all day
● Peace green tea time!! Happyzone!!!! :)))))
● Sentence structure cues (e.g. caps) often:– absent
– over-used
How do current tools do?
● Badly!– Out of the box:
– Trained on Twitter, IRC and WSJ data:
Where do they break?
● Continued work extending Stanford Tagger● Terrible at doing whole sentences
– Best was 10% accuracy
– SotA on newswire about 55-60%
● Problems on unknown words – this is a good target set to get better performance on– 1 in 5 words completely unseen
– 27% token accuracy on this group
What errors occur on unknowns?
● Gold standard errors (dank_UH je_UH → _FW)● Training lacks IV words (Internet, bake)● Pre-taggables (URLs, mentions, retweets)● NN vs. NNP (derek_NN, Bed_NNP)● Slang (LUVZ, HELLA, 2night)● Genre-specific (unfollowing)● Tokenisation errors (ass**sneezes)● Orthographic (suprising)
Do we have enough data?
● No, it's even worse than normal
– Ritter: 15K tokens, PTB, one annotator
– Foster: 14K tokens, PTB, low-noise
– CMU: 39K tokens, custom, narrow tagset
Tweet PoS-tagging issues
● From analysis, three big issues identified:
1. Many unseen words / orthographies
2. Uncertain sentence structure
3. Not enough annotated data
● Continued with Ritter dataset
Unseen words in tweets
● Two classes:● Standard token, non-standard orthography;
– freinds– KHAAAANNNNNNN!
● Non-standard token, standard orthography– omg + bieber = omb– Huntington
Unseen words in tweets
● Majority of non-standard orthographies can be corrected with a gazetteer: typical Pareto– vids → videos– cussin → cursing– hella → very
● No need to bother with e.g. Brown clustering● 361 entries give 2.3% token error reduction
Unseen words in tweets
● The rest can handled reasonably with word shape and contextual features
● Using edu.stanford.nlp.tagger.maxent.ExtractorFramesRare
● Features include:– word prefix and suffix shapes
– distribution of shape in corpus
– shapes of neighbouring words
● Corpus small, so adjust rare threshold● +5.35% absolute token acc., +18.5% sentence
Tweet “sentence” “structure”
● They are structured (sometimes)
● We still do better if we look at global features– Unigram tagger accuracy: 66%
● Sentence-level accuracy is important– Unigram tagger sentence accuracy: 2.3%
Tweet “sentence” “structure”
● Tweets contain some constrained-form tokens● Links, hashtags, user mentions, some smileys● We can fix the label for these tokens
● Knowing P(ci) constrains both P(ci-1|ci) and P(ci+1|ci)
Tweet “sentence” “structure”
● This allows us to prune the transition graph of labels in the sequence
● Because the graph is read in both directions, fixing any label point impacts whole tweet
● Setting label priors reduces token error 5.03%
Not enough data
● Big unlabelled data - 75 000 000 tweets / day (en)● Bootstrapping sometimes helps in this case
● Problem: initial accuracy is too low ● • ︵ _UH● Solution: consensus with > 1 tagger ◕ ◡ ◕ _UH
● Problem: only one tagger using PTB tags ⋋〴 _⋌ 〵 _UH● Solution: Vote-constrained Bootstrapping _ ⊙ ʘ _UH
Vote-constrained bootstrapping
● Not many taggers available for building semi-supervised data
● We chose Ritters plus the CMU tagger
● Where classes don't map 1:1● Create equivalence classes between tags
– CMU tag R (adverb) → PTB (WRB,RB,RBR,RBS)
– CMU tag !(interjection) → PTB (UH)● Coarser tag constrains set of fine-grained tags
Vote-constrained bootstrapping
● Ask both taggers to label the candidate input● Add tweet to semi-supervised data if both agree● Lebron_^ + Lebron_NNP → OK, Lebron_NNP● books_N + books_VBZ → Fail, reject tweet
● Evaluated quality on development set– Agreed on 17.8% of tweets
– Of those, 97.4 of tokens correctly PTB labelled
– 71.3% whole tweets correctly labelled
Vote-constrained bootstrapping
● Results:– Use Trendminer lang ID + data
– Collected 1.5M agreed-upon tokens
● Adding this bootstrapped data reduced error by:– Token-level: 13.7% Sentence-level: 4.5%
www.trendminer-project.eu
Final results
● Unknown accuracy rate: from 27.8% to 74.5%
Token SentenceBaseline: Ritter T-Pos 84.55 9.32GATE: eval set 88.69 20.34 - error reduction 26.80 12.15GATE: dev set 90.54 28.81 - error reduction 38.77 21.49
Where do we go next?
● Local tag sequence bounds?● Better handling of hashtags
– I'm stressed at 9am, shopping on my lunch break... can't deal w/ this today. #retailtherapy
– I'm so #bored today
● More data – bootstrapped ● More data – part-bootstrapped (e.g. CMU GS)● More data – human annotated
● Parsing
Downloadable & Friendly
● As command-line tool; as GATE PR; as Stanford Tagger model
● Included in GATE's TwitIE toolkit (4pm, Europa)● 1.5M token dataset available
● Updates since submission:– Better handling of contractions
– Less sensitive to tokenisation scheme
● Please play!
Thank you for your time!
There is hope:
Jersey Shore is overrated. studying and history homework then a fat night of sleep!
Do you have any questions?
Owoputi et al.
● NAACL'13 paper: 90.5% token perf w/ PTB accuracy● Advancement of the Gimpel tagger, used for our bootstrapping● Late discovery: Can be adapted to PTB tagset with good
results● We use disjoint techniques to Owoputi; combining them could
give an even better result!● Our model readily re-usable and integrated into existing NLP
tool sets
Capitalisation
● Noisy tweets have unusual capitalisation, right?– Buy Our Widgets Now– ugh I haet u all .. stupd ppl #fml
● Lowercase model with lowercased data allows us to ignore capitalisation noise
● Tried multiple approaches to classifying noisy vs. well-formed capitalisation
● Gain from ignoring case in noisy tweets offset by loss from mis-classified well-cased data