Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

Leon DerczynskiAlan RitterSam Clark

Kalina Bontcheva

Streaming social media is powerful

● It's Big Data!– Velocity: 500M tweets / day– Volume: 20M users / month– Variety: earthquakes, stocks, this guy

● Sample of all human discourse - unprecedented● Not only where people are & when, but also

what they are doing● Interesting stuff - just ask the NSA!

Tweets are dirty

● You all know what Twitter is, so let's just look at some difficult tweets

● Orthography: Kk its 22:48 friday nyt :D really tired so imma go to sleep :) good nyt x god bles xxxxx

● Fragments: Bonfire tonite. All are welcome, joe included

● Capitalisation: Don't Have Time To Stop In??? Then, Check Out Our Quick Full Service Drive Thru Window :)

● Nonverbal acts: RT @Huddy85: @Mz_Twilightxxx *kisses your ass**sneezes after* Lol

Tough tweets: Do we even care?

● Most tweets are linguistically fairly well-formed● RT @DesignerDepot: Minimalist Web Design: When Less is More - http://ow.ly/2FwyX

● just went on an unfollowing spree... there's no point of following you if you haven't tweeted in 10+ days. #justsaying ..

● The tweets we find most difficult, are those that seem to say the least

● So im in tha chi whts popping tonight?

● i just gave my momma some money 4 a bill.... she smiled when i put it n her hand __AND__ said "i wanna go out to eat"... -______- HELLA SCAN

We do care

● However, there is utility in trivia:– Sadilek: Predict if you will get flu, using spatial co-location and friend network

– Sugumaran, U. Northern Iowa. Crow corpse reports precede West Nile Virus

– Emerging events: tendency to describe briefly

''There's a dead crow in my garden''

@mari: i think im sick ugh..

Problem representation

● Tweets into finite tokens (PTB + URLs, Smileys)● Put tokens in categories, depending on linguistic function

● Discriminative – cases one by one

– e.g. unigram tagger

● Sequence labelling– order matters!

– consider neighbouring labels

● Goal: label the whole sequence correctly

Word order still matters.. just

● Hard for tweets: exclamations and fragments● Whole sequences a bit rare● @FeeninforPretty making something to eat, aint ate all day

● Peace green tea time!! Happyzone!!!! :)))))

● Sentence structure cues (e.g. caps) often:– absent

– over-used

How do current tools do?

● Badly!– Out of the box:

– Trained on Twitter, IRC and WSJ data:

Where do they break?

● Continued work extending Stanford Tagger● Terrible at doing whole sentences

– Best was 10% accuracy

– SotA on newswire about 55-60%

● Problems on unknown words – this is a good target set to get better performance on– 1 in 5 words completely unseen

– 27% token accuracy on this group

What errors occur on unknowns?

● Gold standard errors (dank_UH je_UH → _FW)● Training lacks IV words (Internet, bake)● Pre-taggables (URLs, mentions, retweets)● NN vs. NNP (derek_NN, Bed_NNP)● Slang (LUVZ, HELLA, 2night)● Genre-specific (unfollowing)● Tokenisation errors (ass**sneezes)● Orthographic (suprising)

Do we have enough data?

● No, it's even worse than normal

– Ritter: 15K tokens, PTB, one annotator

– Foster: 14K tokens, PTB, low-noise

– CMU: 39K tokens, custom, narrow tagset

Tweet PoS-tagging issues

● From analysis, three big issues identified:

1. Many unseen words / orthographies

2. Uncertain sentence structure

3. Not enough annotated data

● Continued with Ritter dataset

Unseen words in tweets

● Two classes:● Standard token, non-standard orthography;

– freinds– KHAAAANNNNNNN!

● Non-standard token, standard orthography– omg + bieber = omb– Huntington

● Majority of non-standard orthographies can be corrected with a gazetteer: typical Pareto– vids → videos– cussin → cursing– hella → very

● No need to bother with e.g. Brown clustering● 361 entries give 2.3% token error reduction

● The rest can handled reasonably with word shape and contextual features

● Using edu.stanford.nlp.tagger.maxent.ExtractorFramesRare

● Features include:– word prefix and suffix shapes

– distribution of shape in corpus

– shapes of neighbouring words

● Corpus small, so adjust rare threshold● +5.35% absolute token acc., +18.5% sentence

Tweet “sentence” “structure”

● They are structured (sometimes)

● We still do better if we look at global features– Unigram tagger accuracy: 66%

● Sentence-level accuracy is important– Unigram tagger sentence accuracy: 2.3%

● Tweets contain some constrained-form tokens● Links, hashtags, user mentions, some smileys● We can fix the label for these tokens

● Knowing P(ci) constrains both P(ci-1|ci) and P(ci+1|ci)

● This allows us to prune the transition graph of labels in the sequence

● Because the graph is read in both directions, fixing any label point impacts whole tweet

● Setting label priors reduces token error 5.03%

Not enough data

● Big unlabelled data - 75 000 000 tweets / day (en)● Bootstrapping sometimes helps in this case

● Problem: initial accuracy is too low ● • ︵ _UH● Solution: consensus with > 1 tagger ◕ ◡ ◕ _UH

● Problem: only one tagger using PTB tags ⋋〴 _⋌ 〵 _UH● Solution: Vote-constrained Bootstrapping _ ⊙ ʘ _UH

Vote-constrained bootstrapping

● Not many taggers available for building semi-supervised data

● We chose Ritters plus the CMU tagger

● Where classes don't map 1:1● Create equivalence classes between tags

– CMU tag R (adverb) → PTB (WRB,RB,RBR,RBS)

– CMU tag !(interjection) → PTB (UH)● Coarser tag constrains set of fine-grained tags

● Ask both taggers to label the candidate input● Add tweet to semi-supervised data if both agree● Lebron_^ + Lebron_NNP → OK, Lebron_NNP● books_N + books_VBZ → Fail, reject tweet

● Evaluated quality on development set– Agreed on 17.8% of tweets

– Of those, 97.4 of tokens correctly PTB labelled

– 71.3% whole tweets correctly labelled

● Results:– Use Trendminer lang ID + data

– Collected 1.5M agreed-upon tokens

● Adding this bootstrapped data reduced error by:– Token-level: 13.7% Sentence-level: 4.5%

www.trendminer-project.eu

Final results

● Unknown accuracy rate: from 27.8% to 74.5%

Token SentenceBaseline: Ritter T-Pos 84.55 9.32GATE: eval set 88.69 20.34 - error reduction 26.80 12.15GATE: dev set 90.54 28.81 - error reduction 38.77 21.49

Where do we go next?

● Local tag sequence bounds?● Better handling of hashtags

– I'm stressed at 9am, shopping on my lunch break... can't deal w/ this today. #retailtherapy

– I'm so #bored today

● More data – bootstrapped ● More data – part-bootstrapped (e.g. CMU GS)● More data – human annotated

● Parsing

Downloadable & Friendly

● As command-line tool; as GATE PR; as Stanford Tagger model

● Included in GATE's TwitIE toolkit (4pm, Europa)● 1.5M token dataset available

● Updates since submission:– Better handling of contractions

– Less sensitive to tokenisation scheme

● Please play!

Thank you for your time!

There is hope:

Jersey Shore is overrated. studying and history homework then a fat night of sleep!

Do you have any questions?

Owoputi et al.

● NAACL'13 paper: 90.5% token perf w/ PTB accuracy● Advancement of the Gimpel tagger, used for our bootstrapping● Late discovery: Can be adapted to PTB tagset with good

results● We use disjoint techniques to Owoputi; combining them could

give an even better result!● Our model readily re-usable and integrated into existing NLP

tool sets

Capitalisation

● Noisy tweets have unusual capitalisation, right?– Buy Our Widgets Now– ugh I haet u all .. stupd ppl #fml

● Lowercase model with lowercased data allows us to ignore capitalisation noise

● Tried multiple approaches to classifying noisy vs. well-formed capitalisation

● Gain from ignoring case in noisy tweets offset by loss from mis-classified well-cased data

Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

Technology

Transcript of Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

Stochastic PDE model identiﬁcation from sparse …...Prior Permeability Stochastic PDE model identiﬁcation from sparse noisy data Fei Lu1, Nils Weitzel2 and Adam Monahan3 Introduction

Sparse Graphs (from noisy data, obviously)Sparse Graphs (from noisy data, obviously) Ernst Wit Johann Bernoulli Institute University of Groningen October 4, 2012 Ernst Wit Sparse graphs

Learning Deep Structured Modelsurtasun/publications/chen_etal_icml15.pdf · Learning Deep Structured Models of our method in the tasks of predicting words from noisy images, and tagging

1 A proximal iteration for deconvolving Poisson noisy ... · arXiv:0803.2623v2 [stat.AP] 27 Aug 2008 1 A proximal iteration for deconvolving Poisson noisy images using sparse representations

Sparse PCA via Covariance Thresholding - mit.eduyash/PAPERS/sparse.pdfIn sparse principal component analysis we are given noisy observations of a low-rank matrix

Sparse Graphs (from noisy data, obviously) · 2019. 8. 26. · Sparse Graphs (from noisy data, obviously) Ernst Wit e.c.wit@rug.nl Joint work with Ivan Vujacic, Fentaw Abegaz and

DIRECT NOISY SPEECH MODELING FOR NOISY-TO-NOISY VOICE ...

China Noisy

Noisy neighbors

Large Scale High-Precision Topic Modeling on Twitter...Summary Jubjub Twitter topic modeling system - Infer topics for tweets which are noisy, sparse and ambiguous in nature - At full

Sparse phase retrieval from noisy data - tuni.filasip/DDT/pdfs/Sparse phase retrieval from noisy data... · OCIS codes: 070.2025, 100.3010, 100.3190, 100.5070 1. Introduction 1.A.

Social Bookmarking and Tagging. Outline Tagging Social Bookmarking Photosharing.

Hidden Markov Models (HMM)•Hidden Markov Models (HMM) Many slides from Michael Collins. Overview I The Tagging Problem I Generative models, and the noisy-channel model, for supervised

LENOVO ASSET TAGGING · LENOVO® ASSET TAGGING Lenovo’s Asset Tagging Service forms the foundation for a strong asset management program. Executed in production, Asset Tagging ensures

Integrating tagging: tagging as integration

Analysis of Sparse and Noisy Ocean Current Data Using Flow ...faculty.nps.edu/pcchu/web_paper/jtech/recon1.pdf · reconstructing a 2D wind-driven circulation in a rotating channel,

Fundamental Limits of Recovering Tree Sparse Vectors from Noisy Linear Measurements

LOCALITY-CONSTRAINED GROUP SPARSE REPRESENTATION … · recognition and computer vision due to its wide applications in human-computer interaction, automatic photo-tagging, and information

Generalized Sparse Signal Mixing Model and Application to Noisy Blind Source Separation

Introduction to Sparsity in Signal Processingeeweb.poly.edu/.../sparsity_intro/sparsity_intro_slides.pdfy : noisy speech signal, length-M A : M N DFT matrix (15) c : sparse Fourier