Social Media & Text...

Wei Xu ◦ socialmedia-class.org

Social Media & Text Analysis lecture 5 - Paraphrase Identification and Logistic Regression

CSE 5539-0010 Ohio State UniversityInstructor: Wei Xu

Website: socialmedia-class.org

In-class Presentation

• pick your topic and sign up

• a 10 minute presentation (20 points)

- A Social Media Platform

- Or a NLP Researcher

socialmedia-class.org

Reading #6 & Quiz #3

Mini Research Proposal• propose/explore NLP problems in GitHub dataset

https://github.com

Mini Research Proposal• propose/explore NLP problems in GitHub dataset

• GitHub: - a social network for programmers (sharing,

collaboration, bug tracking, etc.) - hosting Git repositories (a version control system

that tracks changes to files, usually source code) - containing potentially interesting text fields in

natural language (comments, issues, etc.)

(Recap)

what is Paraphrase?“sentences or phrases that convey approximately the same meaning using different words” — (Bhagat & Hovy, 2012)

(Recap)

what is Paraphrase?

wealthy richword

“sentences or phrases that convey approximately the same meaning using different words” — (Bhagat & Hovy, 2012)

(Recap)

what is Paraphrase?

wealthy richword

the king’s speech His Majesty’s addressphrase

(Recap)

what is Paraphrase?

wealthy richword

the king’s speech His Majesty’s addressphrase

… the forced resignation of the CEO of Boeing,

Harry Stonecipher, for …sentence

… after Boeing Co. Chief Executive Harry Stonecipher was ousted from …

The Ideal

(Recap)

Paraphrase Research'01

Novels'05 '13Bi-Text

'04News

'13* '14* Twitter

'11Video

XuRitter

Callison-Burch Dolan

XuRitter Dolan

Grishman Cherry

'12*Style

‘01Web

XuCallison-Burch

Napoles

'15*'16*Simple

* my researchWei Xu. “Data-driven Approaches for Paraphrasing Across Language Variations” PhD Thesis (2014)

80sWordNet

DIRT (Discovery of Inference Rules from Text)

modified by adjectives objects of verbsadditional, administrative,

assigned, assumed, collective, congressional,

constitutional ...

assert, assign, assume, attend to, avoid, become,

breach ...

Decking Lin and Patrick Pantel. “DIRT - Discovery of Inference Rules from Text” In KDD (2001)

Lin and Panel (2001) operationalize the Distributional Hypothesis using dependency relationships to define

similar environments.

Duty and responsibility share a similar set of dependency contexts in large volumes of text:

Bilingual Pivoting

... fünf Landwirte , weil

... 5 farmers were in Ireland ...

oder wurden , gefoltert

or have been , tortured

festgenommen

thrown into jail

festgenommen

imprisoned

... ...

Source: Chris Callison-Burch

word alignment

Bilingual Pivoting

festgenommen

thrown into jail

festgenommen

imprisoned

... ...

word alignment

Bilingual Pivoting

festgenommen

thrown into jail

festgenommen

imprisoned

... ...

word alignment

Bilingual Pivoting

festgenommen

thrown into jail

festgenommen

imprisoned

... ...

word alignment

Bilingual Pivoting

festgenommen

thrown into jail

festgenommen

imprisoned

... ...

word alignment

Quiz #2

Key Limitations of PPDB?

Quiz #2

insect, beetle,pest, mosquito,

squealer, snitch, rat, mole

microphone, tracker, mic,

wire, earpiece, cookie

glitch, error, malfunction, fault, failure

bother, annoy, pester

microbe, virus, bacterium,

germ, parasite

word sense

Another Key Limitation'01

'04News

'13* '14*Twitter

'11Video

'12*Style

‘01Web

'15*'16*Simple

80sWordNet

only paraphrases, no non-paraphrases

Paraphrase Identificationobtain sentential paraphrases automatically

Mancini has been sacked by Manchester City

Mancini gets the boot from Man City Yes!%

WORLD OF JENKS IS ON AT 11

World of Jenks is my favorite show on tvNo!$

Wei Xu, Alan Ritter, Chris Callison-Burch, Bill Dolan, Yangfeng Ji. “Extracting Lexically Divergent Paraphrases from Twitter” In TACL (2014)

(meaningful) non-paraphrases are needed to train classifiers!

Also Non-Paraphrases'01

'04News

'13* '14*Twitter

'11Video

'12*Style

‘01Web

'15*'16*Simple

80sWordNet

(meaningful) non-paraphrases are needed to train classifiers!

News Paraphrase Corpus

Microsoft Research Paraphrase Corpus

(Dolan, Quirk and Brockett, 2004; Dolan and Brockett, 2005; Brockett and Dolan, 2005)

also contains some non-paraphrases

Twitter Paraphrase Corpus

Wei Xu, Alan Ritter, Chris Callison-Burch, Bill Dolan, Yangfeng Ji. “Extracting Lexically Divergent Paraphrases from Twitter” In TACL (2014)

also contains a lot of non-paraphrases

Paraphrase Identification:

A Binary Classification Problem• Input:

- a sentence pair x- a fixed set of binary classes Y = {0, 1}

• Output: - a predicted class y ∈ Y (y = 0 or y = 1)

negative (non-paraphrases)

positive (paraphrases)

Classification Method:

Supervised Machine Learning• Input:

- a sentence pair x- a fixed set of binary classes Y = {0, 1}- a training set of m hand-labeled sentence pairs

(x(1), y(1)), … , (x(m), y(m))

• Output:

- a learned classifier 𝜸: x → y ∈ Y (y = 0 or y = 1)

Classification Method:

Supervised Machine Learning• Input:

- a sentence pair x (represented by features)- a fixed set of binary classes Y = {0, 1}- a training set of m hand-labeled sentence pairs

(x(1), y(1)), … , (x(m), y(m))

• Output:

- a learned classifier 𝜸: x → y ∈ Y (y = 0 or y = 1)

(Recap Week #3) Classification Method: Supervised Machine Learning

• Naïve Bayes• Logistic Regression • Support Vector Machines (SVM) • …

(Recap Week#3)

Naïve Bayes• Cons:features ti are assumed independent given the class y

• This will cause problems:- correlated features ➞ double-counted evidence - while parameters are estimated independently - hurt classier’s accuracy

P(t1,t2,...,tn | y) = P(t1 | y) ⋅P(t2 | y) ⋅...⋅P(tn | y)

Classification Method: Supervised Machine Learning

• Naïve Bayes • Logistic Regression • Support Vector Machines (SVM) • …

Logistic Regression• One of the most useful supervised machine

learning algorithm for classification!

• Generally high performance for a lot of problems.

• Much more robust than Naïve Bayes (better performance on various datasets).

Let’s start with something simpler!

Before Logistic Regression

Simplified Features

• We use only one feature:

- number of words that two sentence shared in common

A very related problem of Paraphrase Identification:

Semantic Textual Similarity• How similar (close in meaning) two sentences are?

5: completely equivalent in meaning

4: mostly equivalent, but some unimportant details differ

3: roughly equivalent, some important information differs/missing

2: not equivalent, but share some details

1: not equivalent, but are on the same topic

0: completely dissimilar

A Simpler Model:

Linear RegressionSe

#words in common (feature)0 5 10 15 20

A Simpler Model:

Linear Regression

• also supervised learning (learn from annotated data)

• but for Regression: predict real-valued output (Classification: predict discrete-valued output)

ilarit

A Simpler Model:

Linear Regression

ilarit

threshold ➞ Classification

A Simpler Model:

Linear Regression

ilarit

paraphrase{

A Simpler Model:

Linear Regression

ilarit

paraphrase

non-paraphrase

Training Set

- m hand-labeled sentence pairs (x(1), y(1)), … , (x(m), y(m)) - x’s: “input” variable / features - y’s: “output”/“target” variable

#words in common (x)

Sentence Similarity (y)

… …

Wei Xu ◦ socialmedia-class.org Source: NLTK Book

(Recap Week#3)

Supervised Machine Learningtraining set

(also called) hypothesis

#words in commonSentence Similarity

x(estimated)

hypothesis (h)

x(estimated)

training set

• How to represent h ?

hθ (x) = θ0 +θ1x

Linear Regression w/ one variable

Linear Regression:

Model Representation

- m hand-labeled sentence pairs (x(1), y(1)), … , (x(m), y(m)) 𝞱’s: parameters

… …

hθ (x) = θ0 +θ1x

Linear Regression w/ one variable:

Source: many following slides are adapted from Andrew Ng

- m hand-labeled sentence pairs (x(1), y(1)), … , (x(m), y(m))

- 𝞱’s: parameters

… …

hθ (x) = θ0 +θ1x

How to choose 𝞱?

Wei Xu ◦ socialmedia-class.org Andrew'Ng'

0' 1' 2' 3'0'

0' 1' 2' 3'

Linear Regression w/ one variable::

ilarit

• Idea: choose 𝞱0, 𝞱1 so that h𝞱(x) is close to y for training examples (x, y)

Cost Function

ilarit

Cost Function

ilarit

J(θ0,θ1) =12m

(hθ (x(i ) )− y(i ) )2

minimizeθ0 , θ1

J(θ0,θ1)

squared error function:

Cost Function

Linear Regression• Hypothesis:

• Parameters:

• Cost Function:

• Goal:

hθ (x) = θ0 +θ1x

minimizeθ0 , θ1

J(θ0,θ1)

Andrew'Ng'

Hypothesis:'

Parameters:'

Cost'Func6on:'

Goal:'

Simplified'

𝞱0, 𝞱1

J(θ0,θ1) =12m

(hθ (x(i ) )− y(i ) )2

Andrew'Ng'

Hypothesis:'

Parameters:'

Cost'Func6on:'

Goal:'

Simplified'

• Parameters:

• Cost Function:

• Goal:

hθ (x) = θ0 +θ1x

minimizeθ0 , θ1

J(θ0,θ1)

Andrew'Ng'

Hypothesis:'

Parameters:'

Cost'Func6on:'

Goal:'

Simplified'

𝞱0, 𝞱1

hθ (x) = θ1x

J(θ1) =12m

(hθ (x(i ) )− y(i ) )2

∑minimize

θ1J(θ1)

Simplified

J(θ0,θ1) =12m

(hθ (x(i ) )− y(i ) )2

0 1 2 3

hθ (x)

0 0.5 1 1.5 2 2.5-0.5 0

(for fixed 𝞱1, this is a function of x )

J(𝞱1)

J(θ1)(function of the parameter 𝞱1 )

J(1) = 12 × 3

[(1−1)2 + (2 − 2)2 + (3− 3)2 ]= 0

𝞱1=1

0 1 2 3

hθ (x)

0 0.5 1 1.5 2 2.5-0.5 0

J(𝞱1)

J(1) = 12 × 3

[(0.5 −1)2 + (1− 2)2 + (1.5 − 3)2 ]= 0.68

𝞱1=0.5

0 1 2 3

hθ (x)

0 0.5 1 1.5 2 2.5-0.5 0

J(𝞱1)

minimizeθ1

J(θ1)

hθ (x)(for fixed 𝞱0, 𝞱1, this is a function of x ) (function of the parameter 𝞱0, 𝞱1 )

𝞱1 𝞱0

J(𝞱1, 𝞱2)

J(θ0,θ1)

Andrew'Ng'

Hypothesis:'

Parameters:'

Cost'Func6on:'

Goal:'

Simplified'

• Parameters:

• Cost Function:

• Goal:

hθ (x) = θ0 +θ1x

minimizeθ0 , θ1

J(θ0,θ1)

Andrew'Ng'

Hypothesis:'

Parameters:'

Cost'Func6on:'

Goal:'

Simplified'

𝞱0, 𝞱1

hθ (x) = θ1x

J(θ1) =12m

(hθ (x(i ) )− y(i ) )2

∑minimize

θ1J(θ1)

Simplified

J(θ0,θ1) =12m

(hθ (x(i ) )− y(i ) )2

hθ (x)(for fixed 𝞱0, 𝞱1, this is a function of x )

J(θ0,θ1)(function of the parameter 𝞱0, 𝞱1 )

Andrew'Ng'

(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'contour plot

Andrew'Ng'

(for'fixed''''''''''','this'is'a'func6on'of'x)' (func6on'of'the'parameters'''''''''''')'

Andrew'Ng'

Parameter Learning• Have some function

• Want

• Outline:

- Start with some 𝞱0, 𝞱1

- Keep changing 𝞱0, 𝞱1 to reduce J(𝞱1, 𝞱2) until we hopefully end up at a minimum

J(θ0,θ1)minθ0 , θ1

J(θ0,θ1)

0 0.5 1 1.5 2 2.5-0.5 0𝞱1

J(𝞱1)

minimizeθ1

J(θ1)

Gradient Descent

0 0.5 1 1.5 2 2.5-0.5 0𝞱1

J(𝞱1)

minimizeθ1

J(θ1)

θ1 := θ1 −α∂∂θ1

J(θ1)

learning rate

Gradient Descent

Andrew'Ng'

θ1"θ0"

J(θ0,θ1)"

Gradient Descent

minimizeθ0 ,θ1

J(θ0,θ1)

Andrew'Ng'

J(θ0,θ1)"

Gradient Descent

minimizeθ0 ,θ1

J(θ0,θ1)

repeat until convergence {

Gradient Descent

θ j := θ j −α∂∂θ j

J(θ0,θ1) (simultaneous update for j=0 and j=1)

learning rate

Gradient Descent

∂∂θ0

J(θ0,θ1) = ?∂∂θ1

J(θ0,θ1) = ?

hθ (x) = θ0 +θ1x

J(θ0,θ1) =12m

(hθ (x(i ) )− y(i ) )2

∑Cost Function

θ0 := θ0 −α1m

(hθ (x(i ) )− y(i ) )

∑ simultaneous update 𝞱0, 𝞱1

Gradient Descent

θ1 := θ1 −α1m

(hθ (x(i ) )− y(i ) )

∑ ⋅ x(i )

𝞱1 𝞱0

J(𝞱1, 𝞱2)

Linear Regressioncost function is convex

Andrew'Ng'

Batch Update

• Each step of gradient descent uses all the training examples

(Recap)

Linear Regression

ilarit

(Recap)

Linear Regression

ilarit

paraphrase{

(Recap)

Linear Regression

ilarit

paraphrase

non-paraphrase

𝞱1 𝞱0

J(𝞱1, 𝞱2)

(Recap)

• Parameters:

• Cost Function:

• Goal:

hθ (x) = θ0 +θ1x

minimizeθ0 , θ1

J(θ0,θ1)

𝞱0, 𝞱1

J(θ0,θ1) =12m

(hθ (x(i ) )− y(i ) )2

θ j := θ j −α∂∂θ j

J(θ0,θ1) (simultaneous update for j=0 and j=1)

learning rate

(Recap)

Gradient Descent

socialmedia-class.org

Next Class: - Logistic Regression (cont’)

Social Media & Text...

Documents

Transcript of Social Media & Text...

A4 Landscape Template 2 - Royal Holloway, University · Web vie Subheading Text text text text text text text text text text text text text text text text text text text text text

cs4414: Operating Systemsrust-class.org/static/classes/class22/exokernel.pdf · 2014. 5. 30. · Created Date: 4/17/2014 11:04:43 AM

Crowdsourcing and Human Computation - NETS 213 - An ...crowdsourcing-class.org/slides/taxonomy-of-human...Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website:

Social Media & Text Analysissocialmedia-class.org/slides/lecture6_paraphrase_data.pdf · Alan Ritter socialmedia-class.org Social Media & Text Analysis lecture 5 - Paraphrase Data

Social Media & Text Analysissocialmedia-class.org/slides/lecture7_Twitter_NLPpipeline.pdf · Twitter Lexical Normalization and Named Entity Recognition. Wei Xu socialmedia-class.org

SIMcard’exploitaon’rust-class.org/0/static/classes/class18/131031.SIM_card... · 2014-01-14 · Agacker’SMS’can’requestDESPsigned’SMS’response’with’fully’ predictable’content

CLASSIFICATION - lk.rs-class.org

Crowdsourcing and Human Computer Interaction Designcrowdsourcing-class.org/slides/crowdsourcing-and-HCI.pdf · Crowdsourcing and Human Computer Interaction Design ... Instructor:

Jiwei Li NLP Researcher - socialmedia-class.orgsocialmedia-class.org/slides/students_2017/Jiwei Li_NLP_Researche… · Deep Reinforcement learning for dialogue generation [2016],

RULES - lk.rs-class.org

Social Media & Text Analysissocialmedia-class.org/slides_AU2019/lecture3_TwitterNLP_langid.pdf · Wei Xu socialmedia-class.org Naïve Bayes • Pros: (works well for spam ﬁltering,

Summer 1995 Gems & Gemology - Gemological Institute … akcller@class.org fohn I. Koivula Subscriptions Emmanuel Fritsch Jin Lin~ Editors, Book Reviews (800 ... Today, the diamond

Extracting Narrative Structureinteractive-fiction-class.org/slides/extracting-narrative-structure.pdf · EVENTS communicated by one, two, or several (more or less overt) NARRATORS

Crowdsourcing and Human Computation Instructor: Chris ...crowdsourcing-class.org/slides/iterative-and-parallel-processing.pdfHuman computation in general has recently received a lot

Crowd Workers - crowdsourcing-class.orgcrowdsourcing-class.org/slides/crowd-workers.pdf · Requesters like him, and CrowdFlower, collude, explicitly or ... crowd-workers.com • I

Morphology - mt-class.orgmt-class.org/jhu-2015/slides/lecture-morphology.pdf · Derivational morphology allows changing part of speech of words Example: – base:nation, noun!national,

Machine Learning part 2 - crowdsourcing-class.orgcrowdsourcing-class.org/slides/machine-learning-part-2.pdfMachine Learning part 2 ... possibility of parole. ... • The rule based

IIT – Madras: AIDB Lab - socialmedia-class.orgsocialmedia-class.org/slides/students/IIT_Madras.pdf · IIT Madras,India Mithesh M Khapra IBM Research India Sarath Chandar University

Computational Linguistics - Neural Networks part 2computational-linguistics-class.org/slides/09-neural... · 2020-05-29 · intermediate variables: the output of the summation, z,

Euro Micro Pskov 2017 Notice of Race Положениеmicro-class.org/.../Euro_Micro_Pskov-2017_Notice_of_Race.pdfEuro Micro Cup Russia – Pskov 2017 Notice of Race The Micro Class