Multiclass Classiﬁcation of Tweets and Twitter Users Based...

CS229 Final Project Report

Multiclass Classification of Tweets and Twitter UsersBased on Kindness Analysis

Wanzi Zhou [email protected] Han [email protected] Huang [email protected]

I. Introduction

Nowadays social networks such as Twitter andFacebook are most indispensable in people’sdaily lives, and thus it is important to keepthe social community healthy. Establishing akindness assessment mechanism is very helpfulfor maintaining a healthy environment, whichcould be used for applications like a rewardingsystem or parent control modes for childrenusing social network.

Our goal is to set up a kindness rating systemfor tweets/ Twitter users. To accomplish this,we decompose the task into two stages: firstly,for stream data of tweets, we run unsupervisedlearning algorithms to classify them into threeclusters: positive, negative and neutral. Sec-ondly, we choose a group of Twitter users andapply our trained model to assess their kind-ness.

II. Related work

In 2015, Cheng et al [1] from Stanford and Cor-nell Universities have developed a logistic re-gression model, using labeled posts to predictantisocial behavior in online discussion com-munities. Their study focuses on spotting outwhether a user is a troll or not, which is a binaryclassification problem. Earlier in 2011, Sood etal [2] from Pomona College and Yahoo Com-pany developed a model for automatic identifi-cation of personal insults on social news sites,which is also a supervised learning work and be-longs to binary classification problem. They gottheir data labeled via Amazon Mechanical Turk.Meanwhile, sentiment analysis using Twitterdata has been a popular topic in machine learn-ing. Bifet and Frank [3] conducted a supervisedlearning with multinomial naive Bayes classifierto predict the sentiment and opinion of tweets.

Pak and Paroubek [4] improved this model bybetter cleaning the input data. Agarwal et al[5] from Columbia University further exploredtweets with a 3-way classification, namely pos-itive, negative and neutral. All the mentionedresearch studies are supervised learning, how-ever, it is infeasible to label enough trainingdata in short time. Thus, different from formerwork, we propose to give each tweet/Twitteruser a kindness rating, leading to an unsuper-vised multinomial classification or regression.

III. Dataset and Features

Twitter has always been a great resource forNatural Language Processing researchers. Ithas sufficiently large size of data, along withoutstanding qualities - it comprises of real-lifeconversations, uniform length (140 characters),rich variety, and real-time data stream. WithTwitter API, we captured a random sample oftweets in continuous 24 hours in a regular dayand picked out all the English tweets. After theabove procedures, we obtained 58292 tweets asour dataset for this project.

We first used the lexicon features. We collectedtwo lexicons of positive words [6] such as "amaz-ing" and negative words [7] such as "bastard",which has 723 and 236 words respectively. Weclean the data by transforming all the lettersinto lowercases and neglecting the punctuations.For every tweet we obtained from the dataset,we compare them to the words in the dictionaryof both positive words and negative words, andobtain a 959× 1 feature vector, where each valuein the vector represents the number of times theword appears in a certain tweet. We then usethe features to implement the learning part.

For a second try, since our data does not havelabels, we want the features to be more reason-

1


able and objective so that the later unsupervisedlearning can lead to a better result, so we alsotried considering the semantics and relationsbetween words to assign a different weight towords in the dictionary. To achieve this, weused a word2vec method using an online datasetfrom the Data Compression Programs by MattMahoney[8]. Using the package in the programand fitting it into our model, we pre-processedthe data file to obtain 17005207 words of allkinds(including positive words such as "opti-mistic" and negative words such as "bastard" inour positive and negative dictionaries). Amongthese there are 50000 unique words in all. Thenwe built a skip gram model and trained themodel with SGD optimization for 40000 stepsto obtain a 50000× 128 word embedding matrix,where each row is the word embedding vectorof each of the 50000 words. We extracted vectorsfor words in our positive and negative dictionar-ies from the matrix. Then, for each word, wecompute its cosine similarity with every otherwords and take the average similarities of posi-tive/negative words as a measurement towards"negative"/"positive". Based on above we assigndifferent weights to build the 959 × 1 featurevector of every tweet for the learning part.

IV. Unsupervised Learning Model

For the project we are using three methodsto implement the unsupervised clustering: K-means, principle component analysis (PCA) in-corporated with K-means and Gaussian MixtureModel with EM algorithm. We then comparethe results between these methods.

i. K-means

Since the data we have obtained are unlabeleddata, we do unsupervised learning by classi-fying the tweets into three clusters: positive,negative and neutral. We first try very straight-forward method of K-means clustering.

The input vector we have obtained throughfeature extraction is the feature vector contain-ing information of use of positive words andnegative words. We run K-means for all the

58295 tweets data.

We initial the cluster centroids based on theprior knowledge that the K cluster centroidsshould be well separated from each. We alsoadd a random process in generating the cen-troids to avoid local minimum. Then we repeatthe following K-means algorithms until conver-gence:For every i, i = 1, . . . , 58295, set

c(i) = arg minj||x(i) − µj||2

For every j, j = 1, 2, ..., K, set

µj :=∑m

i=1 1{c(i) = j}x(i)

∑mi=1 1{ci = j}

To choose the optimal cluster number K, we vi-sualize the clustering results in a two dimensionspace, where the two dimensions represent thenormalized sum of positive and negative featurenumber counts, respectively. We use this 2Dprojection result as a criterion to determine theoptimal K based on the fact that if the samplesare well clustered in a low dimensional space,they must be better if not equally clustered in ahigher dimensional space.

According to our clustering results with K =2, 3, 4, 5, as shown in section V, we find the op-timal cluster number K = 3. We then look intothe values of the three cluster centroids. One ofthem is extremely close to a zero vector whilethe other two’s positive and negative compo-nents are distinctly recognized, which showsthat the three clusters correspond to the threecategories: positive, negative and neutral as wediscussed in the previous section.

ii. PCA

After trying out straight forward K-means, wethink it might be helpful to reduce the compu-tation time by applying PCA (principal compo-nent analysis) before the K-means algorithms.

We first shrink the 959× 1 (723 negative + 236positive) feature vector to 605× 1 by eliminatingthe word that never showed up in the dataset.

2


Then we normalize the feature data to zeromean and unit-variance for each component.

Afterwards, we calculate the the empirical co-variance matrix Σ of the feature data. Then weproject our data into a k-dimensional subspace(k < m). Here we choose k = 100. Specificallywe choose u1, . . . , uk to be the top k eigenvec-tors of Σ. Then we present the feature vector onthe basis of ui’s.

iii. Gaussian Mixture Model

To reflect the correlation between the individualcomponents in the feature vector, we also useExpectation-Maximization (EM) algorithm tolearn a Gaussian mixture model.

Since we have already demonstrated K = 3 isthe optimal cluster number in the previous dis-cussion, for Gaussian mixture model we usethree Gaussians representing cluster 1 for neu-tral, cluster 2 for positive and cluster 3 for neg-ative words. Our goal is to maximize the loglikelihood

`(φ, µ, Σ) =m

∑i=1

logk

∑z(i)=1

p(x(i)|z(i); µ, Σ)p(z(i); φ)

where x(i) is data of every tweet i, and z(i) isits corresponding latent variable in GMM. Herek = 1, 2, 3. The parameters φ, µ, Σ for our GMMmodel is maximized by the EM algorithm. Thenrepeat the following EM algorithm until conver-gence:E-step: we "guess" the value of z(i)’s. Set

w(i)j = p(z(i) = j|x(i); φ, µ, Σ)

M-step: update our parameters φj, µj, Σj for ev-ery j.

V. Experimental Results &Discussion

Word2vec The following figures show the 2Ddistance map of all the words in the word2vecmodel.

-0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1

Average word2vect distance towards negative

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Ave

rag

e w

ord

2ve

ct

dis

tan

ce

to

wa

rds p

ositiv

e

Dictionary word2vect distance map

Negative words

Positive words

Comparing positive/negative words distribu-tion in this 2D distance map, positive wordstends to appear more on the upper-left of themap than negative words, which gives us aquantitative description of how "positive" or"negative" a word can be.

K-Means The following figures show the 2Dprojection results of applying K-means cluster-ing with different cluster number K = 2, 3, 4, 5.

-0.5 0 0.5 1 1.5 2 2.5 3

Sum of negative words appearance

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Su

m o

f p

ositiv

e w

ord

s a

pp

ea

ran

ce

K-means (K = 2, thresh = 1e-8)

Cluster1

Cluster2

-0.5 0 0.5 1 1.5 2 2.5 3


-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Su

m o

f p

ositiv

e w

ord

s a

pp

ea

ran

ce


Cluster1

Cluster2

Cluster3

3


-0.5 0 0.5 1 1.5 2 2.5 3


-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Su

m o

f p

ositiv

e w

ord

s a

pp

ea

ran

ce


Cluster1

Cluster2

Cluster3

Cluster4

-0.5 0 0.5 1 1.5 2 2.5 3


-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Su

m o

f p

ositiv

e w

ord

s a

pp

ea

ran

ce


Cluster1

Cluster2

Cluster3

Cluster4

Cluster5

As mentioned in section IV, we use the 2D vi-sualized result to determine the effectiveness ofclustering with different cluster number K.

• For K = 2, the clustering is biased in ei-ther negative or positive direction whichapparently is not a good result.

• For K = 3, the three clusters are symmetri-cally well separated from each other.

• For K = 4, 5..., we begin to see some finerstructures inside of the clusters, while theclustering on the far end of the two direc-tion follows the same pattern as K = 3.

Therefore, we consider K = 3 as our optimalcluster number. Taking a deeper look into thethree centroids values, we find that the greencluster represents the tweets containing morepositive words and the blue cluster representsthe tweets containing more negative words. Thered cluster contains tweets that are mostly neu-tral, i.e, not containing many positive wordsor negative words. The result shows using K-means with K = 3 does a pretty good job indiscerning negative, neutral and positive tweets.

With cluster number K = 3 and shrink dimen-sion k = 100, below shows the result of applyingK-means with PCA.

−0.5 0 0.5 1 1.5 2 2.5 3−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4K−means with PCA (k = 100)


Su

m o

f p

ositiv

e w

ord

s a

ppe

ara

nce

Neutural

Negative

Positive

Surprisingly, K-means with PCA does not giveus satisfactory result as it fails to distinguishamong negative, neutral and positive tweets.We think it is because as the dimension of thefeature vector shrinks, we lose some of the non-trivial information of the original tweets. Al-though some words in the dictionary could bestrongly correlated such as "cock", "c-o-c-k" and"cocks", the amount of redundancy could onlyallow us to shrink the dimension to 1/2 or evenless, since most of the words are unique to oth-ers.

Gaussian Mixture Model The following fig-ure shows the 3-way classification with Gaussianmixture model. The plot of classification is sim-ilar to that of K-means with K = 3, but theneutral tweets area is smaller. Meanwhile, thearea of positive and negative tweets expand alot.

Comparison The following plots are thelearning curves of K-means and GMM.

4


0 5 10 15

Iteration times

0

0.5

1

1.5

2

2.5R

ela

tive

err

or

×10-3 K-means relative error

1 2 3 4 5 6 7 8 90

1

2

3

4

5

6

7x 10

−3

Iteration times

Rela

tive e

rro

r

Gaussian Mixture Model relative error

Both of the two algorithms converge quickly,and GMM converges even quicker than K-means. We then list the number of tweetsin each category with K-means and GMM.

Table 1: Multi-class Misclassification on 58295 Tweets

K-Means GMMPositive 110 1461Negative 137 1129Neutral 58048 55705

Comparing the results from K-means and Gaus-sian mixture model, we find that most of thetweets online are neutral. With K-means andGaussian mixture model a proportion of 0.4%and 9% tweets are classified either positive ornegative respectively, which shows that Gaus-sian mixture model can better recognize positiveor negative tweets. This result is because withhard assignment, K-means only realizes spheri-cal clusterings, while GMM considers probabil-ity and incorporates the covariance structure ofdata and adjusts itself to elliptic clusters.

Application We apply our trained GMMmodel to test on the recent 200 tweets of threeUS politicians Barack Obama, Donald Trumpand Hillary Clinton. The result shows that theyall follow the same pattern: while most of theirtweets are neutral, their proportion of positivetweets are significantly higher than general pub-lic. This should be an expected result becauseintuitively politicians tend to convey more posi-tive ideas and information to the public.

Table 2: Model Test on Three US Politicians

BarackObama

DonaldTrump

HillaryClinton

Positive 35 50 47Negative 10 10 8Neutral 155 140 145

VI. Conclusion & Future Work

So far, the basic structure of the model is un-derstood, and we have implemented the classifi-cation of positive, neutral and negative tweetsand comparison between different unsupervisedlearning methods. We found that classifyingtweets into three clusters(positive, neutral andnegative) is currently most reasonable. Mosttweets are neutral and a small portion of tweetsare either positive or negative. K-means withPCA is not doing as good as K-means alone,we think it is because PCA actually removesnon-trivial information in the feature vectors.Compared to K-means, Gaussian mixture modelis performing better at classifying tweets intothe three clusters because it considers the corre-lation between different components of feature.We tested our model on three US politicians andthe result aligns with intuition.

Our next step is to use data of a set of individualtweeter users, based on their tweets history andbuild a model to give them a kindness score,thus establishing the kindness assessment sys-tem. Also we want to know deeper in the logicgap between positive and negative words on apsychological level.

5


References

[1] Cheng, Justin, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. "Antisocial be-havior in online discussion communities."arXiv preprint arXiv:1504.00680 (2015).

[2] Sood, Sara Owsley, Elizabeth F. Churchill,and Judd Antin. "Automatic identificationof personal insults on social news sites."Journal of the American Society for Infor-mation Science and Technology 63.2 (2012):270-285.

[3] Bifet, Albert, and Eibe Frank. "Sentimentknowledge discovery in twitter streamingdata." International Conference on Discov-ery Science. Springer Berlin Heidelberg,2010.

[4] Pak, Alexander, and Patrick Paroubek."Twitter as a Corpus for Sentiment Anal-ysis and Opinion Mining." LREc. Vol. 10.2010.

[5] Agarwal, Apoorv, et al. "Sentiment analysisof twitter data." Proceedings of the work-shop on languages in social media. Associ-ation for Computational Linguistics, 2011.

[6] http://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool/

[7] http://www.the-benefits-of-positive-thinking.com/list-of-positive-words.html

[8] http://www.mattmahoney.net/dc/

6

Multiclass Classiﬁcation of Tweets and Twitter Users Based...

Documents

Transcript of Multiclass Classiﬁcation of Tweets and Twitter Users Based...