Graph-based Analysis and Opinion Mining in Social Network

1

Project Report: Graph-based Analysis and Opinion Mining in Social Network

Khan Mostafa

Stony Brook University

Student ID# 109365509

[email protected]

ABSTRACT

This is the final report for Networks & Data Mining Techniques

project focusing on mining social network to estimate public

opinion about entities and associated keywords. This project mines

Twitter for recent feeds and analyzes them to estimate sentiment

score, discussed entity and describing keywords in each tweet. This

data is then exploited to elicit overall sentiment associated with

each entity. Entities and keywords extracted is also used to form an

entity-keyword bigraph. This graph is further used to detect entity

communities and keywords found within those communities.

Presented implementation works in linear time.

Categories and Subject Descriptors

H.2.8 [Database Management]: Database Applications –

Data Mining.

General Terms

Algorithms, Documentation, Experimentation.

Keywords

Opinion mining, sentiment, graph clustering, graph community

detection.

1. INTRODUCTION This project focuses on mining opinion from social network. It

takes Twitter as a model platform for that it has a publicly available

stream of posts from people of diverge demographic. The goal is to

report public opinion in two forms: (a) overall opinion about some

entity and (b) opinion based cluster of entities and keywords.

Public opinion can be mined from posts about entity of interest. At

first, ample posts are fetched from public stream. Then, each post

is individually scored to find embedded subjectivity. All posts are

not subjective, some assert information while some other express

feelings. Hence, posts can be generally classified as objective,

positive and negative. However, subjective bias is not discrete;

rather each post embody mixed polarity. Again, attempts to

annotate post manually has shown that, different people associate

sentiment to same posts differently. Therefore, this project focuses

on calculating sentiment scores for posts. After each posts are

individually scored, overall opinion is represented using few

aggregative parameters including overall score, diversity, and

percentage of each type of polar posts. A set of keywords (kw) are

also identified to report how the entity (E) is positively and

negatively described.

In this project sentiment analysis is done using an approach similar

to [1], using a combination of two naïve Bayes classifiers to

calculate polarity score – PoS tag based classifier and n-gram based

classifier. Keywords and entities are primarily detected using parts

of speech. Then, in combined analysis, keywords that occur less

frequently for an entity is discarded, as that word is not sufficiently

associated with the entity. Again, those keywords that occur in

descriptions of too many entities, are less likely to be keyword,

rather are stop-words or generic words.

After tweets are individually analyzed further overall analysis can

be done. To do so, first an entity – keyword bigraph (E×kw) is

computed from tweets analyzed. Tweets are collected from recent

public feed stream using Twitter API. Analysis reports a polarity

score, a set of keywords and a set of entities for each tweet. In E×kw bigraph an edge exist between E and kw if both occur in same tweet.

These edges also have associated polarity score. This E×kw bigraph

can be used to generate an E×E graph. In E×E, there exists an edge

between two entities if they share a keyword with similar sentiment

bias. This E×E graph is then clustered using a local clustering

algorithm in linear time.

This project is implemented mainly using .Net framework (C#) and

partially using PHP on Apache server to access Twitter API [2].

PoS tagging is done using a third party TreeTagger developed

recently for tweets [3].

The main contributions of this project are,

Implemented a sentiment analysis tool that can elicit scores for

individual tweets

Implemented a way to report aggregate sentiment score and

associated keywords for queried entity

Devised and implemented a simple approach to identify

entities and keywords in tweets

Implemented a fast local graph clustering algorithm using split

vectors instead of full-blown matrices.

Used the fast local graph clustering to detect and report entity

groups along with keywords and grouped polarity scores

In this report following sections include, overview of prior works,

methodology description and result and analysis of mined data.

2. BACKGROUND Mining social network for eliciting public opinion requires

sentiment analysis, keyword & entity tagging and graph clustering.

Sentiment analysis is vastly studied in several fields and still is an

open problem. There had also been ample investigation on

detecting communities, partitioning, and finding clusters in graphs.

In this section a few prior works are briefly discussed.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior

specific permission and/or a fee.

CSE590 Network and Data Mining Techniques, Fall, 2013, Stony Brook University, NY, USA.

Copyright 2013

2

2.1 Sentiment Analysis Sentiment analysis is being studied thoroughly for a decade or

more. One of the earliest work done by Pang, et al. [4], amongst

others, investigated in the field of sentiment classification. This

investigation opened a wide arena of research and have led to many

outcome by multitude of researchers from different fields.

Statistics, computational linguistics and machine learning has been

studied to solve the challenge of sentiment analysis.

There are several lexicon based techniques for opinion mining viz.

[5], versions of SentiWordNet [6], [7]. A detail survey of many

lexicon based approaches is done by [8].

Although earlier studies [9] suggested use of only adjectives as

subjectivity measure, later investigations revealed sentiment

appraisal is much diverse. Whitelaw, et al. [10] suggested using

appraisal taxonomies for sentiment classification. Similar

observation was found by [11] and [12] stating that, “Adjectives,

Verbs and Adverbs are better than Adjectives Alone”.

Machine learning approaches widely used Support Vector

Machines e.g. [1], [13] and Naïve Bayes e.g. [4], [14] classifiers.

Latent Dirichlet Allocation (LDA) is also utilized e.g. [15], [16]. A

lexicon based holistic approach [17] is also described to address

context dependency.

Opinion mining and sentiment analysis on Twitter is investigated

using various approaches viz. [14] [18] [1] [16] [19].

Most approaches for opinion mining assign strict subjectivity class

(positive, negative, neutral) to individual texts in different

granularity (i.e. sentence, post, paragraph and document).

However, a score assignment will serve better to understand

intensity of opinion. There is a paucity of studies that tried to

aggregate sentiment to identify public opinion. Perception of

opinion vary for each individual and a better insight of public

opinion can be found by eliciting few attributes from social media.

Overall sentiment score, percentage of positive and negative

opinions, key descriptions are useful attributes that can be elicited.

This project will focus on mining tweets about some entity for these

attributes associated with that entity.

2.2 Keyword and Entity detection There are different and diverse approaches for keyword detection.

For example, there are machine learning based approaches, using

SVM [20], associating linguistic knowledge like n-grams and PoS

for supervised keyword extraction [21]. Thesaurus based

approaches [22] use semantic knowledge for machine based

keyword extraction.

Most keyword identification approaches use some kind of machine

learning technique along with some other knowledge. However, for

this project’s purpose, a simple method is required to identify

keywords. This project will employ hints from PoS tagging and

then let data itself build a keyword lexicon while simultaneously

detecting them.

1 Modularity is the fraction of the edges that fall within the given groups

minus the expected such fraction if edges were distributed at random.

[Wikipedia, Accessed Dec 03, 2013]

2.3 Graph clustering Graphs have been studied extensively historically from

mathematical and theoretical viewpoint and in recent few decades

they have been more extensively studied from data analytic

perspectives. A lot of real world and physical phenomena can be

ideally modeled as graphs. These graphs can be then efficiently

investigated to find latent characteristics of modeled data.

One major operation on graph in data mining is to divide them into

smaller parts. Partitioning can be of different types. One approach

might be to partition whole graph into disjoint sub graphs of similar

size [23].

For analyzing graphs, a more natural division is often desired.

Vertices in graphs tend to have edge with vertices that have vertices

with other connected neighboring vertices of its own and thus

create communities. However, communities differ in sizes and

these communities are not disconnected. Rather, there are few links

between nodes of different community in contrast to nodes of same

community. Newman and others has conducted several research

[24] [25] [26] on detecting communities in graphs. They exploited

modularity1 of graph to do so. Most of their early works were

restrictive on scalability but later spectral optimization of

modularity yielded [27] an algorithm that works in near linear time.

Modularity based approaches cluster graph into disjoint

communities. In contrast, often communities are overlapped.

Andersen et al [28] suggested a “Local Graph Partitioning using

PageRank Vectors” and other derived algorithms. The core idea

behind these approaches is to use conductance2 of graph to locally

cluster them. These approaches works near linearly and can detect

communities that overlap.

This project uses an approach as devised by Andersen et al, as it

serves several purposes of the project goal. It can detect

communities that overlap, works near linearly, and an

implementation without necessarily creating the blown-up full

matrix is possible.

3. PROJECT DESCRIPTION

3.1 Problem Statement People express their opinion about entities (viz. location, person,

products etc.) in social networks. In brief, the goal is to,

extract overall public opinion of some entity

elicit opinion based entity groups in recent stream

The scope of the project is to mine a popular microblogging

platform: Twitter.

3.1.1 Extract overall public opinion of some entity The goal is to extract opinion about a given entity, E. This will be

done in terms of ample recent tweets about E. The solution shall be

able to yield the following about a given entity, E,

Overall sentiment: Overall sentiment (viz. positive, negative,

mixed) about E. A sentiment score in a range of [-1, 1] will be

given. This will also show the percentage of positive, negative

2 Conductance is the measure of a sub graph denoting how much it

is connected to rest of the graph. It is the ratio of out-links from

the sub graph to the volume (total edge count from nodes in it).

3

and neutral (some threshold can be applied to distinguish

between these three classes) tweets as well as the count of

analyzed tweets. A measure (e.g. variance) of how diverse the

opinion is can also be included.

Key description: The system will yield a set of keywords (kw)

that are used to describe E

An overall sentiment about an entity is useful to multitude of clients

for various applications. Sets of key descriptive words along with

sentiment will provide a better insight of public feelings.

3.1.2 Opinion based entity groups in recent stream The goal is to detect how entities are grouped together in terms of

sentiment and descriptive keywords. This will be done based on a

stream of recent tweets. Each tweets shall be individually analyzed,

as in 3.1.1. Analysis on each tweet will yield,

Text in the tweet, T

Entities discussed in it, E

Keywords in it, kw

Polarity score, P

This tuples (T,E,kw,P) will then be used to build E×kw bigraph

such that,

There exists an edge between Ei and kwj if there is one or

more tweet that contains Ei and kwj

The edge has a weight indicating co-occurrence of Ei and

kwj. i.e.

weightij = Count ({Tk | Ei ∈ Tk.E ∧ kwj∈ Tk.kw})

The edge has pScore that is average of pScore (=P) for

all such occurrences. i.e.

pScore =

Sum({Tk .pScore| Ei ∈ Tk.E ∧ kwj∈ Tk.kw})/weight

After this, a filter will be run on this graph to eliminate those links

that exist between entity and keyword where the keyword is not

enough descriptive of the entity. This is done, by calculating freq

such that,

freqij = weightij/ Occurrence (Ei)

If freqij is smaller than certain threshold, εfreq then that keyword is

filtered out for this entity Ei.

This E×kw bigraph will then be used to build E×E graph, such that,

there exists an edge between Ei and Ej if

Occurrence(Ei)> εeo ∧ Occurrence(Ej)> εeo

{kw(Ei) | Occurrence(kwx)< εkwo} ⋂ {kw(Ej) |

Occurrence(kwx)< εkwo} is not empty

Polarity bias for both are similar

To describe, there is an edge between two entities if they share one

or more keywords with similar polarity bias link. These entities are

such that, they occur over a threshold, εeo. These keywords are such

that, they do not occur for more than some threshold, εkwo, times.

This threshold over keywords is motivated from following

intuition,

If a potential word occur in description of most entities

then that is not an keyword but is a generic term

Then, a community detection algorithm is to be run on this E×E

graph to find groups of entities that are bind together with lot of

polarity aligned keyword links. After one such groups of entities is

generated, there will be a group of keywords such that, they occur

in edges that are within that community of nodes. Also, a

representative averaged pScore can be calculated for such a group.

To summarize, given a stream of tweets, the system shall be able to

generate,

(T,E,kw,P) tuples

E×kw bigraph

E×E graph

Return group of entities has similar opinion

3.2 Data collection

3.2.1 Corpus and entity from Twitter This project requires collecting two types of data. First, a corpus of

subjective and objective tweets are collected – these data is used to

train classifier (scorer). After training the classifier, training (not

the training data set) can be stored in a file so that scorer can act

later by loading them from file.

Secondly, on query time posts are fetched from Twitter.

Following API from Twitter is used:

search/tweets This API is called with ‘q’ = emoticons for gathering training

data (positive and negative posts).

In query time, same API is used with ‘q’ = query term to fetch

related recent posts.

statuses/user_timeline This API is used to fetch objective training data by querying

'screen_name' = popular_stream. I used, Lifehacker,

Gizmodo, New York Times, and The Atlantic as source.

Twitter API do not allow fetching more than 100 posts at once.

Hence, I had to exploit max_id for iteratively requesting same call

for different portions of result. I have collected ten thousands of

each type of data for training. In query time 200~2000 posts are

fetched.

3.2.2 Mining recent twitter stream To generate an E×E graph large enough to detect grouping of

entities a large stream of Twitter public stream is to be collected.

To do this, again Twitter API is used and strapped continuously for

a large amount of windows. Note that, in v1.1, Twitter API allow

only 180 search query per window per user and 450 query per

window per app. At each query, a maximum of 100 tweets are

returned. Currently, windows are 15 minutes each. Hence, max_id

is utilized to continuously fetch tweets using a q=”.” query.

Another alternative to search/tweets API could be a streaming API.

After tweets are fetched, very tiny tweets are discarded. I have,

filtered out tweets with less than 50 characters. This is because,

smaller tweets are difficult to understand. Also, retweets (RT) are

discarded to avoid occurrence of same tweets many times.

Furhtermore, another stage of filtration is imposed to remove yet

duplicate tweets.

4

3.2.3 PoS Tagging After collecting tweets they are passed to a TreeTagger for PoS

tagging. I used recently developed GATE Twitter part-of-speech

tagger [3], which is based on Stanford TreeTagger, which in terms

are based on famous TreeTagger [29] by Schimd. PoS tags yielded

are based on Penn-Treebank-Tagset [30].

3.3 Implementation

3.3.1 Twitter corpus to train sentiment classifier Each posts are individually scored based on two scorers. Following

(Pak and Paroubek 2010) [1], two classifiers are built. To train

them, tweets are queried as such, (1) positive tweets are fetched

with a search of q=””, (2) negative tweets are fetched with a

search of q=”” and (3) objective tweets are fetched from new

media accounts. One classifier exploits parts-of-speech (PoS)

distribution amongst objective and polar statements. PoS

distribution differs amongst positive and negative statements. See

Figure 1 and Figure 2. Another classifier is made exploiting the

distribution of n-grams (n=2). N-grams indicate strong correlation

with bias or with objectivity. Human usually uses common phrases

to express a type of feeling. On the other hand, some phrases are of

assertive nature. This feature of natural language is captured using

n-grams. See Table 2 for top 20 polar n-grams of 94k n-grams.

The reference work used classification result from two classifiers

to verdict final classification. This project enhances the approach

by implementing classifiers as scorers to evaluate PoS score and N-

Gram score for each statement. Then, both score contribute to a

final score of the statement (tweet).

3.3.2 From strapped tweets to graphs As outlined in 3.1.2, (T,E,kw,P) tuples, E×kw bigraph and

E×E graph are generated from a given stream of tweets.

3.3.2.1 Analyzing tweets To do so, first each tweet is scored using sentiment classifier

described in 3.3.1.

PoS tags are exploited to primarily identify entities and keywords.

Entity: Our goal is to analyze entities (location, place, person,

product etc.) In English, they are generally represented by proper

nouns. Also, in Twitter, users can be regarded as entities. Hence,

from, PoS tags, proper nouns (NNP, NNPS, USR) are regarded as

entities.

Keyword: In English adjective, adverbs and verbs are used to

describe an entity. This property is exploited by identifying words

with tags for these PoS (JJ, RB, VB etc.) as keywords. The

algorithm also allows an alternate using a parameter that include

common nouns (not NNP) as keywords.

3.3.2.2 Entity-keyword bigraph From analyzed tweets, (T,E,kw,P) tuples are iterated on to build

an E×kw bigraph as described in 3.3.1. A general intuition, also

confirmed by several studies, is that, graphs are generally sparse.

Thus, instead of building full blown matrix, two dictionary/maps

are stored to represent E×kw bigraph:-

A dictionary of entities, with pointers to keywords, as well as

weight and pScore associated with that node

For ease of iteration, another dictionary of keywords is stored,

which stores pointers back to entities from keywords.

This representation, assure small storage for the entire bigraph, yet

describes entire bigraph with edges and nodes. This reduces the

storage from (E*kw) to edgeCount. Note that,

2*(E+kw) < edgeCount << (E*kw)

Running time for building a bigraph is proportional to number of

edges, i.e. 𝑂(𝑒𝑑𝑔𝑒𝑠).

3.3.2.3 Entity-Entity graph From the E×kw bigraph generated above, an E×E is generated by

iterating over each entity. For each entity, Ei, a set of keywords

kw(Ei) are processed. Each keyword points to another set of

entities, E(kw(Ei). These set of entities are added to neighbor of Ei.

In this step also, a dictionary is used to represent the graph. It

requires one dictionary of entities, where each entry also point to

immediate neighbors. This requires a storage of 2*edge. Runtime

to build this graph is proportional to number of edges. However, a

filtration of entities is done a priori to remove nodes with very few

neighbor from simulation (thus building a set of significant

entities). Filtering generic terms from keyword list (thus only using

legitimate keywords) reduces search space.

3.3.3 Keywords form data Keywords are filtered in several steps to let data define legitimate

keywords. In first step, PoS tagging define preliminary set. After

all tweets are analyzed, a filtration is used to remove low frequency

terms from keyword lists of each entity. After E×kw bigraph is

built, another filtration is used to rule out generic terms. Generic

terms are those potential keywords that are found in too many

entities. A threshold parameter is supplied to the algorithm for this.

Finally after generating communities consolidation step filters out

irregular keywords to yield final set of keywords.

3.3.4 Community detection: group of entities After E×E graph is generated, consisting legitimate keywords and

significant entities a community detection algorithm can be used to

detect community in them. This project implements a fast

derivation of Andersen et al [28].

Table 1. Community Detection Algorithm

1. Significant_entities := entities in (E×E)

2. Seed_node := supplied_seed 3. if(seed null or not exist) then

seed:=first(Significant_entities) 4. aCommunity := new Community() 5. entity :=seed 6. eval := evaluate(entity,aCommunity) 7. if(eval.member) then

aCommunity.Add(entity) remove(entity, Significant_entities) remove(a.Community. Nbor, entity)

8. if(aCommunity.Nbor = empty) goto 11

9. entity := first(aCommunity. Nbor) 10. goto 5 11. add(aCommunity,Communities) 12. if(Significant_entities not emmpty)

goto 4 13. return

5

Algorithm described above uses objects of class Community. It’s

Add() member function adds the entity and updates the community

with, Volume (=edges inside) and outward links. evaluate()

function check membership by calculated conductance if this node

added to community and compare with original conductance.

Conductance is defined as,

Cond = (links outward from community)/(edges inside).

This will generate a set of communities. After generating each

communities, a consolidation step in is undergone to further filter

keywords. This is done as,

size:= size of community := number of entities in it Threshold := ln(size) If (Occcurance(kw)< Threshold) then Remove(kw)

After this step, a set of descriptive keywords is associated with the

group of entities.

3.3.5 Storing result The final outcome of communities is returned as an XML document

from the implementation. Also, (T,E,kw,P) tuples are returned

as XML. Other intermediate graphs, E×kw bigraph and E×E graph

are exported as CSV (comma separated value) files.

4. RESULTS AND FINDINGS

4.1 Findings Findings reported here are based on 160,711 tweets collected in late

November of 2013.

4.1.1 PoS Distributions and n-grams Later in this section are figures of PoS distributions over subjective-

objective statements and positive-negative statements. A positive

bias value in Figure 1 indicate presence of such PoS is more

indicative of the statement of being positive. Same is for negative

values. Subjectivity score in Figure 2 indicates similar score. Table

2 shows top few n-grams. Note that, PoS distributions and top n-

grams slightly differ from referred work [1]. Again, if training data

is collected in different time, some slight change will occur.

Table 2. Top n-gram with occurrence in each class of data

n-gram Positive Negative Objective

'enjoying break' 1 328 1

'happy birthday' 22 207 1

'so happy' 106 53 1

'follow back' 10 132 1

'miss my' 93 10 1

'no one notices' 97 4 1

'notices my' 97 1 1

'good day' 5 82 1

'follow please' 47 38 1

'my phone' 64 18 1

'presenting emotional' 60 20 1

'please follow' 11 66 1

'follow love' 17 60 1

'am sorry' 71 4 1

'so sad' 71 3 1

'miss u' 65 7 1

'new followers' 53 17 1

Figure 1. Distribution of PoS in positive and negative

statements

Figure 2. Distribution of PoS between subjective and objective

tweets

4.1.2 Power law in Entity and Keywords Figure 3 and Figure 4 show how entity and keywords follow power

law.

Figure 3. ln(Occurance) of Entities show power law

Figure 4. ln(Occurance) of keyword show power law

4.1.3 Distribution of Polarity Score in Entities Figure 5 show how polarity score amongst entities are distributed.

It is seen that, polarity score has skewed distribution. Figure 6

shows the distribution of polarity score over natural logarithm (ln)

of occurrence of the entity.

PO

S, 0

.60

0W

P$

, 0.5

00

PD

T, 0

.33

3R

BS,

0.2

80

UR

L, 0

.22

9W

P, 0

.21

7JJ

S, 0

.18

7SY

M, 0

.17

6U

SR, 0

.15

5FW

, 0.1

27

NN

P, 0

.11

0C

D, 0

.06

8D

T, 0

.03

2V

B, 0

.00

0U

H, -

0.0

04

NN

, -0

.00

7JJ

R, -

0.0

10

IN, -

0.0

12

NN

S, -

0.0

15

JJ, -

0.0

19

RB

R, -

0.0

24

WD

T, -

0.0

31

VB

G, -

0.0

34

NN

PS,

-0

.05

0V

BZ,

-0

.05

5E

X, -

0.0

64

MD

, -0.

099

CC

, -0

.10

2P

RP

$, -

0.1

14

PR

P, -

0.13

5V

BP

, -0

.14

4TO

, -0

.14

9R

P, -

0.17

5R

B, -

0.1

82

VB

D, -

0.2

27

VB

N, -

0.24

5W

RB

, -0

.28

2

BIA

S

WR

B, 0

.16

4V

BN

, 0.1

40

VB

D, 0

.12

8R

B, 0

.10

0R

P, 0

.09

6TO

, 0.0

81

VB

P, 0

.07

8P

RP

, 0.0

72

PR

P$

, 0.0

61

CC

, 0.0

54M

D, 0

.05

2E

X, 0

.03

3V

BZ,

0.0

28

NN

PS,

0.0

25

VB

G, 0

.01

7W

DT,

0.0

16

RB

R, 0

.012

JJ, 0

.01

0N

NS,

0.0

08

IN, 0

.00

6JJ

R, 0

.00

5N

N, 0

.00

3U

H, 0

.00

2V

B, 0

.00

0LS

, 0.0

00

DT,

-0

.01

6C

D, -

0.0

33

NN

P, -

0.0

52

FW, -

0.0

60

USR

, -0

.07

2SY

M, -

0.0

81

JJS,

-0

.08

5W

P, -

0.0

98

UR

L, -

0.1

03

RB

S, -

0.12

3P

DT,

-0.

143

WP

$, -

0.20

0P

OS,

-0.

231

SUB

JEC

TIV

ITY

0

1

2

3

4

5

6

7

8

9

0 2000 4000 6000 8000 10000 12000 14000

0

1

2

3

4

5

6

7

8

9

10

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

6

Figure 5. Distribution of Polarity Score over entire entity space

Figure 6. Polarity Score over ln(Occurance) of entities

4.1.4 Graph BFS & communities in adjacency matrix From any arbitrary node, the E×E graph is traversed BFS (breadth

first search) to generate an arbitrary random walk. This BFS assigns

index to each entity and then an adjacency matrix is visualized as

in Figure 7. Notice that, this is a near diagonal matrix. Although the

diagram is white, as there is no self-edge. Notice the blocks; these

blocks are representative of communities. There are tiny and large

communities. There are 157 communities having a maximum size

of 136.

Figure 7. Adjacency matrix of significant entities

4.1.5 Observation of Groups Different size of feed tweet set are examined. It is seen that, number

of significant entities and number of legitimate keywords increase

with size of tweets. They all yield communities with different size.

When manually examined these communities, and keywords, they

matched intuition. An interesting community where the keyword

cries is associated with two stars is noted in Figure 8.

<Community id="146" size="2" conductance="0.5" pScore="0.63566754320156"> <trapped-keywords count="1"> Cries:4, </trapped-keywords> <e>Kristen Stewart</e> <e>Robert Pattinson</e> </Community>

Figure 8. XML representation of a community

4.2 Results Figure 9 shows some sample runs where the system is queried for

overall sentiment analysis of an entity.

<opinion entity='mermaid'> <score>0.21</score> <analysis post-count='1086' percent-positive='52.03' percent-negative='24.59'/> </opinion>

<opinion entity='bankrupt'> <score>-0.18</score> <analysis post-count='2073' percent-positive='30.29' percent-negative='47.03'/> </opinion>

<opinion entity='drunk man'> <score>-0.50</score> <analysis post-count='1084' percent-positive='11.99' percent-negative='65.59'/> </opinion>

<opinion entity='November'> <score>0.20</score> <analysis post-count='2062' percent-positive='53.25' percent-negative='25.12'/> </opinion>

Figure 9. Result runs for query over entity

Few parameters are fluctuated on the sample to see how they works.

Kw threshold (εkwo), Minimum nodes (εeo), Common Noun as

keyword are varied and results are shown in Table 3. Using

common nouns as keyword yield a few groups with very large size.

Thus, it is recommended to discard common noun from keywords.

Table 3. Effect of parameters change

Kw threshold 350 350 450

Minimum nodes 2 2 2

Common Noun

as keyword

false true false

Potential kw 15108 31593 15108

Legitimate kw 14967 31368 14997

Entities 97147 97147 97147

E occurring > 2 7580 7580 7580

Significant E. 1190 2012 1378

Groups 170 92 157

Largest size 70 1256 136

Polarity scores of each entities and keywords are stored and can be

accessed directly in E×kw bigraph.

Building a polarity invariant E×kw bigraph is also tested. For,

similar setting as of last column in Table 3, polarity invariant

version generated 174 groups with largest group of size 598 for

1854 significant entities. Generated groups are also significantly

different.

-1.5

-1

-0.5

0

0.5

1

1.5

-1.5

-1

-0.5

0

0.5

1

1.5

0 1 2 3 4 5 6 7 8

7

Files generated containing result sets are kept online at

http://meaningofdata.com/mining

4.3 Performance

4.3.1 Sentiment Scoring There is no available way to evaluate correctness for overall

sentiment analysis. Therefore, performance for individual scoring

is tested against a publicly available Mechanical Turk annotated

Twitter data [18]. This data set includes 3771 annotated tweets. It

is to be noted that, each of them were annotated by three human.

They annotated 21600 tweets and all three of them agreed on only

3771 tweets. As the test set has strict classification, test tweets are

scored and then classified for testing purpose with a threshold of .5

(i.e. tweets with score above +0.5 are regarded positive, scored

below -0.5 regarded negative and rest are neutral). This yields only

61% matching of sentiment with test data. However most

disagreement are seen in non-biased annotated entries. For biased

posts, mismatch is around 26%.

4.3.2 Opinion based entity groups in recent stream Data strapping from Twitter requires a long while due to the query

restriction per window. The third party GATE TreeTagger I used

performs slowly and this hinders overall performance. However,

the part implemented for the project performs fast in linear fashion.

Table 4 lists time variance over size of sample.

Table 4. Performance of graph analysis for different data size

Sample 1 Large

Sample

Very large

Sample

Tweets 160711 485447 847276

Time to analyze each 48.91s 148.53s 262.01s

Build Bigraph 9.29s 34.24s 66.45

Generate EE graph 1.54s 3.49s 4.99s

Time to Find Groups 0.126s 0.310s 0.358s

Groups count 157 334 457

Largest Group size 136 183 162

Significant Entities 1378 2627 3560

Legitimate Keywords 14997 25818 35005

5. Conclusion This project has devised and studied an approach to mine social

network for eliciting public opinion about entities. Public opinion

is represented as, analysis of individual entities and graph analysis

of entities based on polarity aligned keyword relationship.

Sentiment analysis itself is still an open problem and needs further

investigation. This project uses an approach to analyze sentiment

of tweets, which is built from Twitter as learning corpus. This

approach yield polarity score rather than discrete polarity marker.

To elicit overall opinion about an entity, aggregative polarity score

and representative keywords are detected.

For grouping entities, an entity graph is built from entity-keyword

bigraph involving polarity scores. A local community detection

mechanism is used to finally cluster them.

The problem of detecting keywords is solved as an embedded

approach. During steps for building entity groups from strapped

tweets, keyword are filtered from raw candidate set of keywords to

final set of keywords. This approach can be useful in building

keyword lexicons dynamically.

A report of sample runs of implementation is also added in this

document. Several key observation are noted in section 4.

6. References

[1] A. Pak and P. Paroubek, "Twitter as a Corpus for Sentiment

Analysis and Opinion Mining," in Language Resources and

Evaluation, 2010.

[2] Twitter, "REST API v1.1 Resources," [Online]. Available:

https://dev.twitter.com/docs/api/1.1.

[3] "GATE Twitter part-of-speech tagger," [Online]. Available:

https://gate.ac.uk/wiki/twitter-postagger.html.

[4] B. Pang, L. Lee and S. Vaithyanathan, "Thumbs up?

Sentiment Classification using Machine Learning

Techniques," in Proceedings of the ACL-02 conference on

Empirical methods in natural language processing,

Philadelphia, PA, USA, 2002.

[5] T. Wilson, J. Wiebe and P. Hoffmann, "Recognizing

contextual polarity in phrase-level sentiment analysis," in

HLT '05 Proceedings of the conference on Human Language

Technology and Empirical Methods in Natural Language

Processing, Stroudsburg, PA, USA, 2005 .

[6] A. Esuli and F. Sebastiani, "Sentiwordnet: A publicly

available lexical resource for opinion mining," in

Proceedings of LREC, 2006.

[7] S. Baccianella, A. Esuli and F. Sebastiani, "SentiWordNet

3.0: An Enhanced Lexical Resource for Sentiment Analysis

and Opinion Mining," in LREC, 2010.

[8] M. Taboada, J. Brooke, M. Tofiloski, K. Voll and M. Stede,

"Lexicon-based methods for sentiment analysis,"

Computational linguistics, vol. 37, pp. 267-307, 2011.

[9] V. Hatzivassiloglou and J. M. Wiebe, "Effects of adjective

orientation and gradability on sentence subjectivity," in

Proceedings of the 18th conference on Computational

linguistics-Volume 1, 2000.

[10] C. Whitelaw, N. Garg and S. Argamon, "Using appraisal

groups for sentiment analysis," in Proceedings of the 14th

ACM international conference on Information and

knowledge management, 2005.

[11] F. Benamara, C. Cesarano, A. Picariello, D. Reforgiato and

V. Subrahmanian, "Sentiment Analysis: Adjectives and

Adverbs are better than Adjectives Alone," in International

Conference on Weblogs and Social Media, Boulder, CO

USA, 2007.

[12] V. S. Subrahmanian and D. Reforgiato, "AVA: Adjective-

verb-adverb combinations for sentiment analysis,"

Intelligent Systems, vol. 23, no. 4, pp. 43-50, 2008.

http://meaningofdata.com/mining

8

[13] T. Mullen and N. Collier, "Sentiment Analysis using Support

Vector Machines with Diverse Information Sources," in

EMNLP, 2004.

[14] A. Bifet and E. Frank., "Sentiment knowledge discovery in

twitter streaming data," in Discovery Science, Berlin

Heidelberg, Springer , 2010, pp. 1-15.

[15] C. Lin and Y. He, "Joint sentiment/topic model for sentiment

analysis," in Proceedings of the 18th ACM conference on

Information and knowledge management, 2009.

[16] S. a. L. Y. a. S. H. Tan, Z. Guan, X. Yan, J. Bu, C. Chen and

X. He, "Interpreting the Public Sentiment Variations on

Twitter," IEEE Transactions on Knowledge and Data

Engineering, vol. 6, no. 1, pp. 1-14, 2012.

[17] X. Ding, B. Liu and P. S. Yu, "A holistic lexicon-based

approach to opinion mining," in WSDM '08 Proceedings of

the 2008 International Conference on Web Search and Data

Mining, New York, NY, USA, 2008.

[18] S. Narr, "Annotated Twitter Sentiment Dataset," [Online].

Available: http://data.dai-labor.de/corpus/sentiment/.

[Accessed 7 10 2013].

[19] "Sentiment140," [Online]. Available:

http://www.sentiment140.com.

[20] K. Zhang, H. Xu, J. Tang and J. Li, "Keyword Extraction

Using Support Vector Machine," in Advances in Web-Age

Information Management, Springer, 2006, pp. 85--96.

[21] A. Hulth, "Improved automatic keyword extraction given

more linguistic knowledge," in EMNLP '03 Proceedings of

the 2003 conference on Empirical methods in natural

language processing, Stroudsburg, PA, USA, 2003.

[22] O. Medelyan and I. H. Witten, "Thesaurus based automatic

keyphrase indexing," in Proceedings of the 6th ACM/IEEE-

CS joint conference on Digital libraries, 2006.

[23] G. Karypis and V. Kumar, "Multilevel k-way Partitioning

Scheme for Irregular Graphs," J. Parallel Distrib. Comput,

vol. 48, no. 1, pp. 96-129, 1998.

[24] M. Girvan and M. E. J. Newman, "Community structure in

social and biological networks," in Proc. Natl. Acad. Sci.

USA, 1999.

[25] M. E. J. Newman, "Fast algorithm for detecting community

structure in networks," in Phys. Rev. E 69, 066133., 2004.

[26] A. Clauset, M. E. J. Newman and C. Moore, "Finding

community structure in very large networks," in Phys. Rev.

E 70, 066111, 2004.

[27] M. E. J. Newman, "Modularity and community structure in

networks," in Proc. Natl. Acad. Sci. USA 103, 8577–8582,

2006.

[28] R. Andersen, F. Chung and K. Lang, "Local graph

partitioning using pagerank vectors," in Foundations of

Computer Science, FOCS'06. 47th Annual IEEE Symposium

on, 2006.

[29] H. Schmid, "TreeTagger," TC project at the Institute for

Computational Linguistics of the University of Stuttgart,

1994.

[30] B. Santorini, Part-of-speech tagging guidelines for the Penn

Treebank Project, 3rd revision ed., 1990.

[31] A. Go, R. Bhayani and L. Huang, "Twitter sentiment

classification using distant supervision," Stanford, 2009.

[32] L. Derczynski, A. Ritter, S. Clark and K. Bontcheva, "Twitter

Part-of-Speech Tagging for All: Overcoming Sparse and

Noisy Data," in Proceedings of the International Conference

on Recent Advances in Natural Language Processing, 2013.

Graph-based Analysis and Opinion Mining in Social Network

Data & Analytics

Transcript of Graph-based Analysis and Opinion Mining in Social Network