Contextual Analysis of User Interests in Social Media Sites – An Exploration with Micro-blogs...

31
Contextual Analysis of User Interests in Social Media Sites – An Exploration with Micro-blogs Nilanjan Banerjee, Dipanjan Chakraborty, Koustuv Dasgupta, Anupam Joshi, Sameer Madan, Sumit Mittal, Seema Nagar, Angshu Rai [CIKM ’09] Advisor: Dr. Koh Jia-Ling Reporter: Che-Wei, Liang Date: 2009/10/26 1

Transcript of Contextual Analysis of User Interests in Social Media Sites – An Exploration with Micro-blogs...

Contextual Analysis of User Interests in Social Media Sites

– An Exploration with Micro-blogs

Nilanjan Banerjee, Dipanjan Chakraborty, Koustuv Dasgupta, Anupam Joshi, Sameer Madan, Sumit Mittal, Seema Nagar, Angshu Rai

[CIKM ’09]

Advisor: Dr. Koh Jia-LingReporter: Che-Wei, Liang

Date: 2009/10/26

1

Outline

• Introduction• Data Set• Mining Real-Time User Interests• Discovering Associations in User Interests• Pattern Discovery in Interest Clusters• Conclusion and Future Work

2

Introduction

3

Introduction

4

Introduction

• Challenges– Tweets tend too be stream of consciousness fragment– Lack of structure of description

• In this paper, – Report the results of analyses of tweets using

unstructured text mining technique

5

Data Set

• Collect data from Twitter– Select the most active users spanning across 10 cities– Collect tweets over four weeks

• from March 2009 to April 2009• Tweet : <user name, tweet, time of publishing the tweet>

6

Mining Real-Time User Interests

• Tweets usually have the following properties– ephemeral: • the interest in an activity changes over time

– descriptive: • the interest can be described using one or more

indicative keywords or terms

– localized: • the interest (or activity) is usually associated with

(contextual) location information

7

Mining Real-Time User Interests

• Identify tweets expressing interests by content-indicative and usage-indicative keywords– Content-indicative keywords (category words)• Express the broad class (category) of user interests, e.g.

movie, sports, etc.

– Usage-indicative keywords• Characterize the activity associated with a particular

interest• Can be either temporal or action keywords

8

Mining Real-Time User Interests

• First, explore what kind of keywords twitters use most

• Exclude pronouns, prepositions, helping verbs, question words, non-indicative words

• Stem the words using Porter-stemming algorithm

9

Mining Real-Time User Interests

• Content-indicative Keywords– Form an initial list of category keywords• Consult from Wordnet and IMDB

– Enriched seed list of keywords by• Manually inspecting thousands of tweets and including

“interest-indicative words”

– Finally, identify five seed categories from the list of category keywords• movie, music, food, sports, dance

10

Mining Real-Time User Interests

11

Mining Real-Time User Interests

12

Mining Real-Time User Interests

• Use term frequency-based measure – estimate the occurrences of temporal and action words

13

Mining Real-Time User Interests

• Context-based discovery of keywords– Consider non-stemmed words to enrich knowledge base of

keywords• Stemmed data incurs a loss of information of tense

– Discover similar words by • Finding matches that are contextually similar to

the seed dictionary words

14

Mining Real-Time User Interests

• POS-based discovery of action verbs– Use a POS analyser to extract action verbs– Identify the relevant action verbs that show a high

correlation with identified category words– Added to existing set of usage-indicative keywords

15

• D represents the total number of tweets • A = { tweets containing the keyword “cw” }• B = { tweets containing the keyword “aw” }

Discovering Associations in User Interests

• Goal: – Explore different latent semantic associations

between content-indicative category words and usage-indicative action/temporal words

• N-Gram Analysis• Contextual Analysis using k-means clustering• Temporal Analysis

16

N-Gram Analysis

• If an user is interested in an intention, he/she should use indicative action and/or temporal words to express interests– E.g. “I want to watch a movie tonight”

• Employ bigram-based analysis of category word– Co-occurring words can be at a

variable distance (a tolerance limit of 5 words)

17

N-Gram Analysis

18

N-Gram Analysis

• People have tendency to tweet about activities that are planned at different times of the day– E.g. “party tonight”

19

N-Gram Analysis

20

Contextual Analysis using k-means clustering

• To discover any new groups of tweets and perform a contextual analysis– Clustering is a better accepted technique to group

similar documents – Use k-means clustering– Analyze clusters to discover latent associations of

cluster tags with other words in the cluster• Tag cluster with the highest occurring words

21

Contextual Analysis using k-means clustering-Cluster Analysis

22

Contextual Analysis using k-means clusteringSub-Cluster Analysis

• Analyzed content of clusters having content-indicative tags, temporal words, action words– Ran k-means, and gathered predominant sub-clusters

23

Temporal Analysis

• Real-time interests have a significant temporal component, if captured can lead to insights on word associations with temporal aspect of interests

24

Pattern Discovery in Interest Clusters

• A microscopic analysis of select content-indicative clusters

• Built a set of benchmark– 5000 comprising of a mix of tweets • from party, food, sports, movie clusters

– Manually tagging those that indicate a real-time interest (i.e. positive tweets)

25

Patterns in Real-time Interest Tweets

• Patterns can be of several types:1. Word occurrence-based• e.g. “gym” occurs with “go” in positive tweets

2. Grammar-based• e.g. party is preceded by a verb of the form “going for"

in positive tweets

3. Precedence-based• e.g. “tonight” succeeds “movie”

26

Patterns in Real-time Interest Tweets

27

Sports Category Food CategoryAn intention to play a sport or go and watch a game

Express a real-time intention of having a food, going to a restaurant

Patterns in Real-time Interest Tweets

28

Party Category Movie Category

Depicting user’s intention to get involved in a party

Expressing an intention to watch a movie in near future

Differentiating Intentions from Tweets -Word Affinity measure

• Affinity of a word “w” to a Set of Tweets “T”– Defined as the probability of “w” to occur in “T”– Using to compute the associations of frequently used

words in tweets

29

Real-time Interest Classification-Initial Evaluation

• An evaluation of how some traditional text classification algorithms perform in classifying tweets

• Further need to exploit several mechanisms– Word-usage based heuristics, rule-based filtering

30

Conclusion And Future Work

• Investigated and evaluated microblogs by – Using contextual information of its users to capture real-

time user interests• Revealed of enough keywords that express interests• Use statistical techniques to discover associations• Clustering reveal words indicative of user interests• Discover patterns from clusters

• There exists ample scope for research– Indentifying user context • Emotions, presence, location

31