CSCE 561 Social Media Projects
description
Transcript of CSCE 561 Social Media Projects
![Page 1: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/1.jpg)
Laboratory for InterNet Computing
CSCE 561Social Media Projects
Ryan BentonOctober 8, 2012
![Page 2: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/2.jpg)
Laboratory for InterNet Computing
Social Media
140 million daily tweets
30 billion pieces of content shared each month
Sources: Facebook; Twitter; CTIA
153 billion US SMS messages in 2009
![Page 3: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/3.jpg)
Laboratory for InterNet Computing
Social MediaProcessing
Baseline Burst
Corpus1
Corpus2
t1
t2
t3
Twitter Facebook Search QueriesSocial Media
Sensors
Low-Level Topic/Event Detection
High-Level Event Tracking/
Correlation
Visualization
Decision Making
tn tn+1
![Page 4: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/4.jpg)
Laboratory for InterNet Computing
• Tweets– User
• Sender information– Name– Display name– Location– Follower and friend counts
• If it directed to other users• If retweet, who from
– Tweet• The message• Hashtags• Date and Time• Media Information
![Page 5: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/5.jpg)
Laboratory for InterNet Computing
What are Hashtags?
• The # symbol, called a hashtag, is used to mark keywords or topics in a Tweet. It was created organically by Twitter users as a way to categorize messages.
![Page 6: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/6.jpg)
Laboratory for InterNet Computing
Representation
• Can convert the social media into graphs– Homogenous
• One node type• One link type
– Heterogeneous• One or more node types• One or more link types• Requirement
– Either the links or the nodes (or both) must have more than one type.
![Page 7: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/7.jpg)
Laboratory for InterNet Computing
Nodes
• Nodes represent an object– Examples
• Users• Concepts• Hashtags• Locations
– May have multiple attributes describing object
![Page 8: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/8.jpg)
Laboratory for InterNet Computing
Links
• Relationships between nodes• May have more than one attribute
![Page 9: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/9.jpg)
Laboratory for InterNet Computing
Visualize
![Page 10: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/10.jpg)
Laboratory for InterNet Computing
Visualize
![Page 11: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/11.jpg)
Laboratory for InterNet Computing
Problems
• Identifying relationships between hashtags in Twitter Data
• Identify (Generate) Important Keywords from Tweets
![Page 12: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/12.jpg)
Laboratory for InterNet Computing
Identifying relationships between hashtags in Twitter Data
![Page 13: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/13.jpg)
Laboratory for InterNet Computing
The idea
• If we have a collection normal associations of hashtags or hashtags that are usually used together.
• Will we be able to identify a situation developing by analyzing a “strange” association?
![Page 14: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/14.jpg)
Laboratory for InterNet Computing
Research Problem
• The main goal of the project is to find common association of entities or groups of “real world” concepts, using a graph structure of hashtags1. Cluster the hashtags to form group of entities
and find out inter-cluster associations.2. Given a collection of hashtags with frequency
and user information, can we identify a change in underlying structure from time t1 to time t2.
![Page 15: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/15.jpg)
Laboratory for InterNet Computing
Project 1: Cluster Hashtags into Entities
• Can we use a underlying graph structure to identify normal associations.
• If so, can it be used identify an association that is not normal
• eg: #UTAustin evacuated due to #Bombthreat
![Page 16: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/16.jpg)
Laboratory for InterNet Computing
Project 2: Analyze the transition between events
• If we have a collection of hashtags from a emergency event, eg: Hurricane, Forest Fire
• If we also have collection of hashtags before the event happened
• Can we identify the transition if hashtags, like frequency or associations?
![Page 17: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/17.jpg)
Laboratory for InterNet Computing
Identify (Generate) Important Keywords
![Page 18: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/18.jpg)
Laboratory for InterNet Computing
Why?
• Hashtags not sufficient
• Example– A tree just flew into my house during
#hurricane Isaac
![Page 19: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/19.jpg)
Laboratory for InterNet Computing
Employ Keyword Selection Methods to Find “Good” Keywords
• Multiple methods– You can choose/research one of your choice.
• Discuss two– “CMore Approach”– “Shixian Chu Approach”
![Page 20: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/20.jpg)
Laboratory for InterNet Computing
CMore
• NSF CMORE” Filter Approach– Generated as part of NSF
• Concept Candidate List– First, generated that corresponds to all phrases
with one, two, three, and four words. • Phrases are not allowed to span from one sentence
to another.
![Page 21: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/21.jpg)
Laboratory for InterNet Computing
CMore, cont.
• Filter Steps– Probabilistic filter uses various concept frequencies to
determine whether or not a concept is of interest. • The filters that it uses are iterative in nature. • Concepts of length one are filtered first, then concepts of
length two and so on. • Several functions that measure the frequency of a concept
relative to its prefix and suffix are defined. • Utilizes Thresholds Filtering rules are formed by applying
certain minimum threshold to the values of these functions. Once concepts of all lengths are processed using these rules, the remaining concepts are the relevant ones according to the probabilistic filter.
![Page 22: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/22.jpg)
Laboratory for InterNet Computing
CMore, cont.
• Filter Steps– Stop words filter.
• IF phrase contains word in stop word list then that concept is removed.
– Entity type concepts filter• Therefore, those concepts that do not parse to a noun phrase are
discarded
– Commonality filter • Applied only to candidate concepts of length one and two words. • Comparing the frequency with which a concept appears in a
document to the frequency with which that concept appears in the Reuters [5] corpus.
![Page 23: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/23.jpg)
Laboratory for InterNet Computing
Shixian Chu’s Approach
![Page 24: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/24.jpg)
Laboratory for InterNet Computing
Parent-Network
New Jaguar car
(3,0)
New jaguar
(3,0)
new
(3,0)
Jaguar
(0,1) (1,0)
(2,0) (3,1)
car
(0,2) (1,1)
(2,1) (3,2)
sale
(0,3) (1,2)
model
(2,2)
Used Jaguar
(0,0)
Used Jaguar car sale
(0,0)
L L R LRR
Root node
used
(0,0)
R
Jaguar car
(0,1) (1,0)
(2,0) (3,1)
Car sale
(0,2) (1,1)
Car model
(2,1)
Used Jaguar car
(0,0)
Jaguar car sale
(0,1) (1,1)
Jaguar car model
(2,0)
L
R L R R L L
LR
L
R
R
L
![Page 25: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/25.jpg)
Laboratory for InterNet Computing
Simplified Parent-Network
Root node
Jaguar car
(0,1) (1,0)
(2,0)
Jaguar car sale
(0,1) (1,1)
Jaguar car model
(2,0)
Used Jaguar car sale
(0,0)
![Page 26: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/26.jpg)
Laboratory for InterNet Computing
Parent-Network-based Key Phrase Extraction
• Step 1: Document pruning.– Sentence boundaries are marked and non-word tokens are
stripped.
• Step 2: Document stemming.• Step 3: Creating Parent-Network.• Step 4: Computing logical frequency.
– The logical frequency = (physical_frequency - the logical_frequency of all its ancestors that have been accepted as key phrases).
– If no parents, the logical frequency = physical frequency. – Key phrase if logical frequency >= frequency threshold of this
level. – The order for computation is from higher level to lower level
(parent to child).
![Page 27: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/27.jpg)
Laboratory for InterNet Computing
Phrase Extraction -- catch.
• Designed to work on documents and/or collection of documents– Tweets are very small
![Page 28: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/28.jpg)
Laboratory for InterNet Computing
Logical Frequency
• Arithmetic Logical Frequency
• Entropy-based Logical Frequency
n
jjlfipfilf
1)()()(
))(
)(log
)(
)(
)(log
1()()()(
12
2
n
j is
jlf
is
jlf
isisipfilf
![Page 29: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/29.jpg)
Laboratory for InterNet Computing
Solution
• Create “tweet” collections– Randomly select X hashtags– For each hashtag, group tweets by time
• Hour, day or week
– Each hashtag/time group is now a collection
![Page 30: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/30.jpg)
Laboratory for InterNet Computing
Evaluation
• Test impact of changing– Number of hashtags– Time period used to group– Modifying threshold values
• What is impact on number of keywords?• How much overlap?• Does the results look reasonable?
![Page 31: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/31.jpg)
Laboratory for InterNet Computing
Resources
• Twitter Collection Code– Need to check availability– If not, fairly straightforward to implement.
• Database Schema– MySQL
![Page 32: CSCE 561 Social Media Projects](https://reader036.fdocuments.in/reader036/viewer/2022062519/56814ef8550346895dbc88cb/html5/thumbnails/32.jpg)
Laboratory for InterNet Computing
Thank-you
Questions?