Twitter Sub-event Detection Project Presentation
-
Upload
pallav-shah -
Category
Presentations & Public Speaking
-
view
151 -
download
3
description
Transcript of Twitter Sub-event Detection Project Presentation
Project : Sub-event detection on Social Media
Codebase: https://github.com/pallavshah/TwitterSubeventDetector Pallav Shah Akshay Joshi Rajat Bhardwaj Ravneet Singh Kathuria
The Project
• Make a timeline/summary of events from a corpus of tweets commenting on the event.• The corpus consists of tweets from a specific domain talking about a
single major event.• The objective of the project is to extract sub-events within the event. • Summary will be short description about the sub event.
Our Approach
We followed a two-step approach: • Sub-event Detection: The first step is to identify if and when a sub-
event has occurred and if it has, what tweets comprise the sub-event• Tweet Selection: The second step is to choose a representative tweet
that describes the sub-event appropriately.
The aggregation of these two processes will in turn provide a set of tweets as a summary of the event.
Part1: Detecting the sub-eventSub-event detection is done by finding the distance measure between different tweets of same event.• Dictionary of words: The parsed data is used to create a dictionary
which stores relevant words and its count in the corpus.• Vector for each tweet: The generated dictionary and a second parse
over the parsed data are used to get a single sparse vector corresponding to each tweet. This vector contains the id and count of each word present in the tweet.
Part 1:Detecting the sub-event(continued)• The sub-event detector module:
The module uses LSHash Library of Python to find similarity distance between various tweets. Each tweet is analyzed and compared with the existing group of similar tweets.
If the tweet matches to any of the group with a high threshold, the tweet is assumed to belong to that group and added to it.
Otherwise, a new group is created with that tweet as the representative tweet of the group. In the end all the tweets as thus partitioned into groups (or clusters) representing different sub-events.
Part 2: Summarization of Sub-event
• Term Frequency Inverse Document Frequency: A statistical weighting technique that assigns each term within a document a weight that reflects the term’s saliency within the document. The TF-IDF value is composed of two primary parts.
The term frequency component (TF) assigns more weight to words that occur frequently within a document because important words are often repeated.
The inverse document frequency component (IDF) compensates for the fact that some words such as common stop words are frequent.
Normalization of tweets: The tweets are normalized to prevent bias towards larger tweets.
System Block Diagram
Technologies Used
We have used the following python libraries:• LSHash: https://pypi.python.org/pypi/lshash/0.0.3dev• Gensim: http://radimrehurek.com/gensim/
DatasetWe used Snow dataset containing tweets of 2012 US General Elections.
Experiments and Results
• Tested on the 2012 US General Elections tweets data set from SNOW 2014. • Results bore around 60% accuracy as compared to manual evaluation
of the tweets data.