Internship
-
Upload
naman-gupta -
Category
Documents
-
view
95 -
download
0
Transcript of Internship
Integrated analysis of News, views & reviewsPresented By :
Naman Gupta
IIT Bombay
M.Tech - CSE
Guided By :Dr. Lipika DeyPrincipal Scientist, TCS Innovation Labs - Delhi
Problem Statement• Integrating open source data like News articles with social-media content from Twitter and dedicated discussion forum like customer complaint/review websites
• Retrieval of relevant information
• Linking related information
• Visualization
• Domain : Automobile (Car)
Objective• Helping integrated analysis of structured and unstructured
data.
• Twitter gives people reaction to news item.
• Websites give early signals about problems faced by customers.
• To be used in future for Predictive Analysis.
Joint Analysis of News & Tweets
Linking & Retrieval
Analysis of Customer Comments
(Edmunds.com)
Visualization
Summary of the Work
Module 1 : • Analyzing Tweets with respect to News Article to capture user reaction to an event reported in the news
• Grouping of Tweets
• Ranking of Tweets
• Tag Cloud
• Tweet Distribution
• Tweet Space.
Grouping of Duplicate Tweets• Initial Scheme : Retweets were grouped.
• Used BLEU (Bilingual Evaluation Understudy) score measure to group tweets which are syntactically same.
• BLEU Score : Measures the quality of translation.
• Algorithm (To Follow)
Algorithm• Input : N tweets, Output : Tweet Groups.
• Clean Tweet by removing special characters, url’s, #tags.
• For every tweet t_i :
• If no group present :
• Make a new group with Tweet t_i in it.
• else
For tweet t_j in every other group.
• If t_i is substring of t_j or t_j is a substring of t_i• Add Tweet to group of t_j.
• Else• Score = BlueScore(t_i,t_j)• If score >= 0.7• Add t1 to group of t_j.
• Else• Make a new group with tweet t_j in it.
Ranking of Tweets• Initial Scheme :
• Tweets were ordered by the number of Tweets in a Group.
• Higher number of re-tweets does not guarantee the most relevant tweet for a news.
• Modified Scheme :
• Used News text to rank tweets.
• News text focuses on keywords related to main event like recall, faulty, steering etc multiple times.
• Algorithm :
• N= Extract the top frequent words (after removing stop words).
• For every Tweet t1
• Num_Key = number of words from N present in t1
• Rank tweets based on Num_Key
Visualization : Tweets & News• Objective :
• To show the main problem / event reported by the news.
• Number of Tweets : 8 Lacs.
• Method
• 8 Lac+ Tweets.
• Tweets were cleaned by removing special characters, #tags, urls.
• Tweets and News description were fed in OPTRA .
• Processed to extract Noun Phrases.
• For every news, Most frequent NP were displayed as Tag Cloud.
• Used D3 Tag Cloud API.
Modules 3 & 4
• Extracting users review/complaints
• Extraction and Processing of Data.
• Crawler for Edmunds.com
• Text Processing done in OPTRA
• Content visualization using output of OPTRA
• Report generation for relevant content retrieved
Extracting Data from Edmunds.com• Reviews for 10 car models were extracted from Edmunds.
• Crawler using Jsoup Api.
• Information Extracted :
• Review date,
• Review,
• Suggested Improvement,
• Favorite features,
• Review Rating,
• Up Rating for a review and
• Down rating for a review.
Content retrieval – linking problems across sources• Objective:
• Capture common problems, features discussed for a chosen entity
• To retrieve customer reviews that dealt with issues reported in a News article
• Challenge – the language used in two different sources are not identical
• Approximate matching technique using proximity was used
• Method:
• Noun Phrases(NP) and Enhanced Phrases(EP) from OPTRA are used.
• Phrases with their frequency are obtained.
• Algorithm :
• Fetch Enhanced Phrases.
• Clean the Phrase (remove numbers, stalk word).
• If Phrase after removal has length >=2.
• Preserve the phrase.
• Fetch NP and clean them using above method.
• If NP is present as an EP also:
• Boost the frequency of the Enhanced Phrase
• Output Phrases having highest frequency.
Screenshots
Evaluation – matching techniquePhrase # Sentence
Retrieved# Relevant Accuracy
Steer Wheel 37 35 94.59
Power Steer 27 25 92.59
Heat seat 25 21 84
Manual Transmission
12 8 66.66
Automatic Transmission
13 12 92.30
Air Sensor 3 1 33.33
Low Speed 16 14 87.5
Evaluation – contd.Phrase # Sentence
Retrieved# Relevant Accuracy
Steer bolt 1 1 100
Lose bolt 0 0 0
Back seat 10 9 90
Power window 4 4 100
Car problem 8 4 50
Road noise 8 7 87.5
Trunk Lid 6 6 100
Engine Fire 5 3 60
Evaluation
Total Sentences Retrieved
Total Relevant Sentences
Accuracy
175 150 85.71
Future Work• Adding more sources to work together within the same
framework.
• Adding automated analysis for detecting early signals and predicting effects.
Thanks