Internship

21
Integrated analysis of News, views & reviews Presented By : Naman Gupta IIT Bombay M.Tech - CSE Guided By : Dr. Lipika Dey Principal Scientist, TCS Innovation Labs - Delhi

Transcript of Internship

Page 1: Internship

Integrated analysis of News, views & reviewsPresented By :

Naman Gupta

IIT Bombay

M.Tech - CSE

Guided By :Dr. Lipika DeyPrincipal Scientist, TCS Innovation Labs - Delhi

Page 2: Internship

Problem Statement• Integrating open source data like News articles with social-media content from Twitter and dedicated discussion forum like customer complaint/review websites

• Retrieval of relevant information

• Linking related information

• Visualization

• Domain : Automobile (Car)

Page 3: Internship

Objective• Helping integrated analysis of structured and unstructured

data.

• Twitter gives people reaction to news item.

• Websites give early signals about problems faced by customers.

• To be used in future for Predictive Analysis.

Page 4: Internship

Joint Analysis of News & Tweets

Linking & Retrieval

Analysis of Customer Comments

(Edmunds.com)

Visualization

Summary of the Work

Page 5: Internship

Module 1 : • Analyzing Tweets with respect to News Article to capture user reaction to an event reported in the news

• Grouping of Tweets

• Ranking of Tweets

• Tag Cloud

• Tweet Distribution

• Tweet Space.

Page 6: Internship

Grouping of Duplicate Tweets• Initial Scheme : Retweets were grouped.

• Used BLEU (Bilingual Evaluation Understudy) score measure to group tweets which are syntactically same.

• BLEU Score : Measures the quality of translation.

• Algorithm (To Follow)

Page 7: Internship

Algorithm• Input : N tweets, Output : Tweet Groups.

• Clean Tweet by removing special characters, url’s, #tags.

• For every tweet t_i :

• If no group present :

• Make a new group with Tweet t_i in it.

• else

For tweet t_j in every other group.

• If t_i is substring of t_j or t_j is a substring of t_i• Add Tweet to group of t_j.

• Else• Score = BlueScore(t_i,t_j)• If score >= 0.7• Add t1 to group of t_j.

• Else• Make a new group with tweet t_j in it.

Page 8: Internship

Ranking of Tweets• Initial Scheme :

• Tweets were ordered by the number of Tweets in a Group.

• Higher number of re-tweets does not guarantee the most relevant tweet for a news.

• Modified Scheme :

• Used News text to rank tweets.

• News text focuses on keywords related to main event like recall, faulty, steering etc multiple times.

• Algorithm :

• N= Extract the top frequent words (after removing stop words).

• For every Tweet t1

• Num_Key = number of words from N present in t1

• Rank tweets based on Num_Key

Page 9: Internship

Visualization : Tweets & News• Objective :

• To show the main problem / event reported by the news.

• Number of Tweets : 8 Lacs.

• Method

• 8 Lac+ Tweets.

• Tweets were cleaned by removing special characters, #tags, urls.

• Tweets and News description were fed in OPTRA .

• Processed to extract Noun Phrases.

• For every news, Most frequent NP were displayed as Tag Cloud.

• Used D3 Tag Cloud API.

Page 10: Internship

Modules 3 & 4

• Extracting users review/complaints

• Extraction and Processing of Data.

• Crawler for Edmunds.com

• Text Processing done in OPTRA

• Content visualization using output of OPTRA

• Report generation for relevant content retrieved

Page 11: Internship

Extracting Data from Edmunds.com• Reviews for 10 car models were extracted from Edmunds.

• Crawler using Jsoup Api.

• Information Extracted :

• Review date,

• Review,

• Suggested Improvement,

• Favorite features,

• Review Rating,

• Up Rating for a review and

• Down rating for a review.

Page 12: Internship

Content retrieval – linking problems across sources• Objective:

• Capture common problems, features discussed for a chosen entity

• To retrieve customer reviews that dealt with issues reported in a News article

• Challenge – the language used in two different sources are not identical

• Approximate matching technique using proximity was used

• Method:

• Noun Phrases(NP) and Enhanced Phrases(EP) from OPTRA are used.

• Phrases with their frequency are obtained.

• Algorithm :

• Fetch Enhanced Phrases.

• Clean the Phrase (remove numbers, stalk word).

• If Phrase after removal has length >=2.

• Preserve the phrase.

• Fetch NP and clean them using above method.

• If NP is present as an EP also:

• Boost the frequency of the Enhanced Phrase

• Output Phrases having highest frequency.

Page 13: Internship

Screenshots

Page 14: Internship
Page 15: Internship
Page 16: Internship
Page 17: Internship

Evaluation – matching techniquePhrase # Sentence

Retrieved# Relevant Accuracy

Steer Wheel 37 35 94.59

Power Steer 27 25 92.59

Heat seat 25 21 84

Manual Transmission

12 8 66.66

Automatic Transmission

13 12 92.30

Air Sensor 3 1 33.33

Low Speed 16 14 87.5

Page 18: Internship

Evaluation – contd.Phrase # Sentence

Retrieved# Relevant Accuracy

Steer bolt 1 1 100

Lose bolt 0 0 0

Back seat 10 9 90

Power window 4 4 100

Car problem 8 4 50

Road noise 8 7 87.5

Trunk Lid 6 6 100

Engine Fire 5 3 60

Page 19: Internship

Evaluation

Total Sentences Retrieved

Total Relevant Sentences

Accuracy

175 150 85.71

Page 20: Internship

Future Work• Adding more sources to work together within the same

framework.

• Adding automated analysis for detecting early signals and predicting effects.

Page 21: Internship

Thanks