Text analytics of flipkart reviews of sony xperia z report

33
TEXT ANALYTICS OF FLIPKART REVIEWS FOR SONY XPERIA Z Submitted by Atanu Maity Contents: Problem Definition Methodology Analysis Recommendation Problem Definition: Detail analysis of Flipkart Reviews of Sony Xperia Z. Methodology Data Collection: Flipkart Reviews of Various Users of the product. Each review is independent of each other.

Transcript of Text analytics of flipkart reviews of sony xperia z report

TEXT ANALYTICS OF FLIPKART REVIEWS FOR

SONY XPERIA Z

Submitted by Atanu Maity

Contents:

Problem Definition

Methodology

Analysis

Recommendation

Problem Definition:

Detail analysis of Flipkart Reviews of Sony Xperia Z.

Methodology

Data Collection:

Flipkart Reviews of Various Users of the product.

Each review is independent of each other.

Sampling Design:

Total 110 reviews have been collected for the product from Flipkart to do the analysis.

Among 458 reviews, first 110 reviews have been taken for our purpose.

TextMining Methods used for Analysis:

Worcloud

Latent Semantic Analysis (LSA)

Support Vector Machine

Sentiment Analysis

Tool Used:

R Studio

Analysis

Review Extraction from Flipkart:

Packages Needed: RCurl, XML, rvest, tm, wordcloud

First we will build a anchorlist and doclist which will contain the page links from where we will

extract the reviews and contents of those pages respectively.

Code:

Output:

anchorlist

[[1]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=10" [[2]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=20" [[3]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=30" [[4]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=40" [[5]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=50" [[6]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=60" [[7]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=0" [[8]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=70" [[9]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=80" [[10]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=90" [[11]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=100"

[[12]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=110" [[13]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=120" [[14]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=130"

Doclist:

The extracted doclist can be viewed from the following link

https://drive.google.com/file/d/0B_Y-QpyzAwhSUkZ1Wmg0bWNWRWM/view?usp=sharing

Now from this doclist we have extracted the ‘reviews’ on which the analysis will be done.

Review Extraction:

Code:

Output:

Total 110 reviews have been extracted from 11 docs. The review list can be found from the

following link

https://drive.google.com/file/d/0B_Y-QpyzAwhSU0Y3UHFHeFdmYVk/view?usp=sharing

Filtering of the review contents:

We have removed Numbers, Punctuations, Stopwords from the reviews and made all the

letters of each word to lowercase.

Code:

Making of WordCloud:

At the very first step we have made a TermDocument Matrix with words in rows and

documents in columns from the reviews. And using that we finally have made the WordCloud.

Code:

Output:

Inference from the WordCloud output:

1) Words like Phone, Sony, Xperia, Water, Good, Camera, Battery, Service, Display, Screen,

Quality etc are important, as they are looking big and bold compare to other words.

2) Quality of the phone might have been very satisfactory and also overall performance

might be very good as words like screen, battery, camera, display had been repeatedly

used by the consumers

3) There might have been some issue with Water for this Sony model.

4) Also there might have been some comparison of Sony with Samsung and HTC handsets.

Rating Extraction and Analysis of Ratings

We have extracted rating for Sony Experia Z for each of the first 110 reviews and found that

each reviewer has given rating and then done the analysis accordingly.

Rating Extraction:

Code:

Output:

missingRating>

So from this output we can see that there is no missing rating for each individual page i.e. each

reviewer has given the rating.

Now we want those ratings. Those 110 ratings are like following

Ratings>

[1] "5 stars" "1 star" "5 stars" "5 stars" "4 stars" "5 stars" "1 star" "1 star" "5 stars" [10] "5 stars" "3 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "1 star" "1 star" [19] "4 stars" "5 stars" "4 stars" "1 star" "1 star" "5 stars" "1 star" "2 stars" "5 stars" [28] "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" [37] "5 stars" "5 stars" "1 star" "1 star" "3 stars" "5 stars" "5 stars" "5 stars" "1 star" [46] "2 stars" "1 star" "5 stars" "4 stars" "4 stars" "2 stars" "4 stars" "5 stars" "3 stars" [55] "1 star" "5 stars" "4 stars" "5 stars" "5 stars" "5 stars" "5 stars" "1 star" "5 stars" [64] "5 stars" "4 stars" "5 stars" "1 star" "1 star" "5 stars" "5 stars" "5 stars" "5 stars" [73] "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "4 stars" "5 stars" "5 stars" "1 star"

[82] "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" [91] "1 star" "5 stars" "5 stars" "5 stars" "4 stars" "4 stars" "1 star" "1 star" "1 star" [100] "1 star" "4 stars" "5 stars" "5 stars" "1 star" "1 star" "5 stars" "1 star" "1 star" [109] "5 stars" "5 stars"

The extracted ratings are in character format, but for our calculation purpose we want them as

numerical. So we have done like this

Code:

Output:

finalRating

[1] 5 1 5 5 4 5 1 1 5 5 3 5 5 5 5 5 1 1 4 5 4 1 1 5 1 2 5 5 5 5 5 5 5 5 5 5 5 5 1 1 3 5 5 5 1 2 1 [48] 5 4 4 2 4 5 3 1 5 4 5 5 5 5 1 5 5 4 5 1 1 5 5 5 5 5 5 5 5 5 4 5 5 1 5 5 5 5 5 5 5 5 5 1 5 5 5 [95] 4 4 1 1 1 1 4 5 5 1 1 5 1 1 5 5

Now we have defined the satisfaction level of the users for the product in this way, if the user

has given rating above 3, then he/she is satisfied with it or if rating is below 3 or equal to 3 then

he/she is not.

Code:

Output:

[1] "satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" [7] "dissatisfied" "dissatisfied" "satisfied" "satisfied" "dissatisfied" "satisfied" [13] "satisfied" "satisfied" "satisfied" "satisfied" "dissatisfied" "dissatisfied" [19] "satisfied" "satisfied" "satisfied" "dissatisfied" "dissatisfied" "satisfied" [25] "dissatisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" [31] "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" [37] "satisfied" "satisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied" [43] "satisfied" "satisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied" [49] "satisfied" "satisfied" "dissatisfied" "satisfied" "satisfied" "dissatisfied" [55] "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" [61] "satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" [67] "dissatisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" [73] "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" [79] "satisfied" "satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" [85] "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" [91] "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" [97] "dissatisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied" "satisfied" [103] "satisfied" "dissatisfied" "dissatisfied" "satisfied" "dissatisfied" "dissatisfied" [109] "satisfied" "satisfied"

Now we have made a data frame combining the document-term matrix and the satisfaction

level, so for each document (i.e for each review) we have a satisfaction level in this data frame

and then we have done the classification modeling by applying SVM (Support Vector Machine)

technique so that by looking at some words we can measure either the user is satisfied or not

with the product.

Code:

Now we want to know the variables (i.e the words) which are most and least important in

predicting the satisfaction level.

Code:

We got the importance matrix of 2923x2 order whose one column is words and other column is

importance of a particular word in a particular row, the words are arranged in increasing order

with respect to their importance.

The 5 least important words:

Code:

Output:

Warranty, said, centre, cost, customers, working are the least 5 important words and their

importance in prediction of satisfaction level are like above table.

The 5 most important words:

Code:

Output:

Battery, quality, great, best, camera, awesome are the 5 most important words and their

importance in prediction of satisfaction level are like above table.

Sentiment Analysis

Now we want to know what is the overall sentiment of the reviewers/ users has been worked

for this particular product. Is it positive or negative? What is the average score of the

sentiment.

For this purpose we have measured either a review is polar or neutral by computing polarity

score of each review and then have calculated the average polarity score for all the 110

reviews. If it is positive then there is a positive sentiment for the product or if it is negative then

there is a negative sentiment for the product among the users.

Code:

Output:

f

Here we can clearly see that in this 11 documents, there are 12830 words and 1807 sentences.

The positive words like happy, glad, well, like, willing are there in the documents. The average

polarity calculated for all the documents is 0.111 and it is also positive. So from here we can

infer that overall positive sentiment has been worked in the user of the product.

Polarity plot:

Output:

From this polarity plot we can clearly see that the average polarity, which is represented here

as cross sign, is above 0 and also most of the dense part is lying above 0. So from the plot also

we can infer that overall it’s a positive sentiment on an average.

Latent Semantic Analysis

We have done Latent Semantic Analysis to know various semantic relationships among various

pieces of texts or words. We have tried to find different topic spaces which are defined by the

words/ terms and their contextual synonyms used in the reviews for Sony Xperia Z using LSA

technique.

First we have created a Document-term Matrix from the reviews we have.

Code:

Output(Sample):

View(m1)

So here we can see first 3 columns are like -, “red” and ‘. These symbols would have no

significance in our analysis. So we have removed these first three columns.

Code:

Output(sample):

View(m)

Similarly we have calculated weighted TF-IDF score for each word in each document and

represented them in matrix form called m_tfidf which is of 110x2923 order.

Code:

Output(Sample):

Now we have done SVD( Singular Value Decomposition) of that ‘m’ and ‘m_tfidf’ matrices

through LSA technique and here we chose dimension share = 0.6

Code:

Note: Here ‘t()’ operator is transpose of a matrix. So actually we are doing analysis on Term-

Document Matrix.

SVD has divided both the matrices into 3 component matrices. One is Term-Dimension matrix,

one is Diagonal matrix containing the Eigen Values and last one is Document-Dimension Matrix.

For lsa_m, those are lsa_m$tk (2923x34), lsa_m$sk (34x1) and lsa_m$dk (110x34) respectively

For lsa_mtfidf, those are lsa_mtfidf$tk (2923x45), lsa_mtfidf$sk (45x1) and lsa_mtfidf$dk

(110x45) respectively

Now checking for lsa_m$sk, we have

[1] 89.49235 43.53660 38.06285 33.57073 31.87552 28.48354 25.34172 23.19148 21.81091 21.29905 20.44049 [12] 20.27423 19.45968 19.06065 18.46173 17.47600 17.17047 16.70915 15.97285 15.91400 15.25898 14.62148 [23] 14.45512 14.29820 14.23971 13.38578 13.16881 13.02947 12.81555 12.56187 12.38814 12.23705 11.99354 [34] 11.91981

So first 4 Eigen values are explaining the maximum variances, and afterwards no such

significant changes.

Thus for our purpose we have selected first 4 significant dimensions excluding the others and

will go forward with these 4 dimensions.

Now we will do clustering the similar terms into similar clusters. For that we have two step

process:

1) First a k-mean clustering

2) Second a Hierarchical clustering to find optimal number of clusters

Step1:

Code:

Output(Sample):

Step2:

Here we have used ‘ward.D’ method for Hierarchical Clustering and then scaling the plot by

different colors for different heights we have tried to find optimal number of clusters.

Code:

Output:

From the Dendogram and color scaling we can say that optimal number of clusters will be

either 3 or 4 or 7.

Now we will check cluster sizes for each to decide which one will be the optimum and for this

purpose we will again use k-means clustering technique.

Code:

Output:

Looking at these figures, we have decided k=4 will give the optimal solution for our purpose i.e.,

we have 4 clusters as optimal solution.

We will look for the clusters membership now. For this we have done like following:

Code:

Output:

K4_1 i.e., cluster 1 memberships are like that

Looking at this we can say, that cluster 1 is reflecting for the good things in the sony xperia

phone like display, camera, battery etc or this topic space is saying for good things in the phone.

K4_2 i.e., cluster 2 memberships are like that

Looking at this, we can infer that this topic space is discussing about the bad things in the

phone. Words like ‘bad’, ‘issues’, ‘complaining’, ‘useless’, ‘ill’ are provoking negative impact in

this topic space and with this link, may be processor, charging system of this phone are not so

good or not upto the expectation level of the consumers.

K4_3 i.e., cluster 3 membership is

Only one word is here and also of frequency 1. We cant say anything from this. This could be a

noise.

K4_4 i.e., cluster 4 memberships are like that

Looking at the top frequency words in this cluster we can say this cluster/ topic space is also

discussing about some service issues of the phone. There might be a lot of problem with the

service of this phone.

Next we have created ‘lsa_tk4’ matrix whose columns are words and those 4 dimensions which

we have chosen from ‘lsa_m$sk’. ‘lsa_tk4’ is a 2923x5 order matrix.

Code:

Output(Sample):

The topic spaces plots with respect to different two dimensions are like following:

Code:

Plot:

Code:

Plot:

Code:

Plot:

Code:

Plot:

Code:

Plot:

Code:

Plot:

Recommendations

1) The company should concentrate on their customer services to improve the customer

relationship and to hold their loyal customers.

2) Sony is known for sound and image quality and for a long decade, they are satisfying

their customers with these qualities, so in future they should not compromise with the

quality of these features to compete other brands in the market in its category.

THANK YOU