Text analytics of flipkart reviews of sony xperia z report
-
Upload
ytiam-unata -
Category
Data & Analytics
-
view
269 -
download
1
Transcript of Text analytics of flipkart reviews of sony xperia z report
TEXT ANALYTICS OF FLIPKART REVIEWS FOR
SONY XPERIA Z
Submitted by Atanu Maity
Contents:
Problem Definition
Methodology
Analysis
Recommendation
Problem Definition:
Detail analysis of Flipkart Reviews of Sony Xperia Z.
Methodology
Data Collection:
Flipkart Reviews of Various Users of the product.
Each review is independent of each other.
Sampling Design:
Total 110 reviews have been collected for the product from Flipkart to do the analysis.
Among 458 reviews, first 110 reviews have been taken for our purpose.
TextMining Methods used for Analysis:
Worcloud
Latent Semantic Analysis (LSA)
Support Vector Machine
Sentiment Analysis
Tool Used:
R Studio
Analysis
Review Extraction from Flipkart:
Packages Needed: RCurl, XML, rvest, tm, wordcloud
First we will build a anchorlist and doclist which will contain the page links from where we will
extract the reviews and contents of those pages respectively.
Output:
anchorlist
[[1]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=10" [[2]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=20" [[3]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=30" [[4]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=40" [[5]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=50" [[6]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=60" [[7]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=0" [[8]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=70" [[9]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=80" [[10]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=90" [[11]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=100"
[[12]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=110" [[13]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=120" [[14]] [1] "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all&type=all&sort=most_helpful&start=130"
Doclist:
The extracted doclist can be viewed from the following link
https://drive.google.com/file/d/0B_Y-QpyzAwhSUkZ1Wmg0bWNWRWM/view?usp=sharing
Now from this doclist we have extracted the ‘reviews’ on which the analysis will be done.
Review Extraction:
Code:
Output:
Total 110 reviews have been extracted from 11 docs. The review list can be found from the
following link
https://drive.google.com/file/d/0B_Y-QpyzAwhSU0Y3UHFHeFdmYVk/view?usp=sharing
Filtering of the review contents:
We have removed Numbers, Punctuations, Stopwords from the reviews and made all the
letters of each word to lowercase.
Code:
Making of WordCloud:
At the very first step we have made a TermDocument Matrix with words in rows and
documents in columns from the reviews. And using that we finally have made the WordCloud.
Code:
Output:
Inference from the WordCloud output:
1) Words like Phone, Sony, Xperia, Water, Good, Camera, Battery, Service, Display, Screen,
Quality etc are important, as they are looking big and bold compare to other words.
2) Quality of the phone might have been very satisfactory and also overall performance
might be very good as words like screen, battery, camera, display had been repeatedly
used by the consumers
3) There might have been some issue with Water for this Sony model.
4) Also there might have been some comparison of Sony with Samsung and HTC handsets.
Rating Extraction and Analysis of Ratings
We have extracted rating for Sony Experia Z for each of the first 110 reviews and found that
each reviewer has given rating and then done the analysis accordingly.
Rating Extraction:
Code:
Output:
missingRating>
So from this output we can see that there is no missing rating for each individual page i.e. each
reviewer has given the rating.
Now we want those ratings. Those 110 ratings are like following
Ratings>
[1] "5 stars" "1 star" "5 stars" "5 stars" "4 stars" "5 stars" "1 star" "1 star" "5 stars" [10] "5 stars" "3 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "1 star" "1 star" [19] "4 stars" "5 stars" "4 stars" "1 star" "1 star" "5 stars" "1 star" "2 stars" "5 stars" [28] "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" [37] "5 stars" "5 stars" "1 star" "1 star" "3 stars" "5 stars" "5 stars" "5 stars" "1 star" [46] "2 stars" "1 star" "5 stars" "4 stars" "4 stars" "2 stars" "4 stars" "5 stars" "3 stars" [55] "1 star" "5 stars" "4 stars" "5 stars" "5 stars" "5 stars" "5 stars" "1 star" "5 stars" [64] "5 stars" "4 stars" "5 stars" "1 star" "1 star" "5 stars" "5 stars" "5 stars" "5 stars" [73] "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "4 stars" "5 stars" "5 stars" "1 star"
[82] "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" [91] "1 star" "5 stars" "5 stars" "5 stars" "4 stars" "4 stars" "1 star" "1 star" "1 star" [100] "1 star" "4 stars" "5 stars" "5 stars" "1 star" "1 star" "5 stars" "1 star" "1 star" [109] "5 stars" "5 stars"
The extracted ratings are in character format, but for our calculation purpose we want them as
numerical. So we have done like this
Code:
Output:
finalRating
[1] 5 1 5 5 4 5 1 1 5 5 3 5 5 5 5 5 1 1 4 5 4 1 1 5 1 2 5 5 5 5 5 5 5 5 5 5 5 5 1 1 3 5 5 5 1 2 1 [48] 5 4 4 2 4 5 3 1 5 4 5 5 5 5 1 5 5 4 5 1 1 5 5 5 5 5 5 5 5 5 4 5 5 1 5 5 5 5 5 5 5 5 5 1 5 5 5 [95] 4 4 1 1 1 1 4 5 5 1 1 5 1 1 5 5
Now we have defined the satisfaction level of the users for the product in this way, if the user
has given rating above 3, then he/she is satisfied with it or if rating is below 3 or equal to 3 then
he/she is not.
Code:
Output:
[1] "satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" [7] "dissatisfied" "dissatisfied" "satisfied" "satisfied" "dissatisfied" "satisfied" [13] "satisfied" "satisfied" "satisfied" "satisfied" "dissatisfied" "dissatisfied" [19] "satisfied" "satisfied" "satisfied" "dissatisfied" "dissatisfied" "satisfied" [25] "dissatisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" [31] "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" [37] "satisfied" "satisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied" [43] "satisfied" "satisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied" [49] "satisfied" "satisfied" "dissatisfied" "satisfied" "satisfied" "dissatisfied" [55] "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" [61] "satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" [67] "dissatisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" [73] "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" [79] "satisfied" "satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" [85] "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" [91] "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" [97] "dissatisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied" "satisfied" [103] "satisfied" "dissatisfied" "dissatisfied" "satisfied" "dissatisfied" "dissatisfied" [109] "satisfied" "satisfied"
Now we have made a data frame combining the document-term matrix and the satisfaction
level, so for each document (i.e for each review) we have a satisfaction level in this data frame
and then we have done the classification modeling by applying SVM (Support Vector Machine)
technique so that by looking at some words we can measure either the user is satisfied or not
with the product.
Code:
Now we want to know the variables (i.e the words) which are most and least important in
predicting the satisfaction level.
Code:
We got the importance matrix of 2923x2 order whose one column is words and other column is
importance of a particular word in a particular row, the words are arranged in increasing order
with respect to their importance.
The 5 least important words:
Code:
Output:
Warranty, said, centre, cost, customers, working are the least 5 important words and their
importance in prediction of satisfaction level are like above table.
The 5 most important words:
Code:
Output:
Battery, quality, great, best, camera, awesome are the 5 most important words and their
importance in prediction of satisfaction level are like above table.
Sentiment Analysis
Now we want to know what is the overall sentiment of the reviewers/ users has been worked
for this particular product. Is it positive or negative? What is the average score of the
sentiment.
For this purpose we have measured either a review is polar or neutral by computing polarity
score of each review and then have calculated the average polarity score for all the 110
reviews. If it is positive then there is a positive sentiment for the product or if it is negative then
there is a negative sentiment for the product among the users.
Code:
Output:
f
Here we can clearly see that in this 11 documents, there are 12830 words and 1807 sentences.
The positive words like happy, glad, well, like, willing are there in the documents. The average
polarity calculated for all the documents is 0.111 and it is also positive. So from here we can
infer that overall positive sentiment has been worked in the user of the product.
Polarity plot:
Output:
From this polarity plot we can clearly see that the average polarity, which is represented here
as cross sign, is above 0 and also most of the dense part is lying above 0. So from the plot also
we can infer that overall it’s a positive sentiment on an average.
Latent Semantic Analysis
We have done Latent Semantic Analysis to know various semantic relationships among various
pieces of texts or words. We have tried to find different topic spaces which are defined by the
words/ terms and their contextual synonyms used in the reviews for Sony Xperia Z using LSA
technique.
First we have created a Document-term Matrix from the reviews we have.
Code:
Output(Sample):
View(m1)
So here we can see first 3 columns are like -, “red” and ‘. These symbols would have no
significance in our analysis. So we have removed these first three columns.
Code:
Output(sample):
View(m)
Similarly we have calculated weighted TF-IDF score for each word in each document and
represented them in matrix form called m_tfidf which is of 110x2923 order.
Code:
Output(Sample):
Now we have done SVD( Singular Value Decomposition) of that ‘m’ and ‘m_tfidf’ matrices
through LSA technique and here we chose dimension share = 0.6
Code:
Note: Here ‘t()’ operator is transpose of a matrix. So actually we are doing analysis on Term-
Document Matrix.
SVD has divided both the matrices into 3 component matrices. One is Term-Dimension matrix,
one is Diagonal matrix containing the Eigen Values and last one is Document-Dimension Matrix.
For lsa_m, those are lsa_m$tk (2923x34), lsa_m$sk (34x1) and lsa_m$dk (110x34) respectively
For lsa_mtfidf, those are lsa_mtfidf$tk (2923x45), lsa_mtfidf$sk (45x1) and lsa_mtfidf$dk
(110x45) respectively
Now checking for lsa_m$sk, we have
[1] 89.49235 43.53660 38.06285 33.57073 31.87552 28.48354 25.34172 23.19148 21.81091 21.29905 20.44049 [12] 20.27423 19.45968 19.06065 18.46173 17.47600 17.17047 16.70915 15.97285 15.91400 15.25898 14.62148 [23] 14.45512 14.29820 14.23971 13.38578 13.16881 13.02947 12.81555 12.56187 12.38814 12.23705 11.99354 [34] 11.91981
So first 4 Eigen values are explaining the maximum variances, and afterwards no such
significant changes.
Thus for our purpose we have selected first 4 significant dimensions excluding the others and
will go forward with these 4 dimensions.
Now we will do clustering the similar terms into similar clusters. For that we have two step
process:
1) First a k-mean clustering
2) Second a Hierarchical clustering to find optimal number of clusters
Step1:
Code:
Output(Sample):
Step2:
Here we have used ‘ward.D’ method for Hierarchical Clustering and then scaling the plot by
different colors for different heights we have tried to find optimal number of clusters.
Code:
Output:
From the Dendogram and color scaling we can say that optimal number of clusters will be
either 3 or 4 or 7.
Now we will check cluster sizes for each to decide which one will be the optimum and for this
purpose we will again use k-means clustering technique.
Code:
Output:
Looking at these figures, we have decided k=4 will give the optimal solution for our purpose i.e.,
we have 4 clusters as optimal solution.
We will look for the clusters membership now. For this we have done like following:
Code:
Output:
K4_1 i.e., cluster 1 memberships are like that
Looking at this we can say, that cluster 1 is reflecting for the good things in the sony xperia
phone like display, camera, battery etc or this topic space is saying for good things in the phone.
K4_2 i.e., cluster 2 memberships are like that
Looking at this, we can infer that this topic space is discussing about the bad things in the
phone. Words like ‘bad’, ‘issues’, ‘complaining’, ‘useless’, ‘ill’ are provoking negative impact in
this topic space and with this link, may be processor, charging system of this phone are not so
good or not upto the expectation level of the consumers.
K4_3 i.e., cluster 3 membership is
Only one word is here and also of frequency 1. We cant say anything from this. This could be a
noise.
K4_4 i.e., cluster 4 memberships are like that
Looking at the top frequency words in this cluster we can say this cluster/ topic space is also
discussing about some service issues of the phone. There might be a lot of problem with the
service of this phone.
Next we have created ‘lsa_tk4’ matrix whose columns are words and those 4 dimensions which
we have chosen from ‘lsa_m$sk’. ‘lsa_tk4’ is a 2923x5 order matrix.
Code:
Recommendations
1) The company should concentrate on their customer services to improve the customer
relationship and to hold their loyal customers.
2) Sony is known for sound and image quality and for a long decade, they are satisfying
their customers with these qualities, so in future they should not compromise with the
quality of these features to compete other brands in the market in its category.
THANK YOU