$2 8 )6( 1 % # ( D 9(, (# ! & $# ( $ ' -...

International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)Volume 5 Issue 1, January 2016

148ISSN: 2278 – 1323 All Rights Reserved © 2016 IJARCET

Abstract— The e-commerce is developing rapidly these years,

buying products on-line has become more and more fashionableowing to its variety of options, low cost value (high discounts)and quick supply systems, so abundant folks intend to do onlineshopping. In the meantime the standard and delivery ofmerchandise is uneven, fake branded products are delivered. Sousers’ comments become the vital data to evaluate the product’squality and merchandise. However, for many products, theamount of reviews is too large to be processed manually andtheir quality varies largely. The star ratings are given to thewhole product and shoppers/product makers do not have amean to analyze the feedback for the single features. We useproduct users review comments about product and reviewabout retailers from Flipkart as data set and classify review textby subjectivity/objectivity and negative/positive attitude ofbuyer. Such reviews are helpful to some extent, promising boththe shoppers and products makers. This paper presents anempirical study of efficacy of classifying product review bysemantic meaning. In the present study, we tend to analyze thefundamentals of opinion mining, pros and cons of past opinionmining systems and supply some direction for the futureanalysis work. The authors hereby propose completely differentapproaches including spelling correction in review text, andthen classifying comments employing hybrid algorithmcombining Decision Trees and Naive Bayes algorithm

Index Terms— Decision tree, Facts, Naïve Bayes, opinionmining, sentiment analysis, and user reviews.

I. INTRODUCTIONIntroduction Smart phones, Laptops and internet have

made online shopping very easy. India’s internet user base354 million, registers 17% growth in first 6 months of 2015:IAMAI (Internet and Mobile Association of India.) report.The base had grown to 302 million by the end of 2014 afterclocking its fastest rise of 32% in a year, as per IAMAI,which includes members such as Google, Microsoft,Facebook, eBay, IBM, Flipkart, Ola and LinkedIn [11].While it took more than a decade for the user base to

increase from 10 million to 100 million, and three years tocross the 200 million mark, it took only a year for the userbase to swell to 300 million from 200 million [11]. As thee-commerce is developing rapidly these years, onlineshopping has become more and more popular because of itsvariety of types, cheap price (high discounts) and fast logisticsystems. More people intend to do online shopping thesedays. Meanwhile the quality and delivery of products isuneven, thus users’ comments become the important

Manuscript received Jan, 2016.Gurneet Kaur, Research Scholar, Department of computer science and

engineering Bhai Gurdas Institute of Engg. & Tech Sangrur, India9779322988.

Abhinash Singla, Assistant Professor, Bhai Gurdas Institute of Engg. &Tech Sangrur, India.

information to judge the product’s quality and delivery time.At the same time, the product manufacturers can obtain thecurrent main viewpoints from the users in order to improvethe products [1].Confronting to the massive data in the websites, analyzing

and concluding the information manually is impossible. Sohow to extract useful information and build objectiveproducts’ quality test system automatically to deal with themassive textual information is emerging in the relatedresearch field. Opinion Mining is a new technology based onthe technology of text mining and natural languageprocessing. It provides the approach to cope with the problem.So generating summary of the products has been attractingmany researchers during these years. [1]Emotional orientation of each review is focused with

Document-level sentiment analysis. It recognizes the opinionof the contents which authors express, mainly discusses thesentence-level opinion mining and treats the statements of theproduct’ features for each viewpoint as analysis objects, thenwe can find authors’ opinion inclinations. Thereforesentence-level sentiment analysis is the main task on opinionmining. The approach can find the specific details of thecomments and has a high confidential degree, but theoperation is very complex. For example, if we take a type oflaptop into consideration, we can divide the laptop’s featuresinto performance, price, appearance, endurance time, brandand so on. We consider each feature or attribute which eachauthor expresses for each comment respectively, then do acomprehensive evaluation in order to avoid theovergeneralization. [9]Feature-specific opinion mining attracts much attention.

An object is an entity. It can be a product, person, event,organization, topic or something else. It is associated with ahierarchy or taxonomy of components or a set of attributes.Meanwhile, each component also can have its own set ofsubcomponents or attributes. A feature is defined to showboth components and attributes and it is the subject of areview.In fact, people obey the grammatical rules to organize

sentences while writing articles. But under informalcircumstance, people usually neglect it and there are so manyspelling mistakes. This phenomenon is especially prominentwhen people make comments after online shopping. [1] Thesentences have some different features comparing with theformal ones.1) Products have a set of definite attributes and related

opinion phrases. Thus we can use a small fixed set ofkeywords to recognize frequent feature and opinion words.2) The opinionated sentences contain opinion operators

which can be used to find positions of opinion expressions.3) Many comment sentences are of free style, sometimes

Sentimental Analysis of Flipkart reviews usingNaïve Bayes and Decision Tree algorithm

Gurneet Kaur, Abhinash Singla



there are no opinion words in the comment sentences. If afeature is showed in the form of a noun or a noun phrase, thenit is defined as the explicit feature. Meanwhile, the sentencewhich contains the explicit feature is recognized as theexplicit sentence. According to the variety of the expression,we can divide the customer reviews into explicit sentencereviews and the ones without the explicit opinion feature arecalled implicit sentence reviews. The phenomenon that thereis no explicit opinion feature in the sentence is very commonin many comments. In our database which we extracted fromthe e-commerce website, the sentences without explicitopinion target make up to 30% approximately. For example:It is very cheap.We can deduce from the word “cheap” that the user may

indicate the product’s price. But the word “price” has notbeen directly mentioned but implied by the use of the word“cheap” which we can call feature indicator.

II. WEBMININGData mining is to extract information and knowledge which isnot known by people and potentially useful from a largenumber of incomplete and vague random data of practicalapplication. Web mining is the application of data miningtechnology, which is to extract interesting and potentiallyuseful patterns and hidden information from web documentsand web activities [1]. Web Mining is broadly categorizedinto Web content mining (WCM), Web structure mining(WSM), and Web usage mining (WUM) [1]. Web contentmining is related to the uncovering of useful informationfrom web contents, including text, image, audio, video, etc.Research in web content mining encompasses resourcediscovery from the web, document categorization andclustering, and information extraction from web pages.Web structure mining studies the web's hyperlink

structure.It usually involves analysis of the in-links and out-links of aweb page, and it has been used for search engine resultranking [1]. Web usage mining focuses on analyzing searchlogs or other activity logs to find interesting patterns.

A. Web mining processThe process of Web mining is divided into four stages:

source data collection, data preprocessing, pattern discoveryand pattern analysis. The process is explained in fig 1. Inmining of Web data, Web log files on the Web server are themain source of data [2]. Web log files contain the history ofthe visitor's browsing behavior. Web log files include theserver log, agent log and client log. The actual data collectedhave certain features such as redundancy, ambiguity andincomplete. In order to mine the knowledge more effectively,pre-processing the data collected is essential. Preprocessingcan provide accurate, concise data for data mining. Datapreprocessing, includes data cleaning, user identification,user session identification, access path supplement andtransaction identification.

Weblog file

DataWarehouse

WebLog Filebase

File afterPre-

Processor

PatternType

Knowledge

DataMining

PatternAnalysis

DataPreparation

Fig 1: Process of web data mining

III. SENTIMENT ANALYSIS

Sentiment analysis of natural language texts is a large andgrowing field. Sentiment analysis or Opinion Mining is thecomputational treatment of opinions and subjectivity of text.Sentiment analysis is an Information Extraction task thatintends to acquire writer’s feelings expressed in positive ornegative comments, after analyzing his documents. The term‘Presence’ is more important to sentiment analysis then term‘Frequency’ which was earlier used for traditionalinformation retrieval. It has also been reported that unigramssurpass bigrams for classifying movie reviews by sentimentpolarity. Hatzivassiloglou and McKeown theorize thatadjectives separated by “and" have the same polarity, whilethose separated by “but” have opposite polarity. Sentimentclassification is a recent sub discipline of text classificationwhich is concerned with opinion expressed by reviews.Opinion mining mean to determine whether a term thatcarries opinionated content has a positive or a negativeimplication. Sentiment classification can be divided intoseveral specific subtasks: determining subjectivity,determining orientation and the strength of orientation. Theterm SENTIWORDNET [4], is a lexical resource in whicheach WordNet synset is associated with three numericalscores, i.e., Obj(s), Pos(s), and Neg(s), thus describing howobjective, positive, and negative the terms contained in thesynset.

Sentiment classification can be regarded as abinary-classification task. Structured reviews are used fortesting and training, identifying appropriate features andscoring methods from information retrieval for analyzingnegative and positive annotations. Then the classifier is usedto identify and classify review sentences from the web.Various supervised or data-driven techniques to Sentimentanalysis like Naïve Byes, Maximum Entropy and SVM areused. Maximum Entropy and Support Vector Machines inSentiment analysis on different features like considering onlyunigrams, bigrams, combination of both, incorporating partsof speech and position information, taking only adjectivesetc.

IV. BACKGROUND STUDYHui Song et. Al. in their paper “Semantic Analysis andImplicit Target Extraction of Comments from E-commerceWebsites” E-commerce Websites” propose a new approachto extract explicit and implicit opinion which theeffectiveness of the approach. In e-commerce websites,customers usually blog after buying the products. Insentiment analysis, a finer-grained opinion mining approachfocuses on not only the product as a whole but also product



range and product line. Traditional approach always focuseson clear or detailed featured as compared to implicit ones [1].Yadav, M. P., et al in their paper title “Mining the

customer behavior using web usage mining in e-commerce”they explained customer behavior for E-commercecompanies using K Mean. With the drastic growth of WWWusers can easily find, extract, filter and evaluated whateverthey want. With the advancement in technology servers arenow able to collect and store a lot of data which can helpthem to know about customers perceptions. Hence, todetermine the relationship between web mining data andecommerce. Consumers mostly prefer to choose amongmillions of ones in an online store to satisfy their demandsinstead to choose from a superstore. It shows that consumershave taken interest on e-commerce site to engage ininternational trade [2].Prashast Kumar Singh et al in their research paper [3] title

“An approach towards feature specific opinion mining andsentimental analysis across e-commerce websites” thereresearch focus to collect information about what users thinkabout that product and on the basis of it analysis has beendone. On the basis of it geographical data can be collectedand reviews can be fetched from various sources. In thisapproach internet slang language and phrases which hashelped to gather millions of reviews on social networkingsites. . And finally, providing the end user(business/manufacturer) summarized data about theexpressed sentiments in way of intuitive and easy tounderstand graphs, charts and other visualization.Ahmad Tasnim Siddiqui et. al. in their paper title “Web

Mining Techniques in E-Commerce Applications” explainedtoday web is the best medium of communication in modernbusiness. Now day’s online purchase has been increased ascompared to window shopping as it provides millions ofranges. As, companies are able to attract most of thecustomers because ecommerce is not just buying and sellingover internet but it also act as to get advantage on big giantsof market. For this purpose data mining sometimes called asknowledge discovery is used. As vast information has beenprovided on internet, it helps to improve e-commerceapplications After that they explained the proposedarchitecture which contains mainly four components businessdata, data obtained from consumer’s interaction, datawarehouse and data analysis. After finishing the task by dataanalysis module it’ll produce report which can be utilized bythe consumers as well as the e-commerce application owners[4].Songbo Tan et al. in their paper title “Adapting Naive Bayesto Domain Adaptation for Sentiment Analysis” explained inthe community of sentiment analysis this is so-calleddomain-transfer problem. In their work, they attempt toattack this problem by making the maximum use of both theold-domain data and the unlabeled new-domain data. To gainknowledge from the new domain data, we proposed AdaptedNaïve Bayes (ANB), a weighted transfer version of NaiveBayes Classifier. The experimental results indicate thatproposed approach could improve the performance of baseclassifier dramatically, and even provide much betterperformance than the transfer-learning baseline, i.e. theNaïve Bayes Transfer Classifier (NTBC). They proposed anovel approach for domain adaptation in the context ofsentiment analysis. First, in order to make the maximum useof the old-domain data, we proposed an effective method, i.e.,Frequently Co-occurring Entropy (FCE). First, in order to

make the maximum use of the old-domain data, we proposedan effective method, i.e., Frequently Co-occurring Entropy(FCE). Thirdly, they conducted extensive experiments on sixdomain adaptation tasks. They believe that their workprovides an effective machine learning and data miningalgorithm especially when a ranking are more desirable. Apotential problem with IGCNB is that IGCNB has relatively[12].

V. NAIVE BAYES CLASSIFICATIONIt is an approach to text classification that assigns the class

, to a given document d. A naive

Bayes classifier is a simple probabilistic classifier based onBayes' theorem and is particularly suited when thedimensionality of the inputs are high [9]. Its underlyingprobability model can be described as an "independentfeature model". The Naive Bayes (NB) classifier uses theBayes’ rule Eq. (1),

(1)Where, P (d) plays no role in selecting c*. To estimate the

term P (d | c), Naive Bayes decomposes it by assuming the fi’sare conditionally independent given d’s class as in Eq. (2),

(2)

Where, m is the no of features and fi is the feature vector.Consider a training method consisting of a relative-frequencyestimation P(c) and P (fi | c). Despite its simplicity and thefact that its conditional independence assumption clearlydoes not hold in real-world situations, Naive Bayes-basedtext categorization still tends to perform surprisingly well;indeed, Naive Bayes is optimal for certain problem classeswith highly dependent features.

VI. PARAMETERS FOR EVALUATIONIn the context of classification, True Positives (TP), TrueNegatives (TN), False Negatives (FN) and False Positives(FP) are used to compare the class labels assigned todocuments by a classifier with the classes the items actuallybelongs to. True positive means, which are truly classified asthe positive terms. True Negative means, which are trulyclassified as the Negative terms. Other evaluation measureslike precision, recall, F-measure, specificity and accuracy caneasily be calculated from these four variables.

Table 1.Contegency tableCorrect labels

Positive NegativeClassifiedlabels

Positive TruePositive

FalsePositive

Negative Falsenegative

TrueNegative

A. AccuracyA common measure for classification performance is

accuracy, or its complement error rate. Accuracy is theproportion of correctly classified examples to the totalnumber of examples, while error rate uses incorrectlyclassified instead of correctly. However, one should be



careful to use only accuracy when one is using skewed data.This is because when one class occurs significantly more thanthe other, the classifier might get higher accuracy by justlabelling all examples as the dominant class then what it getswhen it tries to classify some with the other class.

B. Precision and recallPrecision and recall are two widely used metrics for

evaluating performance in text mining, and in other textanalysis field like information retrieval. They can be seen asextended versions of accuracy, and by using a combination ofthese measures the problem with skewed data for classifiersdissipates. Precision is used to measure exactness, whereasrecall is a measure of completeness. Precision is the numberof examples correctly labeled as positive divided on the totalnumber that are classified as positive, while recall is thenumber of examples correctly labeled as positive divided onthe total number of examples that truly are positive. This isshown in the following formulas.

C. F measureF-Measure is the harmonic mean of precision and recall. Thisgives a score that is a balance between precision and recall.F-Measure combines them into one score for easier usage.This is important because it might be better to optimize thesystem to favors either the precision or the recall if one ofthese has a more positive influence on the final result of thetrading simulation than the other.F1 measure is used as the evaluation metric for aspect

identification and aspect sentiment classification. It is acombination of precision and recall, as

to evaluate the performance decision tree algorithm is used.

VII. EVALUATION SETUP

A. Text PreprocessingText pre-processing techniques are divided into twosubcategories.1. Tokenization: Textual review data comprises block of

characters called tokens. The review comments are separatedas tokens and used for further processing.2. Removal of Stop Words: A stop-list is the name

commonly given to a set or list of stop words. It is typicallylanguage specific, although it may contain words. Some ofthe more frequently used stop words for English include "a","of", "the", "I", "it", "you", and ”and” these are generallyregarded as 'functional words' which do not carry meaning.When assessing the contents of natural language, themeaning can be conveyed more clearly by ignoring thefunctional words.Hence it is practical to remove those words which appear

too often that support no information for the task. If the stopword removal is applied, all the stop words in the particulartext file will not be loaded. If the stop word removal is notapplied, the stop word removal algorithm will be disabledwhen the dataset is loaded.

Fig. 2. Steps used in sentiment classification.

B. Text TransformationThe score of each sentence in the source document is

calculated by sum of weight of each term in thecorresponding sentences [6].

C. Feature SelectionMany statistical feature selection methods for document levelclassification can also be used for sentiment analysis [6]. Thesimplest statistical approach for feature selection is to use themost frequently occurring words in the corpus as polarityindicators. The majority of the approaches for sentimentanalysis involve a two-step process:• Identify the parts of the document to contribute the positiveor negative sentiments.• Join these parts of the document in ways that increase theodds of the document falling into one of these two polarcategories.

VIII. ALGORITHMS USED

Algorithm 1Algorithm for extracting Sentiment of ReviewCommentRequire: Product Review DocumentEnsure: Sentiment of User comment.1. Fetch the comment.2. Convert the unstructured comment data to structureddocument.3. Tokenize the sentences into keywords.4. Eliminate Stop words and tag the tokens using POStagger.5. If term is not in the dictionary check for the correct word.6. Apply Nave Bayes classifier.7. Calculate Precision Recall and F measure.8. Apply decision tree algorithm.9. Compute sentiments using algorithm 210. Return sentiment and sentiment score of review

Algorithm 2Algorithm to calculate the review orientation1. Procedure Review Sense( )



2. begin3. for each review sentence si4. begin5. sense = 0;6. For each review word rw in si7. sense + = Word Sense (rw, si);8. /* Positive =1 , Negative =-1*/9. if (sense >0) si’ s sense = Positive;10. else if (sense<0) si’ s sense = Negative11. endfor;12. end

1. Procedure Word Sense (word, sentence)2. begin3. sense = orientation of word in bag of keywords;4. If(there is NEGATIVE_WORD appears closelyaround word in sentence)5. sense = opposite(sense);

End

IX. RESULTS AND DISCUSSIONMOTO X Play phone is the most searched item on Flipkart.

The top 10 reviews of MOTO X Play Mobile phone are inmore than 4300 words. Fig 3 show the classification offetched comments using Naïve Bayes classifier with starrating and total negative and positive words. There aremillions of products and millions of users reviews aboutproducts.

Fig. 3 Top 10 reviews of Mobile phone MOTO X PlayThere are more than 100 spelling mistakes and slang words

used in top 10 reviews that effect the performance of anysentiment analysis algorithm. In order to evaluate theeffectiveness of the proposed feature extraction approach, wemanually read every review and chose the major qualityfeatures mentioned in the reviews as the ground truth. We userecall @K as the measure to evaluate the accuracy of featureextraction result. That is, given a threshold K, the top Kfeatures are extracted and compared to the ground truthfeature set. Recall is calculated as the ratio of the number ofcollect features in the extraction result (NE) to the size of theground truth feature set.Precision recall and F measure is calculated of Moto

phones shown in Fig. 4 the overall polarity of products iscalculated and results are shown in Fig 5. The productsreview sentiment analysis is calculated and from results theMoto X Play (16 GB) is the best phone then Moto E andMoto G.

Fig. 4 Products for analysis

Fig. 5 Graphical representation of Precision recall and F-measure.

X. CONCLUSIONInstead of some thousand products in a superstore, consumersmay choose among millions of products in an online store tosatisfy the personalization demands. It is clear that targetcustomers marketing can be effective when an e-commercecompany is able to collect rich information about buyer'sbehavior on e-commerce site. In this paper, we use NaïveBayes algorithm and semantic decision tree to classify thepolarity of comments given on e-commerce websites.First, we use a web crawler to fetch comment on a

particular web page. The spelling correction is done to makethe most sensible comment for knowing the polarity of wordsusing Word Net dictionary. Then stemming is performed toremove the stop words. After classifying the positive andnegative words using Naïve Bayes algorithm, the overallpolarity is calculated using decision tree.In future we will extend our study on framework

developed websites where tags are hidden in browser and wewill add prevision of adding other languages words in datasetfor more accurate results.



REFERENCES

[1] Song, H., Chu, J., Hu, Y., & Liu, X. (2013, December). SemanticAnalysis and Implicit Target Extraction of Comments fromE-Commerce Websites. In Software Engineering (WCSE), 2013Fourth World Congress on (pp. 331-335). IEEE.

[2] Yadav, M. P., Feeroz, M., & Yadav, V. K. (2012, July). Mining thecustomer behavior using web usage mining in e-commerce. InComputing Communication & Networking Technologies (ICCCNT),2012 Third International Conference on (pp. 1-5). IEEE.

[3] Kumar Singh, P., Sachdeva, A., Mahajan, D., Pande, N., & Sharma, A.(2014, September). An approach towards feature specific opinionmining and sentimental analysis across e-commerce websites. InConfluence The Next Generation Information Technology Summit(Confluence), 2014 5th International Conference- (pp. 329-335).IEEE.

[4] Yu, C., & Ying, X. (2009, December). Application of Data MiningTechnology in E-Commerce. In Computer Science-Technology andApplications, 2009. IFCSTA'09. International Forum on (Vol. 1, pp.291-293). IEEE.

[5] Sellam, T. (2010). Embedding Naive Bayes classification in aFunctional and Object Oriented DBMS.

[6] Mouthami, K., Devi, K. N., & Bhaskaran, V. M. (2013, February).Sentiment analysis and classification based on textual reviews. InInformation Communication and Embedded Systems (ICICES), 2013International Conference on (pp. 271-276). IEEE.

[7] Zha, Z. J., Yu, J., Tang, J., Wang, M., & Chua, T. S. (2014). Productaspect ranking and its applications. Knowledge and Data Engineering,IEEE Transactions on, 26(5), 1211-1224.

[8] Sudhakaran, P., Hariharan, S., & Lu, J. (2013). Research Directions,Challenges and Issues in Opinion Mining. International Journal ofAdvanced Science and Technology, 60, 1-8.

[9] Pang, B., Lee, L., & Vaithyanathan, S. (2002, July). Thumbs up?:sentiment classification using machine learning techniques. InProceedings of the ACL-02 conference on Empirical methods innatural language processing-Volume 10 (pp. 79-86). Association forComputational Linguistics.

[10] Anitha, N., Anitha, B., & Pradeepa, S. (2013). Sentiment ClassificationApproaches–A Review. International Journal of Innovations inEngineering and Technology (IJIET), 3(1), 22-31.

[11] http://articles.economictimes.indiatimes.com/2015-09-03/news/66178659_1_user-base-iamai-internet-and-mobile-association.

[12] Tan, S., Cheng, X., Wang, Y., & Xu, H. (2009). Adapting naive bayesto domain adaptation for sentiment analysis. In Advances inInformation Retrieval (pp. 337-349). Springer Berlin Heidelberg.

Gurneet Kaur received her B.Tech. Degree fromPTU. She is Lecturer in Baba Hira Singh Bhattal Instituteof Engineering and Technology Lehragaga, Sangrur,Punjab. She is Research Scholar in Bhar Gurdas instituteof engineering and technology Sangrur, Punjab. Herresearch interests include natural language processing,sentiment analysis and machine learning.

Abhinash Singla, Assistant Professor, Bhai GurdasInstitute of Engg. & Tech Sangrur, India His researchinterests include natural language processing, sentimentanalysis and Digital image processing.

$2 8 )6( 1 % # ( D 9(, (# ! & $# ( $ ' -...

Documents

Transcript of $2 8 )6( 1 % # ( D 9(, (# ! & $# ( $ ' -...