DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA...

34
DATA ANALYTICS ON AMAZON PRODUCT REVIEW USING NOSQL HIVE AND MACHINE LEARNING ON SPARKS ON HADOOP FILE SYSTEM. PRESENTED BY: DEVANG PATEL (2671221) SONAL DESHMUKH (2622863)

Transcript of DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA...

Page 1: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

DATA ANALYTICS ON AMAZON PRODUCT REVIEW USING NOSQL HIVE AND MACHINE LEARNING ON SPARKS ON HADOOP FILE

SYSTEM.

PRESENTED BY:

DEVANG PATEL (2671221)

SONAL DESHMUKH (2622863)

Page 2: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

� INTRODUCTION:

• The significance of online shopping is growing day by day because of easy purchase

method by just one click.

• Amazon is one such world widely known E-commerce website. Initially it was known for

its huge collection of books but later it was expanded for other items.

• It is all about making money. So, customer satisfaction and opinion is important part of E-

commerce websites.This gave rise “User Reviews”.

• User Reviews are customer suggestions which help other customers to make decision

about that product.

Page 3: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

� HOW TO GET AMAZON REVIEW DATASET ?:

• We emailed them to get the access of

amazon review dataset and they

provide the link from which we can

download the review dataset.

• Data was in JSON file format.

Page 4: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

� SOFTWARESAND TOOLS:

Page 5: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

� REVIEW SAMPLE AND DATASET DESCRIPTION:

• Rating (1-5 stars)

• Review Text

• Summary

• No of peopled who found review

helpful

• Product ID

• Reviewer ID

Page 6: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

CONTINUED…

• We got JSON dataset that contains following fields:

• reviewerID: ID of the Reviewer.

• asin: ID of the Product.

• reviewerName: Name of the reviewer.

• helpful: Helpfulness rating of the review.

• reviewText: Text of the review.

• overall: Rating of the Product.

• summary: Summary of the review.

• unixReviewTime: Time of the review (Unix time).

• reviewTime: Time of the review.

Page 7: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

� IMPLEMENTATION:

• Data was in JSON format and we

are using hive. So Need to convert

JSON to CSV file but we choose

JSONSerDe.

• hive-serdes-1.0-SNAPSHOT.jar

• We downloaded JSONSerDe jar file

and copied it in hive/lib folder.

Page 8: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

CONTINUED…

• Uploaded data of

Music_Instruments.JSON file on

HDFS.

• After uploading it on HDFS we want

to load it in the hive table.

Page 9: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

CONTINUED…

• Table: MI_table

• Row format: JSONSerDe.class

• Location: HDFS path where JSON

file is stored.

Page 10: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

PROBLEMS FACED DURNG TABLE CREATION IN HIVE

• Select * from MI_table;

• It fetched all the rows 10,261. but

problem was NULL values.

• The problem of NULL values was that

the key names were in capital case.

• Solution

:SERDEPROPERTIES(“case.insensitive”

=“false”)

Page 11: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

PROBLEM ON DATA FORMAT

• The metadata file of amazon reviwes has key-value pairs in single quotes.

• We tried all the types of json serde available but none worked

• Json.dumps functionality converted the json data into correct format .

Page 12: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

SELECT * DESERIALIZES

Page 13: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

� CAN WE USE THIS TABLE AT PRODUCTION LEVEL?:

Page 14: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

� MOST REVIEWED PRODUCT:

Page 15: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

� AVERAGE RATING OF MUSIC INSTRUMENTS:

Page 16: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

AVERAGE RATINGS ON PRODUCTS

Page 17: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

� AMAZON 5 DIFFERENT PRODUCTS AVERAGE:

• Automative (4.18)

• Cellphones (4.12)

• Lawn_Garden (4.18)

• Musical Instruments (4.48)

• Pet_Suppliers (4.22)

Page 18: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

� REVIEWS WERE POSITIVE OR NEGATIVE?:

1.

2.

3.

Page 19: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

CONTINUED…

4.

5.

6.

Page 20: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

CONTINUED…

Page 21: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

CONTINUED…

Page 22: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

CONTINUED…

Page 23: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

CONTINUED…

Page 24: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

WHICH YEAR PRODUCTS WERE REVIEWED MOST?

Page 25: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON
Page 26: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

COST OF MOST REVIEWED (TOP 5) PRODUCTS

Page 27: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

HDFS IN THE BROWSER :HTTP://LOCALHOST/50070

Page 28: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

TRUST AND HELPFULNESS IN AMAZON PRODUCT REVIEWS

• The ‘helpful’ column contains

values that look like this ‘[56, 63]’.

• The first value represents the

number of helpful votes, the

second represents overall votes

• Percentage and also a binary column which states if the review is helpful or not.

Page 29: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

OUTPUT FILE AND TABLEAU DATA SOURCE .

Page 30: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

Helpful Ratings and Distribution

Page 31: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

PIPELINE MODEL OF SPARK’S MLIB:

• Cylinders indicate DataFrames.

• The Tokenizer.transform

method splits the raw text

documents into words, adding a new column with words to the DataFrame.

• The HashingTF. Transform () method converts the words column into feature vectors,

adding a new column with those vectors to the DataFrame

• Logistic regression is the machine learning algorithm

Page 32: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

PROBLEM FACED IN RUNNING SPARK MACHINE LEARNING PROGRAM EXECUTION

• Creating dataframe for the amazon data .• Downloaded sklearn ,but pyspark has its own classification mlib• The execution throws an error that it requires only Numpy 1.4 or higher version. It is solved by

correcting a bug in the init.py program of the mlib.

Page 33: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

SPARKS EXECUTION : BIN/PYSPARK SHELL COMMAND

Start the Sparks on top of hadoop.

Page 34: DATA ANALYTICS ON AMAZON PRODUCT ...eecs.csuohio.edu/~sschung/cis612/Analysingamazonreview...DATA ANALYTICS ON AMAZON PRODUCT REVIEWUSING NOSQL HIVE AND MACHINE LEARNING ON SPARKSON

TRAINING SET AND TEST SET RESULTS