Clickstream Data

38
Commercial Analytics of Clickstream Data using Hadoop Submitted to: School of Mathematics and Computer Application Department, Thapar university, Patiala. Submitted by: Kartik Gupta 201100048 M.C.A Thapar University June 2014

description

How to manage DATA with HADOOP

Transcript of Clickstream Data

Page 1: Clickstream Data

Commercial Analytics of Clickstream Data using

Hadoop

Submitted to:School of Mathematics and Computer Application Department,Thapar university,Patiala.

Submitted by:Kartik Gupta201100048M.C.A Thapar University

June 2014

Page 2: Clickstream Data

Outline

OverviewBig DataHadoopMajor StepsResults and AnalysisConclusion and Future Scope

Page 3: Clickstream Data

OverviewThis Project gives an analytic report to find the behavior

and location of visitor using Hadoop. Map Reduce is implemented to refine and sort the raw

data.Searching is done based on the country, ip addresses,

Postal code, categories wiseHadoop is a tool which converts the unstructured,

structured and semi-structured data into pair into a single value which is represented in binary format.

MapReduce framework is used for parallel implementation.

Page 4: Clickstream Data

Big Data

Big Data is a term used to describe large collections of data that may be unstructured grow so large and quickly that it is difficult to manage with regular database or statistical tools.

3 v’s of Big data

Page 5: Clickstream Data

Hadoop

Open source project started by Doug Cutting A platform to manage Big Data Helps in Distributed computing Runs on Commodity HardwareData storage (HDFS) Runs on commodity hardware (usually Linux) Horizontally scalable Processing (MapReduce) Parallelized (scalable) processing Fault Tolerant

Page 6: Clickstream Data

CORE PARTS OF HADOOP

Page 7: Clickstream Data

Hadoop Distributed File System(HDFS) Hadoop Distributed File System (HDFS) is a Java-based

file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

Some specific features ensure that the Hadoop clusters are highly functional RackAwareness Minimal Data Motion Utilities Rollback Highly Operable

Page 8: Clickstream Data

How HDFS works

Page 9: Clickstream Data

MapReduce

MapReduce is a programming model and an associated implementation for processing large data sets.

MapReduce usually splits the input data-set into independent chunks which are processed in a completely parallel manner.

This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

The run-time system takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Page 10: Clickstream Data

10

Execution flow in MapReduce

1. Mapreduce program that has been written tells the job client to run a mapreduce job.

Page 11: Clickstream Data

11

Execution flow in MapReduce

2.This sends a message to the Jobtracker which produces a unique ID for the job.

Page 12: Clickstream Data

12

Execution flow in MapReduce

3. JobClient copies job resources , such as jar file.

Page 13: Clickstream Data

13

Execution flow in MapReduce

4. Once the resources are in Distributed Filesystem, the JobClient can tell the JobTracker to start the job.

Page 14: Clickstream Data

14

Execution flow in MapReduce

5. The JobTracker does its own initialization for the job.. It retrieves these input splits from the distributed file system.

Page 15: Clickstream Data

15

Execution flow in MapReduce

6. Now that the Jobtracker has work for Tasktrackers, it will return the map task or reduce task as response to the heart beat.

Page 16: Clickstream Data

16

Execution flow in MapReduce

7. The TaskTracker need to obtain the code to execute, so they get it from the shared file system.

Page 17: Clickstream Data

17

Execution flow in MapReduce

8. The TaskTracker now will run the job.

Page 18: Clickstream Data

OTHER TECHNOLOGICAL TERMS

Clickstream Data Clickstream data is an information trail a user leaves behind while

visiting a website. It is typically captured in semi-structured website log files.

Potential Uses of Clickstream Data What is the most efficient path for a site visitor to research a product,

and then buy it? What products do visitors tend to buy together, and what are they

most likely to buy in the future? Where should I spend resources on fixing or enhancing the user

experience on my website?Basically we will focus on the “path optimization” use case. Specifically: how can we improve our website to reduce bounce rates and improve conversion?

Page 19: Clickstream Data

STEP I

Upload Acme website log dataset contains about 4 million rows of data, which represents five days of clickstream data.

Page 20: Clickstream Data

STEP II

Represent the dataset in unstructured format i.e timestamp, registerd user swid, ip address, geocoded ip address, url

Page 21: Clickstream Data

STEP III

Represent the users data from the unstructured loaddataset

Page 22: Clickstream Data

STEP IV

Represent the products categories wise from the dataset

Page 23: Clickstream Data

STEP V

Shows the refine dataset of acme logfiles

Page 24: Clickstream Data

STEP VI

Combine all the tables i.e acme log, products, users.

Page 25: Clickstream Data

Results and Analysis

Configuration of Hadoop

Page 26: Clickstream Data

Results and Analysis

Count the no of VISITORS from any country

Page 27: Clickstream Data

Results and Analysis

Retrieving the ip address and displaying the state of visitors

Page 28: Clickstream Data

Results and Analysis

Showing the no of ip access this category at a time

Page 29: Clickstream Data

Results and Analysis

Initial stage of mapping and reduction

Page 30: Clickstream Data

Results and Analysis

Category accessed by total no of ips

Page 31: Clickstream Data

Results and Analysis

Showing shoes category acc to state access by total no of ip

Page 32: Clickstream Data

Results and Analysis

showing details of ip accessed by visitors but gender wise

Page 33: Clickstream Data

Result and Analysis

No of Females accessed this page

Page 34: Clickstream Data

Result and Analysis

Total no of ip address accessed particular webpage

Page 35: Clickstream Data

Result and Analysis

Calculate the sum of ages of all the visitors

Page 36: Clickstream Data

Conclusion

The amount of clickstream data is rapidly growing and with this demand for accessing information over web has increased significantly.

Therefore analyze the behavior and location of the visitor.

It is inefficient to process large data using traditional sequential method

Therefore MapReduce is used for processing large datasets

Page 37: Clickstream Data

Future Scope

Clickstream information play an important role in a wide variety of applications such as decision support systems, profile-based marketing.

Location search is used by various industries like telecom , e-commerce industry , in event detection.

Nearest location method can be fused with any other method to help in better way for decision making.

Then the tradeoff would be done between distance and other factor that would be fused

Page 38: Clickstream Data

Thank you !!!