Clickstream Data

Commercial Analytics of Clickstream Data using

Hadoop

Submitted to:School of Mathematics and Computer Application Department,Thapar university,Patiala.

Submitted by:Kartik Gupta201100048M.C.A Thapar University

June 2014

Outline

OverviewBig DataHadoopMajor StepsResults and AnalysisConclusion and Future Scope

OverviewThis Project gives an analytic report to find the behavior

and location of visitor using Hadoop. Map Reduce is implemented to refine and sort the raw

data.Searching is done based on the country, ip addresses,

Postal code, categories wiseHadoop is a tool which converts the unstructured,

structured and semi-structured data into pair into a single value which is represented in binary format.

MapReduce framework is used for parallel implementation.

Big Data

Big Data is a term used to describe large collections of data that may be unstructured grow so large and quickly that it is difficult to manage with regular database or statistical tools.

3 v’s of Big data

Hadoop

Open source project started by Doug Cutting A platform to manage Big Data Helps in Distributed computing Runs on Commodity HardwareData storage (HDFS) Runs on commodity hardware (usually Linux) Horizontally scalable Processing (MapReduce) Parallelized (scalable) processing Fault Tolerant

CORE PARTS OF HADOOP

Hadoop Distributed File System(HDFS) Hadoop Distributed File System (HDFS) is a Java-based

file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

Some specific features ensure that the Hadoop clusters are highly functional RackAwareness Minimal Data Motion Utilities Rollback Highly Operable

How HDFS works

MapReduce

MapReduce is a programming model and an associated implementation for processing large data sets.

MapReduce usually splits the input data-set into independent chunks which are processed in a completely parallel manner.

This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

The run-time system takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

10

Execution flow in MapReduce

1. Mapreduce program that has been written tells the job client to run a mapreduce job.

11


2.This sends a message to the Jobtracker which produces a unique ID for the job.

12


3. JobClient copies job resources , such as jar file.

13


4. Once the resources are in Distributed Filesystem, the JobClient can tell the JobTracker to start the job.

14


5. The JobTracker does its own initialization for the job.. It retrieves these input splits from the distributed file system.

15


6. Now that the Jobtracker has work for Tasktrackers, it will return the map task or reduce task as response to the heart beat.

16


7. The TaskTracker need to obtain the code to execute, so they get it from the shared file system.

17


8. The TaskTracker now will run the job.

OTHER TECHNOLOGICAL TERMS

Clickstream Data Clickstream data is an information trail a user leaves behind while

visiting a website. It is typically captured in semi-structured website log files.

Potential Uses of Clickstream Data What is the most efficient path for a site visitor to research a product,

and then buy it? What products do visitors tend to buy together, and what are they

most likely to buy in the future? Where should I spend resources on fixing or enhancing the user

experience on my website?Basically we will focus on the “path optimization” use case. Specifically: how can we improve our website to reduce bounce rates and improve conversion?

STEP I

Upload Acme website log dataset contains about 4 million rows of data, which represents five days of clickstream data.

STEP II

Represent the dataset in unstructured format i.e timestamp, registerd user swid, ip address, geocoded ip address, url

STEP III

Represent the users data from the unstructured loaddataset

STEP IV

Represent the products categories wise from the dataset

STEP V

Shows the refine dataset of acme logfiles

STEP VI

Combine all the tables i.e acme log, products, users.

Results and Analysis

Configuration of Hadoop


Count the no of VISITORS from any country


Retrieving the ip address and displaying the state of visitors


Showing the no of ip access this category at a time


Initial stage of mapping and reduction


Category accessed by total no of ips


Showing shoes category acc to state access by total no of ip


showing details of ip accessed by visitors but gender wise

Result and Analysis

No of Females accessed this page

Result and Analysis

Total no of ip address accessed particular webpage

Result and Analysis

Calculate the sum of ages of all the visitors

Conclusion

The amount of clickstream data is rapidly growing and with this demand for accessing information over web has increased significantly.

Therefore analyze the behavior and location of the visitor.

It is inefficient to process large data using traditional sequential method

Therefore MapReduce is used for processing large datasets

Future Scope

Clickstream information play an important role in a wide variety of applications such as decision support systems, profile-based marketing.

Location search is used by various industries like telecom , e-commerce industry , in event detection.

Nearest location method can be fused with any other method to help in better way for decision making.

Then the tradeoff would be done between distance and other factor that would be fused

Thank you !!!

Clickstream Data

Documents

Transcript of Clickstream Data