Clickstream Data
-
Upload
kartik-gupta -
Category
Documents
-
view
13 -
download
0
description
Transcript of Clickstream Data
Commercial Analytics of Clickstream Data using
Hadoop
Submitted to:School of Mathematics and Computer Application Department,Thapar university,Patiala.
Submitted by:Kartik Gupta201100048M.C.A Thapar University
June 2014
Outline
OverviewBig DataHadoopMajor StepsResults and AnalysisConclusion and Future Scope
OverviewThis Project gives an analytic report to find the behavior
and location of visitor using Hadoop. Map Reduce is implemented to refine and sort the raw
data.Searching is done based on the country, ip addresses,
Postal code, categories wiseHadoop is a tool which converts the unstructured,
structured and semi-structured data into pair into a single value which is represented in binary format.
MapReduce framework is used for parallel implementation.
Big Data
Big Data is a term used to describe large collections of data that may be unstructured grow so large and quickly that it is difficult to manage with regular database or statistical tools.
3 v’s of Big data
Hadoop
Open source project started by Doug Cutting A platform to manage Big Data Helps in Distributed computing Runs on Commodity HardwareData storage (HDFS) Runs on commodity hardware (usually Linux) Horizontally scalable Processing (MapReduce) Parallelized (scalable) processing Fault Tolerant
CORE PARTS OF HADOOP
Hadoop Distributed File System(HDFS) Hadoop Distributed File System (HDFS) is a Java-based
file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
Some specific features ensure that the Hadoop clusters are highly functional RackAwareness Minimal Data Motion Utilities Rollback Highly Operable
How HDFS works
MapReduce
MapReduce is a programming model and an associated implementation for processing large data sets.
MapReduce usually splits the input data-set into independent chunks which are processed in a completely parallel manner.
This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
The run-time system takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
10
Execution flow in MapReduce
1. Mapreduce program that has been written tells the job client to run a mapreduce job.
11
Execution flow in MapReduce
2.This sends a message to the Jobtracker which produces a unique ID for the job.
12
Execution flow in MapReduce
3. JobClient copies job resources , such as jar file.
13
Execution flow in MapReduce
4. Once the resources are in Distributed Filesystem, the JobClient can tell the JobTracker to start the job.
14
Execution flow in MapReduce
5. The JobTracker does its own initialization for the job.. It retrieves these input splits from the distributed file system.
15
Execution flow in MapReduce
6. Now that the Jobtracker has work for Tasktrackers, it will return the map task or reduce task as response to the heart beat.
16
Execution flow in MapReduce
7. The TaskTracker need to obtain the code to execute, so they get it from the shared file system.
17
Execution flow in MapReduce
8. The TaskTracker now will run the job.
OTHER TECHNOLOGICAL TERMS
Clickstream Data Clickstream data is an information trail a user leaves behind while
visiting a website. It is typically captured in semi-structured website log files.
Potential Uses of Clickstream Data What is the most efficient path for a site visitor to research a product,
and then buy it? What products do visitors tend to buy together, and what are they
most likely to buy in the future? Where should I spend resources on fixing or enhancing the user
experience on my website?Basically we will focus on the “path optimization” use case. Specifically: how can we improve our website to reduce bounce rates and improve conversion?
STEP I
Upload Acme website log dataset contains about 4 million rows of data, which represents five days of clickstream data.
STEP II
Represent the dataset in unstructured format i.e timestamp, registerd user swid, ip address, geocoded ip address, url
STEP III
Represent the users data from the unstructured loaddataset
STEP IV
Represent the products categories wise from the dataset
STEP V
Shows the refine dataset of acme logfiles
STEP VI
Combine all the tables i.e acme log, products, users.
Results and Analysis
Configuration of Hadoop
Results and Analysis
Count the no of VISITORS from any country
Results and Analysis
Retrieving the ip address and displaying the state of visitors
Results and Analysis
Showing the no of ip access this category at a time
Results and Analysis
Initial stage of mapping and reduction
Results and Analysis
Category accessed by total no of ips
Results and Analysis
Showing shoes category acc to state access by total no of ip
Results and Analysis
showing details of ip accessed by visitors but gender wise
Result and Analysis
No of Females accessed this page
Result and Analysis
Total no of ip address accessed particular webpage
Result and Analysis
Calculate the sum of ages of all the visitors
Conclusion
The amount of clickstream data is rapidly growing and with this demand for accessing information over web has increased significantly.
Therefore analyze the behavior and location of the visitor.
It is inefficient to process large data using traditional sequential method
Therefore MapReduce is used for processing large datasets
Future Scope
Clickstream information play an important role in a wide variety of applications such as decision support systems, profile-based marketing.
Location search is used by various industries like telecom , e-commerce industry , in event detection.
Nearest location method can be fused with any other method to help in better way for decision making.
Then the tradeoff would be done between distance and other factor that would be fused
Thank you !!!