Post on 29-Jun-2015
description
_____________________________________________
Clickstream Data Warehouse – turning clicks into customers
1
Albert Hui
About Me• Associate Director with EPAM Canada
• Over 12 years with Business Intelligence/Data Warehousing
• Over 7 years with Java and web technologies
• BIDW Architect, Big Data Evangelist
• Conference Speaker at IOUG, TOUG Collaborate 2011, 2012 and 2013
• Technical editor on Oracle 12c Book.
4/19/2013 22
• Technical editor on Oracle 12c Book.
• Master in Engineering in the area of Artificial Intelligence – Fuzzy logic
• MBA, University of Toronto
• Toronto based
• Twitter: @dataeconomist
• Father of two twin boys
Agenda
Objective of this Session
What is Clickstream data?
How to collect Clickstream data?
Use Cases
4/19/2013 33
Challenges – what are we trying to solve?
Solutions
Live Demo
How to Start?
Concluding Thoughts
Q/A’s
Some Leaders Who Chose EPAM.
4/19/2013 44
Objective of this session
Introduction of Clickstream Data
4/19/2013 55
Start thinking how to fully
Utilize Clickstream Data
Get started
Individually and as
An organization - a
Sample Demo
Introduction of Clickstream Data
Solutions and
Available
Technologies
Movie – A Beautiful Mind
4/19/2013 666
Sales – how to sell a lobster
4/19/2013 777
www.bishopbigideas.com
Let’s have a quick quiz
8
• In US, a 45year male, 3 children, Around 150-180K
income, Post Graduate Education, if he wants to buy a
car. Which brand?
Quick Quiz
4/19/2013 99
• In US, a 45year male, 3 children, 180K income,
Graduate School Education, if he wants to buy a car.
And he lives in Texas, then which brand?
Quick Quiz
4/19/2013 1010
• In US, a 45year male, 3 children, Graduate School
Education, if he wants to buy a car. And he lives in
Texas, he is a single parent, <Unknown> income, but
he is looking to travel to Florida ONLY. then Which
brand?
Quick Quiz
4/19/2013 11
brand?
11
• But, would these preferences change (evolving
behaviours) over time? How do we catch-up?
Quick Quiz
4/19/2013 1212
@MiamiParking lot
4/19/2013 1313
What is Clickstream Data?
14
What is Clickstream?
• A Clickstream is the recording of the parts of the screen
a computer user clicks on while web browsing or using
another software application. As the user clicks
anywhere in the webpage or application, the action is
logged on a client or inside the web server, as well as
Clickstream Data
What is Clickstream?
4/19/2013 15
logged on a client or inside the web server, as well as
possibly the web browser, router, proxy server or ad
server. Clickstream analysis is useful for web activity
analysis, software testing, market research, and for
analyzing employee productivity.
Source: wikipedia
15
• Clickstream is not just weblogs.
• They can be essentially every interaction that you transact with any electronic devices.– TV PVRs.
– Smart phones.
– Game consoles.
Clickstream Data
What is Clickstream?
4/19/2013 16
– Game consoles.
– Sensors: security systems, highways.
– E-Payment cards, Loyalty cards.
– Geolocation
– Maybe more:• Alarm clocks.
• Printers
• Parking etc.....
16
• Clickstream Data is not new.– Published in January 2002, Clickstream Data
Warehousing, by Mark Sweiger
• There are essentially two types of Clickstream data– Individual Site’s Clickstream, - click path
Clickstream Data
What is Clickstream?
4/19/2013 17
– Individual Site’s Clickstream, - click path
– Internet Clickstream Data
• Server weblog accounts for 75% of daily data generation according to Gartner.
• Facebook alone captures 1.5PB of weblog data daily.
• Amazon captures 200TB of weblog data daily.
17
Sample of Clickstream Data
What is Clickstream?
• Web logs204.243.130.5 - - [26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437
"http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5 [en] (Win98; I)“
204.243.130.5 - - [26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200 1900 "http://www.clickstreamconsulting.com/"
"Mozilla/4.5 [en] (Win98; I)“
204.243.130.5 - - [26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0" 200 7363 "http://www.clickstreamconsulting.com/"
"Mozilla/4.5 [en] (Win98; I)“
4/19/2013 1818
• A click path is the sequence of links a site visitor
follows.
Clickstream – Click-path Analytics
What is Clickstream?
4/19/2013 1919
• A click path is the sequence of links a site visitor
follows.
Clickstream – Click-path Analytics
What is Clickstream?
4/19/2013 2020
Let’s take another quick
21
quiz
Customer A
What is Clickstream?
Quiz 2: Which one is a more frustrated customer?
4/19/2013 2222
Customer B
Quiz 2: Which one is a more frustrated customer?
What about I tell you the What is Clickstream?
4/19/2013 2323
What about I tell you the customer is a Deal finder?
How Clickstream Data is collected?
24
• Web Logs
– Here no need to use JavaScript code for tracking purpose.
The data is collected by the web server independently of
a visitor’s browser. It captures all the requests made to
your web server including pages, images and PDFs.
Clickstream – how to collect
4/19/2013 25
your web server including pages, images and PDFs.
25
• Page Tagging
– Google Analytics is implemented with "page tags". A
page tag, in this case called the Google Analytics Tracking
Code (GATC) is a snippet of JavaScript code that the
website owner user adds to every page of the website.
Clickstream – how to collect
4/19/2013 26
website owner user adds to every page of the website.
The GATC code runs in the client browser when the client
browses the page (if JavaScript is enabled in the browser)
and collects visitor data and sends it to a Google data
collection server as part of a request for a web beacon.
26
What about some Use Cases for Clickstream?
27
Clickstream – Use Cases
What is Clickstream?
• Internet Traffic Analytics is another type of
Clickstream data. E.g.
– Google Analytics
– Yandex
4/19/2013 2828
– Kontagent
Clickstream – Use case – Google Analytics
What is Clickstream?
• Google Analytics measure how your site is performing– Competitor Analytics
– Social Mobile analytics
– Advertising Analytics
4/19/2013 2929
Clickstream – Use Case - Yandex
What is Clickstream?
• Yandex is another big one based in Russia
4/19/2013 3030
Clickstream – Use Cases – make money
Advertising on the Internet
1. Banner Ads
2. Paid Search
3. Email Campaign
4/19/2013 3131
Use cases
Clickstream – Use Cases – make money
Personalized Advertising
Minority Report-style
shopping? The billboard
that profiles you and
then flashes up ads
4/19/2013 3232
Use cases
then flashes up ads
tailored to your tastes
Clickstream – Use Cases – medical field
Use cases
Medical Science – electronic clicks
4/19/2013 3333
Clickstream – Use Cases - games
• Kontagent is the user analytics platform for
developers, marketers, product managers, and
strategic partners across the social and mobile
web. The platform kSuite provides social data
pattern visualization and analysis that delivers
actionable insights via an on-demand services.
4/19/2013 3434
Use cases
actionable insights via an on-demand services.
• San Francisco/Toronto based.
• It focuses on the gaming industry, - records every
click of the gamers.
• It tries to make gaming sites more sticky.
• Raised $50M+ US in the last 3years.
Quiz #3
35
Clickstream – Quiz #3
What is Clickstream?
1. What is the main focus on these
analytics?
4/19/2013 3636
2. What are they missing?
YOU
4/19/2013 3737
YOU
SIMILARITY
BETWEEN all of
4/19/2013 3838
BETWEEN all of
YOU
Collective Intelligence
4/19/2013 3939
Crowd Sourcing
What are we trying to solve?
40
Clickstream - Challenges
• Yes, you are right! We have too much data.Challenges
4/19/2013 4141Yes, we have a lot of data
Clickstream - Challenges
Challenges
• And user demographics
data is hard to get, due to
localized privacy laws.
• Users’ sense of privacy.
4/19/2013 4242
• Users’ sense of privacy.
• User preferences change
constantly, there are no
one-size-fit-all rules.
Clickstream – What are we trying to solve?
4/19/2013 4343Rules inside the data
Clickstream - ChallengesChallenges
Gende
r
Age Marita
l
status
occupa
tion
No. Of
Kids
Incom
e
Region Race Own a
house
Car
brand
Like
sport
Like
politics
Like
busine
ss
... click
path
Buy
M 25-35 M Engine
er
3 80-90K Toront
o
Caucas
ian
Y BMW - N Y ... ABACB
CDE...
Y
M 25-35 S Chemis
t
1 50-60K NY Asian Y N/A N - N ... AABEB
FGHIG
SJBA..
Y
F 35-45 D Chemis 0 50-60K Toront Caucas N TOYOT N N - ... ABAEB N
4/19/2013 4444
F 35-45 D Chemis
t
0 50-60K Toront
o
Caucas
ian
N TOYOT
A
N N - ... ABAEB
FGHIG
FSBA...
.
N
F 50-60K M Doctor 6 - Minsk Caucas
ian
Y BMW N Y Y ... ABAEB
FGHIG
FSBA...
..
Y
F 35-45 D Resear
cher
0 50-60K Toront
o
Caucas
ian
Y N/A N Y N ... ABAEB
FGHIG
FSBA..
N
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Clickstream - ChallengesChallenges
Gende
r
Age Marita
l
status
occupa
tion
No. Of
Kids
Incom
e
Region Race Own a
house
Car
brand
Like
sport
Like
politics
Like
busine
ss
... click
path
Buy
M 25-35 M Engine
er
3 80-90K Toront
o
Caucas
ian
Y BMW - N Y ... ABACB
CDE...
Y
M 25-35 S Chemis
t
1 50-60K NY Asian Y N/A N - N ... AABEB
FGHIG
SJBA..
Y
F 35-45 D Chemis 0 50-60K Toront Caucas N TOYOT N N - ... ABAEB N
4/19/2013 4545
F 35-45 D Chemis
t
0 50-60K Toront
o
Caucas
ian
N TOYOT
A
N N - ... ABAEB
FGHIG
FSBA...
.
N
F 50-60K M Doctor 6 - Minsk Caucas
ian
Y BMW N Y Y ... ABAEB
FGHIG
FSBA...
..
Y
F 35-45 D Resear
cher
0 50-60K Toront
o
Caucas
ian
Y N/A N Y N ... ABAEB
FGHIG
FSBA..
N
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Clickstream – What are we trying to solve?
Prediction
4/19/2013 4646
Prediction
Solutions here.
47
Clickstream – Solutions – Clickstream Data
Warehouse
Solutions
Problems Solutions
Too much Data
4/19/2013 4848
Rules inside the data
Prediction
Architecture and Schema
Data
Vectorization
Clickstream – Solutions – handling too much data
• Top level Apache project
• Open source
• Software Framework - Java
• Inspired by Google’s white papers onMap/Reduce (MR)Google File System (GFS)Big Table
Solutions
4/19/2013 4949
Big Table
• Originally developed to support Apache Nutch
• Designed for
– Large scale data processing
– For batch processing
– For sophisticated analysis
– To deal with structured and unstructured data
Clickstream – Solutions – Data Vectorization
SolutionsClustering: Understanding data as vectors
X = 5 , Y = 3
Y
Mahout Vector Implementation1. DenseVector2. RandomAccessSparseVector3. SequentialAccessSpareVector
4/19/2013 505050
• The vector denoted by point (5, 3) is simply
Array([5, 3]) or HashMap([0 => 5], [1 => 3])
X = 5 , Y = 3(5, 3)
X
3. SequentialAccessSpareVector
Storing non-zero values in memory
Vectors must implements Java
Interface
java.io.serializable
java.mahout.VectorWritable
Clickstream – Solutions – Data as n-dimensional
vectors
Solutions Clustering: Understanding data as vectors
• Imagine one dimension for each feature for user,
product, geography, time etc.
• Each dimension is also called a feature or label
4/19/2013 515151
• Each dimension is also called a feature or label
• Support Vector Machine (SVM) age
income
occupation
Clickstream – Solutions – Predictive Algorithms
SolutionsTrain/test the
model
Then predict
What to happenFour major steps
4/19/2013 525252
Collection
And model
The Data
Select/build a model
Clickstream – Solutions – Predictive Algorithms
Solutions
• An Apache Software Foundation project to create
scalable machine learning libraries under the Apache
Software License
• http://mahout.apache.org
• Why Mahout?
4/19/2013 535353
• Why Mahout?
– Many Open Source ML libraries either:
• Lack Community
• Lack Documentation and Examples
• Lack Scalability
• Lack the Apache License
• Or are research-oriented
“Hindi” word stands for
Elephant Driver
Clickstream – Solutions – Algorithms
Solutions
Algorithms and ApplicationsAlgorithms and Applications
4/19/2013 545454
Math
es/SVD
Math
Vectors/Matric
es/SVD
RecommendersClusteringClassificationFreq. Pattern
Mining
Utilities
Lucene/Solr
Statistics
ProbabilityApache Hadoop
See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Clickstream – Solutions – Algorithms – Mahout
Solutions
Command line launcher
bin/mahout list (This shows the list of algorithms)
Valid program names are:
1. canopy: : Canopy clustering
2. cleansvd: : Cleanup and verification of SVD output
3. clusterdump: : Dump cluster output to text
4. dirichlet: : Dirichlet Clustering
5. fkmeans: : Fuzzy K-means clustering
4/19/2013 555555
5. fkmeans: : Fuzzy K-means clustering
6. fpg: : Frequent Pattern Growth
7. itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
8. kmeans: : K-means clustering
9. lda: : Latent Dirchlet Allocation
10. ldatopics: : LDA Print Topics
11. lucene.vector: : Generate Vectors from a Lucene index
12. matrixmult: : Take the product of two matrices
13. meanshift: : Mean Shift clustering
14. recommenditembased: : Compute recommendations using item-based collaborative filtering
…..
Clickstream – Solutions – Algorithms – build a model
Solutions
• Learn a model from a manually trained dataset
• Predict the class of an unseen object based on features
• E.g. features of user profile, product, click path to predict users’ preferences.
4/19/2013 565656
Clickstream – Solutions – Algorithms – build a model
Solutions
• Learn a model from a manually trained dataset
• Predict the class of an unseen object based on features
• E.g. features of user profile, product, click path to predict users’ preferences.
4/19/2013 575757
Clickstream – Solutions – Clickstream Data
Warehouse
Solutions
Traditional Clickstream Data Warehouse Schema
Common Dimensions:
1. Customer
2. Product
3. Time
4. Geography
4/19/2013 5858
5. Page
6. Content (meta-data)
7. User
Facts:
1. Sales
2. User Activities
Design:
Schema Design depends on the data we have and the measures we have
Solutions
Clickstream – Solutions – Clickstream Data
Warehouse
4/19/2013 5959
Source: Clickstream Data warehouse By Mark Sweiger
Solutions
Clickstream – Solutions – Clickstream Data
Warehouse
4/19/2013 6060
Source: Clickstream Data warehouse By Mark Sweiger
Solutions
Clickstream – Solutions – Clickstream Data
Warehouse
4/19/2013 6161
Source: Clickstream Data warehouse by Albert H
Clickstream – Solutions – technology stack
Solutions
ETL (INFA,
BI TOOL
RMDB, Oracle MySQL ZooKeeper
Model
Application
Reporting
Reports Web App
Hosting Models
4/19/2013 626262
APACHE HIVE,
HBASE
STATISTICAL
MAHOUT
ETL (INFA,
Talend)
APACHE HADOOP Clickstream logs
Algorithms
ModelData Movement
Data-
warehouse
What about a Case study - demo?
63
study - demo?
• An Asia based Hotspot Wi-Fi provider, wireless routers throughout
China/Hong Kong.
• Revenue Model: Advertising
– Advertisers place ads when users browse the Net.
• Data
– Survey data: Users are required to fill a survey before logging in.
Clickstream – Case Demo
Demo
4/19/2013 64
– Survey data: Users are required to fill a survey before logging in.
– Click logs including Ad click-through
• Data Size:
– 12GB+ compressed a day.
– 150M+ clicks and 2.4M click through a day.
• Problem definition: click-through rate is too low
64
Clickstream – Case Demo
4/19/2013 6565Demo
Hadoop – running Cloudera CDH4
Clickstream – Case Demo
Demo
• Meet the Clickstream logs
4/19/2013 6666
MAC AddrAD Site ClickedRouter LocationWhen the click is
recorded
Clickstream – Case Demo
Demo
• Meet the survey questions
4/19/2013 6767
Some Sample of Survey Questions
Clickstream – Case Demo
Demo
• Meet the answer and survey results
Options
For Survey
Answers
4/19/2013 6868
Clickstream – Case Demo
Demo• Vectorize the data for users who click weibo.com
4/19/2013 6969MAC Addr
Data Vectors
Clickstream – Case Demo
Demo
Training data set
Resultant
4/19/2013 7070
“cosine value
Distance”
Resultant
Vector
Clickstream – Case Demo
Demo
Test data Set
4/19/2013 7171
Area under
The curve is a table with two rows and two columns that reports
the number of false positives, false negatives, true positives,
and true negatives.
Clickstream – Case Demo
Demo
macaddr q16 q17 q18 q19 q20 q21 q22 q23 q24 q25 AUC Value
Actually
chicked
00:22:5f:34:54:3e 116 166 135 146 157 169 172 177 183 193 0.76 Y
00:1f:5b:b3:26:6d 117 125 136 144 162 0 0 0 0 197 0.65 N
00:1a:73:e8:56:c6 117 122 137 152 159 169 172 177 190 195 0.65 N
00:18:de:1f:fe:c0 0 0 0 0 0 0 0 0 0 193 0.61 Y
00:1e:65:51:34:80 0 0 137 141 157 0 0 0 0 210 0.59 N
00:17:c4:a9:16:6c 0 0 0 0 0 0 0 0 0 0 0.53 N
2 out
Of 6
Are
Predicted
right
> 0.5 is goodTest Results
4/19/2013 72
00:17:c4:a9:16:6c 0 0 0 0 0 0 0 0 0 0 0.53 N
00:1f:3b:06:87:3d 118 131 0 0 0 0 0 0 0 201 0.41 Y
00:21:19:a4:8d:ea 0 0 134 151 157 170 172 177 184 211 0.32 N
00:1e:65:7d:2d:d2 0 0 0 0 0 0 0 0 0 0 0.29 N
00:16:44:c7:80:35 0 0 0 0 0 0 0 0 0 0 0.24 Y
00:16:44:d4:11:9a 0 0 0 0 0 0 0 0 0 0 0.22 Y
00:13:02:a4:33:9c 0 0 0 0 0 0 0 0 0 0 0.2 Y
00:21:19:9a:64:ad 0 0 0 0 0 0 0 0 0 0 0.18 N
00:1f:df:75:0a:8e 0 0 0 0 0 0 0 0 0 0 0.16 N
00:25:d3:50:37:92 118 127 0 0 0 0 0 0 0 0 0.13 Y
00:21:00:d6:98:2c 118 123 0 0 0 169 172 176 187 192 0.11 Y
00:17:c4:9b:2c:e2 0 0 0 0 0 0 0 0 0 0 0.11 N
00:0d:f0:6d:fc:47 0 0 0 0 0 0 0 0 0 0 0.11 Y
00:1e:65:3f:e1:6c 0 0 0 0 0 0 0 177 188 0 0.1 N
00:21:00:e3:a5:f1 0 0 0 0 0 0 0 0 0 0 0.08 N
Clickstream – Case Demo
Demo
• Meet the ETL process with Talend BD V 5.2
4/19/2013 7373
Clickstream – Case Demo
Demo
• Meet some sample reports
4/19/2013 7474
Clickstream – Case Demo
Demo
• Meet some sample reports
4/19/2013 7575
Objective of this session
Introduction of Clickstream Data
4/19/2013 7676
Start thinking how to fully
Utilize Clickstream Data
Get started
Individually and as
An organization - a
Sample Demo
Introduction of Clickstream Data
Solutions and
Available
Technologies
4/19/2013 7777
Thank you!
Albert Hui, MBA, MASc., P.Eng, CSM
EPAM Canada, Associate Director
Email: albert_hui@epam.com
Follow me at Twitter: @dataeconomist
4/19/2013 7878
Please help fill an evaluation form
www.ioug.org/eval
Session # 353