Webinar - The Science of Segmentation: What Questions You Should be Asking Your Data?

2© 2015 Pivotal Software, Inc. All rights reserved. 2© 2015 Pivotal Software, Inc. All rights reserved.

The Science of Segmentation: What Questions Should You Be Asking Your Data?

April 14, 2015

Jarrod Vawdrey, Data Scientist @ Pivotal

Grace Gee, Data Scientist @ Pivotal


Agenda

• Typical State Of Companies New To Big Data Analytics

– Benefits of Big Data technologies

• When to Use Segmentation

– Common business problems

– Types of available data

• Use Cases & Approaches To Segmentation

– Common approaches

– Best practices


Typical State of Companies New to Big Data Analytics


Typical State of Companies New to Analytics

• Companies in the process of transforming into a data-

driven organization often ask similar questions about

where to start:How do I make data available for

my analysts?What tools are needed to efficiently

process and build models on my big data

sets?What data should I be collecting and

archiving?

Where and how can I start to use all my

data to quickly gain actionable insights

and begin integrating data science into our

organization’s practices?

How do I leverage data to generate

value for stakeholders? How do I enable analysts and data

scientists to be more effective?


Common Business Challenges

Data Availability

• Disparate data sources

• No integration of data across lines of businesses

• Insufficient data

• Unknown single source of truth

Slow Time-to-Insight

• Often outdated analytics architectures focused on operational processes hamper experimental nature of big data analytics

• Lack of knowledge about analytics software for in-place processing of and computation on Big Data

• Company organizational structure inhibits fast acquisition of data and communication of insights


Big Data Technologies for Data-Driven

Organizations• Data Lake: efficient, massively scalable Big Data storage platform

– Store all data: we don’t want to inhibit the ability to answer future questions

– Save all (structured, unstructured, and semi-structured) types of data: we may not immediately know “optimal” form to store data for analysis

– Work with multiple types of data from one location

– Centralized location of data accessible to all organizations

• Agile Analytics Platform: purpose-built architecture for getting results and gaining insights quickly through parallel, in-place data analytics

– No required sampling due to limited memory

– No data movement

– Scalable analytics


Big Data Technologies for Data-Driven

Organizations

Enterprise

Apps

Reporting

Prioritized

Operational

Processes

Data Sources

Inventory

Optimization

Demand

Forecasting

Proprietary

Structured Data

Proprietary

Unstructured Data

Partner Data

Self Reporting:

Google, Weblogs,

Twitter

External Sources:

Census, Nielsen,

Weather, etc…

Sensors

HAWQGreenplum DB

Pivotal HD (HDFS)

GemFire XDMADlib, PL/R,

PL/Python, etc.

Platform-Driven Data Science

1 0

0 1 01 0

0 1

0 1

1 1 0

Fraud

Detection


Segmentation:An important step for understanding data

• What is segmentation?

– Automatic grouping of entities based on a common set of features

– Identification of patterns amongst similar entities

• What is segmentation good for?

– Identifying select features that greatly differentiate groups of entities

• E.g. Identifying behaviors of high-profit suppliers and low-profit suppliers

– Identifying similar characteristics amongst different groups• E.g. Identifying similar market segments to target

– Predicting characteristics and behaviors of new or unknown entities

• E.g. Inferring missing labels, predicting market response to new products


Segmentation & Big Data Technologies

• Segmentation problems often deal with:

– Multiple data sources from multiple lines of business and external sources

– BIG DATA, particularly from sensor data or transactions/point of sales

– High-dimensional feature sets

• Big Data technologies help make segmentation problems become feasible and bring faster time-to-insights through:

– Ability to leverage and integrate all relevant data sources, no matter how large

Data Lake

– Using ALL data to train segmentation models and not rely on samples or a subset of data that fits into memory

MPP databases, Hadoop, HAWQ, MADlib, Spark, etc.

– Quickly building segmentation models and scoring new entities through parallelized, in-place computation

MPP databases, Hadoop, HAWQ, MADlib, Spark, etc.

How cutting edge Big Data technology enables faster insights


When To Use Segmentation


Common Business Problems

Customer Micro-targeting

Identifying market segments and their purchasing behaviors

Operations & Logistics

Identifying behaviors of underperforming or outperforming stores, suppliers, delivery services, etc.

Fraud

Identifying normal and anomalous user behaviors within networks

Domain Resolution

Inferring labels or groups of similar web domains

where segmentation can help


Data Used In Segmentation of CustomersPower in leveraging both internal and external datasets

Demographic profiles

Sensor dataProduct

metadata

Shipment data Store metadataTransactions and invoices

Delivery information

Marketing plans

External data: Census,

Nielsen, social networks, etc.


Gaining Additional Value From External Data

Often companies do not or cannot collect

sufficient data about their customers to

construct a complete profile. Augmenting

internal data with external sources allows

companies to:

• Develop a 360 degree customer view

• Gain insights into how consumers are

interacting with competitors

• Improve accuracy of predictive models

• Increase the value of internal data

Point of

sales

Transaction

data

Web/Apps

logs

Investments

Market

basket

Loans

Traffic

Weather

IXI wealth

complete

Haver time

series

Dept. of

LaborCRM

Internal External

Note: This list only represents a subset of data sources that should be considered.


Example: Using Census Data to Build

Family Profiles

Consumer Packaged Goods (CPG) companies are

often interested in building market profiles for

micro-targeting to improve marketing strategies and

supply chain planning.

Hypothesis:

• Not only are CPG companies interested in the

individual consumer, but in the family profile as

well

– E.g. Consumption of child products is

affected by family size

Approach:

• Census Public Use Microdata Sample (PUMS)

files include person records and housing

records which can be combined in

segmentation models to build rich family

profiles.

fraction of households

Households with Children*

*Children as defined by a certain age group


Use Cases &Approaches to Segmentation


Common Approaches for Implementing

Segmentation

Data Step

• Identify join relationships across all data sources

• Aggregate data to common granularity

Feature Step

• Identify and create features that can characterize the entities you want to segment, e.g. age, gender, types of last transactions, average time between visits, average spend, sensitivity to price change, etc.

Model Step

• Candidate algorithms: clustering strategies like k-means & hierarchical clustering, regression or hierarchical modeling and grouping by similar coefficients, ensemble methods, etc.

Analysis of Results

• Look at average features across clusters

• Look at average cluster features vs. population average (e.g. to find anomalous behavior)

• Identify common features amongst segments (e.g. opportunities for cross-sell/up-sell)


• Objective:

Identify characteristics of consumers that prefer certain brands or

products

• Common business challenges:

– No integration of data amongst different lines of businesses

– Internal data is not sufficient for building profiles

– No information about which consumers are more/less profitable

• Data sources:

– Point of sales, demographic data, loyalty data,

product and store metadata, external data

Example: Profiling market and consumer

segments


• Identify relationships and joins amongst all data sources

• Clean data by removing outliers and imputing missing values if appropriate

– For example using the median or weighted average value for a state to

impute into a missing value for a county

• Aggregate or select data to common granularity that makes sense

– For example, demographic profiles can be built at the zip code or county level,

and store profiles can be built at the individual store or tier or region level

Step 1: Consolidate Data Sources

• Do gap analysis to determine the scope of data

sufficient for analysis- For example, a certain subset of customers may

be missing data for a large time period and should

be scoped out

time

nu

mb

er

of sto

res r

ep

ort

ing

Using an MPP database like Greenplum, we can join tables with billions of rows

in a little over a minute.


Step 2: Feature Engineering & SelectionT

ran

sa

cti

on

s &

Po

int

of

Sa

les Total sales

Change in sales

Price

Discount

Market basket Sto

re/L

oc

ati

on Geolocation

Weather

Pro

du

ct Department/Type

Color

Size

Brand

Package De

mo

gra

ph

ic Age

Gender

Income

Employment

Education

Family size

Marital status

Citizenship

Language

Lo

ya

lty Status

Length of membership

Activity

It’s common for data scientists to generate hundreds of thousands of features.


Step 2: Feature Engineering & Selection

In order to reduce feature dimensionality and account for unwanted bias due to the inclusion of highly correlated features, we can filter features using approaches such as :

• Principal Component Analysis

• Reducing the dimensionality of the feature space to a select number of principal components

• Iterative pairwise correlation comparison

• Calculate NxN pairwise correlations, where N is the number features

• Remove the feature existing in the greatest number of correlated pairs (correlation coefficient greater than some threshold)

• Iterate until no correlated pairs exist

Example: Subset of feature correlation matrix. The

large number of features requires an automated

approach to feature selection


• Example: K-means Clustering

1. Create single feature vector for each entity, e.g.

consumer

2. Use k-means clustering to identify k consumer

segments

i. Try multiple training trials for multiple values of k

ii. Use any one of a variety of techniques for selecting

optimal k, e.g. silhouette coefficient

3. Look at average features across segments to identify

segment characteristics

4. Look at purchasing behaviors of each segment to

identify segment preferences

Step 3: Build Models


• Segmentation models used to

identify and profile consumer groups

– Calculate descriptive statistics for each

segment and compare to uncover

previously hidden opportunities

• Cross-sell/up-sell opportunities

• Potential data issues or supply chain

execution opportunities regarding

unequal proportion of product shipment

or inventory to regional preference

• Rich set of reusable data assets

made available for ongoing analysis

& reporting

Step 4: Extract Business Value from Results

Cluster 1 Cluster 2 Cluster 3 Cluster 4

Feature 1

Feature 2

Feature 3

.

.

.

low value high valuecompared across clusters


What Questions Should You Be Asking Your

Data?

• Are you collecting the right data & storing it in the right

fashion?

• Do you have the right technology to support your data

and data science endeavors?

• Where are the gaps in your data? How can external

sources fill those gaps?

• How can your data sources be joined or aggregated

together to build rich feature sets?

• How can you extract business value from your data?

Segmentation will help you answer all of these questions!


Thank You.

Webinar - The Science of Segmentation: What Questions You Should be Asking Your Data?

Data & Analytics

Transcript of Webinar - The Science of Segmentation: What Questions You Should be Asking Your Data?