MBA7025_08.ppt/Mar 24, 2015/Page 1 Georgia State University - Confidential MBA 7025 Statistical...
-
Upload
bruce-wright -
Category
Documents
-
view
212 -
download
0
Transcript of MBA7025_08.ppt/Mar 24, 2015/Page 1 Georgia State University - Confidential MBA 7025 Statistical...
MBA7025_08.ppt/Mar 24, 2015/Page 1Georgia State University - Confidential
MBA 7025
Statistical Business Analysis
Data Warehousing & Data Mining
Mar 24, 2015
MBA7025_08.ppt/Mar 24, 2015/Page 2Georgia State University - Confidential
Agenda
Data Mining
Data Warehouse & Relational Database
Designing & Building the
Data Warehouse
Appendix: SQL
MBA7025_08.ppt/Mar 24, 2015/Page 3Georgia State University - Confidential
The Data Warehouse
The Data Warehouse
• is physically separated from all other operational systems
• holds aggregated data and transactional data for management separate from that data used for online transaction processing
MBA7025_08.ppt/Mar 24, 2015/Page 4Georgia State University - Confidential
Data Flow
OperationalData Store
DataWarehouse
DataMart
Metadata
LegacySystems
PersonalData
Warehouse
MBA7025_08.ppt/Mar 24, 2015/Page 5Georgia State University - Confidential
Metadata
What is Metadata?
• Data about Data• Without metadata, the data is meaningless• Provides consistency of the truth
Components of Metadata
• Transformation Mapping• Extraction and Relationship History• Algorithms for Summarization (and calculations)• Data Ownership• Patterns of Warehouse Access• Business Friendly naming conventions• Status Information
MBA7025_08.ppt/Mar 24, 2015/Page 6Georgia State University - Confidential
Data Warehouse Vendors
• Business Objects
• Cognos
• Hyperion
• IBM
• Microsoft
• NCR / Teradata
• Oracle
• SAS
MBA7025_08.ppt/Mar 24, 2015/Page 7Georgia State University - Confidential
Relational Database
A relational database is a collection of data items organized as a set of formally-described tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables. The relational database was invented by E. F. Codd at IBM in 1970.
The standard user and application program interface to a relational database is the structured query language (SQL). SQL statements are used both for interactive queries for information from a relational database and for gathering data for reports.
A relational database is a set of tables containing data fitted into predefined categories. Each table (which is sometimes called a relation) contains one or more data categories in columns. Each row contains a unique instance of data for the categories defined by the columns. For example, a typical business order entry database would include a table that described a customer with columns for name, address, phone number, and so forth. Another table would describe an order: product, customer, date, sales price, and so forth. A user of the database could obtain a view of the database that fitted the user's needs. For example, a branch office manager might like a view or report on all customers that had bought products after a certain date. A financial services manager in the same company could, from the same tables, obtain a report on accounts that needed to be paid.
MBA7025_08.ppt/Mar 24, 2015/Page 8Georgia State University - Confidential
Relational Database
When creating a relational database, you can define the domain of possible values in a data column and further constraints that may apply to that data value. For example, a domain of possible customers could allow up to ten possible customer names but be constrained in one table to allowing only three of these customer names to be specifiable.
The definition of a relational database results in a table of metadata or formal descriptions of the tables, columns, domains, and constraints. Meta is a prefix that in most information technology usages means "an underlying definition or description." Thus, metadata is a definition or description of data and metalanguage is a definition or description of language.
A database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. A distributed database is one that can be dispersed or replicated among different points in a network. An object-oriented programming database is one that is congruent with the data defined in object classes and subclasses.
SQL (Structured Query Language) is a standard interactive and programming language for getting information from and updating a database. Although SQL is both an ANSI and an ISO standard, many database products support SQL with proprietary extensions to the standard language. Queries take the form of a command language that lets you select, insert, update, find out the location of data, and so forth.
MBA7025_08.ppt/Mar 24, 2015/Page 9Georgia State University - Confidential
Business Intelligence Environment
Internal Source Systems
External Data Sources
Ext
ract
, T
rans
form
atio
n an
d L
oad
Data WarehouseData Mart
MBA7025_08.ppt/Mar 24, 2015/Page 10Georgia State University - Confidential
Relational Database
• IBM DB2, DB2/400 • Microsoft SQL/Server • Teradata • Oracle • Sybase • Informix / Red Brick
• Microsoft Access• MySQL
MBA7025_08.ppt/Mar 24, 2015/Page 11Georgia State University - Confidential
SQL
SQL – Structured Query Language
1. DDL – Data Definition Language
• Create• Drop • Alter
2. DML – Data Manipulation Language
• Insert• Update• Delete• Select
MBA7025_08.ppt/Mar 24, 2015/Page 12Georgia State University - Confidential
Relational Database
RDBMS
SoftwareApplication
SQLRequest
ResultSet
MBA7025_08.ppt/Mar 24, 2015/Page 13Georgia State University - Confidential
Agenda
Data Warehouse & Relational Database
Data Mining
Designing & Building the
Data Warehouse
Appendix: SQL
MBA7025_08.ppt/Mar 24, 2015/Page 14Georgia State University - Confidential
Why Business Intelligence
1. Improve consistency and accuracy of reporting
2. Reduce stress on operational systems for reporting and analysis
3. Faster access to information
4. BI tools provide increased analytical capabilities
5. Empowering the Business User
6. Companies are realizing that data is a company’s most underutilized asset
MBA7025_08.ppt/Mar 24, 2015/Page 15Georgia State University - Confidential
ERM vs. DM
ERM - Entity Relationship Model
• Remove redundancy
• Efficiency of transactions
DM - Dimensional Model • Intuitive View of the Data • Efficiency of access and analysis
MBA7025_08.ppt/Mar 24, 2015/Page 16Georgia State University - Confidential
Dimensional Model
Fact Table
Foreign_Key_1Foreign_Key_2Foreign_Key_3Foreign_Key_4Metric_1Metric_2. . . .
Dimension Table
Primary_KeyDescriptive_Attribute_1Descriptive_Attribute_2Descriptive_Attribute_3Descriptive_Attribute_4Descriptive_Attribute_5Descriptive_Attribute_6Descriptive_Attribute_7. . . .
Dimension Table
Primary_KeyDescriptive_Attribute_1Descriptive_Attribute_2Descriptive_Attribute_3Descriptive_Attribute_4Descriptive_Attribute_5Descriptive_Attribute_6Descriptive_Attribute_7. . . .
Dimension Table
Primary_KeyDescriptive_Attribute_1Descriptive_Attribute_2Descriptive_Attribute_3Descriptive_Attribute_4Descriptive_Attribute_5Descriptive_Attribute_6Descriptive_Attribute_7. . . .
Dimension Table
Primary_KeyDescriptive_Attribute_1Descriptive_Attribute_2Descriptive_Attribute_3Descriptive_Attribute_4Descriptive_Attribute_5Descriptive_Attribute_6Descriptive_Attribute_7. . . .
Star Schema
MBA7025_08.ppt/Mar 24, 2015/Page 17Georgia State University - Confidential
Retail Sales Dimensional Model (Partial)
Sales Fact Table
Time_Key (FK)Product_Key (FK)Store_Key (FK)Customer_Key(FK)UnitsRevenueCost. . .
Product Dimension Table
Product_Key (PK)SKU_NumberDescriptionBrandProduct_CategorySize. . . .Etc.
Customer Dimension Table
Customer_Key (PK)Customer_NamePurchase_ProfileCredit_ProfileDemographic_CategoryAddress. . . .Etc.
Time Dimension Table
Time_Key (PK)DateDay_of_WeekWeek_NumberMonth. . . .Etc.
Store Dimension Table
Store_Key (PK)Store_IDStore_NameAddressDistrictFloor_Plan. . . .Etc.
MBA7025_08.ppt/Mar 24, 2015/Page 18Georgia State University - Confidential
Fact Table
1. Contains Foreign Keys that relate to Dimension Tables
2. Have a many-to-one relationship to Dimension Tables
3. Contains Metrics to be aggregated
4. Typically does not contain any non-foreign key or non-metric data elements
5. Level of Granularity defines depth and flexibility of analysis
Sales Fact Table
Time_Key (FK)Product_Key (FK)Store_Key (FK)Customer_Key(FK)UnitsRevenueCost. . .
MBA7025_08.ppt/Mar 24, 2015/Page 19Georgia State University - Confidential
Dimension Table
1. Contains a Primary Key that relates to the Fact Table(s)
2. Has a one-to-many relationship to the Fact Table(s)
3. Contains Descriptive data used to limit and aggregated metrics from the Fact Table(s)
4. Can sometimes contain pre-aggregated data
Product Dimension Table
Product_Key (PK)SKU_NumberDescriptionBrandProduct_CategorySize. . . .Etc.
MBA7025_08.ppt/Mar 24, 2015/Page 20Georgia State University - Confidential
Agenda
Data Warehouse & Relational Database
Data Mining
Designing & Building the
Data Warehouse
Appendix: SQL
What is Data Mining?
Market Basket Analysis
Marketing Analytics – Direct Marketing Campaign
Cluster Analysis
MBA7025_08.ppt/Mar 24, 2015/Page 21Georgia State University - Confidential
What is Data Mining?
• A set of activities used to find new, hidden, or unexpected patterns in data
• Verification versus Discovery
• Accuracy in predicting consumer behavior
MBA7025_08.ppt/Mar 24, 2015/Page 22Georgia State University - Confidential
OLAP – Online Analytical Processing
• MOLAP – Multidimensional OLAP
Data Warehouse/ Data Mart
RDBMS
• ROLAP – Relational OLAP
MBA7025_08.ppt/Mar 24, 2015/Page 23Georgia State University - Confidential
Limitations of Data Mining
• All relevant data items / attributes may not be collected by the operational systems
• Data noise or missing values (data quality)
• Large database requirements and multi-dimensionality
MBA7025_08.ppt/Mar 24, 2015/Page 24Georgia State University - Confidential
Techniques and Technologies
• Techniques Used to Mine the Data• Classification• Association• Sequence• Cluster
• Data Mining Technologies• Statistical Analysis• Neural Networks, Genetic Algorithms and Fuzzy Logic• Decision Trees
MBA7025_08.ppt/Mar 24, 2015/Page 25Georgia State University - Confidential
General Data Mining Methods
• Predicting which customers will purchase, based on demographics, psychographics, firmographics, service history, transactions, credit history, etc. Statistical algorithms and decision trees are used for these problems with much success.
• Market Basket Analysis: which customers who purchase an additional telephone line are also likely to purchase dialup internet service? Pattern matching works well: associative rules, fuzzy logic, neural networks.
• Which types of activities precede each other; eg, do customer hospitality and gaming activities show patterns or sequences? We use a combination of statistical modeling and simulations to identify these trigger points for action, and to estimate the marginal value of each.
• Clustering is useful for determining similar groups based on how closely they resemble each other. Multitude of clustering techniques exist, with the primary difference being in how they define what is “close”. Clustering can be very useful for marketing messaging and advertising, strategy development and implementation, and channel development.
Classification:
Association:
Sequencing:
Clustering:
MBA7025_08.ppt/Mar 24, 2015/Page 26Georgia State University - Confidential
Analytics Process
DISCOVERY DATA PREPARATION
KNOWLEDGE DEVELOPMENT
LEVERAGING ANALYTICS
POST ANALYSIS
OPPORTUNITIES
IDENTIFYING
SCOPING
OBJECTIVE SETTING
DATA WAREHOUSE
EXTERNAL DATA APPEND
DATA EXTRACTION
DATA VALIDATION
STATISTICAL MODELING
SEGMENTATION
OFFER OPTIMIZATION
CUSTOMER BEHAVIOR SCORING
DIRECT MAIL
TELEMARKETING
LOYALTY CAMPAIGN
RESULTS DECOMPOSITION
REFININGANALYTICS
FEEDBACK
HYPOTHESISTESTING
DEVELOPINGHYPOTHESES
EFFORT
FEEDBACK FOR
MBA7025_08.ppt/Mar 24, 2015/Page 27Georgia State University - Confidential
Market Basket Analysis
• Market Basket Analysis• Most common and useful in Marketing• What products customers purchase together
Diapers and Beer sell well on Thursday nights
• Benefits• Better target marketing• Product positioning with stores (virtual stores)• Inventory management
• Limitations• Large volume of real transactions needed• Difficult to correlate frequently purchased items with infrequently
purchased items• Results of previous transactions could have been affected by other
marketing promotions
MBA7025_08.ppt/Mar 24, 2015/Page 28Georgia State University - Confidential
Market Basket Analysis
Association Rules for Market Basket Analysis
• All associations are unidirectional and take on the following form: Left-hand side rule IMPLIES Right-hand side rule Left and Right hand side can both contain multiple items (Multi-
dimensional Market Analysis) Examples:
Steak IMPLIES Red Wine
Hunting Magazines IMPLIES Smokeless Tobacco
MBA7025_08.ppt/Mar 24, 2015/Page 29Georgia State University - Confidential
Market Basket Analysis
3 Measures of Market Basket Analysis
• Support – the percentage of baskets in the analysis where the rule is true• Of 100 baskets 11 contained both steaks and red wine.• 11% support
• Confidence – the percentage of Left-hand side items that also have right-side items• Of the 17 baskets that contained steak, 11 contained red wine.• 65% confidence
• Lift – compares the likelihood of finding the right-hand item in any random basket• Also referred to as Improvement• Lift of less than 1 means it is less predictive than random choice• If Confidence is 35%, but the right-hand side items is in 40% of the
baskets, the rule offers no Improvement of random selection.
MBA7025_08.ppt/Mar 24, 2015/Page 30Georgia State University - Confidential
Market Basket Analysis
Market Basket Analysis results can be:
• Trivial • Hot Dogs IMPLIES Hot Dog Buns• TV IMPLIES TV Warranty
• Inexplicable
Virtual Items – Associating non-items or other attributes into the correlation study
“New Customer”
MBA7025_08.ppt/Mar 24, 2015/Page 31Georgia State University - Confidential
Marketing Analytics Landscape
Where can I find new customers?
Where can I find more revenue & profit from my
current customers?
Which of my customers are at risk and how
can I keep them?
Which customers do I
want to win back?
Strategy & Tactics: Guiding the business & helping to make numbersBusiness Planning, Forecasting, Corp Strategy, Financial Metrics, Profitability Analysis
Customer Knowledge – Who are my customers?Segmentation & Profiles, External Data, Mkt Share/Wallet Share, Channel Preference Modeling
• Customer Acquisition
• Prospect profiling
• Event driven marketing
• Propensity to buy & response modeling
• Marketing Optimization
• Market Basket Analysis
• Online and Retail Channels
• Customer and product churn modeling
• Retentive stickiness of key products
• Prediction of key events (eg, residential movers)
• Customer reacquisition
• Customer profitability analysis
Acquisition Growth ReacquisitionRetention
MBA7025_08.ppt/Mar 24, 2015/Page 32Georgia State University - Confidential
Direct Marketing Campaign Platform
ACQUIRE
RETAIN
REACTIVATE
“FIRE”
STORE DIFFERENT CHANNELS
A C T I V A T I O N P R O M O T I O NA C T I V A T I O N P R O M O T I O N
E-mail Address
Vehicles:
• Statements
• Newsletters
• Inserts
• Direct mail
• Personalized kits
• Telephone
Vc Cost to reactivateIf:
Vc < Cost to reactivateIf:
Ugly Postcard???
TestArea
• POS
• Partners
• Advertising
Vehicles:
• Direct Mail
• Statements
Triggered Promotions
highest value
customers
lowest value
customersdowngrade
trigger *
(for example)Days since last purchase = X
X = 30 days for PTNM
X = 60 days for GOLD
X = 120 days for CLUB
Direct Marketing Campaign Platform
PURCHASED
NO PURCHASE
PURCHASE
* < 1 purchase in last 12 mo
If : Time since inactive = X, and
Point balance > X
MBA7025_08.ppt/Mar 24, 2015/Page 33Georgia State University - Confidential
Cluster Analysis
• Definition: The identification and grouping of consumers that share similar characteristics
• Yields: better understanding of prospects/customers
• Translates into: improved business results through revised strategies attributes
• Definition: The identification and grouping of consumers that share similar characteristics
• Process:
• Data Selection
• Missing Values
• Standardization
• Removal of Outliers
• Cluster Analysis Considerations
MBA7025_08.ppt/Mar 24, 2015/Page 34Georgia State University - Confidential
Cluster Analysis
• Only want a small subset of variables for clustering
• Weed out undesirable variables
• Can use PROC FACTOR, PROC CORR
• Can use expert system
• Consideration for observations, weighting
• Probably done with factor analysis
• If not, then two options
• Set Missing to Mean of data
• Set Missing to Value of Equivalent Performance
• No right or wrong answer
• Might do both - depending on variables
MBA7025_08.ppt/Mar 24, 2015/Page 35Georgia State University - Confidential
Clustering
ProspectBase
ProspectBase
Midscale / Leisure Traveler
Midscale / Leisure Traveler
Upscale / Leisure Traveler
Upscale / Leisure Traveler
Country Club /
Resort Set
Country Club /
Resort Set
Midscale / Business Traveler
Midscale / Business Traveler
Upscale / Business Traveler –
Prosperous Traveler
Upscale / Business Traveler –
Prosperous Traveler
OtherOther
Upscale / Business Traveler –
Loan Dependent
Upscale / Business Traveler –
Loan Dependent
MBA7025_08.ppt/Mar 24, 2015/Page 36Georgia State University - Confidential
Cluster Analysis
Attribute Cluster
Name A B C D E (ALL)
Age of Head of Household
38
62
48
44
52
43
Length of Residence in high income group zip codes
7
12
9
6
7
7
Household Income (,000)
48
45
102
73
71
72
Weekday Check in 13
1
3
6
2
3
Weekend Check in 69
6
29
51
7
30
No. Stays (resort) between Jan 1, 2001 and Jun 30, 2002
0
5
6
5
3
2
No. Stays (mid properties) between Jan 1, 2001 and Jun 30, 2002
11
55
21
15
32
16
No. Stays (upscale properties) between Jan 1, 2001 and Jun 30, 2002
24
2
10
15
8
7
MBA7025_08.ppt/Mar 24, 2015/Page 37Georgia State University - Confidential
Cluster Analysis
Cluster Population % Resp. Index Avg. Profit
A 6 250 (75)
B 16 30 5
C 5 110 48
D 8 175 86
E 7 80 (5)
.
. . .
.
. . .
All 100 100 35
MBA7025_08.ppt/Mar 24, 2015/Page 38Georgia State University - Confidential
Cluster Analysis
Cluster 1 Cluster 1 Cluster 1------------
Calculate Scores
(ROI, Response, Utilization)
Overlay Profitability Estimate
Evaluate Risk-Return Tradeoff (by Offer and by
Cluster)
Make Final Selections
DM/Offer 1 DM /Offer 2 DM /Offer N--------
LowRETURNHigh
Low
RISK
High
No-Mail
MBA7025_08.ppt/Mar 24, 2015/Page 39Georgia State University - Confidential
Agenda
Data Warehouse & Relational Database
Data MiningAppendix:
SQL
Designing & Building the
Data Warehouse
MBA7025_08.ppt/Mar 24, 2015/Page 40Georgia State University - Confidential
SQL Select Statement
SELECT column1, column2, . . .
FROM table1, table2, . . .
WHERE criteria1 AND/OR criteria2 . . . . .
ORDER BY column1, column1, . . .
MBA7025_08.ppt/Mar 24, 2015/Page 41Georgia State University - Confidential
SQL Select Statement
SELECT column1, column2, . . .
FROM table1, table2, . . .
WHERE criteria1 AND/OR criteria2 . . . . .
ORDER BY column1, column1, . . .
GROUP BY column1, column1, . . .
HAVING criteria1 AND/OR criteria2 . . . . .
Aggregation
MBA7025_08.ppt/Mar 24, 2015/Page 42Georgia State University - Confidential
SQL – Example 1
SQL
CREATE
TABLE ADDR_BOOK ( NAME char(30),
COMPANY char(20),
E_MAIL char (25)
Output
Name Company Email
John Smith Microsoft [email protected]
Jeff Jones Delta [email protected]
MBA7025_08.ppt/Mar 24, 2015/Page 43Georgia State University - Confidential
SQL – Example 2
2a)
SQL
SELECT
NAME,
COMPANY,
E_MAIL
FROM
ADDR_BOOK
WHERE COMPANY = ‘Microsoft'
Output
Name Company Email
John Smith Microsoft [email protected]
2b)
Table - Product
ID Name Category
I Internet A
B Browsers A
A Application Null
G Graphics Null
SQL
SELECT
ID,
NAME
from
PRODUCT
WHERE CATEGORY = NULL
MBA7025_08.ppt/Mar 24, 2015/Page 44Georgia State University - Confidential
SQL – Example 3
SQL
SELECT
ADDR_BOOK.NAME,
COMPANY.EMAIL
FROM
ADDR_BOOK,
COMPANY
WHERE ADDR_BOOK.EMPLOYEE_ID = COMPANY.EMPLOYEE_ID
Output
Name Email
John Smith [email protected]
Jeff Jones [email protected]
MBA7025_08.ppt/Mar 24, 2015/Page 45Georgia State University - Confidential
SQL – Example 4
SQL
CREATE TABLE CUSTOMER (
CUST_NO INTEGER,
FIRST_NAME CHAR(30),
LAST_NAME CHAR(30),
ADDRESS CHAR(50),
CITY CHAR(30),
STATE CHAR (2),
ZIP_CODE CHAR(9),
COUNTRY CHAR(20) )
CREATE TABLE ORDER (
ORDER_NO INTEGER,
DATE_ENTERED DATE,
CUST_NO INTEGER )
SQL
SELECT
ORDER.ORDER_NO, CUSTOMER.NAME, CUSTOMER.ADDRESS, CUSTOMER.CITY, CUSTOMER.ZIP_CIDE, CUSTOMER.COUNTRY
FROM
ORDER, CUSTOMER
WHERE ORDER.CUST_NO = CUSTOMER.CUST_NO
AND
ORDER.DATE_ENTERED = '1998-20-11'
MBA7025_08.ppt/Mar 24, 2015/Page 46Georgia State University - Confidential
SQL – Example 5
SQL
CREATE
TABLE ADDR_BOOK ( NAME char(30),
COMPANY char(20),
E_MAIL char (25)
Output
Name Company Email
John Smith Microsoft [email protected]
Jeff Jones Delta [email protected]
MBA7025_08.ppt/Mar 24, 2015/Page 47Georgia State University - Confidential
SQL – Example 6 – Referential Integrity
SQL
CREATE TABLE CUSTOMER (
CUST_NO INTEGER PRIMARY KEY,
FIRST_NAME CHAR(30),
LAST_NAME CHAR(30),
ADDRESS CHAR(50),
CITY CHAR(30),
ZIP_CODE CHAR(9),
COUNTRY CHAR(20) )
CREATE TABLE ORDER (
ORDER_NO INTEGER PRIMARY KEY,
DATE_ENTERED DATE,
CUST_NO INTEGER REFERENCES CUSTOMER (CUST_NO) )
SQL
CREATE TABLE ORDER_ITEMS (
ORDER_NO INTEGER,
ITEM_NO INTEGER,
PRODUCT CHAR(30),
QUANTITY INTEGER,
UNIT_PRICE MONEY )
ALTER TABLE ORDER_ITEMS
ADD PRIMARY KEY PK_ORDER_ITEMS (ORDER_NO, ITEM_NO)
ALTER TABLE ORDER_ITEMS
ADD FOREIGN KEY FK_ORDER_ITEMS_1 (ORDER_NO)
REFERENCES ORDER (ORDER_NO)
MBA7025_08.ppt/Mar 24, 2015/Page 48Georgia State University - Confidential
SQL – Example 7 – Index
When you have a primary key, you already have an implicitly (or explicitly) defined unique index on the primary key columns. It's generally a good idea to define non-unique indexes on the foreign keys.
SQL
CREATE UNIQUE INDEX PK_CUSTOMER ON CUSTOMER (CUST_NO)
CREATE UNIQUE INDEX PK_ORDER ON ORDER (ORDER_NO)
CREATE INDEX FK_ORDER_1 ON ORDER (CUST_NO)
CREATE UNIQUE INDEX PK_ORDER_ITEMS ON ORDER_ITEMS (ORDER_NO, ITEM_NO)
CREATE INDEX FK_ORDER_ITEMS_1 ON ORDER_ITEMS (ORDER_NO)