Pentaho Dev Day 30th October 2013 - Meetupfiles.meetup.com/1804355/Pentaho_Zaponet_DevDay... · 1...
Transcript of Pentaho Dev Day 30th October 2013 - Meetupfiles.meetup.com/1804355/Pentaho_Zaponet_DevDay... · 1...
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 1
Pentaho Dev Day
30th October 2013
Sébastien Cognet
Sales Engineer EMEA
@opentoile
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 2
Agenda
PENTAHO Introduction
Analytic suite and more…
USER CONSOLE Architecture
Report, Analyze & Dashboard
Datasource management
Mobility
Demo #1
DEV TOOLS Metadata
Metadata Editor
Schema Workbench
Report Designer
CTools
INTEGRATION Graphic Design
Agile development
Demo #2
OEM Embedded analytics
Many architectures
BIG DATA Visual MapReduce
Big Data Layer
Bended Data
Demo #3
DATAMINING Weka
PDI
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 3
PENTAHO CORP
PRESENTATION
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 4
Pentaho Mission Big Data Analytics without Boundaries
Modern, unified data integration and business
analytics platform
• Native integration into big data ecosystem
• Embeddable, cloud-ready analytics
Critical mass achieved
• Over 1,000 commercial customers
• Over 10,000 production deployments
Fast and Broad Innovation
• Open source development model
• Extensible by customer, partner & community
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 5
Analytic suite and more …
Enterprise Analytic Platform Easy of Use / Administration / Audit
Data Integration
Traditional
Big Data Layer (framework)
Blended Data (as a service)
OEM
One value
CTOOLS framework
Multi-Tenant
Embedded
VISUALIZATION
Report
Analyze
Dashboard
Data Wizard
Services
Training
Workshop
Checkpoint
Consulting
Subscription
Productivity
Garanty
ASSISTANCE
Conciergerie
Network
JIRA
Help-Desk
Infocenter
Community / Forums
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 6
Pentaho Business Analytics Platform Our approach addresses the challenges
Operational
Data
Big
Data
Data
Stream
Public/
Private Clouds
Multi-Tenant Ready Open API’s 100% Java
Access Integrate Cleanse Enrich
Score
Forecast
Connect Visualize
Report Dashboard
Analyze/Explore
DBA ETL/BI Developer
Business Users Executives
Analysts Data Scientists
Embed
Use Case Segments
Big Data
Business Analytics
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 7
Data Ingestion
Manipulation
Integration
Enterprise &
Ad Hoc Reporting
Data Discovery
Visualization
Predictive Analytics
Complete Big Data Analytics &
Visual Data Management
Relational Hadoop NoSQL Analytic
Databases
Pentaho Big Data Analytics
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 8
Pentaho Data Integration
• Visual development for big data
• Broad connectivity
• Data quality & enrichment • Integrated scheduling
• Security integration
• Visual data exploration
• Ad hoc analysis
• Interactive charts & visualizations
DASHBOARD DESIGNER
• Self-service dashboard builder
• Content linking & drill through
• Highly customized mash-ups
Pentaho Data Mining / Predictive Analytics
• Model construction & evaluation
• Learning schemes
• Integration with 3rd part models using PMML
Pentaho Product Components
INTERACTIVE REPORT
• Both ad hoc & distributed reporting
• Drag & drop interactive reporting
• Pixel-perfect enterprise reports
Pentaho for Big Data MapReduce & Instaview
• Visual Interface for Developing MR
• Self-service big data discovery
• Big data access to Data Analysts
ANALYZER
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 9
Pentaho Business Analytics
Modern Platform Built for the Future of
Analytics
100% Java, cross-platform Modular, lightweight, pluggable High-performance, scalable
Service-oriented architecture for easy integration Standards-based, highly extensible, easy to embed
Reporting, dashboards, analysis, data mining, predictive Power tools for business users, analysts, & data scientists
Structured, unstructured & NoSQL data Native support for emerging Big Data sources
End-to-end platform – unified data integration and business analytics Agile approach for fast prototyping & iterations Low cost subscription
Modern Architecture
Embedded Analytics
Broadest Spectrum of Insight
Diverse Data
Integrated, Low Cost Platform
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 10
Pentaho Core Competencies
Information Consumers
Powerful Reporting and Visualization Business Users
Power Users, Developers &
DBAs
Data Integration and Big Data
Advanced Analytical
Professionals Data Mining (Predictive Analytics)
Knowledge Workers/
Business Users
Self-Service Analysis & Queries
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 11
Pentaho Core Competencies
Any Report Executive dashboards
Management / Operational Reports
Financial Reports
Any Format HTML – for the web
PDF – for printing
Excel / CSV – for finance or sharing
Anytime / Anywhere On-demand, scheduled
Event driven – manage by exception
Access via web portal, email or mobile
Information Consumers
Powerful Reporting and Visualization Business Users
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 12
Pentaho Core Competencies
Easy and Intuitive Content Creation Drag and drop
Metadata intelligence
Context-sensitive, right-click interface
Web-based Interactivity Drill down, drill thru, slice & dice, pivoting, lasso filtering
User-defined calculations
Rich Visualizations Scatter plots
Geo-mapping
Heat grids
Knowledge Workers/
Business Users
Self-Service Analysis & Queries
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 13
Pentaho Core Competencies
Deep Integration with multiple sources and analytics output
Inputs: RDBMS, files, web services, NoSQL, Analytical DBs, Hadoop
Output: ETL is tightly coupled with analytics
Scalability Scale up – multi-threaded
Scale out - clustering
Workflow Scheduling
Monitoring
Alerting
Power Users, Developers &
DBAs
Data Integration and Big Data
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 14
Pentaho Core Competencies
Full data mining lifecycle support Preparation of input data
Statistical evaluation of learning schemes
Visualization of input data and result of learning
Visualization, classification and clustering capabilities
Explorer - data exploration/visualization, model construction and export, preliminary evaluation
118 classification/regression algorithms
11 clustering algorithms
Integrated with PDI ETL Execute Weka and R predictive models inside of a PDI transformation
Append probabilities dynamically to each row in the data flow
Retrain Weka models using the KnowledgeFlow plugin for PDI
Advanced Analytical
Professionals Data Mining (Predictive Analytics)
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 15
Refs
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 16
Click Stream Analytics From buying patterns to revenue
360o View of Customer
• Monetize buying patterns hidden in billions of
data points
• Quickly analyze multi-channel click stream data
Pentaho Benefits
• Reduced ETL time to analyze blended data
from Hadoop, Hbase & data warehouse
• Use of big data analytics to grow revenue from
targeted campaigns
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 17
Device Data Analytics Big Data for Net App
Business Challenge
• Affordably scale machine data from storage
devices for customer support app
• Predict device failure
• Enhance product performance
Pentaho Benefits
• Easy to use ETL & analysis for Hadoop, Hbase,
& Oracle data sources
• 15x cost improvement
• Stronger performance against customer SLA’s
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 18
Data Warehouse Optimization Cost effective, fast processing
Business Challenge
• Gain competitive advantage through intraday
balance reporting for commercial customers
• Use Hadoop and relational data stores to
process huge volumes 15x faster
to develop
10x faster
to execute
No coding
Integrate
with existing
Easy to find
resources
Pentaho Benefits
• Graphical orchestration for Hadoop, Hbase &
DB2 data integration workloads
• 15x faster to develop, 10x faster to execute
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 19
Telecom Use Cases Big Data Analytics and the Customer Experience
Expand Net Promoter Score beyond traditional
survey methods
• Add social media data
• Focus on highly valued customers
• Refine predictive models
Better track and manage overall IT
infrastructure for telecom services
• Capacity planning and forecasting
Use emerging sensor/device data to enable
services for a customer’s connected lifestyle
• Connected car, digital life, and mobile wallet
On-Line Ad Performance
• YP.com
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 20
USER CONSOLE
PRESENTATION
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 21
Plateform
CENTRAL ADMINISTRATION, AUDITING & MONITORING
DELIVER When & Where Users Need It
STREAMLINE Information Delivery
VISUALIZE & Report Information In Any Style
ACCESS All Data Sources
ISV & Packaged Applications
SaaS / Cloud Applications
INTEGRATION
Web
Mobile
STANDALONE
‣ Advanced &
Predictive Analytics
DATA MINING
‣ Proactive
‣ Operational
‣ Enterprise
REPORTING
‣ Ad hoc Exploration
‣ Multi-Dimensional
ANALYSIS
‣ Interactive Metrics
‣ Rich Visualizations
DASHBOARDS
ERP / CRM /
Enterprise Apps (e.g. SAP, Oracle)
Hadoop, NoSQL Data & Analytical
Unstructured &
semi-structured (XML, Excel, Files, etc.)
Traditional Relational Data
Cloud (e.g. Salesforce,
Amazon, Dell)
‣ Direct Access
‣Data Integration
‣ Hadoop Clustering
‣ Graphical ETL Designer
‣ Enterprise Scalability
INTEGRATE, CLEANSE, & ENRICH DATA
‣ In Memory Caching
‣ High Performance
‣ Relational OLAP Cubes
METADATA LAYER
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 22
Generic architecture
Unstructured data
EDW
Structured data
Technology BIG DATA
and/or
Staging Area
Pentaho Data Integration
Collect
Pentaho Data Integration
Cleansing
Transformation
Change Data Capture
Data Warehouse Management
PDI PDI Metadata
Dashboard
Report
Analyzer
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 23
Cache
Complete management
Ad-hoc Data
Data Mart(s) / Entrepôts
Alertes SMS, eMail & pièces jointes
PDI Metadata
Dashboard
Rapport
Analyse
Technology BIG DATA
and/or
Staging Area
PDI
Collect, Transform, Load and Alert
Structured data
Unstructured data
Pentaho Data Integration
Pentaho Data Integration
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 24
J2EE Container (Tomcat/JBoss)
Data Integration
Architecture: Components Se
rver
W
ork
stat
ion
Th
in c
lien
t
Business Analytics
• Analytics • Reporting • Dashboards
Data Discovery and Advanced Analytics
Reporting, Ad Hoc Query, Dashboards, Mobile
Pu
blish
H
TTP/HTTP
S
• ETL • Data profiling, cleansing, quality • Job Design/Orchestration
• Scheduling • Administration • Content Storage
Report Designer PDI Designer Metadata Design
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 25
Architecture: Technology
J2EE Container (Tomcat/JBoss)
Data Integration
Serv
eur
Stat
ion
de
trav
ail
Clie
nt
lége
r
Business Analytics
• Analytics • Reporting • Dashboards
Data Discovery and Advanced Analytics
Reporting, Ad Hoc Query, Dashboards, Mobile
• ETL • Data profiling, cleansing, quality • Job Design/Orchestration
• Scheduling • Administration • Content Storage
Report Designer PDI Designer Metadata Design
• JavaScript • Dojo • GWT
• Swing • SWT
• Java • J2EE Web
Application
Technologies
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 26
Extensible Architecture
BUSINESS ANALYTICS PLUGIN TYPES:
❯ Platform plugins - security integration, new components
❯ User Console – new editors, analytic displays
❯ Analyzer visualizations – integrate 3rd party visualizations
❯ Dashboard Framework – filter control types, visualizations
❯ Dashboard Designer – additional widget types
DATA INTEGRATION
❯ Transformation Steps – connectors, transformation elements
❯ Job Entries – process/orchestration elements
❯ Perspectives – integrated design or analytic environments
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 27
Architectural Components
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 28
User Console
DEMO
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 29
2013 Performance Goals
1. Increase subscription revenue by analyzing call data to upsell PAYG customers to subscriptions
2. Improve store profitability by holding store managers accountable by bursting store income statements
3. Reduce stock outs with real-time inventory report delivered on an Ipad.
4. Maximize profits by profiling users with high average call duration
5. Maximize revenue by analyzing e-commerce clickstream data in MongoDB to profile purchasing users
6. Improve supply chain by giving phone manufacturers and resellers web-based reports
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 30
3 Calling Plans
• Nationwide
• PAYG
• Prepaid 50
2 Business
Units
• B2B
• B2C
7 Retail Stores
7 Product
Lines
3 Websites
Clear Wireless – Wireless Carrier
10 Resellers
9 Phone
Manf
Red River Mobile
Apple
• San Francisco • Boston • NYC • Paris • Tokyo • Sydney • London
• Smartphones • Home Phones • Wifi Devices • Modems • Notebooks • Tablets • Accessories
• Ecommerce Site • Reseller Portal • Manufacturer Portal
EXTERNAL INTERNAL
IFrame Integration
Custom Widget Embedding
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 31
Pentaho pour iPad & Android Instant Visualization & Analysis for Mobile Users
INSTANT AND INTERACTIVE VISUALIZATION
❯ Attractive dashboards, analysis, operational &
enterprise reports
❯ Touch filtering, drill-thru to details
POWER TO CREATE NEW ANALYSIS ON
THE GO
❯ Unique to Pentaho
❯ Highly interactive vs. a read-only access to static
content
EASY TO DEPLOY, EASY TO EMBED
❯ IT-free, create once, access anywhere
❯ Web-based, easily embeddable into mobile apps
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 32
DEV TOOLS
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 33
Translation
YOU CAN LOCALIZE THE PENTAHO USER CONSOLE AND DEVELOPMENT TOOL IN ANY LANGUAGE USING ISO FORMAT
CONCEPT:
❯ Application:
❯ Specific message bundles within the Pentaho Web application
❯ Message bundles are dynamically adjusted according to browser locale
❯ Metadata:
❯ Reporting & Analysis metadata development tool contain specific localization functionalities
❯ Data:
❯ As in your database. Can use multi Tenant Id to switch beetwen different content.
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 34
Metadata Translation
METADATA
EDITOR
❯ All data can
be translated
SCHEMA
WORKBENCH
❯ All levels can
be translated
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 35
Metadata
Data Mart(s) / Warehouse
Metadata
Dashboard
Rapport
Analyse
PDI Datasource
Operations Mart
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 36
Schema Workbench
Pentaho Server
Mondrian Schema
Metadata Schema
MDX
SQL
Metadata Editor
Analyzer
Interactive Reporting
Report Designer
Architecture Metadata
2 TYPES OF METADATA
❯ Metadata Report
❯ Metadata Olap
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 37
END USER CAN CREATE HIS OWN METADATA
❯ From a file
❯ With Sql Statement
❯ From database
Architecture Metadata
Data Source Wizard
Pentaho Server
Mondrian Schema
Metadata Schema
MDX
SQL
Analyzer
Interactive Reporting
Report Designer
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 38
Schema Workbench & Aggregation Designer
CALCULATION, VIRTUAL CUBES AND AGGREGATION TOOL
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 39
Metadata Editor
CREATE METADATA FOR YOUR END-USERS
Modeling
Properties
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 40
PENTAHO DATA INTEGRATION
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 41
PDI Composants
Extract
Transform
Load
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 42
PDI Components
SPOON ❯ Graphical environment for modeling
❯ Transformations are metadata models describing the flow of data
❯ Jobs are workflow-like models for coordinating resources, execution and dependencies of ETL activities
PAN ❯ Command line tool for executing
transformations modeled in Spoon
KITCHEN ❯ Command line tool for executing
jobs modeled in Spoon
… AND OF COURSE KETTLE ❯ The Engine itself
KDE ETTL Environment
Spoon Interface – Designing a Transformation
Job Example
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 43
PDI Components
ENTERPRISE EDITION DATA INTEGRATION SERVER
❯ Execution and remote monitoring
❯ Integrated scheduling
❯ Enterprise Security options
❯ Enhanced content management including revision history and locking
❯ Remote distributed cluster based processing
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 44
Any Format of Data
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 45
Overall Management
Not just processing … A key element once in a production environment
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 46
Access Controls
PREVENT UNAUTHORIZED USERS FROM VIEWING DATA
TRANSFORMATION RULES AND POSSIBLY CONNECTION
CREDENTIALS (E.G. DATABASE LOGINS / PASSWORDS)
❯ Integrate with existing security (e.g. LDAP / Active Directory)
PROVIDE ACCESS TO TRANSFORMATIONS AND JOBS ON A
“NEED TO KNOW” BASIS’
PROTECT DATABASE LOGINS AND PASSWORDS
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 47
Version Control
AUTOMATICALLY SAVE MULTIPLE REVISIONS OF A
TRANSFORMATION OR JOB
ELIMINATE THE RISK OF “FAT FINGERS” … ACCIDENTAL
DELETION OR CHANGES
EXPERIMENT WITH DIFFERENT ETL DESIGNS WHILE PRESERVING
THE ORIGINAL
RESTORE TRANSFORMATIONS AND JOBS FROM AN EARLIER
VERSION
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 48
Production Control
IDENTIFY THE CORRECT VERSIONS OF TRANSFORMATIONS AND
JOBS TO RUN
ALLOW “EXECUTE ONLY” BY IT OPERATIONS PERSONNEL
LOCK TRANSFORMATIONS WITH COMMENTS
SCHEDULE JOBS TO RUN ON A CENTRAL SERVER AT
PREDETERMINED TIMES
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 49
Pentaho Data Integration
STEP BASED PROCESSING ENGINE WITH INSTANT VISUALISATION
OF RESULTS
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 50
Pentaho Data Integration
• AGILE BI METHODOLOGY
• Load
• Modeling
• Visualize
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 51
Traditional DB
DATA INTEGRATION ANALYSIS
etc etc etc
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 52
Broadest Support for Big Data Platforms
Hadoop NoSQL Analytic Databases
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 53
BIG DATA
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 54
Big Data?
http://www.youtube.com/watch?v=QV3t-3QIf1E
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 55
CENTRAL ADMINISTRATION, AUDITING & MONITORING
DELIVER When & Where Users Need It
STREAMLINE Information Delivery
VISUALIZE & Report Information In Any Style
ACCESS All Data Sources
ISV & Packaged Applications
SaaS / Cloud Applications
INTEGRATION
Web
Mobile
STANDALONE
‣ Advanced &
Predictive Analytics
DATA MINING
‣ Proactive
‣ Operational
‣ Enterprise
REPORTING
‣ Ad hoc Exploration
‣ Multi-Dimensional
ANALYSIS
‣ Interactive Metrics
‣ Rich Visualizations
DASHBOARDS
ERP / CRM /
Enterprise Apps (e.g. SAP, Oracle)
Hadoop, NoSQL Data & Analytical
Unstructured &
semi-structured (XML, Excel, Files, etc.)
Traditional Relational Data
Cloud (e.g. Salesforce,
Amazon, Dell)
‣ Direct Access
‣Data Integration
‣ Hadoop Clustering
‣ Graphical ETL Designer
‣ Enterprise Scalability
INTEGRATE, CLEANSE, & ENRICH DATA
‣ In Memory Caching
‣ High Performance
‣ Relational OLAP Cubes
METADATA LAYER
Big Data with Pentaho
BIG DATA Discovery
‣ Instaview
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 56
Classic usage
Web Application
User Behavior JavaScript, Java, PHP,
Embedded Specialist Tool
CRM Style Data
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 57
ORCHESTRATE
ERP DW
Processing
CRM
Pig, Oozie, Flume, Hive, Hbase, Sqoop
Raw Data
Parsed Data
Analytic Datasets
Transform & visualize
Master Data
Analysis & Reporting
A
N
A
L
Y
Z
E
Unstructured Data
Structured Data
INGEST
Ingestion
VISUAL MAP REDUCE
Data Integration Analytics
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 58
Pentaho Big Data Strategy
• VISUAL MAP REDUCE • Graphic development
• Technical architecture near from Hadoop
• BIG DATA LAYER • Framework included all Big Data distribution
• Technical partnership (Cloudera, HortonWorkd, MongoDB, …)
• BLENDED DATA • JDBC Driver to use our ETL like a datasource
• Data as a service
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 59
Pentaho Visual Development Eliminates need for complex coding
Would you rather do this?
Integrate, Manipulate, Ingest
… or this?
Schedule
Model
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 60
Pentaho Visual MapReduce Drag & Drop then run in the cluster
Parallel execution as MapReduce
in the Hadoop cluster
Up to 15x faster than hand-
written code
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 61
VISUAL MAP REDUCE
UNIQUEMENT DES DÉVELOPPEURS ETL
The main part of your transformation doesn’t change… only a new first and last steps
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 62
Big Data Orchestration
• Scheduling, Event management • Use your existing scripts (cf. scripts Pig) • All Db’s and File system – Hadoop, NoSQL, RDBMS, …
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 63
Adaptive Big Data Layer Leadership in Big Data Integration and Analytics
• Insulates from changing versions, vendors, data stores
• Give customers broad flexibility of choice, rapid time to value, reduced risk
• Provides native integration into the big data ecosystem
• Broadest, deepest Big Data Support
Transparent Access to & Integration of Big Data
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 64
BIG DATA LAYER
ALL MAIN HADOOP DISTRIBUTION
NOSQL CONNECTORS
ACCESS TO AMAZON REDSHIFT & SPLUNK.
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 65
How Pentaho helps you?
• WE REDUCE COMPLEXITY WHICH SIMPLIFIES MIGRATION TO NEW VERSIONS OF
HADOOP
• BECAUSE WE DON’T GENERATE CODE, WE REDUCE THE RISK OF OBSOLESCENCE AS
HADOOP EVOLVES
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 66
Data Ingestion
Manipulation
Integration
Enterprise &
Ad Hoc Reporting
Data Discovery
Visualization
Predictive Analytics
Complete Big Data Analytics &
Visual Data Management
Relational Hadoop NoSQL Analytic
Databases
Pentaho Big Data Analytics
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 67
Orchestration Engine
Applications
Databases
Files
Job (.KJB)
Plug-In Job
Entries
Monitoring Logs
Flume
SQL
Files FTP
Sqoop
Folder
Oozie
Sub-Job
Analytic DB
NoSQL
Hadoop Cluster
PDI Engine
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 68
Transformation Engine
Data Node
Task Tracker
Transformation Engine
Data Node
Task Tracker
Transformation Engine
Data Node
Task Tracker
Transformation Engine
JobTracker
Orchestration
Distributed Cache
Transformation (.KTR)
Transformation Engine
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 69
BLENDED DATA
Data sources SQL
Datawarehouse
Location
Network
Web
Social Media
WebServices NoSql
Hadoop
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 70
• Takes time • Requires IT • Target database is updated as
transformations are run
How do we integrate data today?
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 71
Location
Network
Web
Social Media
We put it in: Hadoop NoSQL Analytic DB’s
Where do we store big data?
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 72
Location
Network
Web
Social Media
How can we bring it together?
ETL
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 73
Location
Network
Web
Social Media
PDI
What if a user could bring together both types
of data on demand?
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 74
❯ Just in time blending of data from multiple sources for a complete
picture
❯ Connect, combine and transform data from multiple sources
❯ Query data directly from any transformation
❯ Access architected blends with the full spectrum of Pentaho Analytics
❯ Manage governance and security of data for on-going accuracy
Accurate, Blended Big Data Analytics
EDW
Existing ETL Tool
or PDI Custom
er
Billing
Provisioning
NoSQL Network
Location
PDI
PDI Analytics
Just in time blending
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 75
Evolving Big Data Architectures
Just-in-Time Integration P
D
I
PDI
Analytic
DB
Location
Web
Social Media
Network
Existing
Process
or PDI Hadoop
Cluster
NoSQL
Existing
ETL Tool
or PDI
EDW Data
Marts
Analytics
Existing
ETL Tool
or PDI
Customer
Provisioning
Billing
Other BI Tools
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 76
Improve operational effectiveness
• Machines/sensors: predict failures, network attacks
• Financial risk management: reduce fraud, increase security
Reduce data warehouse cost
• Integrate new data sources without increased database cost
• Provide online access to ‘dark data’
Drive incremental revenue
• Predict customer behavior across all channels
• Understand and monetize customer behavior
• Begin to monetize data as a service
Customer Value from Big Data
MONETIZING BIG DATA-DRIVEN USE CASES DRIVING NEED TO BLEND DATA
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 77
Analytics
Analyze quality of service: • Network outages
• Dropped calls
• Poor quality
• Calls to support center
For profiles of customers: • Up for renewal
• Profitable
• Multiple agreements/services
• In competitive area
Determine best action to take: • Billing Credit
• Customer Coupon
• No Action
EDW
Existing
ETL
Tool
or PDI
Customer
Billing
Provisioning
Customer Financial Data:
• Billing
• Payment
• Usage
NoSQL Network
Location
PDI
Customer Experience Data:
• Outages
• Drops
• Service Quality
PDI
Blend revenue-related and
quality-of-service data
together to find customers at
risk
Why Blending at the Source Matters Customer Experience Analytics for Loyalty and Revenue
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 78
Optimisation Data Warehouse
Data Sources Big Data Architecture
Data Warehouse (Master & Transactional Data)
ERP
CRM
CDR
Analytic Data Mart(s)
Analytic Data Mart(s)
Analytic Data Mart(s)
Logs Logs
Other Data
Raw Data
Parsed Data
Analytic Datasets
Master Data
Tape
Archive
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 79
Our Big Data solutions B
ig D
ata
Mgm
t.
NoSQL Databases
• Data Integration • Job Orchestration • Workflow
Pentaho Business Analytics
• Scheduling • High Performance • Visual IDE
Dat
a In
tegr
atio
n
Analytic Databases Hadoop Java MapReduce, Pig Pentaho MapReduce
Big
An
alyt
ics
3RD Party Tools
• R • 3rd Party BI Tools • Applications
Business Analytics Embedded Analytics
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 80
MongoDB & Pentaho
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 81
OEM & ISV
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 82
OEM pattern
Pentaho BI Server
Your Application
Pentaho
Your functions
Your application
Pentaho components
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 83
OEM
Bundled Mashup Extended Embedded
Value Fastest Way to Get Analytics that Have
Your Look & Feel
An Integrated Experience for Yours
End User
Customizing Pentaho for Your
Experience
Ultimate Integration and Customization
What it Takes?
• Pentaho is a separate app, branded with Partner’s logo, look & feel
• Optional: Partner app may include links to Pentaho reports, analysis and dashboards (popping new window)
• Optional: Single sign-on creates a seamless experience
• Pentaho & Partner app have the same UI • Pentaho User Console, or individual reports, analysis or dashboards are included in partner app
• Single sign-on creates a seamless experience
• Pentaho’s core functionality is extended through plug-ins. Examples: - Connecting to custom data sources - Adding new visualizations - Customizing security - Replacing Pentaho rules engine
• Integrate with Partner’s App Server
• Directly embedding Pentaho into your app
• Calling Pentaho Java APIs from your App
Skill Level • Limited HTML skills • HTML skills • HTML skills • Java skills
• HTML skills • Java skills • Knowledge of Pentaho architecture
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 84
COMPLETE ISOLATION OF ALL CONTENT INCLUDING:
ARTIFACTS (REPORTS, DASHBOARDS, ETC.)
DATA SOURCES
SCHEDULER
PLUGINS?
“VIRTUAL SERVER PER ORGANIZATION”
Pentaho BI Server
Use Case – Share Nothing
Artifacts
Data Sources
Schedules
Configuration
Artifacts
Data Sources
Schedules
Configuration
Organizations
Company B Company A
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 85
•ONE OR MORE COMMON DATABASES
•DATA ‘STRIPED’ WITH TENANT ID
Pentaho BI Server
Use Case – Shared Data, tagged with Tenant ID
Artifacts
Data Sources
Schedules
Configuration
Artifacts
Data Sources
Schedules
Configuration
Organizations
Company B Company A
Shared Database
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 86
•COMMON DATA SOURCE DEFINITION
•ORGANIZATION DATA IN ISOLATED DATABASES WITH THE SAME DATA
MODEL
•CONNECTIONS ‘ROUTED’ BY TENANT ID
Pentaho BI Server
Use Case – Shared Data Source (data isolation)
Artifacts
Data Sources
Schedules
Configuration
Artifacts
Data Sources
Schedules
Configuration
Organizations
Company B Company A
Data Sources
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 87
•COMMON PENTAHO ARTIFACTS – REPORTS, DASHBOARDS, ANALYSIS
•EACH TENANT CAN VIEW SHARED AND TENANT SPECIFIC CONTENT
Pentaho BI Server
Use Case – Shared Content
Artifacts
Data Sources
Schedules
Configuration
Artifacts
Data Sources
Schedules
Configuration
Organizations
Company B Company A
Artifacts
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 88
DATAMINING
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 89
Class Photo
…And that, in simple terms, is how data mining works.
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 90
Pentaho Core Competencies
Full data mining lifecycle support Preparation of input data
Statistical evaluation of data mining models
Visualization of inputs and results of model learning
Visualization, classification and clustering capabilities
118 classification/regression algorithms
11 clustering algorithms
Integrated with PDI ETL Execute Weka and R predictive models inside of a PDI transformation
Append probabilities dynamically to each row in the data flow
Retrain Weka models using the KnowledgeFlow plugin for PDI
Advanced Analytical
Professionals Data Mining (Predictive Analytics)
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 91
3 Calling Plans
• Nationwide
• PAYG
• Prepaid 50
2 Business
Units
• B2B
• B2C
7 Retail Stores
7 Product
Lines
3 Websites
Clear Wireless – Wireless Carrier
• San Francisco • Boston • NYC • Paris • Tokyo • Sydney • London
• Smartphones • Home Phones • Wifi Devices • Modems • Notebooks • Tablets • Accessories
• Ecommerce Site • Reseller Portal • Manufacturer Portal
Call Detail Records
Retail Sales
Website Clickstream
Databases
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 92
2013 Performance Goals
Increase subscription revenue
Improve store profitability
Eliminate inventory stock outs
Leverage big data to maximize profits
Profile and target profitable customers
Improve supply chain visibility for partners
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 93
2013 Performance Goals
93
Goals Objectives Enabler
Increase subscription revenue
Analyze call data to upsell PAYG customers to subscriptions
Improve store profitability
Hold store managers accountable by pushing store income statements to email
Eliminate inventory stock outs
Empower store employees with iPads and real-time inventory reports
Profile and target profitable customers
Profile mobile plan customers with high average call duration
Leverage big data to maximize profits
• Analyze e-commerce clickstream data in MongoDB to profile purchasing users
• Use predictive technologies to improve marketing effectiveness
Improve supply chain visibility for partners
Give phone manufacturers and resellers web access to secure sales reports
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 94
Data Mining Lifecycle
Phases
a visual guide to CRISP-DM methodology
SOURCE CRISP-DM 1.0
http://www.crisp-dm.org/download.htm
DESIGN Nicole Leaper
http://www.nicoleleaper.com
Generic Tasks
Specialized Tasks
(Process Instances)
Determine Business
Objectives
Background
Business Objectives
Business Success Criteria
(Log and Report Process)
Assess Situation
Inventory of Resources,
Requirements, Assumptions,
and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
(Log and Report Process)
Determine Data Mining
Goals
Data Mining Goals
Data Mining Success Criteria
(Log and Report Process)
Produce Project Plan
Project Plan
Initial Assessment of Tools and
Techniques
(Log and Report Process)
Collect Initial Data
Initial Data Collection Report
(Log and Report Process)
Describe Data
Data Description Report
(Log and Report Process)
Explore Data
Data Exploration Report
(Log and Report Process)
Verify Data Quality
Data Quality Report
(Log and Report Process)
Data Set
Data Set Description
(Log and Report Process)
Select Data
Rationale for Inclusion/
Exclusion
(Log and Report Process)
Clean Data
Data Cleaning Report
(Log and Report Process)
Construct Data
Derived Attributes
Generated Records
(Log and Report Process)
Integrate Data
Merged Data
(Log and Report Process)
Format Data
Reformatted Data
(Log and Report Process)
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
(Log and Report Process)
Generate Test Design
Test Design
(Log and Report Process)
Build Model Parameter
Settings
Models
Model Description
(Log and Report Process)
Assess Model
Model Assessment
Revised Parameter
(Log and Report Process)
Evaluate Results
Align Assessment of Data
Mining Results with
Business Success Criteria
(Log and Report Process)
Approved Models
Review Process
Review of Process
(Log and Report Process)
Determine Next Steps
List of Possible Actions
Decision
(Log and Report Process)
Plan Deployment
Deployment Plan
(Log and Report Process)
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
(Log and Report Process)
Produce Final Report
Final Report
Final Presentation
(Log and Report Process)
Review Project
Experience
Documentation
(Log and Report Process)
Modeling
manipulate data and
draw conclusions
Evaluation
evaluate model and
conclusions
Deployment
apply conclusions to
business
Business Understanding
identify project objectives
Data Understanding
collect and review data
Data Preparation
select and cleanse data
Data Mining Life Cycle
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 95
Business Understanding
WEBSITE CLICKSTREAM DATA
INCREASE REVENUE THROUGH TARGETED MARKETING
LIMITED DIRECT MARKETING BUDGET
PREDICT A WEB USER’S PROPENSITY TO PURCHASE BASED ON THEIR ONLINE CLICKSTREAM BEHAVIOR
SEND EXPENSIVE PROMOTIONAL OFFERS TO WEB USERS MOST LIKELY TO MAKE A PURCHASE
ASSUMPTIONS: $5/MAILING FOR $500 PURCHASE
❯ Expected Benefit of true positive prediction: $495 ($500 – mailing cost)
❯ Expected Benefit of false negative: $0 (no gain & no loss)
❯ Expected Cost of a false positive: $5 (cost of mailing)
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 96
Data Understanding Phases
a visual guide to CRISP-DM methodology
SOURCE CRISP-DM 1.0
http://www.crisp-dm.org/download.htm
DESIGN Nicole Leaper
http://www.nicoleleaper.com
Generic Tasks
Specialized Tasks
(Process Instances)
Determine Business
Objectives
Background
Business Objectives
Business Success Criteria
(Log and Report Process)
Assess Situation
Inventory of Resources,
Requirements, Assumptions,
and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
(Log and Report Process)
Determine Data Mining
Goals
Data Mining Goals
Data Mining Success Criteria
(Log and Report Process)
Produce Project Plan
Project Plan
Initial Assessment of Tools and
Techniques
(Log and Report Process)
Collect Initial Data
Initial Data Collection Report
(Log and Report Process)
Describe Data
Data Description Report
(Log and Report Process)
Explore Data
Data Exploration Report
(Log and Report Process)
Verify Data Quality
Data Quality Report
(Log and Report Process)
Data Set
Data Set Description
(Log and Report Process)
Select Data
Rationale for Inclusion/
Exclusion
(Log and Report Process)
Clean Data
Data Cleaning Report
(Log and Report Process)
Construct Data
Derived Attributes
Generated Records
(Log and Report Process)
Integrate Data
Merged Data
(Log and Report Process)
Format Data
Reformatted Data
(Log and Report Process)
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
(Log and Report Process)
Generate Test Design
Test Design
(Log and Report Process)
Build Model Parameter
Settings
Models
Model Description
(Log and Report Process)
Assess Model
Model Assessment
Revised Parameter
(Log and Report Process)
Evaluate Results
Align Assessment of Data
Mining Results with
Business Success Criteria
(Log and Report Process)
Approved Models
Review Process
Review of Process
(Log and Report Process)
Determine Next Steps
List of Possible Actions
Decision
(Log and Report Process)
Plan Deployment
Deployment Plan
(Log and Report Process)
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
(Log and Report Process)
Produce Final Report
Final Report
Final Presentation
(Log and Report Process)
Review Project
Experience
Documentation
(Log and Report Process)
Modeling
manipulate data and
draw conclusions
Evaluation
evaluate model and
conclusions
Deployment
apply conclusions to
business
Business Understanding
identify project objectives
Data Understanding
collect and review data
Data Preparation
select and cleanse data
Data Mining Life Cycle
Use ecommerce website clickstream log data stored in a MongoDB database
Key Source Fields: [id_user], [date], [event_name]
Events
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 97
Data Understanding/Data Preparation
WEBSITE CLICKSTREAM DATA USE WEBSITE CLICKSTREAM LOG DATA STORED IN MONGODB
NEED TO AGGREGATE, TRANSFORM, AND ENRICH THAT DATA
NEED TO PIVOT THE DATA INTO TABULAR FORMAT FOR PREDICTIVE MODELS
CREATE A FILE IN AARF FORMAT FOR PENTAHO DATA MINING
USE PENTAHO DATA MINING TO DERIVE “PROPENSITY SCORE” FOR EACH USER
Phases
a visual guide to CRISP-DM methodology
SOURCE CRISP-DM 1.0
http://www.crisp-dm.org/download.htm
DESIGN Nicole Leaper
http://www.nicoleleaper.com
Generic Tasks
Specialized Tasks
(Process Instances)
Determine Business
Objectives
Background
Business Objectives
Business Success Criteria
(Log and Report Process)
Assess Situation
Inventory of Resources,
Requirements, Assumptions,
and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
(Log and Report Process)
Determine Data Mining
Goals
Data Mining Goals
Data Mining Success Criteria
(Log and Report Process)
Produce Project Plan
Project Plan
Initial Assessment of Tools and
Techniques
(Log and Report Process)
Collect Initial Data
Initial Data Collection Report
(Log and Report Process)
Describe Data
Data Description Report
(Log and Report Process)
Explore Data
Data Exploration Report
(Log and Report Process)
Verify Data Quality
Data Quality Report
(Log and Report Process)
Data Set
Data Set Description
(Log and Report Process)
Select Data
Rationale for Inclusion/
Exclusion
(Log and Report Process)
Clean Data
Data Cleaning Report
(Log and Report Process)
Construct Data
Derived Attributes
Generated Records
(Log and Report Process)
Integrate Data
Merged Data
(Log and Report Process)
Format Data
Reformatted Data
(Log and Report Process)
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
(Log and Report Process)
Generate Test Design
Test Design
(Log and Report Process)
Build Model Parameter
Settings
Models
Model Description
(Log and Report Process)
Assess Model
Model Assessment
Revised Parameter
(Log and Report Process)
Evaluate Results
Align Assessment of Data
Mining Results with
Business Success Criteria
(Log and Report Process)
Approved Models
Review Process
Review of Process
(Log and Report Process)
Determine Next Steps
List of Possible Actions
Decision
(Log and Report Process)
Plan Deployment
Deployment Plan
(Log and Report Process)
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
(Log and Report Process)
Produce Final Report
Final Report
Final Presentation
(Log and Report Process)
Review Project
Experience
Documentation
(Log and Report Process)
Modeling
manipulate data and
draw conclusions
Evaluation
evaluate model and
conclusions
Deployment
apply conclusions to
business
Business Understanding
identify project objectives
Data Understanding
collect and review data
Data Preparation
select and cleanse data
Data Mining Life Cycle
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 98
Data Understanding/Data Preparation Phases
a visual guide to CRISP-DM methodology
SOURCE CRISP-DM 1.0
http://www.crisp-dm.org/download.htm
DESIGN Nicole Leaper
http://www.nicoleleaper.com
Generic Tasks
Specialized Tasks
(Process Instances)
Determine Business
Objectives
Background
Business Objectives
Business Success Criteria
(Log and Report Process)
Assess Situation
Inventory of Resources,
Requirements, Assumptions,
and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
(Log and Report Process)
Determine Data Mining
Goals
Data Mining Goals
Data Mining Success Criteria
(Log and Report Process)
Produce Project Plan
Project Plan
Initial Assessment of Tools and
Techniques
(Log and Report Process)
Collect Initial Data
Initial Data Collection Report
(Log and Report Process)
Describe Data
Data Description Report
(Log and Report Process)
Explore Data
Data Exploration Report
(Log and Report Process)
Verify Data Quality
Data Quality Report
(Log and Report Process)
Data Set
Data Set Description
(Log and Report Process)
Select Data
Rationale for Inclusion/
Exclusion
(Log and Report Process)
Clean Data
Data Cleaning Report
(Log and Report Process)
Construct Data
Derived Attributes
Generated Records
(Log and Report Process)
Integrate Data
Merged Data
(Log and Report Process)
Format Data
Reformatted Data
(Log and Report Process)
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
(Log and Report Process)
Generate Test Design
Test Design
(Log and Report Process)
Build Model Parameter
Settings
Models
Model Description
(Log and Report Process)
Assess Model
Model Assessment
Revised Parameter
(Log and Report Process)
Evaluate Results
Align Assessment of Data
Mining Results with
Business Success Criteria
(Log and Report Process)
Approved Models
Review Process
Review of Process
(Log and Report Process)
Determine Next Steps
List of Possible Actions
Decision
(Log and Report Process)
Plan Deployment
Deployment Plan
(Log and Report Process)
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
(Log and Report Process)
Produce Final Report
Final Report
Final Presentation
(Log and Report Process)
Review Project
Experience
Documentation
(Log and Report Process)
Modeling
manipulate data and
draw conclusions
Evaluation
evaluate model and
conclusions
Deployment
apply conclusions to
business
Business Understanding
identify project objectives
Data Understanding
collect and review data
Data Preparation
select and cleanse data
Data Mining Life Cycle
Key Source Fields: [id_user], [date], [event_name]
PARSE CLEAN AND FORMAT
GROUP AND AGGREGATE ENRICH w OTHER DATA
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 99
Data Understanding/Data Preparation
USE PDI ROW DENORMALIZER STEP TO PREPARE DATE FOR PREDICTIVE MODEL
• Pivots events from records into columns (tabular format)
Phases
a visual guide to CRISP-DM methodology
SOURCE CRISP-DM 1.0
http://www.crisp-dm.org/download.htm
DESIGN Nicole Leaper
http://www.nicoleleaper.com
Generic Tasks
Specialized Tasks
(Process Instances)
Determine Business
Objectives
Background
Business Objectives
Business Success Criteria
(Log and Report Process)
Assess Situation
Inventory of Resources,
Requirements, Assumptions,
and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
(Log and Report Process)
Determine Data Mining
Goals
Data Mining Goals
Data Mining Success Criteria
(Log and Report Process)
Produce Project Plan
Project Plan
Initial Assessment of Tools and
Techniques
(Log and Report Process)
Collect Initial Data
Initial Data Collection Report
(Log and Report Process)
Describe Data
Data Description Report
(Log and Report Process)
Explore Data
Data Exploration Report
(Log and Report Process)
Verify Data Quality
Data Quality Report
(Log and Report Process)
Data Set
Data Set Description
(Log and Report Process)
Select Data
Rationale for Inclusion/
Exclusion
(Log and Report Process)
Clean Data
Data Cleaning Report
(Log and Report Process)
Construct Data
Derived Attributes
Generated Records
(Log and Report Process)
Integrate Data
Merged Data
(Log and Report Process)
Format Data
Reformatted Data
(Log and Report Process)
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
(Log and Report Process)
Generate Test Design
Test Design
(Log and Report Process)
Build Model Parameter
Settings
Models
Model Description
(Log and Report Process)
Assess Model
Model Assessment
Revised Parameter
(Log and Report Process)
Evaluate Results
Align Assessment of Data
Mining Results with
Business Success Criteria
(Log and Report Process)
Approved Models
Review Process
Review of Process
(Log and Report Process)
Determine Next Steps
List of Possible Actions
Decision
(Log and Report Process)
Plan Deployment
Deployment Plan
(Log and Report Process)
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
(Log and Report Process)
Produce Final Report
Final Report
Final Presentation
(Log and Report Process)
Review Project
Experience
Documentation
(Log and Report Process)
Modeling
manipulate data and
draw conclusions
Evaluation
evaluate model and
conclusions
Deployment
apply conclusions to
business
Business Understanding
identify project objectives
Data Understanding
collect and review data
Data Preparation
select and cleanse data
Data Mining Life Cycle
Data Output: • Parsed • Cleaned; Formatted • Grouped; Aggregated • Enriched
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 100
Data Understanding/Data Preparation
The learning algorithms require each row of data to be an independent example of what is to be learned. This allows us to provide an example which summarizes all of the possible user's events over the hour in a single record. This “presence/absence” of event types are then used as predictors for "add to cart".
Why do we have to pivot the records into tabular format?
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 101
Data Preparation – Final Step
• Create a file in Weka’s ARFF format
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 102
Modeling
Phases
a visual guide to CRISP-DM methodology
SOURCE CRISP-DM 1.0
http://www.crisp-dm.org/download.htm
DESIGN Nicole Leaper
http://www.nicoleleaper.com
Generic Tasks
Specialized Tasks
(Process Instances)
Determine Business
Objectives
Background
Business Objectives
Business Success Criteria
(Log and Report Process)
Assess Situation
Inventory of Resources,
Requirements, Assumptions,
and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
(Log and Report Process)
Determine Data Mining
Goals
Data Mining Goals
Data Mining Success Criteria
(Log and Report Process)
Produce Project Plan
Project Plan
Initial Assessment of Tools and
Techniques
(Log and Report Process)
Collect Initial Data
Initial Data Collection Report
(Log and Report Process)
Describe Data
Data Description Report
(Log and Report Process)
Explore Data
Data Exploration Report
(Log and Report Process)
Verify Data Quality
Data Quality Report
(Log and Report Process)
Data Set
Data Set Description
(Log and Report Process)
Select Data
Rationale for Inclusion/
Exclusion
(Log and Report Process)
Clean Data
Data Cleaning Report
(Log and Report Process)
Construct Data
Derived Attributes
Generated Records
(Log and Report Process)
Integrate Data
Merged Data
(Log and Report Process)
Format Data
Reformatted Data
(Log and Report Process)
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
(Log and Report Process)
Generate Test Design
Test Design
(Log and Report Process)
Build Model Parameter
Settings
Models
Model Description
(Log and Report Process)
Assess Model
Model Assessment
Revised Parameter
(Log and Report Process)
Evaluate Results
Align Assessment of Data
Mining Results with
Business Success Criteria
(Log and Report Process)
Approved Models
Review Process
Review of Process
(Log and Report Process)
Determine Next Steps
List of Possible Actions
Decision
(Log and Report Process)
Plan Deployment
Deployment Plan
(Log and Report Process)
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
(Log and Report Process)
Produce Final Report
Final Report
Final Presentation
(Log and Report Process)
Review Project
Experience
Documentation
(Log and Report Process)
Modeling
manipulate data and
draw conclusions
Evaluation
evaluate model and
conclusions
Deployment
apply conclusions to
business
Business Understanding
identify project objectives
Data Understanding
collect and review data
Data Preparation
select and cleanse data
Data Mining Life Cycle
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 103
Modeling
BUSINESS GOAL: INCREASE REVENUE THROUGH TARGETED MARKETING
❯ Would like to focus on who is most likely to purchase and try to persuade them to spend more (or perhaps commit to spending if they are borderline) with seductive marketing, special offers etc.
DATA MINING
❯ Build a model that will predict “Added_item_to_cart” with higher accuracy than random selection or any hand-crafted business rules
❯ Based on data characteristics – few attributes; small number of instances – try some “likely suspects” first
❯ Naïve Bayes (linear)
❯ Logistic regression (linear)
❯ Decision tree (non-linear)
❯ K nearest neighbors (non-linear)
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 104
Modeling
• Take a quick look at summary info in the Explorer
Phases
a visual guide to CRISP-DM methodology
SOURCE CRISP-DM 1.0
http://www.crisp-dm.org/download.htm
DESIGN Nicole Leaper
http://www.nicoleleaper.com
Generic Tasks
Specialized Tasks
(Process Instances)
Determine Business
Objectives
Background
Business Objectives
Business Success Criteria
(Log and Report Process)
Assess Situation
Inventory of Resources,
Requirements, Assumptions,
and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
(Log and Report Process)
Determine Data Mining
Goals
Data Mining Goals
Data Mining Success Criteria
(Log and Report Process)
Produce Project Plan
Project Plan
Initial Assessment of Tools and
Techniques
(Log and Report Process)
Collect Initial Data
Initial Data Collection Report
(Log and Report Process)
Describe Data
Data Description Report
(Log and Report Process)
Explore Data
Data Exploration Report
(Log and Report Process)
Verify Data Quality
Data Quality Report
(Log and Report Process)
Data Set
Data Set Description
(Log and Report Process)
Select Data
Rationale for Inclusion/
Exclusion
(Log and Report Process)
Clean Data
Data Cleaning Report
(Log and Report Process)
Construct Data
Derived Attributes
Generated Records
(Log and Report Process)
Integrate Data
Merged Data
(Log and Report Process)
Format Data
Reformatted Data
(Log and Report Process)
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
(Log and Report Process)
Generate Test Design
Test Design
(Log and Report Process)
Build Model Parameter
Settings
Models
Model Description
(Log and Report Process)
Assess Model
Model Assessment
Revised Parameter
(Log and Report Process)
Evaluate Results
Align Assessment of Data
Mining Results with
Business Success Criteria
(Log and Report Process)
Approved Models
Review Process
Review of Process
(Log and Report Process)
Determine Next Steps
List of Possible Actions
Decision
(Log and Report Process)
Plan Deployment
Deployment Plan
(Log and Report Process)
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
(Log and Report Process)
Produce Final Report
Final Report
Final Presentation
(Log and Report Process)
Review Project
Experience
Documentation
(Log and Report Process)
Modeling
manipulate data and
draw conclusions
Evaluation
evaluate model and
conclusions
Deployment
apply conclusions to
business
Business Understanding
identify project objectives
Data Understanding
collect and review data
Data Preparation
select and cleanse data
Data Mining Life Cycle
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 105
Knowledge Flow
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 106
Evaluation
Phases
a visual guide to CRISP-DM methodology
SOURCE CRISP-DM 1.0
http://www.crisp-dm.org/download.htm
DESIGN Nicole Leaper
http://www.nicoleleaper.com
Generic Tasks
Specialized Tasks
(Process Instances)
Determine Business
Objectives
Background
Business Objectives
Business Success Criteria
(Log and Report Process)
Assess Situation
Inventory of Resources,
Requirements, Assumptions,
and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
(Log and Report Process)
Determine Data Mining
Goals
Data Mining Goals
Data Mining Success Criteria
(Log and Report Process)
Produce Project Plan
Project Plan
Initial Assessment of Tools and
Techniques
(Log and Report Process)
Collect Initial Data
Initial Data Collection Report
(Log and Report Process)
Describe Data
Data Description Report
(Log and Report Process)
Explore Data
Data Exploration Report
(Log and Report Process)
Verify Data Quality
Data Quality Report
(Log and Report Process)
Data Set
Data Set Description
(Log and Report Process)
Select Data
Rationale for Inclusion/
Exclusion
(Log and Report Process)
Clean Data
Data Cleaning Report
(Log and Report Process)
Construct Data
Derived Attributes
Generated Records
(Log and Report Process)
Integrate Data
Merged Data
(Log and Report Process)
Format Data
Reformatted Data
(Log and Report Process)
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
(Log and Report Process)
Generate Test Design
Test Design
(Log and Report Process)
Build Model Parameter
Settings
Models
Model Description
(Log and Report Process)
Assess Model
Model Assessment
Revised Parameter
(Log and Report Process)
Evaluate Results
Align Assessment of Data
Mining Results with
Business Success Criteria
(Log and Report Process)
Approved Models
Review Process
Review of Process
(Log and Report Process)
Determine Next Steps
List of Possible Actions
Decision
(Log and Report Process)
Plan Deployment
Deployment Plan
(Log and Report Process)
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
(Log and Report Process)
Produce Final Report
Final Report
Final Presentation
(Log and Report Process)
Review Project
Experience
Documentation
(Log and Report Process)
Modeling
manipulate data and
draw conclusions
Evaluation
evaluate model and
conclusions
Deployment
apply conclusions to
business
Business Understanding
identify project objectives
Data Understanding
collect and review data
Data Preparation
select and cleanse data
Data Mining Life Cycle
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 107
Logistic Regression: Results
MODEL IS A LINEAR FUNCTION THAT PREDICTS THE LIKELIHOOD
(PROBABILITY) OF A PERSON “ADDING ITEM TO CART”
RELATIVE MAGNITUDES OF THE COEFFICIENTS GIVE AN
INDICATION OF IMPORTANCE
Function for label “0” – i.e. wont add to cart
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 108
Logistic Regression: Results
Negative Impact on Purchase Positive Impact on Purchase
Subscribed to Email Gender - Male
Commented on Blog Language – French
Visited Site Language – Spanish
Signup Newsletter Primary Use - Personal
Watched Video Referring URL - Ebay
Tweeted Item Referring URL - Google
Tweeted Blog Posts Referring URL – Live.com
Signup Free Offer
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 109
Cost/Benefit Analysis
LOGIBOOST LOGISTIC
REGRESSION
LEFT CURVE:
CUMULATIVE GAINS
RIGHT CURVE: BENEFIT
CURVE
❯ y axis: benefit
❯ x axis: sample size
EXPECTED BENEFIT $ 21,475
❯ Also shows
expected benefit if
we just chose a
random subset of
this size from the
total pop.
❯ $ -3,036
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 110
Cost/Benefit Analysis
ALGORITHM: LOGIBOOST LOGISTIC REGRESSION
EXPECTED BENEFITS IF WE USE MACHINE RECOMMENDED MAILER
RECIPIENTS
❯ $ 21,475
EXPECTED BENEFIT IF WE JUST CHOSE A RANDOM SUBSET OF THIS
SIZE FROM THE TOTAL POP.
❯ $ (-3,036)
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 111
Modeling
Phases
a visual guide to CRISP-DM methodology
SOURCE CRISP-DM 1.0
http://www.crisp-dm.org/download.htm
DESIGN Nicole Leaper
http://www.nicoleleaper.com
Generic Tasks
Specialized Tasks
(Process Instances)
Determine Business
Objectives
Background
Business Objectives
Business Success Criteria
(Log and Report Process)
Assess Situation
Inventory of Resources,
Requirements, Assumptions,
and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
(Log and Report Process)
Determine Data Mining
Goals
Data Mining Goals
Data Mining Success Criteria
(Log and Report Process)
Produce Project Plan
Project Plan
Initial Assessment of Tools and
Techniques
(Log and Report Process)
Collect Initial Data
Initial Data Collection Report
(Log and Report Process)
Describe Data
Data Description Report
(Log and Report Process)
Explore Data
Data Exploration Report
(Log and Report Process)
Verify Data Quality
Data Quality Report
(Log and Report Process)
Data Set
Data Set Description
(Log and Report Process)
Select Data
Rationale for Inclusion/
Exclusion
(Log and Report Process)
Clean Data
Data Cleaning Report
(Log and Report Process)
Construct Data
Derived Attributes
Generated Records
(Log and Report Process)
Integrate Data
Merged Data
(Log and Report Process)
Format Data
Reformatted Data
(Log and Report Process)
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
(Log and Report Process)
Generate Test Design
Test Design
(Log and Report Process)
Build Model Parameter
Settings
Models
Model Description
(Log and Report Process)
Assess Model
Model Assessment
Revised Parameter
(Log and Report Process)
Evaluate Results
Align Assessment of Data
Mining Results with
Business Success Criteria
(Log and Report Process)
Approved Models
Review Process
Review of Process
(Log and Report Process)
Determine Next Steps
List of Possible Actions
Decision
(Log and Report Process)
Plan Deployment
Deployment Plan
(Log and Report Process)
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
(Log and Report Process)
Produce Final Report
Final Report
Final Presentation
(Log and Report Process)
Review Project
Experience
Documentation
(Log and Report Process)
Modeling
manipulate data and
draw conclusions
Evaluation
evaluate model and
conclusions
Deployment
apply conclusions to
business
Business Understanding
identify project objectives
Data Understanding
collect and review data
Data Preparation
select and cleanse data
Data Mining Life Cycle
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 112
Marketing
HOW MANY MARKETING PEOPLE DOES
IT TAKE TO SCREW IN A LIGHTBULB?
None….they’ve automated it.
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 113
YOUR PROJET
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 114
Project method
Prepare Explore Designe Develop Deployed
Installation et
Configuration
Formation Agile BI
Etudes & Infrastructure
Architecture
Kick-off projet Revue de développement
Créer du contenu
Explorer
les données
Identifier les besoins
Revue itérative
Publier
Développer le contenu
Extraire et
charger les
données
Affiner le modele de données
Tester, recetter et déployer
Collecte besoins métier
Etendre
Mise en production
Réunion Go/NoGo
Définition Projet
Suite BI et formations avancées
Planning Projet
Plan Projet
Modèle de données
Cahier des charges
Procédure de mise en
production
Specification
Plans test
Formation et documentation
utilisateurs
Revue itérative
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 115
How are you going to manage your project?
Vos efforts
Assistance Pentaho
Ask yourself: What resources do we have?
What competences do we have? What is our project timeline?
What’s our project complexity? Do we have the infrastructure?
© 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 116
Thanks
blog.pentaho.com
@Pentaho
Facebook.com/Pentaho
Pentaho Business Analytics
JOIN THE CONVERSATION. YOU CAN FIND US ON: