Scg data management for advance analytics 20171126
Transcript of Scg data management for advance analytics 20171126
Introduction
Agenda
Customer Use Cases
Business Value of
Big Data Analytics
Solution Architecture
1
2
3
4
Big Data Is Only Getting BiggerParticularly Relevant in the Manufacturing Space
Dat
a G
row
th
END-USERAPPLICATIONS
THE INTERNET
MOBILE DEVICES
SOPHISTICATEDMACHINES
STRUCTURED DATA – 10%
COMPLEX DATA – 90%
1980 TODAY
DEVICES & SENSORS
PLANT & OPERATIONS
SUPPLY CHAIN & INVENTORY
MARKETING & CRM
PUBLIC & TRADE
● What makes “big data” big?
⎼ Volume?
⎼ Variety?
⎼ Velocity?
• Data becomes big when we take one
or more large data sets and start to
analyze relationships between
observations
Big Data and Data Products
The data-driven enterprise
IoT explosion of new data
30Bconnected
devices
440x more data
Enterprises re-architect to modernize IT infrastructure
open source
cloud
machine learning
Modern platform for Big Data Analytics
Data Science &Engineering
Analytic Database
Operational Database
Driver Customer Insights
Improve Product & Services Efficiency
Lower Business Risks
Cloudera Enterprise: Fast, Easy, Secure
Business Value
TechnologyUse Cases
ONE platform. MANY application.
Agenda
Customer Use Cases
Introduction
Business value of
Big Data Analytics
Solution Architecture
1
2
3
4
Industry Use Cases – Driving Business Valuereal customer use cases
<
Financial Services
• Customer 360• Fraud / Cyber • Compliance• Risk• Operational Data
Store• Market Data• Algo Trading• Active Archive
Telco Media
• Customer 360• Churn prediction• Network
Optimization• Data Monetization• EDW Augmentation• Media Streaming• Active Archive
Manufacturing Life Science
• Connected: Car, Plane, Equipment
• Agile Supply Chain• Predictive
Maintenance• IoT Data enabled
“Smart Services” • Clinical trials • Diagnostics
Retail CPG
Transportation
• Ship to Store• Agile Supply Chain• Next Best Offer• Connected Store• Completed baskets• IoT – Stores• Active archiving• Smart Vessel• Customer Loyalty
Government Health Care
• Border Control• Risk / Intelligence• 360 Tax payer• Tax Optimization• Cyber Threat• Fraud prevention• Intelligence • Patient care• Citizen 360
Agenda
Customer Use Cases
Introduction
Business Value of Big Data Analytics
Solution Architecture
1
2
3
4
IoT Data Characteristicsthe foundation of hadoop’s potential
IoT data comes from a variety of different sources• Massive volumes of intermittent data streams
• Generated from a variety of data sources
• Predominantly time-series
• Can come in streams (real-time) or batches
• Diverse data structures and schemas
• Some of it may be perishable
Combining sensor data with contextual data is the key to value creation from IoT
15
Where Is the Manufacturing Data?Mapping and Consolidation Are the Tip of the Iceberg for Big Data
Devices & Sensors
• Device Readings• Device Performance• Device Diagnostics• Battery / Power
Consumption• Software Logs• Environmental
Interactions• R&D• Quality / Testing
Plant & Operations
• MES• Sensors• Video / Surveillance• Line Productivity• Machines• Staffing / Scheduling
Supply Chain & Inventory
• ERP• Supplier / Manufacturer• Orders / Receivables• Commodity Supplies /
Prices
Marketing & CRM
• Transactions• Accounts • Warranties /
Aftermarket• Customer Service Logs• Campaigns /
Promotions• Website / SEO• Affiliates / Merchants• Surveys• Competitive
Intelligence
Public & Trade
• Market Intelligence• Policy / Regulation• Demographic / Census• Psychographic• Inflation / Macroeconomic• Gas Prices• Labor Statistics• Social / Search• Public Health Data• Clinical Studies• Store Schematics• Journals / Editorial• Seismic / Speculation
16
Where Is the Retail Data?Mapping and Consolidation Are the Tip of the Iceberg for Big Data
Customer Transactions
Shopper Behavior
Out-of-StoreBehavior
Merchandising & Operations
• POS / TLOG• E-commerce /
Mobile Sales• In-Store Ordering• Memberships /
Loyalty Programs• Warranties
• Video / Surveillance
• Sensors• Internet of Things
• Social / Sentiment
• Clickstreams• Consumer /
Consumption
• Schematic / Displays
• Store Layout / Characteristics
• Orders / Receipts• Staffing /
Scheduling• Retailtainment• Supplier /
Manufacturer
Marketing & CRM
Public & Trade
• Promotions / Trade
• Campaigns / SEO / Affiliate
• Direct / Indirect• Customer
Support Logs• Surveys• Competitive
Intelligence
• Demographic / Census
• Psychographic• Gas Prices• Labor Statistics• Weather Data• Public Health Data• Industry Research
19
Merchandising
Problem
Solution
Partners
First-in-Basket AnalysisUse exploratory analytics (e.g., clickstream) to identify promoted items that drive greater numbers of transactions and larger total transaction size.
Many Rigid SystemsComplex grid architecture is expensive, inflexible, error-prone, and hard to test. It does not scale to accommodate analysis of millions of SKU combinations.
Regression AnalysisImpala offsets the latency and constraints of EDWs to expand data available for merchandise regressions, also driving down the cost of ad hoc modeling.
Use Case
20
Buying
Problem
Solution
Partners
Consumer-Driven AssortmentOvercome SKU rationalization across categories by isolating the products or mix of products that are most indicative of larger baskets or key customer groups.
Moving Data to ComputeExpanding sources to include broad market data from research, Google, Facebook, Twitter, etc. overwhelms systems built on traditional data warehouses.
Automate in Real TimeCentralize data from silos: transactions, clickstreams, service logs, social, etc. Find data using Search and build models with Pig, Mahout, Spark, analytics tools.
Use Case
22
Increase Customer Satisfaction Challenge:• Disparate view of customers• Unable to analyze unstructured data cost-
effectively and consistently • Manual and random analysis of web chats and
customer sentiments
Solution:• Analyzing over 250K web chats/ month • Tapping into 100% of unstructured data versus
1% previously• Discovering valuable patterns in web chats that
were previously undetectable • Reduce customer complaints by 25%• Roadmap: expand usage to include additional
omni-banking channels for 360 view
RETAIL BANK» CUSTOMER 360
DRIVE CUSTOMER INSIGHTS
23
Increase Customer Retention, Loyalty and Acquisition Rates
Challenge:• Fragmented systems and disparate
view of customers
Solution:
• Serve trend information back to customers via Santander’s “Spendlytics” application
• Capture, transform and enrich data in near-real time
RETAIL BANK» CUSTOMER 360» CRM, MARKETING PROGRAMS» FRAUD» RISK » COMPLIANCE – BCBS239DRIVE CUSTOMER
INSIGHTS
24
DRIVE CUSTOMER INSIGHTS
GLOBAL INFORMATION SERVICES» CUSTOMER 360
Gains insights from customer spend data and behavior-based lifestyle segmentation
Challenge:• High storage costs plus acquiring mass archival
data was cost-prohibitive• Unable to obtain a 360 view of consumer spend
data, preferences and behavior or tap into new data sources
Solution:• Using Apache Spark to analyze consumers’
preferences and interests based on their spending behavior patterns. Enhancing spend data with new data sources
• Processing 500% more matches per day
• 50X performance gains
• Deployment in < 6 months
“Nobody is doing what we’re doing with Hadoop today, especially at this order of
magnitude. The Experian Marketing Suite’s Identity Manager is the first real-time linkage
engine that accepts data, links information together across an entire marketing
ecosystem, and puts it into a usable format for a solid customer experience.”
Emad Georgy, SVP Global Software Development, Experian Marketing Services
25
New Products, Gain Insights, & Reduced Costs • Analyzing data from its 20+ automotive
ecosystem brands (e.g. Autotrader, Kelley Blue Book)
• Combining data for new products and offerings that aren't otherwise possible
• Fine grained real-time view of activity, responses, inventory & pricing
• Reduced TCO by 50% by consolidating over 1PB of data, adding 200M rows daily
• “Impala provides analysts with near-Netezza speeds but on the Hadoop cluster”
DRIVE CUSTOMER INSIGHTS
26
Measure user interaction across the ecosystem, help direct R&D and development spend
• Real-time streaming and batch data from product logs, web analytics, channel data and ERP
• Virtuous cycle: Identify features that facilitate sharing of content that drive new customers
• Analyze utilization of new community attributes that drive adoption
MANUFACTURING» CUSTOMER 360» DATA DRIVEN PRODUCTS» DATA DRIVEN SERVICES
DATA-DRIVENPRODUCTS
28
Predictive Maintenance on Thousands of Industrial Machinery in Real- Time
Challenge:• Collect and analyze data from thousands of
diverse manufacturing systems in real-time
Solution:
• iTrak application using Cloudera in the Cloud to monitor the performance of individual manufacturing systems in real-time
• Predictive Maintenance - Proactively identifying & fixing issues before they break
MANUFACTURING» INDUSTRIAL IoT» PREDICTIVE MAINTENANCE» IMPROVED EFFICIENCIES
Industrial IoT – Predictive Maintenance
DATA-DRIVENPROCESS
CASE STUDY
DATA-DRIVENPRODUCTS
30
LOWER BUSINESS RISKS
MAJOR RETAIL BANK» CYBER SECURITY
Top Retail Bank Uses EDH to Detect and PreventMalware Attacks
Challenge:
• One malware source on SharePoint took 9 months to find – re-infection kept occurring
• Unable to determine source of malware
Solution:
• Uses Cloudera Enterprise to ingest internal network comms, proxy logs, etc. Uses Apache Spark (Machine Learning techniques) to create network graph
• Reduces the spread of malware within bank. Finds malware entry source
• Mobilized quickly to respond to the “shell shock” bug
31
GLOBAL PAYMENT PROCESSOR» REAL-TIME FRAUD DETECTION &
PREVENTION» CUSTOMER 360°» ETL OFFLOAD/STORAGE
OPTIMIZATION
GLOBAL PAYMENT PROCESSOR
FRAUDLOWER BUSINESS RISKS
Challenge:
• Spending $1 billion on EDW environment annually • Data Scientists and Statisticians were unable to
access more than a year’s worth of data • Unable to perform faster queries or mine data for
fraud and risk factors
Solution:• Performs real-time fraud detection using Apache
Spark and Impala• Creates and back-test new fraud models over
historic data• Identifies largest case of fraud in company’s history
• Ingesting 4TB of data per day
• Using Cloudera Enterprise for ETL Offload resulting in 10-15% workload reduction andEDW optimization with $30M in annual savings
32
FRAUD
GLOBAL PAYMENT PROCESSOR» DATA SECURITY» FRAUD DETECTION & PREVENTION» CUSTOMER 360» IT COST REDUCTION
. Cloudera Enterprise: First PCI Certified Hadoop Platform
• Performs real-time fraud detection and prevention with Apache Spark and Impala
• Secures 10 PB of data in a PCI-compliant manner every day
• Optimizes EDW and ETL Offload with savings in millions
• MasterCard Advisors partners with Cloudera
33
LOWER BUSINESSRISKS
REGULATORY AUTHORITY» TRADE SURVEILLANCE
Builds Holistic Picture of US Market By Looking at 30BN Events/Day
Challenge:• Overseeing transactions from more than 4,100 firms incl.
exchanges, brokers-dealers & trade reporting facilities• Difficult and costly to aggregate and analyze increasing
volume of data from numerous sources incl. orders, quotes and trades
Solution:• Built market event graph database using EDH• Provides interactive access to graph data for investigations• Using EDH on-premise and in the cloud• Monitoring and analyzing transactions to detect fraud,
insider trading, short sale, best execution
• Savings of $10-20M annually
34
Agenda
Customer Use Cases
Introduction
Business Value of Big Data Analytics
Solution Architecture
1
2
3
4
35
The Legacy Approach
• Batch File Ingestion • Discover Threats Too Late
The Hadoop Machine Learning Approach
• Real-Time Packet Ingestion• Discover in Seconds vs. Hours or Days
Legacy Approach vs. Hadoop Machine Learning
36
The Legacy Approach
• Batch File Ingestion • Discover Threats Too Late
• Rules Based• Don’t Discover Zero-Day Attack Methods• False Positive Overload
The Hadoop Machine Learning Approach
• Real-Time Packet Ingestion• Discover in Seconds vs. Hours or Days
• Real-Time Anomaly Detection• Discover 250% to 350% More Fraud
• 20 to 30 Times Less False Positives
Legacy Approach vs. Hadoop Machine Learning
37
The Legacy Approach
• Batch File Ingestion • Discover Threats Too Late
• Rules Based• Don’t Discover Zero-Day Attack Methods• False Positive Overload
• Data Silos• No Crime 360 Signals
The Hadoop Machine Learning Approach
• Real-Time Packet Ingestion• Discover in Seconds vs. Hours or Days
• Real-Time Anomaly Detection• Discover 250% to 350% More Fraud
• 20 to 30 Times Less False Positives
• Enterprise Data Hub• Discover Crime 360 Signals
Legacy Approach vs. Hadoop Machine Learning
38
The Legacy Approach
• Batch File Ingestion • Discover Threats Too Late
• Rules Based• Don’t Discover Zero-Day Attack Methods• False Positive Overload
• Data Silos• No Crime 360 Signals
• Flat world Forensics• Discover Incident not Crime Rings
The Hadoop Machine Learning Approach
• Real-Time Packet Ingestion• Discover in Seconds vs. Hours or Days
• Real-Time Anomaly Detection• Discover 250% to 350% More Fraud
• 20 to 30 Times Less False Positives
• Enterprise Data Hub• Discover Crime 360 Signals
• Graph based Visual Analytics• Discover Crime Rings
Legacy Approach vs. Hadoop Machine Learning
39
The Legacy Approach
• Batch File Ingestion • Discover Threats Too Late
• Rules Based• Don’t Discover Zero-Day Attack Methods• False Positive Overload
• Data Silos• No Crime 360 Signals
• Flat world Forensics• Discover Incident not Crime Rings
• High Cost Proprietary Architecture• Limited Data due to cost constraints
The Hadoop Machine Learning Approach
• Real-Time Packet Ingestion• Discover in Seconds vs. Hours or Days
• Real-Time Anomaly Detection• Discover 250% to 350% More Fraud
• 20 to 30 Times Less False Positives
• Enterprise Data Hub• Discover Crime 360 Signals
• Graph based Visual Analytics• Discover Crime Rings
• Native Hadoop Architecture• Unlimited Data Storage & Analytics
Legacy Approach vs. Hadoop Machine Learning
40© Cloudera, Inc. All rights reserved.
Enterprise Data Warehouse
ApplicationsData Sources Operational Data Stores
Traditional Architecture
Enterprise Data Warehouse
ServeELT
Archive
BI System
Modeling
Reporting
ETL
HPC GRID
Storage #2
Storage #1
Ingest
Pro
cess Load
Unstructured
FinancialLedger P&L
RisksMarket,
Counterparty,Ratings
PaymentsCollections
Charges
Ingest
Ingest
PortfolioContractsPortfolio
Challenges Architectures
41© Cloudera, Inc. All rights reserved.
ApplicationsRisk Data Sources Cloudera Enterprise Data Hub (EDH)
Modern Architecture
EDHIngest
Active Structured Data
Serve
Serve
Archive Load
Extract Load
BI System
Modeling
Reporting
Enterprise Data Warehouse (EDW)
PortfolioContractsPortfolio
Unstructured
FinancialLedger P&L
RisksMarket,
Counterparty,Ratings
PaymentsCollections
Charges
Compute
Transfo
rm
Storage
New Architecture with Big Data
42© Cloudera, Inc. All rights reserved.http://www.jobs.ac.uk/enhanced/industry/lifesciences-london/
Data exploration
Data preparation
Data modelling
Data visualization
Machine learning
Process and Tools
Bringing the goals to lives
44© Cloudera, Inc. All rights reserved.
A. Technology Savvy
● Data management
● Analytics & virtualization
B. Service Oriented
● Architectural design
● System development
● Quality management
● System management
C. Our Experiences
● Massive and real time data
processing
● Advance data analytics
● Natural language processing
(Thai and English)
D. Applications
● Data lake and virtual platform
● Voice of customer management
● Machine learning for
personalization and
recommendation
E. Team Proficiency
● Data architect
● Data engineer
● Report designer
● Data scientist
F. Our Alliances
● Big data experience center
● Consulting firm & Experts
● MS Partner Development Unit
G-ABLE data and analytics unit
• No Hub / No Data Lake
• No C360
• Tape – expensive to read data
• Expensive ETL tooling
• Expensive EDW per TB
• No enterprise search
• Long time to get value from data
• Slow to get access to the data
• IT led
• Problem with scale
• Data to analysis – days
• Not Petabyte scale
• Logs, clickstream data archived
Before Hadoop
• Enterprise Data Hub – Governed with Security and Search
• C360 – all data web logs, click stream, active archive
• Cheap 1/30 cost EDW. Easy to scale. PB+
• Seconds from log data, click stream to analytics
With Hadoop
Analytics – BI and Predictive on all data
DB2
Oracle MySQL
Structured Data Cloud
sqoop
Web Logs Click Stream Data
Fla
fka
Hive / ODBC
Semi-Structured Data
Impala HIVE