Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data, Data Science & Signal Extraction Deep Dive
-
Upload
hyderabad-scalability-meetup -
Category
Technology
-
view
97 -
download
1
Transcript of Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify
Technology Basics
Big Data Overview & Snapshot
Big Data Architecture : Deep Dive
Hadoop Overview
Clear Understanding of Data Science
Big Data Career Opportunities
Q & A
1
What we will cover in the 60 mins
2
3
4
5
6
7
Apart from that we will also cover …
• An overview of the shift to Data Science Platforms
• The 3 critical components of a Data Science platform
• Industries that are most likely to get disrupted and shift to Data Science
• Characteristics of firms that get left behind the Data Science wave
• Factors that push an industry towards Data Science
• A brief overview of aspects of platform architecture beyond technology
Who am I ?
• Mahesh Kumar CV is A Big Data Entrepreneur
• Mahesh got about 14 years of experience in architecting and developing distributed and real-time data-driven systems.
• Specialties: Translating big data into action, Big Data Trainings, Product Engineering Services, and Building Big Data CoE & Big Data Incubators
• Written more than 60 Blogs in Big Data & SAP Analytics
• Worked in the past with IBM, Mindtree, CSC & Rolta companies
• Conducted couple of Boot camps & Workshops in Different companies
Data Vs Information
• Data refers to a collection of numbers, characters and is a relative term;
• Data is Raw, Facts , Figures etc
• Information is Process Data
Structure Data Vs Unstructured Data
So where is this data getting generated ?
Social Networking and Media:
700 million Facebook users, 250 million Twitter users
175+ million public blogs
Each Facebook update, Tweet, blog post and comment creates multiple new data points, both structured, semi-structured and unstructured
Mobile Devices:
5 billion mobile phones in use worldwide
Each call, text and instant message is logged as data
particularly smart phones and tablets, also make it easier to use social media
Internet Transactions:
Billions of online purchases, stock trades and other transactions happen every day, including countless automated transactions
Each creates a number of data points collected by retailers, banks, credit cards, credit agencies and others
Networked Devices and Sensors:
Electronic devices of all sorts – including servers and other IT hardware, smart energy meters and temperature sensors -- all create semi-structured log data that record every action
Build Vs Buy
HUMAN DRIVEN
WEB LOGS
DOCUMENTS
SOCIAL
MACHINE DRIVEN
SATELLITE IMAGES
BIO- INFORMATICS
M2M LOG FILES
SENSORS
VIDEO
AUDIO
BUSINESS DRIVEN
OLTP
ALL DATA TYPES
1X 10X 100X
BIG DATA TODAY
BIG DATA TOMORROW
Defining Big Data
Any amount
of data
that's too
BIG
to be handled by one computer
John Rauser
Why Big Data
12 TB of Tweets in a Day
80% Of world’s data is unstructured
30 billion pieces of content shared on Facebook every
month
Expected Data in 2020 would be 35 ZB
5 Million Trade events per second
2267 Billion Internet Users
4.7 billion searches on Google per day
5 Billion people tweet,text,call and browse
on mobile phones daily
Walmart handles 1 Million transaction per hour
255 Million Websites
Enterprise Data Landscapes
Operational
Warehouses
Marts
Dimensional
Semantic
Information
Oracle DB2 SQL Other
BW TeraData Netezza
Mart Mart Mart
OLAP OLAP
IQ
Universe
? Queries Ad-Hoc Dashboard
E
T
L
Applications
Reports
OLAP
Mart Mart Mart
OLAP
Mart
Unstructured
Data
Big Data Reference Architecture
Structured Data Sources Data Integration (Batch /
Near real-time) Data Repositories
MDM
End User Analytics
Reports / Dashboards
Unstructured/Semi-
structured Data Sources
Web logs, Application /
Network log, Social, Chat
transcripts, Emails
Legacy applications, ERP
and CRM applications
Data Extraction
External feeds
Instrumentation data /
Sensors, RFID, Telematics,
Time and Location data
Real-time Streaming/Integration
Data Cleaning and
Transformation
Change Data Capture for
Structured Data
Change Data Capture
ODS
Analytics
Data Warehouse
DW Appliances
Data Marts
MOLAP Cube In-memory Databases
Unstructured / Semi-
structured data
Scorecards and Metrics
Events and Alerts
Data Mining and Exploration
Predictive Analytics
Text Analytics
Visual Exploration
Mobile BI
Columnar Databases
Columnar
Databases
Structured Data Sources Data
Integration Data Repositories
MDM
End User Analytics
Reports
Unstructured/Semi-
structured Data Sources
Web logs, Application / Network
log, Social, Chat transcripts,
Emails
Legacy and ERP
Data
Extraction,
Transformation
External feeds
Instrumentation data / Sensors,
RFID, Telematics, Time and
Location data
Real-time Streaming /
Integration
Data
Quality
CDC for
Structured
data
Change
Data
Capture
ODS
DW
DW
Appliance
Data
Marts
MOLAP
Cube
In-memory
Databases
Unstructured /
Semi-structured
Scorecards /
Metrics
Events /
Alerts
Data
Mining
Predictive
Analytics
Text
Analytics
HANA / BW
/ Sybase
SAP HANA Dash
boards
BO WebI /
Crystal
Reports
BO dashboard
Data
Exploration
Mobile
BI
SAP HANA
Sybase IQ /
HANA BO Mobile
SAP HANA/
Sybase
RDS /
Rapid
Marts
SAP BW SAP Lumira
SAP Predictive
Analysis
Analytics
Hadoop
Platform
BO CMS
SAP HANA
/ SAP BW SAP MDM
SA
P B
O
Da
ta S
erv
ice
s
3rd Party
3rd Party
SAP HANA
Big Data Reference Architecture
SAP
Columnar
Databases
Structured Data Sources Data
Integration Data Repositories
MDM
End User Analytics
Reports
Unstructured/Semi-
structured Data Sources
Web logs, Application /
Network log, Social, Chat
transcripts, Emails
Legacy Applications
and ERP
Data
Extraction
External feeds
Instrumentation data /
Sensors, RFID, Telematics,
Time and Location data
Real-time Streaming
Data
Quality
CDC for
Structured
Data
CDC for
Unstructured
Data
Hadoop
Platform
ODS
Data
Warehouse
DW
Appliance
Data
Marts
MOLAP
Cube
In-memory
Databases
Semi /
Unstructured
Scorecards /
Metrics
Events /
Alerts
Predictive
Analytics
Text
Analytics
Content
Analytics
Info
Sphere
Info
rmation S
erv
er
Dash
boards
Cognos B
uis
ness Inte
lligence
Ente
rprise
Visual
Exploration
Mobile
BI
Cognos
TM1
Cognos
Mobile
Pure
Data
(Nete
zza,
Info
Sphere
Ware
house)
Cognos TM1
InfoSphere
Data Explorer
SPSS
Premium
SPSS
Content
Analytics InfoSphere Streams
InfoSphere
CDC
Analytics
Sandbox
Big Insights /
Streams
Big Insights
InfoSphere
MDM
Big Insights /
NoSQL
Big Insights /
HBase
Pu
reD
ata
(N
ete
zza
,
Info
Sp
he
re W
are
ho
use
,
ISA
S)
Big Data Reference Architecture
IBM
Columnar
Databases
Structured Data Sources Data
Integration Data Repositories
MDM
End User Analytics
Reports
Unstructured/Semi-
structured Data Sources
Web logs, Application /
Network log, Social, Chat
transcripts, Emails
Legacy Applications
and ERP
Data
Extraction
External feeds
Instrumentation data /
Sensors, RFID, Telematics,
Time and Location data
Real-time Streaming
Data
Quality
CDC for
Structured
Data
CDC for
Unstructured
Data
Hadoop
Platform
ODS
Data
Warehouse
DW
Appliance
Data
Marts
MOLAP
Cube
In-memory
Databases
Semi /
Unstructured
Scorecards /
Metrics
Real Time
Decision Mgt.
Data
Mining
Predictive
Analytics
Text
Analytics
Data
Integrator
Exadata Dash
boards
BI Publisher
OBI Foundation
Suite
Visual
Exploration
Mobile
BI
Exalytics
OBI Mobile
Ora
cle
/ E
xa
da
ta
Oracle /
Exadata
Essbase /
Hyperion
Exalytics
OBI Scorecard
Exa
lytics +
Ora
cle
R E
nt.
Endeca Oracle Golden Gate
Analytics
Sandbox Exalytics
Hadoop /
Golden Gate
Big Data
Appliance
Oracle MDM
Big Data
Appliance
Exadata EHCC
/ HBase
Silver Creek
Data Integrator
/ Golden Gate
Real-time
Decisions
Big Data Reference Architecture
ORACLE
Big Data Reference Architecture
Informatica+EMC+SAS
Columnar
Databases
Structured Data Sources Data
Integration Data Repositories
MDM
End User Analytics
Reports
Unstructured/Semi-
structured Data Sources
Legacy Applications
and ERP
Data
Extraction
External feeds
Instrumentation data /
Sensors RFID, Telematics,
Time and Location data
Real-time Streaming
Data
Quality
CDC for
Structured
Data
CDC for
Unstructured
Data
Hadoop
Platform
ODS
Data
Warehouse
DW
Appliance
Data
Marts
MOLAP
Cube
In-memory
Databases
Semi /
Unstructured
Scorecards /
Metrics
Data
Exploration
Predictive
Analytics
Text
Analytics
Info
rma
tica
Po
we
rCe
nte
r &
Da
ta Q
ua
lity
EMC GreenPlum Dash
boards
SAS BI
Visual
Exploration
Mobile
BI
SAS Visual
Analytics
SAS BI
EM
C G
ree
nP
lum
Da
tab
ase
EMC GreenPlum
SAS OLAP
Server
SAS Visual
BI
SAS Ent.
Miner
SAS Strategy
Mgt
JMP Pro
SAS Text
Miner
Informatica PowerCenter – Real-time edition
Analytics
Sandbox EMC GreenPlum
UAP
Informatica
hParser /
Hadoop Pwx
EMC
Greenplum HD
EMC
GreenPlum
HD
HBase
Informatica
MDM
Web logs, Application /
Network log, Social, Chat
transcripts, Emails
Big Data Reference Architecture Open Source Technologies
Columnar
Databases
Structured Data Sources Data
Integration Data Repositories
MDM
End User Analytics
Reports
Unstructured/Semi-
structured Data Sources
Legacy Applications
and ERP
Data
Extraction
External feeds
Instrumentation data /
Sensors RFID, Telematics,
Time and Location data
Real-time Streaming
Data
Quality
CDC for
Structured
Data
CDC for
Unstructured
Data
Hadoop
Platform
ODS
Data
Warehouse
DW
Appliance
Data
Marts
MOLAP
Cube
In-memory
Databases
Semi /
Unstructured
Scorecards /
Metrics
Predictive
Analytics
Text
Analytics
Ap
ac
he
Ma
pR
ed
uc
e, P
ig,
Ta
len
d D
ata
In
teg
rati
on
& D
ata
Qu
ality
Commercial
Product
Dash
boards
Visual
Exploration
Mobile
BI
Apache Derby
PentahoMob
ile BI
MyS
QL
, A
pa
ch
e
Hiv
e
MySQL, Hive
SAS OLAP
Server
R, Apache
Mahout
SAS Text
Miner
Apache Flume
Analytics
Sandbox Apache HDFS +
R
Apache
Hadoop
HBase,
NoSQL HBase
Talend MDM
Web logs, Application /
Network log, Social, Chat
transcripts, Emails
Pe
nta
ho
Bu
sin
es
s A
na
lyti
cs
, B
I
What is Hadoop
• It’s a framework for large-scale data processing:
• Inspired by Google’s architecture:
• A top-level Apache project – Hadoop is open source
• Written in Java, plus a few shell scripts
• An open-source software framework that supports data-intensive distributed applications
• Abstract and facilitate the storage and processing of large and rapidly growing data sets
• Structured and non-structured data
• Simple programming models
2 key components of Core Hadoop
• Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes
(2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search
• AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for
doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual
core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity.
• Facebook: To store copies of internal log and dimension data sources and use it as a source for
reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw
storage;
• FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine
cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine
learning
• NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used for crawling,
processing, serving and log analysis
• Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon EC2 ; data storage
in Amazon S3
Hadoop uses every where
HDFS : High level architecture
• HDFS Follows a master-slave architecture
• 2 Major Daemons in HDFS – • Name Node • Data Node
• Master : Name Node • Responsible for namespace and metadata • Namespace : file hierarchy • Metadata : ownership, permissions, block locations etc
• Slave : DataNode • Responsible for storing actual data blocks
MapReduce : High Level Architecture
• Map reduce has a master slave architecture too
• 2 Daemon processes
• Master : Job Tracker • Responsible for dividing, scheduling and monitoring work
• Slave : Task Tracker • Responsible for actual processing
High Level View
Apache Hadoop Ecosystem
Disruptions
1 Japanese dating app
2.Heart implants
MOOC 3
Sensored cows in Netherland
Googles autonomous car
What's common to the following game changing solutions ?
1
2
3
4 5
Japanese dating app
Sensored cows in Netherland Googles autonomous car
MOOC
Heart implants
At the core there is a deep embedded DATA PRODUCT !
Created by DATA SCIENCE !
Conquer the world ! Become Data Scientist
• How our health gets cared for ?
• How we learn ?
• How we fall in love ?
• How we do farming ?
• How we drive ?
The world around is changing… Our lives are intimately Surrounded by Data products (an intimate fabric of our lives)
• Amazon Defeated Borders ( Books )
• Netflix Defeated Blockbuster ( Video )
• iTunes Defeated Tower records ( Music )
• Google defeated Yahoo ( Search ) – Page rank algorithm
How did the following players disrupt the Marketplace ?
If Data Science is not integral you are no longer in the game
Demystifying
Data Science ( in simple plain everyday English )
In a Nutshell
• Data Science is the extraction of knowledge from data
• Data Science is the art of turning data into actions
• The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it
• Data Science seeks to • Extract meaning from data
• Create " Data Products"
• Use all available data to tell a valuable story to non- practioners
The future belongs to the companies and people that turn data into products
Data Science is every where
41 Known Unknowns
(BI)
Unknown Unknowns
( Data Science )
Lots of $ impacting patterns
Unnoticed
Waiting to be discovered!
Data Science vs. BI
“As is” state in most organizations
Data
( Sales , Finance )
Reports
( BO, Cognos, MSAS )
“As is” stage with leading game changers
Data repository
Insights
Analytics cell + Modeling processes
( Segment, Score, Text mine )
Move from Reports Insightful Actions that Impact
What's are 4 core differences between Data Science & Dashboards ?
Data repository
Dashboards
Data repository (Purchase habits)
Signal (Similiar people discovery)
ML process (Collaborative filtering)
Actions (Recommend a product )
Outcomes (Improve cross sell)
2
3
4
Dashboards
1
ML + Signals + Actions = Game Changing Outcomes
What exactly is an model ?
• Mathematically defining a real world phenomena
• Representative of real world
• For example cross sell model
What are 3 common things between predictive models and caricatures ?
• Its an approximation, not a perfection
• Its better than not having anything
• It get the job done
REAL WORLD
ANALYTICAL MODEL
Use data to discover Signals (patterns) that cause changes that impacts $ .
What's the Goal of Data Science ?
Data Science Reference Architecture – Key components
Hadoop
Hive
Hana
Info bright
Clustering
Text mining
Mobile
Digital
Data Ingestion Pipeline
Machine Learning Reference Architecture
STORE ( Hadoop, Hive, HANA, Cloudera, Splunk, Hortonworks)
SENSE ( signal extraction- text mining, scoring models ),
RESPOND ( Front line actions thru website, call centre )
1
2
3
Snapshot of Machine Learning Techniques
1. Segmentation
3.Forecasting
5. Scoring models
2.Text mining
4. Visual Analytics
6.Optimisation
1. Customer behavior segmentation
2. Defect segmentation
3. Employee segmentation model
4. Supplier segmentation mode
5. “Chunking” groups
6. Discovered by algorithm
1. Convert messy unstructured text into actionable signals
2. Keyword frequencies
3. Sentiment ratios
4. Blogs
5. Call center transcripts
6. Emails
7. Multi channel sentiment analysis
1. Predict CLTV
2. Predict Sales at a neighborhood outlet
3. Predict Salary based on experience, qualification,
rating, market demand
4. Identify drivers of behavior
5. Weights processing
1. Beyond line, bar , pie charts
2. Geospatial modeling to see geo correlation
3. Spread analysis
4. Outlier detection
1. Churn propensity
2. Cross sell
3. Attrition modeling in HR
4. Risk scoring models in Banking
5. Logistic
6. Neural networks
7. Decision trees
8. Support Vector machines
1. Constraint modeling
2. Maximize an outcome
3. Maximize sales without cannibalizing sister brands
Its all about DETECTING PATTERNS !
1. Segmentation
2. Unstructured Text Mining
Real world Unstructured text mining in health care
Doctors transcripts
Split sentences
onto
words/tokens
Step-1 : SPLIT
Filter “noise”
words eg : I ,
the, is, was,
Step-2 : FILTER ‘Pulmonary’=
‘pulmonar’
‘Insomnia’ = ‘Sleep’ =
‘Sleeplessnes;
‘
Step-3 : STEMMING
Keyword extraction &
Theme generation
Step-4 : THEME EXTRACTION
Step-5 : THEME /
KEYWORD ANALYSIS
Lab diagnostics Nurses Observations
Cardiac
watch list
Oncology
watch list
Pulmonary
watch list
Diabetic
watch list
Schizophreni
a watch list
3. Scoring Models
4. Forecasting !
5. Recommenders
Industries disrupted by Data Science
• Infrastructure optimisation, Network security Telecom
• Customer sentiment, Multi channel analysis Banking
• Consumer engagement, Recommendation engines Digital channel
• Autonomous cards, Fords OnStar Automotive
• Wearables Health care
• Operations optimisation Oil n Gas
• Digitisation Retail
What factors are driving companies towards data science ?
• Competitive advantage in the market place ( get ahead fast using unique insights )
• Existential threat ( others are moving ahead fast and I need to catch up )
• Revenue enhancement ( Cross sell models, recommenders )
• Cost optimisation ( Operational efficiency )
Technology behind Data Science
Algorithams
Machine learning
Predictive
analytics
R
Why is Big Data HOT ?
Big Data jobs are Exploding!
Data Science jobs are Exploding!
Data Science Jobs exploding in India too !
1
2
3
Transform yourself to 21st Century Skills
The 6 Most Desired Skills in 2015
1
2
3
To summarize 3 key takeaways …
FAQ
FAQ-1: “I am confused between Hadoop and Data Science … What's difference between Hadoop and Data Science?”
• Hadoop = Data Infrastructure layer
• Data Science = Sensing patterns from data to impact business outcome
FAQ-2 : “I have worked on SAP, Oracle, etc How do I transition to becoming a Data Scientist ?”
• Execute your first Data Science pilot • Step-1 : Learn R
• Step-2 : Zero in on a business problem to solve
• Step-3 : Setup R Your technology connector …Get access to data from your Technology
• Step-4 : Apply an Analytical construct ( VEDA ML )
• Step-5 : Discover the pattern which impacts the outcome
• Step-6 : Present final results to executive business team
• Explore setting up a Data science project within existing organisation
• Meetups to explore the outside world
FAQ-3: “Should I know probability and advanced statistics ?”
• Not really
• We are focussed on APPLICATION and not THEORY underpinning it
• We will teach you • Business problem to solve
• How to execute the command on a platform
• What to look for in the output
• What happens within the black box can be seen later
FAQ-4: “This is a big shift for me … In your experience how long does it take to make the transition from IT to Data Science ?”
• We have seen people make the transition from 4 weeks to about 6 months
• It depends upon the time + passion + drive you have
FAQ-5: “How are we going to prepare you for the data science job market ?”
1. Mock preparatory sessions
2. Worksheets + Modelling Checklists + Data Science Playbooks
3. Live projects on clustering , scoring which can be put in resume
4. Our strategic tie-ups with Organisations looking for data science skills
5. Top 30 Practitioner generated Data Science questions
FAQ-6: “I am not an IT professional but a domain person. How can I get started ?”
1. Option-1 : Focus on Industry use cases
2. Option-2 : Take basic introduction to data sciences
Your Turn : Happy to Answer your Questions
Big Data Resources • datasciencecentral.com
• bigdatauniversity.com
• Courseera.com
• Big Data Architecture
• Spotting Signals in Big Data
• Signal Extraction Methodology
• Advanced Visualization in Big Data
• Exploratory Data Analysis (EDA) : Quick Deep Dive
• Best practices in designing dashboards and scorecards
• Exploring Big Data Using Bivariate Analysis
• Where to start looking in Big Data using Univariate Analysis
• Big Data Platform & Applications
• Statistics Role in Data Science
• Applied Mathematics Role in Data Science
• Data-Scientist-playbook
• 5-disruption-data-products By Data Science
All The Best Happy Hadooping & Dating with Data Science
Conquer the world ! Become Data Scientist