Analytics in the cloud
-
Upload
natalino-busa -
Category
Data & Analytics
-
view
355 -
download
0
Transcript of Analytics in the cloud
Analytics in the CloudNatalino Busa - Head of Data Science
2 Natalino Busa - @natbusa
Distributed computing Machine Learning
Statistics Big/Fast Data Streaming Computing
Head of Applied Data Science at Teradata
On most networks:
@natbusa
4 Natalino Busa - @natbusa
Analytics in the cloud: stacking layers
Bare Metal: Physical Machines
5 Natalino Busa - @natbusa
Analytics in the cloud: stacking layers
Bare Metal: Physical Machines
IAAS: Virtual Resources
6 Natalino Busa - @natbusa
Analytics in the cloud: stacking layers
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
7 Natalino Busa - @natbusa
Analytics in the cloud: stacking layers
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes
8 Natalino Busa - @natbusa
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes
DAAAS: Data Analytics as a Service
Watson ServicesAzure ML
GoogleCloud MLBigML
Analytics in the cloud: stacking layers
9 Natalino Busa - @natbusa
Analytics in the cloud: today’s talk
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes
DAAAS: Data Analytics as a Service
10 Natalino Busa - @natbusa
“we live in an age of open source datacenters, so we can stack all these things together and we have open source from the ground to ceiling.”
Sam Ramji, CEO of Cloud Foundry
https://www.youtube.com/watch?v=7oCSFcUW-Qk
13 Natalino Busa - @natbusa
Containers as a Service
https://aws.amazon.com/ecs/
For example: Amazon ECS
14 Natalino Busa - @natbusa
CaaS: 6 offerings
https://www.linux.com/news/5-container-service-tools-you-should-know-about
Project Magnum
Amazon ECSDocker DataCenterGoogle
Container Engine
16 Natalino Busa - @natbusa
PaaS: Big Data SQL Queries
Batch OrientedLarge Aggregations
Interactive QueriesData Exploration
Interactive QueriesMachine Learning
Streaming:Micro-batching
Interactive QueriesMachine Learning
Streaming:Event-driven
18 Natalino Busa - @natbusa
PaaS: Advanced Analytics
Graph analytics:
- Cluster items- Extract similarities- Detect patterns
19 Natalino Busa - @natbusa
PaaS: Advanced Analytics
Text analytics:
- Sentiment Analysis- Language Detection- Summarization- Entity extraction
20 Natalino Busa - @natbusa
PaaS: Advanced Analytics
Machine Learning:
- Classification- Regression- Clustering- Forecasting- Anomaly detection
21 Natalino Busa - @natbusa
PaaS: Advanced Analytics
AI and Deep Learning- Unstructured Data- Object Detection- Natural Language Processing- Video Summarization- Speech Recognition
22 Natalino Busa - @natbusa
PaaS: Advanced Analytics
SQL + Graph + Text + Machine Learning + Voice/Image/Video
23 Natalino Busa - @natbusa
dPaaS: Machine (deep) Learning
… this are just a few examples ...
24 Natalino Busa - @natbusa
Analytics Everywhere
Public Cloud Managed Cloud Private Cloud Private Infra
25 Natalino Busa - @natbusa
iPaas: Components for Analytics in the Cloud
SQL : Big Data Data Warehousing
NoSQL
Machine LearningObjects Stores
Streaming Computing
SQL: RelationalTransactional DB
26 Natalino Busa - @natbusa
iPaas, dPaaS:
Objects Stores
HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis
S3 (AWS)Storage (GCP)...
27 Natalino Busa - @natbusa
iPaas, dPaaS:
NoSQLObjects Stores
HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis
S3 (AWS)Storage (GCP)...
CassandraRedisHBaseAccumulo
Neo4JElasticSearchMongoDBCouchbase
BigTable (GCP)DynamoDB
28 Natalino Busa - @natbusa
iPaas, dPaaS:
NoSQLObjects Stores
SQL: RelationalTransactional DB
HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis
S3 (AWS)Storage (GCP)...
MySQLPostgreSQLMariaDB
Oracle (AWS MP)
CassandraRedisHBaseAccumulo
Neo4JElasticSearchMongoDBCouchbase
BigTable (GCP)DynamoDB
29 Natalino Busa - @natbusa
iPaas, dPaaS:
SQL : Big Data Data Warehousing
NoSQLObjects Stores
SQL: RelationalTransactional DB
HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis
S3 (AWS)Storage (GCP)...
MySQLPostgreSQLMariaDB
Oracle (AWS MP)
HivePrestoSpark SQLImpala
Redshift (AWS)BigQuery (GCP)Big SQL (IBM)
Teradata (AWS MP)SAP Hana(AWS MP)Vertica (AWS MP)
CassandraRedisHBaseAccumulo
Neo4JElasticSearchMongoDBCouchbase
BigTable (GCP)DynamoDB
30 Natalino Busa - @natbusa
iPaas, dPaaS:
SQL : Big Data Data Warehousing
NoSQL Machine Learning
Objects Stores
SQL: RelationalTransactional DB
HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis
S3 (AWS)Storage (GCP)...
MySQLPostgreSQLMariaDB
Oracle (AWS MP)
HivePrestoSpark SQLImpala
Redshift (AWS)BigQuery (GCP)Big SQL (IBM)
Teradata (AWS MP)SAP Hana(AWS MP)Vertica (AWS MP)
CassandraRedisHBaseAccumulo
Neo4JElasticSearchMongoDBCouchbase
BigTable (GCP)DynamoDB
Spark MLH2OFlinkAreosolveTheanoTensorflowXGboost
Azure MLAWS MLGoogle MLIBM Watson
31 Natalino Busa - @natbusa
iPaas, dPaaS:
SQL : Big Data Data Warehousing
NoSQL Machine Learning
Objects Stores
Streaming Computing
SQL: RelationalTransactional DB
HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis
S3 (AWS)Storage (GCP)...
MySQLPostgreSQLMariaDB
Oracle (AWS MP)
HivePrestoSpark SQLImpala
Redshift (AWS)BigQuery (GCP)Big SQL (IBM)
Teradata (AWS MP)SAP Hana(AWS MP)Vertica (AWS MP)
CassandraRedisHBaseAccumulo
Neo4JElasticSearchMongoDBCouchbase
BigTable (GCP)DynamoDB
Spark MLH2OFlinkAreosolveTheanoTensorflowXGboost
Azure MLAWS MLGoogle MLIBM Watson
Heron (Storm)NiFiSpark StreamingFlinkKafka StreamsLogstashStreamSQL
Google DataFlow (GCP)
32 Natalino Busa - @natbusa
iPaaS: Selecting your Analytical Stack
� Flexible. Powerful.- Combinations for this example:
8 * 3 * 4 * 8 * 7 * 7 = 37632
� Right tool for the right job- Fit for purpose- Multi-Genre Analytics
Hard to maintain and upgrade:- Extended Skills and Know-how- Components upgrades must be compatible
Hard to configure: - no matter if cloud or bare or vms- complex stacks with many tools and services
33 Natalino Busa - @natbusa
iPaaS: Deploy & Manage your own Analytics
How to simplify? Select a bundle!
34 Natalino Busa - @natbusa
iPaaS: bundled recipes & stacks
Select a recipe:- Hortonworks Data Platform- Cloudera Data Platform- Reactive Platform - Smack Stack- Pancake Stack- ELK Stack- Select your own
35 Natalino Busa - @natbusa
iPaaS: my favs analytical stacks
Objects Stores
NoSQL SQL : Big Data Data Warehousing
Machine Learning Streaming Computing
All Hadoop (5) HDFS Hbase Hive Spark Storm
Smack stack (2) Cassandra Cassandra Spark Spark Spark
Elastic (5) HDFS ElasticSearch Hive H2O Kafka
Data Science (8) HDFS ElasticSearch Hive, Presto Spark, H2O, Tensorflow Flink
Real Time (2) Cassandra Cassandra Flink Flink Flink
36 Natalino Busa - @natbusa
dPaaS: Managed Analytics
This is hard ! Can we access it as a service?
37 Natalino Busa - @natbusa
dPaaS: Managed Hadoop & Spark
HDInsight: Hadoop, Spark, and R as services
Managed Spark Clusters, BigInsight (Hadoop)
DataFlow and DataProc: Flink, Spark and Hadoop Clusters as a Service
EMR: Hadoop components a la carte
38 Natalino Busa - @natbusa
PaaS: Analytical clusters
Ephemeral
Create then Dispose
Clusters are Short-Lived
Data Exploration
Isolated, Personal
Simple Access Management
Interactive Analytics
Permanent
Clusters are Long Lived
Scheduled Operations
Production ETL
Co-Ordinated
Complex Access Management
Batch Analytics
vs
41 Natalino Busa - @natbusa
DAaaS: Google ML and AI as a service
Cloud Computing forDeep Neural Networks > Train, Score, Data
AI and ML models for:
● Speech (audio)● Language (text)● Vision (images/video)
42 Natalino Busa - @natbusa
Summary
• Analytics in the Cloud:
The dawn of a new computing era
• IPaas, dPaas:
complexity vs flexibility, it’s a tradeoff
• Computing clusters:
Ephemeral and Persistent
43 Natalino Busa - @natbusa
Head of Applied Data Science at Teradata
Distributed computing Machine Learning
Statistics Big/Fast Data Streaming Computing
Linkedin and Twitter:
natbusa