AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift
-
Upload
amazon-web-services -
Category
Technology
-
view
1.452 -
download
1
Transcript of AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Pathak, Amazon EMR
March 30, 2016
Building Big Data Solutions with Amazon EMR & Amazon Redshift
Agenda
• AWS Big Data Platform Overview
• Amazon EMR & Amazon Redshift
• Building a Big Data Application
• Customer Use Cases
AWS Big Data Platform
EMR EC2
Analyze
Glacier
S3
Store
Import Export
Collect
Kinesis
Direct Connect
MachineLearningRedshift
New!
AmazonQuickSight
DynamoDB
Amazon EMR – Managed Hadoop Clusters in the Cloud
Scalable Hadoop clusters as a service
Hadoop, Hive, Spark, Presto, Hbase, etc.
Easy to use; fully managed
On demand, reserved, spot pricing
HDFS, Amazon EBS, and S3 filesystems
End to end security
Amazon EMR
Easy to deploy
AWS Management Console
or use the EMR API with your favorite SDK
Command Line
Choose your instance types
CPUC3/C4 family
MACHINE LEARNING
MemoryR3 family
SPARK AND INTERACTIVE
Disk/IOD2/I2 family
LARGEHDFS
GeneralM3/M4 family
BATCH PROCESS
Customize your storage type and size using Amazon EBS
Try different configurations to find your optimal architecture
The Hadoop ecosystem can run in Amazon EMR
Integrated with the AWS Platform
Amazon DynamoDB
EMR-DynamoDB connector
Amazon RDS
Amazon Kinesis
Streaming dataconnectorsJDBC Data Source
w/ Spark SQL
ElasticSearchconnector
Amazon Redshift
Amazon Redshift Copy From HDFS
EMR File System(EMRFS)
Amazon S3
Amazon EMR
Amazon S3 as your persistent data store
Amazon S3Designed for 99.999999999% durabilitySeparate compute and storage
Resize and shut down Amazon EMR clusters with no data loss
Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)
EMRFS makes it easier to leverage Amazon S3
Better performance and error handling options
Transparent to applications – just read/write to “s3://”
Support for Amazon S3 server-side and client-side encryption
Faster listing using EMRFS metadata
HDFS is still available via local instance storage or Amazon EBS
Amazon Redshift
Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; start at $0.25/hour
End to end security; built in global DR
Amazon Redshift
Amazon Redshift dramatically reduces I/OColumn storage
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With row storage you do unnecessary I/O
• To get total amount, you have to read everything
Amazon Redshift dramatically reduces I/OColumn storage
• With column storage, you only read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Amazon Redshift dramatically reduces I/OColumn storage
Data compression
• Columnar compression saves space & reduces I/O
• Amazon Redshift analyzes and compresses your data
analyze compression listing;
Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw
Amazon Redshift dramatically reduces I/OColumn storage
Data compression
Zone maps
• Track of the minimum and maximum value for each block
• Skip over blocks that don’t contain the data needed for a given query
• Minimize unnecessary I/O
Amazon Redshift dramatically reduces I/OColumn storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
• Use direct-attached storage to maximize throughput
• Hardware optimized for high performance data processing
• Large block sizes to make the most of each read
• Amazon Redshift manages durability for you
Amazon Redshift Has Security Built In
SSL to secure data in transit
Encryption to secure data at restAES-256; hardware acceleratedAll blocks on disks and in Amazon S3 encryptedHSM Support
No direct access to compute nodes
Audit logging, AWS CloudTrail, AWS KMS integration
Amazon VPC support
SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3 / DynamoDB / EMR
Customer VPC
InternalVPC
JDBC/ODBC
LeaderNode
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
Building a Big Data Application
Building a Big Data Application
web clients
mobile clients
DBMS
corporate data center
Getting Started
Building a Big Data Application
web clients
mobile clients
DBMS Amazon Redshift
Amazon QuickSight
AWS cloudcorporate data center
Adding a data warehouse
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Amazon QuickSight
AWS cloud
Bringing in Log Data
corporate data center
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet(Query optimized)
Amazon QuickSight
AWS cloud
Extending your DW to S3
corporate data center
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet(Query optimized)
Amazon QuickSight
Kinesis Streams
AWS cloud
Adding a real-time layer
corporate data center
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet(Query optimized)
Amazon QuickSight
Kinesis Streams
AWS cloud
Adding predictive analytics
corporate data center
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet(Query optimized)
Amazon QuickSight
Kinesis Streams
AWS cloud
Adding encryption at rest with AWS KMS
corporate data centerAWS KMS
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet(Query optimized)
Amazon QuickSight
Kinesis Streams
AWS cloud
AWS KMS
VPC subnet
SSL/TLS
SSL/TLS
Protecting Data in Transit & Adding Network Isolation
corporate data center
Security
• Encryption at rest with choice of key management• Service managed, AWS KMS, CloudHSM, on premise HSM
• Encryption in Transit• Require SSL, all internal communication over SSL/TLS
• Network isolation using Amazon VPC
• Fine grained permissions and auditing using AWS IAM and AWS CloudTrail
Compliance
ISO 9001
SOC 3
SOC 2
ISO 27001
ISO 27017
PCI DSS Level 1ISO 27018
SOC 1 / ISAE 3402
GxPHIPAA
ITAR
FERPA
FISMA, RMF, and DIACAP
FedRAMP
Section 508 / VPAT
DoD SRG Levels 2 & 4
FIPS 140-2
CJIS
Cloud Security Alliance
MPAA
NIST
MLPS Level 3
G-Cloud
IT-Grundschutz
MTCS Tier 3
IRAP Cyber Essentials Plus
Disaster Recovery
• Amazon EMR & Amazon Redshift clusters are resilient and we automatically replace failed nodes/HW
• Data on S3 available in all Availability Zones in a Region
• S3 data can be synced across regions
• Amazon Redshift clusters are continuously backed up to S3 and snapshots can be synced to a second region
Customer Use Cases
Data Source ET
DirectConnect
Client
Forwarder
LoaderState Management
SandboxRedshift
S3Petabytes of data generated on premise and brought to Redshift in the cloud for analysis
High speed connectivity over a redundant pair of Direct Connect leased lines
Stringent security requirements met by leveraging VPC, VPN, Encryption and Rest and In Transit, CloudTrail and database auditing
NTT DOCOMO
Nasdaq – Legacy Warehouse
Expensive ($1.16M annually)
Limited capacity (1 year of data online)
4-8 billion rows inserted per trading day, storing:
• Orders• Trades• Quotes• Market Data• Security Master• Membership
DW can be used to analyze market share, client activity, surveillance, power our billing, and more…
Nasdaq ArchitectureOn premise AWS Regional (Multi-AZ) Scope AWS (US-East,
primary AZ/VPC)
S3
SNS
Redshift Database
Cluster
HSM Key Appliance
Cluster
MySQL
Redshift Load files/ Manifests
Redshift Snapshots/
Backups
Data Loaded Topic
RMS Input Sources (multiple systems)
Data Ingest Process
SmartNews
Useful Resources
• AWS Big Data Blog
• Re:Invent 2015 Big Data Sessions
• AWS Marketplace for Big Data Solutions
• Amazon Big Data Partners
Summary
• AWS enables you to build sophisticated big data applications • Retrospective, Real-time, Predictive
• You can build incrementally, adding use cases and increasing scale as you go
• AWS provides a broad range of security and auditing features to enable you to meet your security requirements
• AWS makes it easy to build hybrid applications that span across your datacenters and the AWS Cloud
Thank you!
@rahulpathak