Big Data in the Cloud with Informatica Cloud and Amazon Redshift

40
Cloud and Amazon Redshift Rahul Pathak, Amazon Redshift Product Management Nicolas Brisoux, Informatica Cloud Platform Adoption Darren Cunningham, Informatica Cloud Marketing @infacloud #redshift

description

Data warehousing costs have been continually rising with the explosion of Big Data. To help you explore the most cost-effective data warehousing techniques, learn from the cloud experts from Amazon and Informatica. Learn more: http://www.informaticacloud.com/amazon-redshift Amazon Redshift is a petabyte-scale cloud-based data warehouse that allows you to provision multiple database nodes on demand and offload raw data from on-premise databases for more cost effective data warehousing. Getting this data into Redshift is easy with Informatica Cloud. In this interactive webinar, you’ll learn: -How Amazon Redshift is changing the economics of data warehousing -Why Big Data integration and management is a strategic imperative within enterprises -How cloud integration makes cloud data warehousing even more cost effective At Informatica, our goal is to unlock your information potential. Join us with featured guest speakers from Amazon for this interactive webinar.

Transcript of Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Page 1: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Cloud and Amazon Redshift

Rahul Pathak, Amazon Redshift Product ManagementNicolas Brisoux, Informatica Cloud Platform AdoptionDarren Cunningham, Informatica Cloud Marketing

@infacloud #redshift

Page 2: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Today’s Agenda

• Informatica and Amazon Strategic Partnership

• Amazon Redshift Overview

• Informatica Cloud Redshift Connector

• Demonstration

• Discussion

• Next Steps

2

Page 3: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Informatica: The Information Management Leader

B2B Data Exchange

Informatica supports the requirements of cross-organizational

data exchange, so users apply familiar & trusted data integration

tools and techniques to the growing practice of B2B data integration.

Cloud Data IntegrationEnterprise Data Integration

Complex Event Processing

Informatica received high praise for its services from customers. For deployments involving systems

monitoring use cases, Informatica offers a five-day stand‐up of

RulePoint.

Ultra Messaging

In spite of the new entrants, Informatica remains the market

leader in this highly demanding part of the messaging market.

Data Quality Master Data Management

Application ILM

Page 4: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Informatica Cloud: our fastest growing product lineToday’s Focus: Cloud Data Integration

4

Page 5: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Informatica Cloud and Amazon Redshift:Enabling cost-effective data warehousing

• Redshift Connector pre-release announced in February

• General availability this month (August)

5

InformaticaCloud.com/Amazon-Redshift

Page 6: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Rahul Pathak | [email protected] | @rahulpathakSenior Product Manager

Amazon Redshift

Page 7: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

AWS Database Services

Amazon RDSFully managed SQL database service for OLTP workloads

Amazon DynamoDB

Fully managed NoSQL service for massively scalable, high throughput, low latency workloads

Amazon Redshift

Fully managed fast and powerful, petabyte-scale data warehouse service

Amazon ElastiCache

Fully managed Memcached-compliant in memory caching service

Page 8: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

We set out to build…

A fast and powerful, petabyte-scale data warehouse that is:

A Lot Faster

A Lot Cheaper

A Lot SimplerAmazon Redshift

Page 9: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Data warehousing done the AWS way

• Pay as you go, no up front costs

• Fast, cheap, easy to use

• SQL

• Easy to provision

Page 10: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Common Customer Use Cases

• Reduce costs by extending DW rather than adding HW

• Migrate completely from existing DW systems

• Respond faster to business; provision in minutes

• Improve performance by an order of magnitude

• Make more data available for analysis

• Access business data via standard reporting tools

• Add analytic functionality to applications

• Scale DW capacity as demand grows

• Reduce HW & SW costs by an order of magnitude

Traditional Enterprise DW Companies with Big Data SaaS Companies

Page 11: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Progress Since Launch on Feb 14, 2013

• Fastest growing service in AWS history

• Well over 1,000 customers; adding over 100 per week

• Obtained SOC1 & SOC2 certification with more in progress

• Deployed in US East (N. Virginia), US West (Oregon), EU (Ireland) and Asia Pacific (Tokyo)

• Additional global regions coming soon

Page 12: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift Customers

• 5x – 20x reduction in query times; 4x cost reduction over HIVE

• 20x – 40x reduction in query times

• Nokia: 50% reduction in costs, 2x improvement in query times

Page 13: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift Customer: bit.ly

“When we want to answer a question with Redshift, we just write a SQL query and get an answer within a few minutes – if not seconds.”

- Sean O’Connor, Engineer at bit.lyBit.ly provides social link sharing analytics, managing over 300 million shortens and 5 billion clicks each month

Page 14: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

14

Amazon Redshift Customer: HasOffers

“Amazon Redshift introduces a major opportunity to improve the performance of our real-time reporting, allowing us to run queries up to 50 times faster than our current OLAP solution.”

- Niek Sanders, VP of Engineering,

HasOffers

HasOffers records and reports billions of desktop and mobile interactions for performance marketers

Page 15: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift Customer: Infor

“This is the formula for fast and broad adoption, where customers can get consistent, accurate, and useful data fast - in weeks not months or years.”

- Ali Shadman, SVP, Business Cloud & Upgrades, Infor

Infor is the world’s third largest ERP vendor, serving over 70,000 customers in 194 countries

Page 16: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• With row storage you do unnecessary I/O

• To get total amount, you have to read everything

Page 17: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

• With column storage, you only read the data you need

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Page 18: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

• Columnar compression saves space & reduces I/O

• Amazon Redshift analyzes and compresses your data

analyze compression listing;

Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw

Page 19: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

• Track of the minimum and maximum value for each block

• Skip over blocks that don’t contain the data needed for a given query

• Minimize unnecessary I/O

Page 20: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

• Use direct-attached storage to maximize throughput

• Hardware optimized for high performance data processing

• Large block sizes to make the most of each read

• Amazon Redshift manages durability for you

Page 21: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift architecture

• Leader Node– SQL endpoint– Stores metadata– Coordinates query execution

• Compute Nodes– Local, columnar storage– Execute queries in parallel– Load, backup, restore via

Amazon S3– Parallel load from Amazon

DynamoDB

• Single node version available

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3

JDBC/ODBC

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

LeaderNode

Page 22: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift runs on optimized hardware

HS1.8XL: 128 GB RAM, 16 Cores, 24 Spindles, 16 TB compressed user storage, 2 GB/sec scan rate

HS1.XL: 16 GB RAM, 2 Cores, 3 Spindles, 2 TB compressed customer storage

• Optimized for I/O intensive workloads

• High disk density

• Runs in HPC - fast network

• HS1.8XL available on Amazon EC2

128 GB RAM

16 cores

16 TB disk

16 GB RAM

2 TB disk

2 cores

Page 23: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift lets you start small and grow big

Extra Large Node (HS1.XL)3 spindles, 2 TB, 16 GB RAM, 2 cores

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

Eight Extra Large Node (HS1.8XL)24 spindles, 16 TB, 128 GB RAM, 16 cores, 10 GigE

Cluster 2-100 Nodes (32 TB – 1.6 PB)

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

XL

XL XL XL XL XL XL XL XL

XL XL XL XL XL XL XL XL

XL XL XL XL XL XL XL XL

XL XL XL XL XL XL XL XL

Note: Nodes not to scale

Page 24: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift is priced to let you analyze all your data

Simple Pricing Number of Nodes x Cost per HourNo charge for Leader Node No upfront costsPay as you go

Price Per Hour for HS1.XL Single Node

Effective Hourly Price Per TB

Effective Annual Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year Reservation

$ 0.500 $ 0.250 $ 2,190

3 Year Reservation

$ 0.228 $ 0.114 $ 999

Page 25: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift is easy to use

• Provision in minutes

• Monitor query performance

• Point and click resize

• Built in security

• Automatic backups

Slides not intended for redistribution.

Page 26: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift has security built-in

• SSL to secure data in transit

• Encryption to secure data at rest

– AES-256; hardware accelerated– All blocks on disks and in

Amazon S3 encrypted

• No direct access to compute nodes

• Amazon VPC support

Slides not intended for redistribution.

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

Amazon S3 / Amazon DynamoDB

Customer VPC

InternalSecurityGroup

JDBC/ODBC

LeaderNode

Compute Node

Compute Node

Compute Node

Page 27: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift continuously backs up your data and recovers from failures

• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times

• Backups to Amazon S3 are continuous, automatic, and incremental– Designed for eleven nines of durability

• Continuous monitoring and automated recovery from failures of drives and nodes

• Able to restore snapshots to any Availability Zone within a region

Slides not intended for redistribution.

Page 28: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift works with your existing analysis tools

More coming soon…

JDBC/ODBC

Amazon Redshift

Connect using drivers from PostgreSQL.org

Page 29: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Amazon Redshift integrates with multiple data sources

Amazon Elastic MapReduce

Amazon DynamoDB

Amazon Elastic Compute Cloud

(EC2)

AWS Storage Gateway Service

Amazon Simple Storage Service

(S3)

Corporate Data Center

Amazon Relational Database Service

(RDS)

Amazon Redshift

Page 30: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Today’s Agenda

• Informatica and Amazon Strategic Partnership

• Amazon Redshift Overview

• Informatica Cloud Redshift Connector

• Demonstration

• Discussion

• Next Steps

30

Page 31: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

2

1

Informatica Cloud Architecture Overview

4SecureAgent

Your Company 3

Marketplace

Amazon Redshift

Page 32: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Map Once. Deploy Anywhere.

ON PREMISE HADOOP 3rd PARTYAPPLICATIONS

CLOUD

Page 33: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Cloud Amazon Redshift Connector DemoNicolas Brisoux, Cloud Platform Adoption

Page 34: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Best practices to remember…

• The Amazon S3 bucket that holds the data files must be created in the same region as your cluster

• Files are deleted from Amazon S3 bucket when upload is complete

• Choose a batch size where the number of batches matches the number of slices in your cluster

• Each XL node has 2 slices, each 8XL node has 16

• If you have a 2 node XL cluster and 40,000 rows of data, choose a batch size of 10,000

• The Informatica Cloud Redshift connector can maximize Amazon’s parallel processing capabilities this way

Page 35: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Informatica Cloud Amazon Redshift demonstration

Firewall

Informatica Cloud Secure Agent

Metadata Mappings

Authenticate and retrieve Data Synchronization Task

1

1

Retrieve Account Data2

2

3 Perform lookup on SLA level

3

4

4

Put Account Data & SLA Level into Flat File

5 Transferred compressed Flat File

5

6 Initiate load from Amazon S3

6

7 Load data into Amazon Redshift

7

Page 36: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

PowerCenter Mappings and Informatica Cloud

• If you want to reuse your existing PowerCenter mappings with Informatica Cloud and Redshift you have 2 options:

• Use the PowerCenter Repository Manager to export your existing workflows and import them into Informatica Cloud using the PowerCenter Tasks feature

Or…

• Keep your existing mappings in PowerCenter and stage the data

• Create a DSS task in Informatica Cloud to move the data to Redshift from the staging area

• This task can be managed from PowerCenter

1

2

Page 37: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Why Informatica Cloud Integration for Redshift?

37

1 Map Once, Deploy Anywhere

2 Rapid Connectivity & Deployment

3 Advanced Integration Delivered Easily

4 Excellence in batch and real-time integration

InformaticaCloud.com

Page 38: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Next Steps

• Get started with Amazon Redshift

• Get started with Informatica Cloud

• InformaticaCloud.com

• Learn more about our Redshift Connector

• InformaticaCloud.com/Amazon-Redshift

38

Page 39: Big Data in the Cloud with Informatica Cloud and Amazon Redshift

Discussion

Rahul Pathak, Amazon Redshift Product Management

Nicolas Brisoux, Informatica Cloud Platform Adoption

Darren Cunningham, Informatica Cloud Marketing

@infacloud #redshift

InformaticaCloud.com

Page 40: Big Data in the Cloud with Informatica Cloud and Amazon Redshift