Data Lake Foundation on the AWS Cloud · You have complete . Amazon Web Services – Data Lake...

24
Page 1 of 24 Data Lake Foundation on the AWS Cloud with Apache Zeppelin, Amazon RDS, and other AWS Services Quick Start Reference Deployment August 2017 Last update: March 2018 (revisions) Cloudwick Technologies AWS Quick Start Reference Team Contents Overview................................................................................................................................. 2 Costs and Licenses.............................................................................................................. 3 Architecture............................................................................................................................ 3 AWS Components............................................................................................................... 3 Data Visualization Components ......................................................................................... 4 Design ................................................................................................................................. 5 Prerequisites .......................................................................................................................... 7 Specialized Knowledge ....................................................................................................... 7 Deployment Options .............................................................................................................. 7 Deployment Steps .................................................................................................................. 7 Step 1. Prepare Your AWS Account .................................................................................... 7 Step 2. Launch the Quick Start ..........................................................................................8 Step 3. Test the Deployment ............................................................................................ 15

Transcript of Data Lake Foundation on the AWS Cloud · You have complete . Amazon Web Services – Data Lake...

Page 1 of 24

Data Lake Foundation on the AWS Cloud

with Apache Zeppelin, Amazon RDS, and other AWS Services

Quick Start Reference Deployment

August 2017

Last update: March 2018 (revisions)

Cloudwick Technologies

AWS Quick Start Reference Team

Contents

Overview ................................................................................................................................. 2

Costs and Licenses .............................................................................................................. 3

Architecture ............................................................................................................................ 3

AWS Components ............................................................................................................... 3

Data Visualization Components ......................................................................................... 4

Design ................................................................................................................................. 5

Prerequisites .......................................................................................................................... 7

Specialized Knowledge ....................................................................................................... 7

Deployment Options .............................................................................................................. 7

Deployment Steps .................................................................................................................. 7

Step 1. Prepare Your AWS Account .................................................................................... 7

Step 2. Launch the Quick Start ..........................................................................................8

Step 3. Test the Deployment ............................................................................................ 15

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 2 of 24

Step 4. Explore the Data Lake Portal ............................................................................... 15

Deleting the Stack ................................................................................................................ 22

Troubleshooting ................................................................................................................... 22

Additional Resources ........................................................................................................... 23

Send Us Feedback ................................................................................................................ 24

Document Revisions ............................................................................................................ 24

This Quick Start deployment guide was created by Amazon Web Services (AWS) in

partnership with Cloudwick Technologies Inc., an AWS Advanced Consulting Partner

specializing in big data.

Quick Starts are automated reference deployments that use AWS CloudFormation

templates to launch, configure, and run the AWS compute, network, storage, and other

services required to deploy a specific workload on AWS.

Overview

This Quick Start reference deployment guide provides step-by-step instructions for

deploying a data lake foundation on the Amazon Web Services (AWS) Cloud.

A data lake is a repository that holds a large amount of raw data in its native (structured or

unstructured) format until the data is needed. Storing data in its native format enables you

to accommodate any future schema requirements or design changes.

This Quick Start deploys a data lake foundation that integrates various AWS Cloud

components to help you migrate your structured and unstructured data from your on-

premises environment to the AWS Cloud, and store, monitor, and analyze the data. The

deployment uses Amazon Simple Storage Service (Amazon S3) as a core service to store the

data. It also includes other AWS services such as Amazon Relational Database Service

(Amazon RDS), AWS Data Pipeline, Amazon Redshift, AWS CloudTrail, and Amazon

Elasticsearch Service (Amazon ES). The Quick Start deploys Apache Zeppelin and Kibana

for analyzing and visualizing the data stored in Amazon S3.

The Quick Start also deploys a data lake portal, where you can upload files to, and

download files from, the data lake repository in Amazon S3, monitor real-time streaming

data using Amazon Kinesis Firehose, analyze and explore the data you’ve uploaded in

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 3 of 24

Kibana, and check your cloud resources. You can follow the instructions in this guide to

upload your data into an Amazon RDS table and try out some of this functionality.

This Quick Start supports multiple user scenarios, including:

Ingestion, storage, and analytics of original data sets, whether they are structured or

unstructured

Integration and analysis of data originating from disparate sources

Reduction in analytics costs as the data captured grows exponentially

Ability to leverage multiple analytic engines and processing frameworks by using the

same data stored in Amazon S3

Costs and Licenses You are responsible for the cost of the AWS services used while running this Quick Start

reference deployment. There is no additional cost for using the Quick Start.

The AWS CloudFormation template for this Quick Start includes configuration parameters

that you can customize. Some of these settings, such as instance type, will affect the cost of

deployment. For cost estimates, see the pricing pages for each AWS service you will be

using. Prices are subject to change.

This Quick Start also deploys the Kibana and Apache Zeppelin open-source software, which

are both free of charge.

Architecture AWS Components

The core AWS components used by this Quick Start include the following AWS services.

Infrastructure:

Amazon EC2 – The Amazon Elastic Compute Cloud (Amazon EC2) service enables you

to launch virtual machine instances with a variety of operating systems. You can choose

from existing Amazon Machine Images (AMIs) or import your own virtual machine

images.

AWS Lambda – Lambda is used to run code without provisioning or managing servers.

Your Lambda code can be triggered based on an event.

Amazon VPC – The Amazon Virtual Private Cloud (Amazon VPC) service lets you

provision a private, isolated section of the AWS Cloud where you can launch AWS

services and other resources in a virtual network that you define. You have complete

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 4 of 24

control over your virtual networking environment, including selection of your own IP

address range, subnet creation, and configuration of route tables and network gateways.

IAM – AWS Identity and Access Management (IAM) enables you to securely control

access to AWS services and resources for your users. With IAM, you can manage users,

security credentials such as access keys, and permissions that control which AWS

resources users can access, from a central location.

AWS CloudTrail – CloudTrail enables governance, compliance, operational auditing,

and risk auditing of your AWS account. With CloudTail, you can log, continuously

monitor, and retain events related to API calls across your AWS infrastructure.

Storage:

Amazon S3 – Amazon Simple Storage Service (Amazon S3) provides a secure and

scalable repository for your data, and is closely integrated with other AWS services for

post-processing and analytics. This Quick Start uses Amazon S3 to store data in its

original format.

Database:

Amazon RDS – Amazon RDS helps set up, operate, and scale MySQL deployments in

the cloud. This Quick Start deploys Amazon RDS to demonstrate how AWS Data

Pipeline can be used to migrate data from your relational database to AWS Cloud

services such as Amazon S3 and Amazon Redshift.

Amazon Redshift – Amazon Redshift helps you analyze all your data using standard

SQL and your existing business intelligence (BI) tools. This Quick Start uses Amazon

Redshift as the data warehouse for the data that’s migrated from an on-premises

relational database.

Analytics:

Amazon ES – Amazon Elasticsearch Service (Amazon ES) helps deploy, operate, and

scale Elasticsearch for log analytics, full text search, and application and metadata

monitoring.

Amazon Kinesis Firehose – Kinesis Firehose is part of the Kinesis streaming data

platform. It delivers real-time streaming data to Amazon ES, and this Quick Start

displays the streaming data captured by Kinesis in Kibana.

Data Visualization Components

Kibana plugin for Amazon ES – Kibana is a web interface for Elasticsearch and provides

visualization capabilities for content indexed on an Elasticsearch cluster.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 5 of 24

Apache Zeppelin – Zeppelin is an open-source tool for data ingestion, analysis, and

visualization based on the Apache Spark processing engine.

Design

Deploying this Quick Start for a new virtual private cloud (VPC) with default parameters

builds the following data lake environment in the AWS Cloud.

Figure 1: Quick Start data lake foundation architecture on AWS

The Quick Start sets up the following:

A virtual private cloud (VPC) that spans two Availability Zones and includes two public

and two private subnets.*

An Internet gateway to allow access to the Internet.*

In the public subnets, managed NAT gateways to allow outbound Internet access for

resources in the private subnets.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 6 of 24

In the public subnets, optional Linux bastion hosts in an Auto Scaling group to allow

inbound Secure Shell (SSH) access to EC2 instances in public and private subnets. The

bastion host instances are omitted by default.

In a private subnet, a web server instance (Amazon Machine Image, or AMI) in an Auto

Scaling group to host the data lake portal. This web server also installs Zeppelin to run

analytics on the data loaded into Amazon S3.

IAM roles to provide permissions to access AWS resources; for example, to access data

in Amazon S3, to enable Amazon Redshift to copy data from Amazon S3 into its tables,

and to associate the generated IAM role with the Amazon Redshift cluster.

In the private subnets, Amazon RDS to enable migrating data from a relational database

to Amazon Redshift using AWS Data Pipeline.

Integration with Amazon S3, AWS Lambda, Amazon ES with Kibana, Amazon Kinesis

Firehose, and CloudTrail for data storage and analysis.

Your choice to create a new VPC or deploy the data lake components into your existing

VPC on AWS. The template that deploys the Quick Start into an existing VPC skips the

components marked by asterisks above.

Here’s how these components work together, with Amazon S3 at the center of the

architecture:

AWS Data Pipeline migrates your RDBMS data from Amazon RDS to Amazon Redshift.

After you deploy the Quick Start, you can follow the instructions in this guide to upload

your data into an Amazon RDS table to explore this functionality.

Zeppelin analyzes and visualizes the data being migrated.

Amazon S3 stores the structured or unstructured data files and associated log files for

the data lake.

Lambda functions capture the metadata associated with the uploaded files and push the

metadata to Amazon ES.

Kinesis Firehose captures streams of metadata associated with the files being uploaded

to Amazon S3.

Kibana fetches and displays the statistics from Amazon ES, and also displays graphics

based on the API calls made to the data lake.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 7 of 24

Prerequisites

Specialized Knowledge

Before you deploy this Quick Start, we recommend that you become familiar with the AWS

services listed in the previous section by following the provided links. (If you are new to

AWS, see Getting Started with AWS.)

Deployment Options This Quick Start provides two deployment options:

Deploy the Quick Start into a new VPC (end-to-end deployment). This option

builds a new AWS environment consisting of the VPC, subnets, NAT gateways,

bastion hosts, security groups, and other infrastructure components, and then

deploys the data lake services and components into this new VPC.

Deploy the Quick Start into an existing VPC. This option deploys the data lake

services and components in your existing AWS infrastructure.

The Quick Start provides separate templates for these options. It also lets you configure

CIDR blocks, instance types, and data lake settings, as discussed later in this guide.

Deployment Steps

Step 1. Prepare Your AWS Account

1. If you don’t already have an AWS account, create one at https://aws.amazon.com by

following the on-screen instructions.

2. Use the region selector in the navigation bar to choose the AWS Region where you want

to deploy the data lake components on AWS.

Important This Quick Start uses Amazon Kinesis Firehose, which is supported

only in the regions listed on the AWS Regions and Endpoints webpage.

3. Create a key pair in your preferred region.

4. If necessary, request a service limit increase for the Amazon EC2 M1 instance type. You

might need to do this if you already have an existing deployment that uses this instance

type, and you think you might exceed the default limit with this reference deployment.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 8 of 24

Step 2. Launch the Quick Start

Note You are responsible for the cost of the AWS services used while running this

Quick Start reference deployment. There is no additional cost for using this Quick

Start. For full details, see the pricing pages for each AWS service you will be using in

this Quick Start. Prices are subject to change.

1. Choose one of the following options to launch the AWS CloudFormation template into

your AWS account. For help choosing an option, see deployment options earlier in this

guide.

Option 1

Deploy Quick Start into a

new VPC on AWS

Option 2

Deploy Quick Start into an

existing VPC on AWS

Important If you’re deploying the Quick Start into an existing VPC, make sure

that your VPC has two private subnets in different Availability Zones for the database

instances. These subnets require NAT gateways or NAT instances in their route

tables, to allow the instances to download packages and software without exposing

them to the Internet. You’ll also need the domain name option configured in the

DHCP options as explained in the Amazon VPC documentation. You’ll be prompted

for your VPC settings when you launch the Quick Start.

Each deployment takes about 20 minutes to complete.

2. Check the region that’s displayed in the upper-right corner of the navigation bar, and

change it if necessary. This is where the network infrastructure for the data lake will be

built. The template is launched in the US West (Oregon) Region by default.

Important This Quick Start uses Amazon Kinesis Firehose, which is supported

only in the regions listed on the AWS Regions and Endpoints webpage.

3. On the Select Template page, keep the default setting for the template URL, and then

choose Next.

4. On the Specify Details page, change the stack name if needed. Review the parameters

for the template. Provide values for the parameters that require input. For all other

Launch Launch

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 9 of 24

parameters, review the default settings and customize them as necessary. When you

finish reviewing and customizing the parameters, choose Next.

In the following tables, parameters are listed by category and described separately for

the two deployment options:

– Parameters for deploying the Quick Start into a new VPC

– Parameters for deploying the Quick Start into an existing VPC

Option 1: Parameters for deploying the Quick Start into a new VPC

View template

Network Configuration:

Parameter label

(name)

Default Description

Availability Zones

(AvailabilityZones)

Requires input The list of Availability Zones to use for resource distribution in

the VPC. This field displays the available zones within your

selected region. You can choose 2, 3, or 4 Availability Zones

from this list. The logical order of your selections is preserved

in your deployment. After you make your selections, make

sure that the value of the Number of Availability Zones

parameter matches the number of selections.

Number of Availability

Zones

(NoOfAzs)

2 The number of Availability Zones to use in the VPC. This count

must match the number of selections in the Availability

Zones parameter; otherwise, your deployment will fail with

an AWS CloudFormation template validation error. (Note that

some regions provide only 2 or 3 Availability Zones.)

VPC CIDR

(VPCCIDR)

10.0.0.0/16 CIDR block for the VPC.

Private Subnet 1 CIDR

(PrivateSubnet1CIDR)

10.0.0.0/19 CIDR block for the private subnet located in Availability Zone

1.

Private Subnet 2 CIDR

(PrivateSubnet2CIDR)

10.0.32.0/19 CIDR block for the private subnet located in Availability Zone

2.

Public Subnet 1 CIDR

(PublicSubnet1CIDR)

10.0.128.0/20 CIDR block for the public (DMZ) subnet located in Availability

Zone 1.

Public Subnet 2 CIDR

(PublicSubnet2CIDR)

10.0.144.0/20 CIDR block for the public (DMZ) subnet located in Availability

Zone 2.

Permitted IP range

(AccessCIDR)

Requires input The CIDR IP range that is permitted to access the data lake

web server instances. We recommend that you set this value to

a trusted IP range. For example, you might want to grant only

your corporate network access to the software.

Add Bastion Host

(AddBastion)

No Set this parameter to Yes if you want to include Linux bastion

host instances in an Auto Scaling group in the VPC.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 10 of 24

Amazon RDS Configuration:

Parameter label

(name)

Default Description

RDS Instance Type

(RDSInstanceType)

db.t2.small EC2 instance type for the RDS DB instances.

RDS Allocated Storage

(RDSAllocatedStorage)

5 Size, in GiB, of the RDS database, in the range 5-1024 GiB.

RDS Database Name

(RDSDatabaseName)

awsdatalakeqs The name of the RDS database. This is a 4-20 character string

consisting of letters and numbers. The database name must

start with a letter and contain no special characters.

RDS User Name

(RDSUserName)

admin The user name associated with the administrator account for

the RDS database instance. This is a 4-20 character string

consisting of letters and numbers. The user name must start

with a letter and contain no special characters.

RDS Password

(RDSPassword)

Requires input The password associated with the administrator account for

the RDS database instance. This string must be a minimum of

8 characters, consisting of letters, numbers, and symbols. It

must contain at least one uppercase letter, one lowercase

letter, and one number. You can use any printable ASCII

characters except for single quotation marks ('), double

quotation marks ("), backslashes (\), forward slashes (/), at

signs (@), or spaces.

Elasticsearch Configuration:

Parameter label

(name)

Default Description

Elasticsearch Instance

Type

(ElasticsearchInstance

Type)

t2.medium.

elasticsearch

EC2 instance type for the Elasticsearch instances.

Elasticsearch Instance

Count

(ElasticsearchInstance

Count)

1 The number of Elasticsearch instances to provision. For

guidance, see the Amazon ES documentation.

Elasticsearch Instance

Volume Size

(ElasticsearchVolumeSize)

20 Volume size of the Elasticsearch instances, in GiBs.

Elasticsearch Instance

Volume Type

(ElasticsearchVolumeType)

gp2 Volume type of the Elasticsearch instances:

gp2 – General Purpose (SSD)

standard – Magnetic

io1 – Provisioned IOPS (SSD)

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 11 of 24

Amazon Redshift Configuration:

Parameter label

(name)

Default Description

Redshift Cluster Type

(RedshiftClusterType)

single-node Cluster type for the Amazon Redshift instances. Options are

single-node and multi-node.

Redshift Node Type

(RedshiftNodeType)

dc1.large Instance type for the nodes in the Amazon Redshift cluster.

Number of Amazon

Redshift Nodes

(NumberOfNodes)

1 The number of nodes in the Amazon Redshift cluster. If the

Redshift Cluster Type parameter is set to single-node,

this parameter value should be 1.

Amazon EC2 Configuration:

Parameter label

(name)

Default Description

Keypair Name

(KeyPairName)

Requires input Public/private key pair, which allows you to connect securely

to your instance after it launches. When you created an AWS

account, this is the key pair you created in your preferred

region.

NAT Instance Type

(NATInstanceType)

t2.micro EC2 instance type for NAT instances. This parameter is used

only if your selected AWS Region doesn’t support NAT

gateways.

Data Lake Portal

Instance Type

(PortalInstanceType)

m1.medium EC2 instance type for the data lake web portal.

Data Lake Administrator Configuration:

Parameter label

(name)

Default Description

Administrator Name

(AdministratorName)

AdminName User name for data lake portal access.

Administrator Email

(AdministratorEmail)

Requires input Email address to which information for accessing the data lake

portal will be sent after deployment is complete. (See step 3

for details.)

AWS Quick Start Configuration:

Parameter label

(name)

Default Description

Quick Start S3 Bucket

Name

(QSS3BucketName)

aws-quickstart The S3 bucket you have created for your copy of Quick Start

assets, if you decide to customize or extend the Quick Start for

your own use. The bucket name can include numbers,

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 12 of 24

Parameter label

(name)

Default Description

lowercase letters, uppercase letters, and hyphens, but should

not start or end with a hyphen.

Quick Start S3 Key

Prefix

(QSS3KeyPrefix)

quickstart-

datalake-

cloudwick/

The S3 key name prefix used to simulate a folder for your copy

of Quick Start assets, if you decide to customize or extend the

Quick Start for your own use. This prefix can include numbers,

lowercase letters, uppercase letters, hyphens, and forward

slashes.

Option 2: Parameters for deploying the Quick Start into an existing VPC

View template

Network Configuration:

Parameter label

(name)

Default Description

VPC ID

(VPCID)

Requires input ID of your existing VPC (e.g., vpc-0343606e).

VPC CIDR

(VPCCIDR)

Requires input CIDR block for the VPC.

Private Subnet 1 ID

(PrivateSubnet1ID)

Requires input ID of the private subnet in Availability Zone 1 in your

existing VPC (e.g., subnet-a0246dcd).

Private Subnet 2 ID

(PrivateSubnet2ID)

Requires input ID of the private subnet in Availability Zone 2 in your

existing VPC (e.g., subnet-b1f432cd).

Public Subnet 1 ID

(PublicSubnet1ID)

Requires input ID of the public subnet in Availability Zone 1 in your

existing VPC (e.g., subnet-9bc642ac).

Public Subnet 2 ID

(PublicSubnet2ID)

Requires input ID of the public subnet in Availability Zone 2 in your

existing VPC (e.g., subnet-e3246d8e).

Amazon RDS Configuration:

Parameter label

(name)

Default Description

RDS Instance Type

(RDSInstanceType)

db.t2.small EC2 instance type for the RDS DB instances.

RDS Allocated Storage

(RDSAllocatedStorage)

5 Size, in GiB, of the RDS database, in the range 5-1024 GiB.

RDS Database Name

(RDSDatabaseName)

awsdatalakeqs The name of the RDS database. This is a 4-20 character string

consisting of letters and numbers. The database name must

start with a letter and contain no special characters.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 13 of 24

Parameter label

(name)

Default Description

RDS User Name

(RDSUserName)

admin The user name associated with the administrator account for

the RDS database instance. This is a 4-20 character string

consisting of letters and numbers. The user name must start

with a letter and contain no special characters.

RDS Password

(RDSPassword)

Requires input The password associated with the administrator account for

the RDS database instance. This string must be a minimum of

8 characters, consisting of letters, numbers, and symbols. It

must contain at least one uppercase letter, one lowercase

letter, and one number. You can use any printable ASCII

characters except for single quotation marks ('), double

quotation marks ("), backslashes (\), forward slashes (/), at

signs (@), or spaces.

Elasticsearch Configuration:

Parameter label

(name)

Default Description

Elasticsearch Instance

Type

(ElasticsearchInstance

Type)

t2.medium.

elasticsearch

EC2 instance type for the Elasticsearch instances.

Elasticsearch Instance

Count

(ElasticsearchInstance

Count)

1 The number of Elasticsearch instances to provision. For

guidance, see the Amazon ES documentation.

Elasticsearch Instance

Volume Size

(ElasticsearchVolumeSize)

20 Volume size of the Elasticsearch instances, in GiBs.

Elasticsearch Instance

Volume Type

(ElasticsearchVolumeType)

gp2 Volume type of the Elasticsearch instances:

gp2 – General Purpose (SSD)

standard – Magnetic

io1 – Provisioned IOPS (SSD)

Amazon Redshift Configuration:

Parameter label

(name)

Default Description

Redshift Cluster Type

(RedshiftClusterType)

single-node Cluster type for the Amazon Redshift instances. Options are

single-node and multi-node.

Redshift Node Type

(RedshiftNodeType)

dc1.large Instance type for the nodes in the Amazon Redshift cluster.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 14 of 24

Parameter label

(name)

Default Description

Number of Amazon

Redshift Nodes

(NumberOfNodes)

1 The number of nodes in the Amazon Redshift cluster. If the

Redshift Cluster Type parameter is set to single-node,

this parameter value should be 1.

Amazon EC2 Configuration:

Parameter label

(name)

Default Description

Keypair Name

(KeyPairName)

Requires input Public/private key pair, which allows you to connect securely

to your instance after it launches. When you created an AWS

account, this is the key pair you created in your preferred

region.

Data Lake Portal

Instance Type

(PortalInstanceType)

m1.medium EC2 instance type for the data lake web portal.

NAT Instance Type

(NATInstanceType)

t2.micro EC2 instance type for NAT instances. This parameter is used

only if your selected AWS Region doesn’t support NAT

gateways.

Data Lake Administrator Configuration:

Parameter label

(name)

Default Description

Administrator Name

(AdministratorName)

AdminName User name for data lake portal access.

Administrator Email

(AdministratorEmail)

Requires input Email address to which information for accessing the data lake

portal will be sent after deployment is complete. (See step 3

for details.)

AWS Quick Start Configuration:

Parameter label

(name)

Default Description

Quick Start S3 Bucket

Name

(QSS3BucketName)

aws-quickstart The S3 bucket you have created for your copy of Quick Start

assets, if you decide to customize or extend the Quick Start for

your own use. The bucket name can include numbers,

lowercase letters, uppercase letters, and hyphens, but should

not start or end with a hyphen.

Quick Start S3 Key

Prefix

(QSS3KeyPrefix)

quickstart-

datalake-

cloudwick/

The S3 key name prefix used to simulate a folder for your copy

of Quick Start assets, if you decide to customize or extend the

Quick Start for your own use. This prefix can include numbers,

lowercase letters, uppercase letters, hyphens, and forward

slashes.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 15 of 24

5. On the Options page, you can specify tags (key-value pairs) for resources in your stack

and set advanced options. When you’re done, choose Next.

6. On the Review page, review and confirm the template settings. Under Capabilities,

select the check box to acknowledge that the template will create IAM resources.

7. Choose Create to deploy the stack.

8. Monitor the status of the stack. When the status is CREATE_COMPLETE, the data

lake cluster is ready.

9. Check the Events tab to check the status of the resources in the stack.

Step 3. Test the Deployment

1. When the Quick Start deployment has completed successfully, you’ll receive an email

with a URL, login ID, and password. Check your inbox for this information 15-20

minutes after deployment is complete.

2. Open the URL in your browser window and log in with the credentials you received to

access the data lake portal, as illustrated in Figure 2.

Figure 2: Login screen for portal

Step 4. Explore the Data Lake Portal

When you log in, you’ll see the data lake portal shown in Figure 3.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 16 of 24

Figure 3: Data lake portal

From this portal page, you can manage data, check resources, and visualize data using the

Data Management, Resources, and Visualize options in the upper-right corner.

Choose Data Management to manage data in Amazon S3 or Kinesis Firehose.

– Use the Amazon S3 option to upload files to, download files from, and delete files in

the data lake repository.

Figure 4: Data management in Amazon S3

– Use the Explore Catalogue option to monitor the metadata of the files in

Amazon S3.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 17 of 24

Figure 5: Data management in Kinesis Firehose

Choose Resources in the upper-right corner to check all the AWS resources used and

their endpoints in the data lake.

a. In the RDS Details section, choose the link next to Instance Identifier to go to

the Amazon RDS page.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 18 of 24

Figure 6: Reviewing AWS resources used in the data lake

To test the Data Pipeline to migrate data from Amazon RDS to Amazon Redshift,

you’ll need to add some tables with data to Amazon RDS.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 19 of 24

b. Choose SQL command in the left pane to create tables and insert data.

Figure 7: Adding data tables to Amazon RDS

Alternatively, you can use the Import option (next to the SQL command button)

to import your .sql files and execute them.

Figure 8: Importing .sql files

c. Choose Resources again in the upper right and scroll down the page to the

Datapipeline Details section. Choose Run a datapipeline.

d. Fill out the form to migrate your data from Amazon RDS to Amazon Redshift.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 20 of 24

Figure 9: Using AWS Data Pipeline to migrate data to Amazon Redshift

e. When the data has been migrated, you can view it in Amazon Redshift by using the

Amazon Redshift endpoint link on the Resources screen.

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 21 of 24

Figure 10: Amazon Redshift endpoint in Resources

Choose Visualize in the upper-right corner to visualize your data using Zeppelin or

Kibana.

– Use Zeppelin to run Spark code on the data in Amazon S3. You can also fetch data

from Amazon Redshift by using the Interpreter option in Zeppelin.

Figure 11: Using Zeppelin from the data lake portal

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 22 of 24

– Use Kibana to visualize real-time streaming data with histograms, line graphs, pie

charts, and heat maps on the type of API calls being made on the data lake.

Figure 12: Data streaming in Kibana

Deleting the Stack

When you have finished using the resources created by this Quick Start, you can delete the

stack. Deleting a stack, either by using the command line interface (CLI) or through the

AWS CloudFormation console, will remove all the resources created by the template for the

stack.

Note The data pipeline is created by the data lake portal and is invoked on

demand. Data pipelines are launched and are terminated once they are done.

Troubleshooting

Q. I encountered a CREATE_FAILED error when I launched the Quick Start.

A. If AWS CloudFormation fails to create the stack, we recommend that you relaunch the

template with Rollback on failure set to No. (This setting is under Advanced in the

AWS CloudFormation console, Options page.) With this setting, the stack’s state will be

retained and the instance will be left running, so you can troubleshoot the issue. (You'll

want to look at the log files in %ProgramFiles%\Amazon\EC2ConfigService and C:\cfn\log.)

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 23 of 24

Important When you set Rollback on failure to No, you’ll continue to

incur AWS charges for this stack. Please make sure to delete the stack when

you’ve finished troubleshooting.

For additional information, see Troubleshooting AWS CloudFormation on the AWS

website.

Q. I encountered a size limitation error when I deployed the AWS CloudFormation

templates.

A. We recommend that you launch the Quick Start templates from the location we’ve

provided or from another S3 bucket. If you deploy the templates from a local copy on your

computer or from a non-S3 location, you might encounter template size limitations when

you create the stack. For more information about AWS CloudFormation limits, see the AWS

documentation.

Additional Resources AWS services

Amazon EC2

https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/

AWS CloudFormation

https://aws.amazon.com/documentation/cloudformation/

Amazon VPC

https://aws.amazon.com/documentation/vpc/

For a complete set of links to AWS services used in this Quick Start, see the AWS

Components section.

Data lake visualization tools

Kibana plug-in

https://aws.amazon.com/elasticsearch-service/kibana/

Apache Zeppelin

http://zeppelin.apache.org/

Quick Start reference deployments

AWS Quick Start home page

https://aws.amazon.com/quickstart/

Amazon Web Services – Data Lake Foundation on the AWS Cloud March 2018

Page 24 of 24

Send Us Feedback You can visit our GitHub repository to download the templates and scripts for this Quick

Start, to post your comments, and to share your customizations with others.

Document Revisions Date Change In sections

March 2018 Changed default setting for bastion host

parameter; updated default Quick Start S3 key

prefix.

Step 2 (parameter tables)

August 2017 Initial publication —

© 2018, Amazon Web Services, Inc. or its affiliates, and Cloudwick Technologies, Inc. All

rights reserved.

Notices

This document is provided for informational purposes only. It represents AWS’s current product offerings

and practices as of the date of issue of this document, which are subject to change without notice. Customers

are responsible for making their own independent assessment of the information in this document and any

use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether

express or implied. This document does not create any warranties, representations, contractual

commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities

and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of,

nor does it modify, any agreement between AWS and its customers.

The software included with this paper is licensed under the Apache License, Version 2.0 (the "License"). You

may not use this file except in compliance with the License. A copy of the License is located at

http://aws.amazon.com/apache2.0/ or in the "license" file accompanying this file. This code is distributed on

an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and limitations under the License.