Challenges for running Hadoop on AWS - AdvancedAWS Meetup

20
Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Challenges of running Hadoop on AWS June 12, 2014 @ AdvancedAWS Meetup - Citizen Space Andrei Savu - @andreisavu Software Engineer, Cloud Automation Team

description

Nowadays we've got all the tools we need to spin-up and tear-down clusters with hundreds of nodes in minutes and this puts more pressure on the tools we use to configure and monitor our applications. This challenge is even more interesting when we have to deal with long running distributed data storage and processing systems like Hadoop. In this talk we will look into some of the challenges we need to deal with when creating and managing Hadoop clusters in AWS, we will discuss improvement opportunities in monitoring (e.g. detecting and dealing with instance failure, resource contention & noisy neighbors) and a bit about the future and how we should go about disconnecting workload dispatch from cluster lifecycle.

Transcript of Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Page 1: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Challenges of running Hadoop on AWSJune 12, 2014 @ AdvancedAWS Meetup - Citizen Space

Andrei Savu - @andreisavu

Software Engineer, Cloud Automation Team

Page 2: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Overview

● Introduction● Context● Challenges● Questions

Page 3: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Andrei Savu

Software Engineer

Cloud Automation Team @ Cloudera

Previously: founder of Axemblr, Apache Whirr PMC, contributor to jclouds, Cloudsoft, Facebook etc. (see LinkedIn)

Page 4: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Cloud Automation Team @ Cloudera

Focused on:

● building tools to automate deployment and ongoing management of Hadoop clusters on cloud infrastructure

● improving Hadoop cloud compatibility (e.g. s3 integration, swift, managed databases, custom network topologies etc.)

We are hiring!

Page 5: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Context

● Hadoop● Types of Deployments● Cluster Topology● AWS

Page 6: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Context: Hadoop

Hadoop is a broad, coherent stack of products for data storage and processing.

“Hadoop” is more than HDFS & MapReduce. It can do: multiple storage systems, different query engines, batch and real-time etc.

Usually running on bare metal now moving towards cloud infra.

Page 7: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Types of Deployments

Long running:

- store data for analytics jobs with MapReduce, Impala, Spark

- online data serving with HBase

On-demand:

- analytical workloads, fetch data on-demand

- triggered by workflows (1:1)

- disconnected lifecycle (Netflix Genie)

Page 8: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Cluster Topology #1

Simple:

● EC2 classic (being phased-out) ● VPC: single subnet, security group with an optional VPN

Complex:

● VPC: multiple subnets & security groups● DirectConnect● highly available with disaster recovery● multiple users & security

Page 10: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Amazon Web Services

Paradigm shift in how we work with infrastructure.

Key concept: software defined - controlled by APIs

Has most of the things we need for storage and high performance data processing (placement groups, large instances, high storage density, ssds, many vCPUs etc.)

Enterprise-ready: IAM, VPC, VPN / DirectConnect, Support etc.

Page 11: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Challenges

● Instance Provisioning & Health● Ensuring Idempotency● Networking & Performance● AMIs & Bootstrap Speed● Data durability● S3 integration

Page 12: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

What makes it more difficult?

… versus a typical stateless web application in an auto-scaling group monitoring request latency or OS load averages

● statefulness (think databases)● each cluster has multiple processes playing different roles● topology & configuration changes require orchestration● knowledge of service inter-dependencies is required

Page 13: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Instance Provisioning & Health

Questions:

● How do you define your cluster size to deal lack of capacity?● How do you define health? Is that stable during setup?● Is health a binary property? Or a threshold that needs to be

continuously evaluated?

Potential answers:

● match AWS semantics: define size as a range● make simplifying assumptions (e.g. healthy during setup)

Page 14: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Ensuring Idempotency

Questions:

● How do you safely retry expensive calls?● How do you build reliable workflows?

Potential answers:

● AWS User Guide via client token● Discuss: Convergence vs. Single step retries

Page 15: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Networking & Performance

Questions:

● What’s the ideal setup that’s both usable and secure?● How do you get consistent intra-cluster performance?

Potential answers:

● VPC with VPN or DirectConnect. Placement groups help. ● Security model: initial it was just perimeter security, now it

can do a lot more (disk encryption, SSL, kerberos)

Page 16: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Images & Bootstrap Speed

Questions:

● Do you allow custom AMIs or force your own choices?● If using custom AMIs how can you reduce bootstrap time?

Potential answers:

● Custom AMIs are common - integrated with existing infra● Fast bootstrap by baking on top with custom bits

Page 17: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Data durability

Questions:

● How do you place replicas? Datacenter topology?● How are instances distributed in different failure domains?

Potential answers:

● ignore or go with large instances that map 1:1 to hosts● would be nice to have: a way to influence host to instance

allocation or to get datacenter topology data

Page 18: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

S3 integration

Questions:

● How do you reconcile differences in semantics with HDFS? (strongly consistent vs. eventual consistency)

● How do you get most out of it in terms of performance?

Potential answers:

● we’ve done a fair amount of work improving S3 in the open source (features, stability improvements, security etc.)

● performance is network bound

Page 19: Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Thanks! Questions?

Andrei Savu - [email protected]

Twitter: @andreisavu

Join us to take Hadoop to the clouds!https://hire.jobvite.com/Jobvite/job.aspx?j=orafYfwy&b=nqlg3nwW

Page 20: Challenges for running Hadoop on AWS - AdvancedAWS Meetup