Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Challenges of running Hadoop on AWSJune 12, 2014 @ AdvancedAWS Meetup - Citizen Space

Andrei Savu - @andreisavu

Software Engineer, Cloud Automation Team

Overview

● Introduction● Context● Challenges● Questions

Andrei Savu

Software Engineer

Cloud Automation Team @ Cloudera

Previously: founder of Axemblr, Apache Whirr PMC, contributor to jclouds, Cloudsoft, Facebook etc. (see LinkedIn)

http://www.linkedin.com/in/sandrei

Cloud Automation Team @ Cloudera

Focused on:

● building tools to automate deployment and ongoing management of Hadoop clusters on cloud infrastructure

● improving Hadoop cloud compatibility (e.g. s3 integration, swift, managed databases, custom network topologies etc.)

We are hiring!

https://hire.jobvite.com/Jobvite/job.aspx?j=orafYfwy&b=nqlg3nwW

Context

● Hadoop● Types of Deployments● Cluster Topology● AWS

Context: Hadoop

Hadoop is a broad, coherent stack of products for data storage and processing.

“Hadoop” is more than HDFS & MapReduce. It can do: multiple storage systems, different query engines, batch and real-time etc.

Usually running on bare metal now moving towards cloud infra.

Types of Deployments

Long running:

- store data for analytics jobs with MapReduce, Impala, Spark

- online data serving with HBase

On-demand:

- analytical workloads, fetch data on-demand

- triggered by workflows (1:1)

- disconnected lifecycle (Netflix Genie)

Cluster Topology #1

Simple:

● EC2 classic (being phased-out) ● VPC: single subnet, security group with an optional VPN

Complex:

● VPC: multiple subnets & security groups● DirectConnect● highly available with disaster recovery● multiple users & security

Cluster Topology #2

● Cloudera Reference Architecture for AWS Deployments:http://www.cloudera.com/content/cloudera/en/resources/library/whitepaper/cloudera-enterprise-reference-architecture-for-aws-deployments.html

● Best Practices for Deploying Cloudera Enterprise on AWS:http://blog.cloudera.com/blog/2014/02/best-practices-for-deploying-cloudera-enterprise-on-amazon-web-services/

http://www.cloudera.com/content/cloudera/en/resources/library/whitepaper/cloudera-enterprise-reference-architecture-for-aws-deployments.html




http://blog.cloudera.com/blog/2014/02/best-practices-for-deploying-cloudera-enterprise-on-amazon-web-services/



Amazon Web Services

Paradigm shift in how we work with infrastructure.

Key concept: software defined - controlled by APIs

Has most of the things we need for storage and high performance data processing (placement groups, large instances, high storage density, ssds, many vCPUs etc.)

Enterprise-ready: IAM, VPC, VPN / DirectConnect, Support etc.

Challenges

● Instance Provisioning & Health● Ensuring Idempotency● Networking & Performance● AMIs & Bootstrap Speed● Data durability● S3 integration

What makes it more difficult?

… versus a typical stateless web application in an auto-scaling group monitoring request latency or OS load averages

● statefulness (think databases)● each cluster has multiple processes playing different roles● topology & configuration changes require orchestration● knowledge of service inter-dependencies is required

Instance Provisioning & Health

Questions:

● How do you define your cluster size to deal lack of capacity?● How do you define health? Is that stable during setup?● Is health a binary property? Or a threshold that needs to be

continuously evaluated?

Potential answers:

● match AWS semantics: define size as a range● make simplifying assumptions (e.g. healthy during setup)

Ensuring Idempotency

Questions:

● How do you safely retry expensive calls?● How do you build reliable workflows?

Potential answers:

● AWS User Guide via client token● Discuss: Convergence vs. Single step retries

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Run_Instance_Idempotency.html

Networking & Performance

Questions:

● What’s the ideal setup that’s both usable and secure?● How do you get consistent intra-cluster performance?

Potential answers:

● VPC with VPN or DirectConnect. Placement groups help. ● Security model: initial it was just perimeter security, now it

can do a lot more (disk encryption, SSL, kerberos)

Images & Bootstrap Speed

Questions:

● Do you allow custom AMIs or force your own choices?● If using custom AMIs how can you reduce bootstrap time?

Potential answers:

● Custom AMIs are common - integrated with existing infra● Fast bootstrap by baking on top with custom bits

Data durability

Questions:

● How do you place replicas? Datacenter topology?● How are instances distributed in different failure domains?

Potential answers:

● ignore or go with large instances that map 1:1 to hosts● would be nice to have: a way to influence host to instance

allocation or to get datacenter topology data

S3 integration

Questions:

● How do you reconcile differences in semantics with HDFS? (strongly consistent vs. eventual consistency)

● How do you get most out of it in terms of performance?

Potential answers:

● we’ve done a fair amount of work improving S3 in the open source (features, stability improvements, security etc.)

● performance is network bound

Thanks! Questions?

Andrei Savu - [email protected]

Twitter: @andreisavu

Join us to take Hadoop to the clouds!https://hire.jobvite.com/Jobvite/job.aspx?j=orafYfwy&b=nqlg3nwW

mailto:[email protected]



Challenges for running Hadoop on AWS - AdvancedAWS Meetup

Engineering

Transcript of Challenges for running Hadoop on AWS - AdvancedAWS Meetup