Docker based Hadoop provisioning - Hadoop Summit 2014

29
Janos Matyas / CTO / SequenceIQ Inc.

description

Docker based Hadoop provisioning in the cloud and on-premise/physical hardware

Transcript of Docker based Hadoop provisioning - Hadoop Summit 2014

Page 1: Docker based Hadoop provisioning - Hadoop Summit 2014

Janos Matyas / CTO / SequenceIQ Inc.

Page 2: Docker based Hadoop provisioning - Hadoop Summit 2014

GOAL / MOTIVATION

TECHNOLOGY STACK

PROBLEM RESOLUTION / HOW IT WORKS

RESULTS / ACHIEVEMENTS

OVERVIEW

Page 3: Docker based Hadoop provisioning - Hadoop Summit 2014

GOAL / MOTIVATION

Ease Hadoop provisioning – everywhere

Automate and unify the process

Arbitrary cluster size

Same process through a cluster lifecycle (Dev, QA, UAT, Prod)

(Auto) scaling Hadoop

QoS

Page 4: Docker based Hadoop provisioning - Hadoop Summit 2014

OUR APPROACH

Use Docker

Build cloud-specific ‘Dockerized’ images

Provision the cluster

Use Ambari

Page 5: Docker based Hadoop provisioning - Hadoop Summit 2014

DOCKER

Lightweight, portable

Build once, run anywhere

VM – without the overhead of a VM

Isolated containers

Automated and scripted

Page 6: Docker based Hadoop provisioning - Hadoop Summit 2014

DOCKER – CONTAINERS vs. VMs

Containers are isolated, but share OS and, where appropriate, bins/libraries

Page 7: Docker based Hadoop provisioning - Hadoop Summit 2014

APACHE AMBARI – ARCHITECTURE

Easy Hadoop cluster provisioning

Management and monitoring

Key features – blueprints

REST API

Page 8: Docker based Hadoop provisioning - Hadoop Summit 2014

APACHE AMBARI – CREATE CLUSTER

Define a blueprint (POST /api/v1/blueprints)

Create cluster (POST /api/v1/clusters/mycluster)

Page 9: Docker based Hadoop provisioning - Hadoop Summit 2014

HADOOP PROVISIONG ISSUES

Each cloud provider has a proprietary API

Create images for each provider

Network configuration

Service discovery

Resize, failover, member join support

Page 10: Docker based Hadoop provisioning - Hadoop Summit 2014

OUR APPROACH – DETAILS

Build your Docker image

Install or pre-install Hadoop services with Ambari

Install Serf and dnsmasq

Build your cloud image

Use Ansible to create an image

Provision the cluster

Page 11: Docker based Hadoop provisioning - Hadoop Summit 2014

BUILD DOCKER IMAGES

Create the Dockerfile

Have Docker.io to build the image

Optionally pre-install services

Use Ambari

Push image to Docker.io

Licensing questions

Page 12: Docker based Hadoop provisioning - Hadoop Summit 2014

BUILD CLOUD IMAGES

Use a Docker ready base image

Use Ansible to provision the image template

Pull the Docker images

Apply custom infrastructure

Use cloud provider specific playbooks

AWS EC2

Azure

Page 13: Docker based Hadoop provisioning - Hadoop Summit 2014

ANSIBLE

Configuration as data

Simplest way to automate IT

Secure and agentless

Goal oriented

One playbook – multiple modules

We use it to “burn” cloud images/templates

Page 14: Docker based Hadoop provisioning - Hadoop Summit 2014

PROVISIONING – ISSUES

FQDN

/etc/hosts is read-only in Docker

Everybody needs to know everybody

DNS

Single point of failure

Dynamic cluster – nodes joining, leaving, failing

Routing

Cloud – ability to inter-host container routing

Collision free private IP range for Docker bridge

We need predefined host names/IP addresses /etc/hosts is read-only in Docker Use Ansible to provision the image template

Pull the Docker images

Start a DNS server Use it as a reference docker run -dns <IP_OF_DNS> Nodes need to know each other

Page 15: Docker based Hadoop provisioning - Hadoop Summit 2014

PROVISIONING – SOLUTION

FQDN

Use –h and –dns Docker params

DNS

dnsmasq is running on each Docker container

Serf member-xxx events trigger dnsmasq reconfiguration

Routing

Docker bridge configuration – follows a convention

Page 16: Docker based Hadoop provisioning - Hadoop Summit 2014

SERF

Gossip based membership

Service discovery

Decentralized

Lightweight, fault tolerant

Highly available

DevOps friendly

Keep an eye on Consul, Open vSwitch, pipework

Page 17: Docker based Hadoop provisioning - Hadoop Summit 2014

SERF – DECENTRALIZED SERVICE DISCOVERY

Gossip instead of heartbeat

LAN, WAN profiles

Provides membership information

Event handlers: member_join, member_leave, member_failed, member-update, member-reap, user

Query

Page 18: Docker based Hadoop provisioning - Hadoop Summit 2014

SERF – GOSSIPING

Page 19: Docker based Hadoop provisioning - Hadoop Summit 2014

SERF – MEMBERSHIP, EVENT HANDLERS

Page 20: Docker based Hadoop provisioning - Hadoop Summit 2014

DNSMASQ

Network infrastructure for small networks

Lightweight DNS, DHCP server

Comes with most Linux distributions

Page 21: Docker based Hadoop provisioning - Hadoop Summit 2014

AWS EC2 – HADOOP CLUSTER

Use EC2 REST API to provision instances (from Dockerized image)

Start Docker containers

One Ambari server

N-1 Ambari agents connecting to server

Connect ambari-shell to

Define blueprint

Provision the cluster

Page 22: Docker based Hadoop provisioning - Hadoop Summit 2014

AWS EC2 – NETWORK SECURITY

Create a VPC

Configure subnets

Routing tables

Security gateway

Set ACL

Configure VPN

Page 23: Docker based Hadoop provisioning - Hadoop Summit 2014

AWS EC2 - CLOUDFORMATION

Manually set up VPC is too complicated

Use CloudFormation

Manage the stack together

Template-based

Environments under version control

Customizable at runtime

No extra charge

"VpcId" : { "Type" : "String", "Description" : "VpcId of your existing Virtual Private Cloud (VPC)" },

"SubnetId" : { "Type" : "String", "Description" : "SubnetId of an existing subnet (for the primary network) in your Virtual Private Cloud (VPC)" },

"SecondaryIPAddressCount" : { "Type" : "Number", "Default" : "1", "MinValue" : "1", "MaxValue" : "5", "Description" : "Number of secondary IP addresses to assign to the network interface (1-5)", "ConstraintDescription": "must be a number from 1 to 5." },

"SSHLocation" : { "Description" : "The IP address range that can be used to SSH to the EC2 instances", "Type": "String", "MinLength": "9", "MaxLength": "18", "Default": "0.0.0.0/0", "AllowedPattern": "(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})/(\\d{1,2})", "ConstraintDescription": "must be a valid IP CIDR range of the form x.x.x.x/x." } },

"Mappings" : { "RegionMap" : { "us-east-1" : { "AMI" : "ami-7f418316" },

Page 24: Docker based Hadoop provisioning - Hadoop Summit 2014

CLOUDBREAK

Cloudbreak is a powerful left surf that breaks over a coral reef, a mile off

southwest the island of Tavarua, Fiji.Cloudbreak is a cloud-agnostic

Hadoop as a Service API. Abstracts

the provisioning and ease

management and monitoring of on-

demand clusters.

Provisioning Hadoop has never been easier

Page 25: Docker based Hadoop provisioning - Hadoop Summit 2014

CLOUDBREAK

Benefits Elastic

Scalable

Blueprints

Flexible

Main REST resources /template – specify a cluster infrastructure

/stack – creates a cloud infrastructure built from a template

/blueprint – describes a Hadoop cluster

/cluster – creates a Hadoop cluster

Page 26: Docker based Hadoop provisioning - Hadoop Summit 2014

RESULTS AND ACHIEVEMENTS

Hadoop as a Service API

Available for EC2 and Azure cloud

OpenStack, bare metal is coming soon

Open source under Apache 2 licence

Same goals as Apache Ambari Launchpad project

What's next?

Page 27: Docker based Hadoop provisioning - Hadoop Summit 2014

HADOOP SERVICES - AS A SERVICE

Leverage YARN

Slider (Hoya) providers

HBase, Accumulo

SequenceIQ providers - Flume, Tomcat

YARN -1964

QoS for YARN – heuristic scheduler

Platform as a Service API

Page 28: Docker based Hadoop provisioning - Hadoop Summit 2014

BANZAI PIPELINE

Banzai Pipeline is a surf reef break located in Hawaii, off Ehukai Beach Park in

Pupukea on O'ahu's North Shore.Banzai Pipeline is a RESTful

application development

platform for building on-demand

data and job pipelines running

on Hadoop YARN.

Banzai Pipeline is a big data API for the REST

Page 29: Docker based Hadoop provisioning - Hadoop Summit 2014

THANK YOU

Get the code: https://github.com/sequenceiq

Read about: http://blog.sequenceiq.com

Facebook: http://facebook.com/sequenceiq

Twitter: http://twitter.com/sequenceiq

LinkedIn: http://linkedin.com/sequenceiq

Contact: [email protected]

FEEL FREE TO CONTRIBUTE