Savanna: Hadoop on OpenStack

31
Savanna - Hadoop on OpenStack Mirantis, 2013 Ilya Elterman Dmitry Mescheryakov

description

More details about the project and its current state could be found there: http://savanna.readthedocs.org

Transcript of Savanna: Hadoop on OpenStack

Page 1: Savanna: Hadoop on OpenStack

Savanna - Hadoop onOpenStack

Mirantis, 2013Ilya EltermanDmitry Mescheryakov

Page 2: Savanna: Hadoop on OpenStack

● Savanna Overview● Roadmap● Phase 1 Live Demo● Phase 2 Features and Architecture

Agenda

Page 3: Savanna: Hadoop on OpenStack

Goal is to create native OpenStack component to provision and operate Hadoop clusters on top of OpenStack. Key characteristics:

● Open source● Native for OpenStack● Support for different Hadoop distributions● Solves both bare cluster provisioning use case

and "analytics as a service"

Savanna - Elastic Hadoop on OpenStack

Page 4: Savanna: Hadoop on OpenStack

● Designed as an OpenStack component● Managed through REST API with UI available as

part of Horizon● Pluggable system of Hadoop installation engines● Integration with Hadoop vendor specific

management tools● Predefined templates of Hadoop configurations

with ability to modify parameters

Savanna Architecture Principles

Page 5: Savanna: Hadoop on OpenStack

● Administrators - centralized cluster management and monitoring

● Dev and QA teams - fast clusters provisioning ● Data Scientists/Analysts - API to run the analytic

jobs with infrastructure provisioning happening under the hood

● Making resources dedicated to IaaS cloud available for Hadoop workload

Use Cases

Page 6: Savanna: Hadoop on OpenStack

● Central point of control over infrastructure● Enables self-service capabilities, including choice

of Hadoop distribution to be used● Integration with vendor tooling

○ Ambari for Apache/HortonWorks○ Cloudera Management Console

● Utilization of free IaaS capacity for Hadoop tasks

Administrators Use Case

Page 7: Savanna: Hadoop on OpenStack

● Fast on-demand provisioning of the environments

● Increase agility and speed of innovation ● Controlled access to data from production

Dev and QA Use Cases

Page 8: Savanna: Hadoop on OpenStack

● Simplified tasks execution - complexity of provisioning and managing cluster hidden under the hood○ Access to higher level interfaces (e.g. pig, hive)

● Bursty workload: ad-hoc queries requiring a significant resource only for short time period

● Utilization of free IaaS capacity for Hadoop tasks

Analytics Use Cases

Page 9: Savanna: Hadoop on OpenStack

● Savanna Overview● Roadmap● Phase 1 Live Demo● Phase 2 Features and Architecture

Agenda

Page 10: Savanna: Hadoop on OpenStack

Roadmap for Hadoop in Cloud

Phase 1 Basic cluster provisioning

Phase 2Cluster operation support and integration with tooling

Phase 3"Analytics as a service": job execution framework, support different scripting languages

Page 11: Savanna: Hadoop on OpenStack

Phase 1 - Basic Cluster Operation

● Cluster provisioning● Deployment Engine implementation for pre-

installed images● Templates for Hadoop cluster configuration● REST API for cluster startup and operations● UI integrated into Horizon

Page 12: Savanna: Hadoop on OpenStack

Phase 1 - Current Status

● All code and documentation open sourced● Phase 1 completed, v 0.1 released on 04/10● Launchpad home page

○ https://launchpad.net/savanna

● Code on stackforge○ Integrated with OpenStack CI/CD

○ https://github.com/stackforge/savanna

● New contributors: RedHat and Hortonworks

Page 13: Savanna: Hadoop on OpenStack

Phase 2 - Advanced Configuration

● Hadoop cluster configuration support:○ Solutions for HDFS data reliability issue○ Configurable DN storage location○ Configurable topology of DN, NN, TT, JT ○ Add/remove nodes○ More Hadoop parameters

● Integration with vendor deployment/management tooling

● Basic monitoring support

Page 14: Savanna: Hadoop on OpenStack

Phase 3 - Analytics as a Service

● API to execute Map/Reduce jobs without exposing details of underlying infrastructure (similar to AWS EMR)

● User-friendly UI for ad-hoc analytics queries based on Hive or Pig

Page 15: Savanna: Hadoop on OpenStack

Further Roadmap

● Autoscaling● HBase support● HA for NameNode● HDFS and Swift integration

○ Caching of Swift data on HDFS● Mahout as a service ● Integration with logging and error handling

Page 16: Savanna: Hadoop on OpenStack

How to Contribute

● Download and install Savanna● Provide feedback and report bugs● Share more ideas via IRC sessions or mailing

list

More details: https://wiki.openstack.org/wiki/Savanna/HowToParticipate

Page 17: Savanna: Hadoop on OpenStack

● Savanna Overview● Roadmap● Phase 1 Live Demo● Phase 2 Features and Architecture

Agenda

Page 18: Savanna: Hadoop on OpenStack

● Savanna Overview● Roadmap● Phase 1 Live Demo● Phase 2 Features and Architecture

Agenda

Page 19: Savanna: Hadoop on OpenStack

Architecture Overview

Savanna Python Client

RE

ST

AP

I

Cluster Configuration

Manager

Horizon

Keystone

Auth

DAL

Nova

Glance

Swift

Savanna Pages

HadoopVM

Provisioning Plugin

HadoopVM

HadoopVM

HadoopVM

VMManager

ImageRegistry

Page 20: Savanna: Hadoop on OpenStack

Extensible Provisioning

● get extra configs● validate input● launch/terminate

cluster● add/remove nodes - launch/terminate VMs

- get VM status- ssh/scp to VM

VM manager

- register image in Savanna- add/remove tags- get image by tag

Image registry

PluginSavanna

Page 21: Savanna: Hadoop on OpenStack

get extra parameters

add/remove nodes

Provisioning Interaction

launch cluster

launch cluster

get extra parametersfor the plugin

Savanna

User

Plugin

validate cluster parameters

add/remove nodes

launch cluster

add/remove nodes

Page 22: Savanna: Hadoop on OpenStack

Provisioning: Launching a Cluster

launch VMs

Plugin

ImageRegistry

VMManager

get image by tag

launch VMs

install and configureHadoop

HadoopVM

HadoopVM

HadoopVM

HadoopVM

passcommandsvia ssh, scp

Page 23: Savanna: Hadoop on OpenStack

Q&A

Page 24: Savanna: Hadoop on OpenStack

HDFS Reliability: the issue

Compute

DN DN

DN

DN DN

DN

Data Block

Compute

Page 25: Savanna: Hadoop on OpenStack

HDFS Reliability: the issue

Compute

DN DN

DN

DN DN

DN

Data Block

Compute

Page 26: Savanna: Hadoop on OpenStack

HDFS Reliability: the issue

Compute

DN DN

DN

DN DN

DN

Data Block

Compute

Page 27: Savanna: Hadoop on OpenStack

HDFS Reliability: single DN per host

DN

Compute

TT | DN

Compute

DN

Compute

DN

Cluster A Cluster B

Page 28: Savanna: Hadoop on OpenStack

HDFS Reliability: Hadoop-8468hypervisor-awareness for HDFS scheduler

DN

Compute

DN DN

Compute

DN DN

Compute

DN

HDFSData Block

Page 29: Savanna: Hadoop on OpenStack

HDFS Reliability: Hadoop-8545enables Swift for Hadoop

Swift

HadoopJob #1

HDFSHadoopJob #2

...HadoopJob #N

initial input

final output

Page 30: Savanna: Hadoop on OpenStack

HDFS Placement Options

● Ephemeral drive/var/lib/nova/instances/instance-xxx/disk -> /mnt/ephemeral

● Block storage volumeCinder Volume -> /mnt/volume

● Bare drive support/dev/sdb -> /mnt/sdb

Page 31: Savanna: Hadoop on OpenStack

● Master node(s)

● Worker nodes

Configurable topology of DN, NN, TT, JT

JT | NN JT NN+

TTTT | DN DN

10 6 8