Savanna - Elastic Hadoop on OpenStack
-
Upload
sergey-lukjanov -
Category
Technology
-
view
114 -
download
5
description
Transcript of Savanna - Elastic Hadoop on OpenStack
Savanna - Hadoop onOpenStack
Mirantis, 2013Sergey LukjanovSavanna Technical Lead
● Savanna Overview● Savanna Use Cases● Roadmap & Current Status● Architecture & Features Overview● Hadoop vs. Virtualization
Agenda
● Savanna Overview● Savanna Use Cases● Roadmap & Current Status● Architecture & Features Overview● Hadoop vs. Virtualization
Agenda
● Open source native OpenStack component● Supports different Hadoop distributions● Solves both bare cluster provisioning use case
and "analytics as a service"● Managed through REST API● Web UI as part of the OpenStack Dashboard● Flexible templates of Hadoop configurations
Savanna - Elastic Hadoop on OpenStack
● Project home - https://launchpad.net/savanna○ bug tracking○ blueprints○ answers
● Code review (gerrit) - https://review.openstack.org● Sources - https://github.com/stackforge/savanna● Mailing list - [email protected] ● CI - https://jenkins.openstack.org and
http://jenkins.savanna.mirantis.com
Savanna - Elastic Hadoop on OpenStack
● Contributors:○ large core team from Mirantis○ teams from RedHat, Hortonworks○ several minor contributors
● Intel joined recently● Several upcoming customers
Savanna - Participants
● Savanna Overview● Savanna Use Cases● Roadmap & Current Status● Architecture & Features Overview● Hadoop vs. Virtualization
Agenda
● Administrators - centralized cluster management and monitoring
● Dev and QA teams - fast clusters provisioning ● Data Scientists/Analysts - API to run the analytic
jobs with infrastructure provisioning happening under the hood
● Making resources dedicated to IaaS cloud available for Hadoop workload
Savanna Use Cases
● Central point of control over infrastructure● Enables self-service capabilities, including choice
of Hadoop distribution to be used● Integration with vendor tooling:
○ Ambari for Apache/HortonWorks○ Cloudera Management Console○ Intel Hadoop
● Utilization of free IaaS capacity for Hadoop tasks
Administrators Use Case
● Fast on-demand provisioning of the environments
● Increase agility and speed of innovation ● Controlled access to data from production
Dev and QA Use Cases
● Simplified tasks execution - complexity of provisioning and managing cluster hidden under the hood○ Access to higher level interfaces (e.g. pig, hive)
● Bursty workload: ad-hoc queries requiring a significant resource only for short time period
● Utilization of free IaaS capacity for Hadoop tasks
Analytics Use Cases
● Savanna Overview● Savanna Use Cases● Roadmap & Current Status● Architecture & Features Overview● Hadoop vs. Virtualization
Agenda
Roadmap for Hadoop in Cloud
Phase 1 Basic cluster provisioning of Apache Hadoop
Phase 2Cluster operation support and integration with tooling,
advanced configuration (HDFS, Swift, etc.)
Phase 3"Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS
Phase 1 - Basic Cluster Operation
● Cluster provisioning● Deployment Engine implementation for pre-
installed images● Templates for Hadoop cluster configuration● REST API for cluster startup and operations● Web UI integrated into OpenStack Dashboard
Roadmap for Hadoop in Cloud
Phase 1 [Released - April, 10]Basic cluster provisioning of Apache Hadoop
Phase 2Cluster operation support and integration with tooling,
advanced configuration (HDFS, Swift, etc.)
Phase 3"Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS
Phase 2 - Advanced Configuration
● Hadoop cluster configuration support:○ Solutions for HDFS data reliability issue○ Configurable DN storage location○ Configurable topology of DN, NN, TT, JT ○ Add/remove nodes○ More Hadoop parameters
● Integration with vendor deployment/management tooling
● Basic monitoring support
Roadmap for Hadoop in Cloud
Phase 1 [Released - April, 10]Basic cluster provisioning of Apache Hadoop
Phase 2 [In progress - July 15]Cluster operation support and integration with tooling,
advanced configuration (HDFS, Swift, etc.)
Phase 3"Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS
Phase 3 - Analytics as a Service
● API to execute Map/Reduce jobs without exposing details of underlying infrastructure (similar to AWS EMR)
● User-friendly UI for ad-hoc analytics queries based on Hive or Pig
Roadmap for Hadoop in Cloud
Phase 1 [Released - April, 10]Basic cluster provisioning of Apache Hadoop
Phase 2 [In progress - July 15]Cluster operation support and integration with tooling,
advanced configuration (HDFS, Swift, etc.)
Phase 3 [Planned - October 15]"Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS
Further Roadmap
● Autoscaling● HA for NameNode● Deeper HDFS and Swift integration
○ Caching of Swift data on HDFS● Integration with logging and error handling● HBase support
● Savanna Overview● Savanna Use Cases● Roadmap & Current Status● Architecture & Features Overview● Hadoop vs. Virtualization
Agenda
Architecture Overview
Savanna Python Client RE
ST A
PI Cluster Configuration
Manager
Horizon
Keystone
Auth
DAL
Nova
Glance
Swift
Savanna Pages
HadoopVM
Provisioning Plugin
HadoopVM
HadoopVM
HadoopVM
Instance Interop Helper
ImageRegistry
● HDFS Reliability● Data Persistence● I/O Performance● etc.
Hadoop vs. Virtualization
● HDFS Reliability● Data Persistence● I/O Performance● etc.
Hadoop vs. Virtualization
● HDFS Reliability● Data Persistence● I/O Performance● etc.
Hadoop vs. Virtualization
● HDFS Reliability● Data Persistence● I/O Performance● etc.
Hadoop vs. Virtualization
HDFS Reliability: the issue
Compute
DN DN
DN
DN DN
DN
Data Block
Compute
HDFS Reliability: the issue
Compute
DN DN
DN
DN DN
DN
Data Block
Compute
HDFS Reliability: the issue
Compute
DN DN
DN
DN DN
DN
Data Block
Compute
HDFS Reliability: single DN per host
DN
Compute
TT | DN
Compute
DN
Compute
DN
Cluster A Cluster B
HDFS Reliability: Hadoop-8468hypervisor-awareness for HDFS scheduler
DN
Compute
DN DN
Compute
DN DN
Compute
DN
HDFSData Block
HDFS Reliability: Hadoop-8545enables Swift for Hadoop
Swift
HadoopJob #1
HDFSHadoopJob #2
...HadoopJob #N
initial input
final output
● Master node(s)
● Worker nodes
Configurable topology of DN, NN, TT, JT
JT | NN JT NN+
TTTT | DN DN
10 6 8
HDFS Placement Options
● Ephemeral drive/var/lib/nova/instances/instance-xxx/disk -> /mnt/ephemeral
● Block storage volumeCinder Volume -> /mnt/volume
● Bare hard drive support/dev/sdb -> /mnt/sdb
Q&A
We are hiring!
Phase 1 deployment mechanism
HadoopVM
HadoopVM
HadoopVM
HadoopVM
Savanna
Provision VMs withpre-installed Hadoop
Configure HadoopCluster
Tool usage scenarios
HadoopVM
HadoopVM
HadoopVM
HadoopVM
ToolManage Hadoop Cluster
VMVM
VM VMTool
Provision & Manage Hadoop Cluster
Scenario I
Scenario II
Extensible Provisioning
● get extra configs● validate input● launch/terminate cluster● add/remove nodes
● launch/terminate VMs● get VM status● ssh/scp to VM
Instance Interop
● register image in Savanna
● add/remove tags● get image by tag
Image registry
PluginSavanna
get extra parameters
add/remove nodes
Provisioning Interaction
launch cluster
launch cluster
get extra parametersfor the plugin
Savanna
User
Plugin
validate cluster parameters
add/remove nodes
launch cluster
add/remove nodes
Provisioning: Launching a Cluster
launch VMs
PLUGIN
ImageRegistry
Instance Interop Helper
get imageby tag
launch VMs
install andconfigureHadoop
HadoopVM
HadoopVM
HadoopVM
HadoopVM
passcommandsvia ssh, scp
Q&A
We are hiring!