Hadoop and OpenStack - Hadoop Summit San Jose 2014

23
Hadoop and OpenStack Matthew Farrellee, @spinningmatt, Red Hat Sumit Mohanty, @smohanty, Hortonworks

description

Merging the insightful power of Hadoop with the management capabilities of OpenStack via Sahara

Transcript of Hadoop and OpenStack - Hadoop Summit San Jose 2014

Page 1: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Hadoop and OpenStackMatthew Farrellee, @spinningmatt, Red HatSumit Mohanty, @smohanty, Hortonworks

Page 2: Hadoop and OpenStack - Hadoop Summit San Jose 2014

What is OpenStack?

Page 3: Hadoop and OpenStack - Hadoop Summit San Jose 2014

OpenStack isA cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.

Page 4: Hadoop and OpenStack - Hadoop Summit San Jose 2014

An ecosystem of projects● Compute - Nova● Networking - Neutron● Object Storage - Swift● Block Storage - Cinder● Identity - Keystone● Image Service - Glance● Dashboard - Horizon● Telemetry - Ceilometer● Orchestration - Heat● Data Processing - Sahara

Page 5: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Sahara is combining use cases

Page 6: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Trends

HadoopEC2

OpenStack

www.google.com/trends/explore#q=hadoop,ec2,openstack

EC2 beta Aug 25 2006 (http://aws.typepad.com/aws/2006/08/amazon_ec2_beta.html)

Page 7: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Data analysis is hard

Page 8: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Data analysis is hard...● Come up w/ a relevant question

○ The question you answer won’t be the question you set out to ask

○ Mine: Can I predict doctor specialty from what procedures they perform?

● Find the data○ Tons, little consistency, unknown origin, horded○ Data w/o a dictionary is worse than code w/o

comments. Run away!

Page 9: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Data analysis is hard...● Data usability

○ Acceptable license? (Even for Gov’t sets)■ Mine: Metadata copyrighted by AMA!

○ Private is often highly protected, no/narrow DMZ● Explore and clean

○ Two of the oldest people in the medical profession working with medicare

○ Stephen Glasser graduated in 1773○ Cheryl Palma graduated in 1776

Page 10: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Data analysis is hard...● You got some answer to a question you

approximately asked● You must refine the question and process● Repeat

This is hard enough without having to manage tools and infrastructure!

Page 11: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Sahara’s goal

Make managing Hadoop+ infrastructure and tools so simple that they get out of your way

Page 12: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Sahara provides

● Apache Hadoop cluster and workload management○ Cluster - construct and manage the lifecycle of a

Hadoop cluster○ Workload - workflow for big data processing with

Hadoop (AWS EMR-like)● Through a Python library, REST API, Web

UI, command line interface

Page 13: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Sahara’s architecture

Data Sources

Sahara Python Client RE

ST A

PI

Cluster Configuration

Manager

Horizon

Keystone

Auth

Data Access Layer

Swift

Sahara Pages

HadoopVM

Vendors Plugins

HadoopVM

HadoopVM

HadoopVM

Resources Orchestration

Manager

Job Sources Job

Manager

Heat

Nova

Glance

Cinder

Neutron

Trove DB

Sahara Service

Page 14: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Sahara’s features● Plugin mechanism - distro choice● Cluster scaling - elasticity● Swift integration - data storage● Cinder integration - persistent HDFS● Network management with Nova and Neutron● Anti-affinity, separate services on physical hardware● Data locality with Swift● Repeatable cluster creation w/ template mechanism● http://docs.openstack.

org/developer/sahara/userdoc/features.html

Page 15: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Storage considerations

● Swift○ Input/output through Swift HCFS plugin○ Intermediate data stored in HDFS on cluster○ Locality when co-locating swift & nova-compute

● HDFS○ Local (long lived cluster) and remote (copy in)

● HDFS backed by ephemeral disk or Cinder○ Ephemeral - /var/lib/nova/instances on compute host○ Cinder - persistent block devices attached to instances

Page 16: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Sahara’s plugin architecture● This is important!● It’s where Hadoop distribution vendors

integrate their management software● It’s how users pick different software

versions● Currently: Vanilla (reference impl. w/ Apache

versions), HDP (via Ambari), IDH (via Intel Manager), and Spark (w/ minimal CDH)

Page 17: Hadoop and OpenStack - Hadoop Summit San Jose 2014

HDP Plugin Overview● Full support for all Sahara Functionality

● Nova and Neutron network● Cluster Scaling● Scale Up● Swift Integration● Cinder Support● Data Locality● EDP

● Apache Ambari REST API’s used for clusterprovisioning

● Monitoring/Management of clusters via Ambari● Full support for multiple HDP stacks● HDP pre-installed or generic VM images

Page 18: Hadoop and OpenStack - Hadoop Summit San Jose 2014

HDP 1.3● NameNode● Secondary NameNode● DataNode● HDFS● ZooKeeper ● Ambari Server/Agent● HCatalog● Sqoop● Job Tracker● Task Tracker● MapReduce● Hive● MySQL● Pig● WebHCat Server● Oozie● Ganglia● Nagios● HBase

HDP Plugin Stack Support

HDP 2.0● History Server● MapReduce 2 / YARN● Resource Manager● YARN Client

HDP 2.1● Storm● Falcon

Coming Soon!

Available

Available

HDP 2.1 +● SOLR● Cascading

Roadmap

Page 19: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Ambari Blueprints● Two primary goals of Ambari Blueprints

○ Ability to export a complete description of a running cluster

○ Provide API based cluster installations based on a self- contained cluster description

● Blueprints contain cluster topology and configuration information

● Enables Interesting use cases between physical and virtual, including OpenStack/Sahara

Page 20: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Blueprint API

BLUEPRINTPOST /blueprints/my-blueprint

CLUSTERINSTANCE POST

/clusters/MyCluster

1

2

Page 21: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Example: Single-Node Definitions{ "configurations" : [ { ”hdfs-site" : {

"dfs.namenode.name.dir" : ”/hadoop/nn" } } ], "host_groups" : [ { "name" : ”uber-host", "components" : [ { "name" : "NAMENODE” }, { "name" : "SECONDARY_NAMENODE” }, { "name" : "DATANODE” }, { "name" : "HDFS_CLIENT” }, { "name" : "RESOURCEMANAGER” }, { "name" : "NODEMANAGER” }, { "name" : "YARN_CLIENT” }, { "name" : "HISTORYSERVER” }, { "name" : "MAPREDUCE2_CLIENT” } ], "cardinality" : "1" } ], "Blueprints" : { "blueprint_name" : "single-node-hdfs-yarn", "stack_name" : "HDP", "stack_version" : "2.0" }}

{ "blueprint" : "single-node-hdfs-yarn", "host_groups" :[ { "name" : ”uber-host", "hosts" : [ { "fqdn" : "c6401.ambari.apache.org”

} ] } ]}

BLUEPRINT

CLUSTER INSTANCE

Description• Single-node cluster• Use HDP 2.0 Stack• HDFS + YARN + MR2• Everything on c6401

Page 22: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Demo - youtu.be/vmry_kXqn4c● http://jayunit100.github.io/bigpetstore/slides

● Bigpetstoreo A full stack hadoop applicationo Uses the main players in the hadoop ecosystemo To demonstrate a single domaino Just accepted into the Bigtop project!

● Come by the Red Hat booth - G18

Page 23: Hadoop and OpenStack - Hadoop Summit San Jose 2014

Q&A

● Status - Integrated for Juno (Oct 2014)● Distro - RDO (Fedora/RHEL/CentOS), RHEL

OSP 5, ...● Home - https://launchpad.net/sahara● Docs - http://docs.openstack.org/developer/sahara● Code - https://github.com/openstack/ *sahara*● Email - openstack-dev w/ [sahara]● IRC - #openstack-sahara on freenode