Spark China Summit 2015 Guancheng Chen

16
SuperVessel: Enabling Spark as a Service with OpenStack and Docker Guan Cheng (G.C.) Chen IBM Research - China weibo.com/parallellabs

Transcript of Spark China Summit 2015 Guancheng Chen

Page 1: Spark China Summit 2015 Guancheng Chen

SuperVessel: Enabling Spark as a Service with OpenStack and Docker

Guan Cheng (G.C.) Chen IBM Research - China weibo.com/parallellabs

Page 2: Spark China Summit 2015 Guancheng Chen

SuperVessel Cloud

4/18/15 IBM  Research  -­‐  China 2 � 

•  Public  cloud  built  on  the  POWER7/POWER8  servers  with  OpenStack  •  It  provides  free  access  for  students,  researchers,  developers  across  the  world,  and  

helps  grow  OpenPOWER  ecosystem  (used  in  30+  universiNes  now)  •  It  provides  advanced  technology  services  such  as  Spark  as  a  Service,  Docker  Services,  

CogniNve  CompuNng  Service,  IoT  Service,  Accelerator  as  a  Service  (FPGA  and  GPU)  

Page 3: Spark China Summit 2015 Guancheng Chen

Spark as a Service

4/18/15 IBM  Research  -­‐  China 3 � 

Step  1:  Login Step  2:  Create Step  3:  Ready!

3 steps to launch a Spark cluster, easy!

Try  it  at:  www.ptopenlab.com

Page 4: Spark China Summit 2015 Guancheng Chen

Why OpenStack?

4/18/15 IBM  Research  -­‐  China 4 � 

•  Most  popular  IaaS  soXware  •  Supports  Docker*  •  Heat  can  orchestra  docker  containers  easily  –  good  for  provision  a  Spark  

cluster  

Picture  source:  h-p://www.qyjohn.net/?p=3801

Page 5: Spark China Summit 2015 Guancheng Chen

Why Docker? •  Less  resource  consumpNon  than  

KVM  –  We  can  provision  more  Spark  

clusters!  •  Boot  faster  than  KVM  

–  Users  like  fast  provision!  •  Incrementally  build,  revert  and  

reuse  your  container  –  We  love  Git  and  AUFS!  

•  However,  Docker  is  not  officially  supported  in  OpenStack  yet  –  Nova  Docker  is  an  external  

component  of  OpenStack  •  Port  Docker  to  POWER  

Architecture  (ppc64  and  ppc64le)  –  Ubuntu  15.04  includes  Docker  for  

POWER8  ppc64le

4/18/15 IBM  Research  -­‐  China 5 � 

Picture  source:  h-p://www.zdnet.com/ar@cle/what-­‐is-­‐docker-­‐and-­‐why-­‐is-­‐it-­‐so-­‐darn-­‐popular/

Page 6: Spark China Summit 2015 Guancheng Chen

Why Spark?

•  Fast  •  Unified  •  Ecosystem    PorNng  to  POWER  

–  Bugfix  submieed  to  the  community  

–  Spark  1.3  works  smoothly!  J  

4/18/15 IBM  Research  -­‐  China 6 � 

Page 7: Spark China Summit 2015 Guancheng Chen

Why not Sahara? Sahara is a component for Hadoop/Spark as a Service in OpenStack

•  We  started  from  OpenStack  Icehouse  …  •  DockerizaNon  

– Beeer  service  deployment  and  isolaNon  for  Big  Data  dashboard  server  

•  CustomizaNon  – WaiNng  for  Sahara’s  improvements  is  somehow  *SLOW*  

•  Docker,  user  analyNcs,  Spark  1.4,  Spark  IDE,  scheduling,  data  visualizaNon  etc.  

4/18/15 IBM  Research  -­‐  China 7 � 

Page 8: Spark China Summit 2015 Guancheng Chen

4/18/15 IBM  Research  -­‐  China 8 � 

Architecture Design

Big Data Dashboard

Keystone

Glance

Neutron

Heat Nova Nova

Docker

Cinder/Manila

Spark Cluster

Docker Image

Spark Master

Spark Worker

Spark Worker

Container 1

Container 2

Container 3

NameNode Spark Driver

DataNode

DataNode

Billing&Auth

1

2

2 3

Page 9: Spark China Summit 2015 Guancheng Chen

Dockerize everything!

•  We  use  containers  to    –  run  applica<ons  –  run  other  containers  –  run  OpenStack  python  daemons  –  run  OpenStack  services  inside  OpenStack  (with  a  special  trunk  link  in  neutron/OVS)  

–  run  OpenStack  inside  OpenStack  •  Good  for  mulN-­‐site  expansion  

4/18/15 IBM  Research  -­‐  China 9 � 

Page 10: Spark China Summit 2015 Guancheng Chen

Heat template design Heat is a component for orchestration in OpenStack

•  Parameters  – Cinder/Malina/Neutron  uuid  – Size  of  Cinder/Malina  resources  

•  Resources  – Master/slave  node  – Neutron    – Cinder/Manila  

•  Need  to  modify  nova-­‐docker  to  mount  the  cinder/manila  resources  when  booNng  the  docker  container  

4/18/15 IBM  Research  -­‐  China 10 � 

Page 11: Spark China Summit 2015 Guancheng Chen

Spark Docker Image

•  Built  from  Ubuntu  14.04.1  •  All  Spark  nodes  use  the  same  image,  with  different  iniNalizaNon  scripts  by  using  cloudinit  

•  IniNalizaNon  scripts  will  – Sync  /etc/hosts  across  all  nodes  – Set  HDFS  and  Spark  configuraNons  accordingly  – Format  HDFS  and  launch  HDFS  – Launch  Spark

4/18/15 IBM  Research  -­‐  China 11 � 

Page 12: Spark China Summit 2015 Guancheng Chen

Big Data Dashboard Development

•  2  Developers  (frond  end  +  backend)  – Online  in  2  months  – Separates  Heat  related  stuff  and  Dashboard  – Reskul  API  (for  billing  and  authenNcaNon  etc)  – Dockerize  the  big  data  dashboard  server  – Separates  development  and  producNon  environment

4/18/15 IBM  Research  -­‐  China 12 � 

Page 13: Spark China Summit 2015 Guancheng Chen

Where should I put the data? Shared File System for Cloud and Spark as a Service

4/18/15 IBM  Research  -­‐  China 13 � 

Docker

(Symphony)

Horizon

OpenStack controller HEAT

Neutron Glance Manila

Nova

Cloud � Infrastructure � Service �  Big � Data � Service � 

•  Select  Big  data  compuNng  framework  (Mapreduce,  SPARK  

•  Select  cluster  size  •  Select  data  folder  size

HEAT  template  for  big  data  cluster

Docker

(Symphony)

Docker

(Symphony)

Docker

(Symphony)

Docker

(Symphony)

Docker

(SPARK)

POWER7/POWER8

KVM/Docker

(Web app)

Folder  A

User  B User  A

Folder  B

User  A

•  HEAT  will  orchestrate  docker  instances,  subnet  and  data  folder  based  on  user’s  request  •  Manila  provides  the  NFS  service  using  GPFS  as  backend,  and  the  folder  will  be  mounted  via  nova-­‐docker  (with  -­‐v  support)  •  Folder  created  by  Manila  could  be  accessed  by  the  KVM/docker  instances  created  for  big  data  and  other  purpose  

GPFS  FPO

POWER7/POWER8

Servers

GPFS  FPO Servers

GPFS  FPO

KeyStone

Cinder

Page 14: Spark China Summit 2015 Guancheng Chen

SuperVessel Services Roadmap

4/18/15 IBM  Research  -­‐  China 14 � 

SuperVessel  Cloud  Infrastructure  

SuperVessel  Cloud    Service

SuperVessel    Big  Data  and  HPC  

Service

Super    Class  Service

OpenPOWER  Enablement  Service

Super  Project  Team  Service

Super  Marketplace

1.  VM  and  container  service  

2.  Storage  service  3.  Network  service  4.   Accelerator  as  

service  5.  Image  service  

1.   Big  Data:  MapReduce  (Symphony),  SPARK  

2.  Performance  tuning  service

1.  X-­‐to-­‐P  migraNon:  AutoPort  tool  

2.  OpenPOWER  new  system  test  service

1.  On-­‐line  video  courses  

2.  Teacher  course  management  

3.  User  contribuNon  management  

1.  Project  management  service  

2.  DevOps  automaNon

Storage IBM  POWER  servers OpenPOWER  server FPGA/GPU

Docker

(Online) (Online) (Preparing) (Online)

Page 15: Spark China Summit 2015 Guancheng Chen

Summary

•  Spark  +  OpenStack  +  Docker  works  very  well  on  OpenPOWER  servers  

•  Dockerized  services  made  DevOps  easier  •  Docker  issues  

–  Zombie  process  –  Can’t  dynamically  aeach  volume  –  Commit,  aeach  operaNons  is  not  supported  on  Nova-­‐docker  

•  Monitoring  everything  (API,  resources  etc)  –  key  for  operaNng  a  cloud  

•  TODO:  Spark  1.4,  Spark  IDE,  SWIFT,  Data  VisualizaNon  

4/18/15 IBM  Research  -­‐  China 15 � 

Page 16: Spark China Summit 2015 Guancheng Chen

4/18/15 IBM  Research  -­‐  China 16 � 

Join Us!

www.ptopenlab.com

QQ  group:  SuperVessel

SuperVessel  WeChat  group

@冠诚 � [email protected] �