Using the Open Science Data Cloud for Data Science Research

56
Using the Open Science Data Cloud for Data Science Research Robert Grossman University of Chicago Open Cloud Consor=um June 17, 2013

description

The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.

Transcript of Using the Open Science Data Cloud for Data Science Research

Page 1: Using the Open Science Data Cloud for Data Science Research

Using  the  Open  Science  Data  Cloud    for  Data  Science  Research  

Robert  Grossman  University  of  Chicago  

Open  Cloud  Consor=um  June  17,  2013  

Page 2: Using the Open Science Data Cloud for Data Science Research

Data:  1  PB  of  OSDC  data  across  several  disciplines  

Instrument:    3000  cores  /    5  PB  OSDC    science  cloud  

+  +  

Team:  you  and  your  colleagues  

Discoveries  

correla=on  algorithms  +  

Page 3: Using the Open Science Data Cloud for Data Science Research

Part  1  What  Instrument  Do  we  Use  to    Make  Big  Data  Discoveries?  

How  do  we  build  a  “datascope?”  

Page 4: Using the Open Science Data Cloud for Data Science Research

What  is  big  data?  

TB?  PB?  EB?    

W?  KW?  MW?  

Page 5: Using the Open Science Data Cloud for Data Science Research

An  algorithm  and  compu=ng  infrastructure  is  “big-­‐data  scalable”  if  adding  a  rack  (or  container)  of  data  (and  corresponding  processors)  allows  you  to  do  the  same  computa=on  in  the  same  =me  but  over  more  data.  

Page 6: Using the Open Science Data Cloud for Data Science Research

Commercial  Cloud  Service  Provider  (CSP)    15  MW  Data  Center  

100,000  servers  1  PB  DRAM  

100’s  of  PB  of  disk  

Automa=c  provisioning  and  infrastructure  management  

Monitoring,  network  security  and  forensics  

Accoun=ng  and  billing   Customer  

Facing  Portal  

Data  center  network  

~1  Tbps  egress  bandwidth    

25  operators  for  15  MW  Commercial  Cloud  

Page 7: Using the Open Science Data Cloud for Data Science Research

OSDC’s  vote  for  a  datascope:  a  (bou=que)  data  center  scale  facility  with  a  big-­‐data  scalable  analy=c  infrastructure.  

Page 8: Using the Open Science Data Cloud for Data Science Research

Data:  1  PB  of  OSDC  data  across  several  disciplines  

Instrument:    3000  cores  /    5  PB  OSDC    science  cloud  

+  +  

Team:  you  and  your  colleagues  

Discoveries  

correla=on  algorithms  +  

Page 9: Using the Open Science Data Cloud for Data Science Research

Discipline   Dura2on   Size   #  Devices  

HEP  -­‐  LHC   10  years   15  PB/year*   One  

Astronomy  -­‐  LSST   10  years   12  PB/year**   One  

Genomics  -­‐  NGS   2-­‐4  years   0.5  TB/genome   1000’s  

Some  Examples  of  Big  Data  Science  

*At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  par=cle  accelerator,  is  expected  to  produce  more  than  15  million  Gigabytes  of  data  each  year.    …  This  ambi=ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer  centres  in  33  countries.    Source:  hhp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html    **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes  processed),  resul=ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hhp://www.lsst.org/News/enews/teragrid-­‐1004.html  

Page 10: Using the Open Science Data Cloud for Data Science Research

One  large  instrument   Many  smaller  instruments  

Page 11: Using the Open Science Data Cloud for Data Science Research

Part  2.  What  is  a  Cloud  and  Why  Do  We  Care?  

11  

Page 12: Using the Open Science Data Cloud for Data Science Research

There  Are  Two  Essen=al    Characteris=cs  of  a  Cloud  

1.  Self  service  2.  Scale  

•  Clouds  enable  you  to  compute  over  large  amounts  of  data  with  the  necessity  of  first  downloading  the  data.  

•  Clouds  can  be  designed  to  be  secure  and  compliant.  

12  

Page 13: Using the Open Science Data Cloud for Data Science Research

Self  Service  

Self  Service  

13  

Page 14: Using the Open Science Data Cloud for Data Science Research

Scale  

14  

Page 15: Using the Open Science Data Cloud for Data Science Research

Types  of  Clouds  

•  Public  Clouds    – Amazon  

•  Private  Clouds  – Run  internally  by  universi=es  or  companies  

•  Community  Clouds  – Run  by  organiza=ons  (either  formally  or  informally),  such  as  the  Open  Cloud  Consor=um  

15  

Page 16: Using the Open Science Data Cloud for Data Science Research

Amazon  Web  Services  (AWS)?  

Community  clouds,  science  clouds,  etc.  

•  Lower  cost  (at  medium  scale)  •  Data  too  important  for  

commercial  cloud  •  Compu=ng  over  scien=fic  

data  is  a  core  competency  •  Can  support  any  required  

governance  /  security  

•  Scale  •  Simplicity  of  a  credit  card  •  Wide  variety  of  offerings.  

vs.  

OCC  supports  AWS  interop  and  burs=ng  when  permissible.  16  

Page 17: Using the Open Science Data Cloud for Data Science Research

Science  Clouds  

NFP  Science  Clouds   Commercial  Clouds  POV   Democra=ze  access  to  

data.    Integrate  data  to  make  discoveries.    Long  term  archive.  

As  long  as  you  pay  the  bill;  as  long  as  the  business  model  holds.  

Data  &  Storage  

Data  intensive  compu=ng  &  HP  storage  

Internet  style  scale  out  and  object-­‐based  storage  

Flows   Large  &  small  data  flows   Lots  of  small  web  flows  Streams   Streaming  processing  

required  NA  

Accoun=ng   Essen=al   Essen=al  Lock  in   Moving  environment  

between  CSPs  essen=al  Lock  in  is  good  

Interop   Cri=cal,  but  difficult   Customers  will  drive  to  some  degree   17  

Page 18: Using the Open Science Data Cloud for Data Science Research

Essen=al  Services  for  a  Science  CSP  •  Support  for  data  intensive  compu=ng  •  Support  for  big  data  flows  •  Account  management,  authen=ca=on  and  authoriza=on  services  

•  Health  and  status  monitoring  •  Billing  and  accoun=ng  •  Ability  to  rapidly  provision  infrastructure  •  Security  services,  logging,  event  repor=ng  •  Access  to  large  amounts  of  public  data  •  High  performance  storage  •  Simple  data  export  and  import  services  

Page 19: Using the Open Science Data Cloud for Data Science Research

Datascope  –  Science  Cloud    Service  Provider  (Sci  CSP)  

Data  scien=st  

Sci  CSP  services  

Page 20: Using the Open Science Data Cloud for Data Science Research

Cloud  Services    Opera=ons  Centers  (CSOC)  

•  The  OSDC  operates  Cloud  Services  Opera=ons  Center  (or  CSOC).  

•  It  is  a  CSOC  focused  on  suppor=ng  Science  Clouds  for  researchers.  

•  Compare  to  Network  Opera=ons  Center  or  NOC.  

•  Both  are  an  important  part  of  cyber  infrastructure  for  big  data  science.  

Page 21: Using the Open Science Data Cloud for Data Science Research

Datascope  –  Science  Cloud    Service  Provider  (Sci  CSP)  

Data  scien=st  

Sci  CSP  services  

Cloud  Service  Opera=ons  Center  (CSOC)  

Page 22: Using the Open Science Data Cloud for Data Science Research

Part  3  Data  Science  

Page 23: Using the Open Science Data Cloud for Data Science Research

Data  

Founda=ons  of  data  science  

General  and  discipline  specific  souware  applica=ons  and  tools  

Models  and  algorithms    

Establish  best  prac=ces,  strategies  for  data  science  in  general  and  discipline  specific  data  science  in  par=cular  

Analy=c  infrastructure  

Data  

Page 24: Using the Open Science Data Cloud for Data Science Research

What  are  the  founda=ons  for  data  science?  

Page 25: Using the Open Science Data Cloud for Data Science Research

Theory  to  Big  Data  Spectrum  

Simple  counts  and  sta=s=cs  over  big  data  

Mathema=cal  theorems  

No  data   Small  data  

Big  data  

Tradi=onal  sta=s=cal  modeling  

Medium  data  

(Semi-­‐)Automa=ng  sta=s=cal  modeling  

GB   TB   PB  

OSDC  Datascope   0.5-­‐2.0  MW  

Page 26: Using the Open Science Data Cloud for Data Science Research

Part  4  The  Open  Science  Data  Cloud  

www.opensciencedatacloud.org  

Page 27: Using the Open Science Data Cloud for Data Science Research

Data:  1  PB  of  OSDC  data  across  several  disciplines  

Instrument:    3000  cores  /    5  PB  OSDC    science  cloud  

+  +  

Team:  you  and  your  colleagues  

Discoveries  

correla=on  algorithms  +  

Page 28: Using the Open Science Data Cloud for Data Science Research

2013  Open  Science  Data  Cloud  (IaaS)  

5  PB  2013    (OpenStack  &  GlusterFS)  

Infrastructure  automa=on  &  management  

(Yates)  

Compliance,  &  security  

(OpenFISMA)  

Accoun=ng  &  billing  

(Salesforce.com)  

Customer  Facing  Portal  (Tukey)  

Data  center  network  

~10-­‐100  Gbps  bandwidth    

5  engineers  to  operate  0.5  MW  Science  Cloud  

Science  Cloud  SW  &  Services  

•  Virtual  Machine  (VM)  containing  common  applica=ons  &  pipelines    

•  Tukey  (OSDC  portal  &  middleware  v0.3)  •  Yates  (infrastructure  automa=on  and  management  v0.1)   28  

Page 29: Using the Open Science Data Cloud for Data Science Research

Tukey  

•  Tukey  (based  in  part  on  Horizon).  •  We  have  factored  out  digital  ID  service,  file  sharing,  and  transport  from  Bionimbus  and  Matsu.  

Page 30: Using the Open Science Data Cloud for Data Science Research

Yates  

•  Automa=on  installa=on  of  OSDC  souware  stack  on  rack  of  computers.  

•  Based  upon  Chef  •  Version  0.1  

Page 31: Using the Open Science Data Cloud for Data Science Research

UDR  

•  UDT  is  a  high  performance  network  transport  protocol  •  UDR  =  rsync  +  UDT    •  It  is  easy  for  an  average  systems  administrator  to  keep  100’s  of  TB  of  distributed  data  synchronized.    

•  We  are  using  it  to  distribute  c.  1  PB  from  the  OSDC  

Page 32: Using the Open Science Data Cloud for Data Science Research

Open  Science  Data  Cloud  Services  

•  Digital  ID  services  •  Data  sharing  services  •  Data  transport  services  (UDR)  •  What  other  core  services  are  essen&al?  •  Of  course,  working  groups  and  applica=ons  always  add  their  own  services  

•  These  core  services  will  hopefully  make  the  OSDC  ahrac=ve  as  a  plaxorm  (PaaS)  for  scien=fic  discovery.  

Page 33: Using the Open Science Data Cloud for Data Science Research

33  www.opencloudconsor=um.org  

•  U.S  based  not-­‐for-­‐profit  corpora=on.  •  Manages  cloud  compu=ng  infrastructure  to  

support  scien=fic  research:  Open  Science  Data  Cloud.  

•  Manages  cloud  compu=ng  infrastructure  to  support  medical  and  health  care  research:  Biomedical  Commons  Cloud  

•  Manages  cloud  compu=ng  testbeds:  Open  Cloud  Testbed.  

 

Page 34: Using the Open Science Data Cloud for Data Science Research

OCC  Members  &  Partners  

•  Companies:  Cisco,  Yahoo!,  Intel,  …  •  Universi=es:    University  of  Chicago,  Northwestern  Univ.,  Johns  Hopkins,  Calit2,  ORNL,  University  of  Illinois  at  Chicago,  …  

•  Federal  agencies  and  labs:  NASA  •  Interna=onal  Partners:  Univ.  Edinburgh,  AIST  (Japan),  Univ.  Amsterdam,  …  

•  Partners:  Na=onal  Lambda  Rail  

34  

Page 35: Using the Open Science Data Cloud for Data Science Research

Third  party  open  source  souware  

+  

Tukey  

Yates  

Open  source  souware  developed  by  the  OCC  and  open  standards  

+  

Data  center  

+  Data  with  permissions  

+  Authoriza=on  of  users  access  to  data  

+  Policies,  procedures,  controls,  etc.  

+  Governance,  legal  agreements  

+  Sustainability  model   35  

Page 36: Using the Open Science Data Cloud for Data Science Research

Part  5  OSDC  Data  

Page 37: Using the Open Science Data Cloud for Data Science Research

Data:  1  PB  of  OSDC  data  across  several  disciplines  

Instrument:    3000  cores  /    5  PB  OSDC    science  cloud  

+  +  

Team:  you  and  your  colleagues  

Discoveries  

correla=on  algorithms  +  

Page 38: Using the Open Science Data Cloud for Data Science Research
Page 39: Using the Open Science Data Cloud for Data Science Research

OSDC  Public  Data  Sets  

•  Over  800  TB  of  open  access  data  in  the  OSDC  •  Earth  sciences  data  •  Biological  sciences  data  •  Social  sciences  data  •  Digital  humani=es    

Page 40: Using the Open Science Data Cloud for Data Science Research

Part  6  OSDC  Working  Groups  

Just  look  around  you  

Page 41: Using the Open Science Data Cloud for Data Science Research

Matsu Working Group: Clouds to Support Earth Science

41

matsu.opensciencedatacloud.org  

Page 42: Using the Open Science Data Cloud for Data Science Research

Matsu  Architecture  

Hadoop  HDFS  

Matsu  Web  Map    Tile  Service  (WMTS)  

Matsu  MR-­‐based  Tiling  Service  

NoSQL  Database  

Images  at  different  zoom  layers  suitable  for  OGC  Web  Mapping  Server  

Level  0,  Level  1  and  Level  2  images  

MapReduce  used  to  process  Level  n  to  Level  n+1  data  and  to  par==on  images  for  different  zoom  levels  

NoSQL-­‐based  Analy=c  Services  

Streaming  Analy=c  Services  

MR-­‐based  Analy=c  Services  

Analy=c  Services   Storage  for  WMS  =les  and  derived  data  products  

Presenta=on  Services  

Web  Coverage  Processing  Service  

(WCPS)  

Workflow  Services  

Page 43: Using the Open Science Data Cloud for Data Science Research

Hadoop-­‐Based  Re-­‐Analysis  Zoom  Level  1:  4  images   Zoom  Level  2:  16  images  

Zoom  Level  3:  64  images   Zoom  Level  4:  256  images  

Page 44: Using the Open Science Data Cloud for Data Science Research

Bionimbus    Working  Group  

bionimbus.opensciencedatacloud.org  (biological  data)  

Page 45: Using the Open Science Data Cloud for Data Science Research

Bionimbus  Protected  Data  Cloud  

45  

Page 46: Using the Open Science Data Cloud for Data Science Research

Analyzing  Data  From    The  Cancer  Genome  Atlas  (TCGA)  

1.  Apply  to  dbGaP  for  access  to  data.  

2.  Hire  staff,  set  up  and  operate  secure  compliant  compu=ng  environment  to  mange  10  –  100+  TB  of  data.      

3.  Get  environment  approved  by  your  research  center.  

4.  Setup  analysis  pipelines.  5.  Download  data  from  CG-­‐

Hub  (takes  days  to  weeks).    6.  Begin  analysis.  

Current  Prac2ce   With  Protected  Data  Cloud  (PDC)  

1.  Apply  to  dbGaP  for  access  to  data.  

2.  Use  your  eRA  commons  creden=als  to  login  to  the  PDC,  select  the  data  that  you  want  to  analyze,  and  the  pipelines  that  you  want  to  use.    

3.  Begin  analysis.  

46  

Page 47: Using the Open Science Data Cloud for Data Science Research

One  Million  Genomes  •  Sequencing  a  million  genomes  would  most  likely  fundamentally  change  the  way  we  understand  genomic  varia=on.  

•  The  genomic  data  for  a  pa=ent  is  about  1  TB  (including  samples  from  both  tumor  and  normal  =ssue).  

•  One  million  genomes  is  about  1000  PB  or  1  EB  •  With  compression,  it  may  be  about  100  PB  •  At  $1000/genome,  the  sequencing  would  cost  about  $1B  

Page 48: Using the Open Science Data Cloud for Data Science Research

Big  data  driven  discovery  on  1,000,000  genomes  and  1  EB  of  data.  

Genomic-­‐driven  

diagnosis  

Improved  understanding  of  genomic  science  

 Genomic-­‐  driven  drug  development  

Precision  diagnosis  and  treatment.    Preven=ve  

health  care.  

Page 49: Using the Open Science Data Cloud for Data Science Research

Biomedical  Commons  Cloud  (BCC)  Working  Group  

Cloud  for  Public  Data    

Cloud  for  Controlled  Genomic  Data    

Cloud  for  EMR,  PHI,  

data  

Example:  Open  Cloud  Consor=um’s  Biomedical  Commons  Cloud  (BCC)  

Medical  Research  Center  A  

Medical  Research  Center  B  

Hospital  D  

Medical  Research  Center  C  

49  

Page 50: Using the Open Science Data Cloud for Data Science Research

Resource   Who  users   Who  operates  Open  Science  Data  Cloud  (OSDC)  

Pan  science  data  for  researchers  

Open  Cloud  Consor=um  (OCC)  supported  by  University  OCC  members  

Biomedical  Commons  Clouds  (BCC)  

(Interna=onal)  biomedical  researchers  

OCC  Biomedical  Commons  Cloud  Working  Group  supported  by  OCC  University  members  

Bionimbus  Protected  Data  Cloud  

Genomics  researchers  

University  of  Chicago  supported  by  the  OCC  

50  

Page 51: Using the Open Science Data Cloud for Data Science Research

OpenFlow-­‐Enabled  Hadoop  WG  

•  When  running  Hadoop  some  map  and  reduce  jobs  take  significantly  longer  than  others.  

•  These  are  stragglers  and  can  significantly  slow  down  a  MapReduce  computa=on.    

•  Stragglers  are  common  (dirty  secret  about  Hadoop)  •  Infoblox  and  UChicago  are  leading  a  OCC  Working  Group  on  OpenFlow-­‐enabled  Hadoop  that  will  provide  addi=onal  bandwidth  to  stragglers.    

•  We  have  a  testbed  for  a  wide  area  version  of  this  project.  

Page 52: Using the Open Science Data Cloud for Data Science Research

OSDC  PIRE  Project  We  select  OSDC  PIRE  Fellows  (US  ci=zens  or  permanent  residents):    •  We  give  them  tutorials  and  training  on  big  data  science.  

•  We  provide  them  fellowships  to  work  with  OSDC  interna=onal  partners.  

•  We  give  them  preferred  access  to  the  OSDC.  

Nominate  your  favorite  scien=st  as  an  OSDC  PIRE  Fellow.    www.opensciencedatacloud.org    (look  for  PIRE)  

Page 53: Using the Open Science Data Cloud for Data Science Research

Part  7  Key  Ques=ons  for  This  Workshop  

Page 54: Using the Open Science Data Cloud for Data Science Research

•  Ques=on  1.    How  can  we  add  partner  sites  at  other  loca=ons  that  extend  the  OSDC?    In  par=cular,  how  can  we  extend  the  OSDC  to  sites  around  the  world?    How  can  the  OSDC  interoperate  with  other  science  clouds?  

•  Ques=on  2.  What  data  can  we  add  to  the  OSDC  to  facilitate  data  intensive  cross-­‐disciplinary  discoveries?  

•  Ques=on  3.    How  can  we  build  a  plugin  structure  so  that  Tukey  can  be  extended  by  other  users  and  by  other  communi=es?  

•  Ques=on  4.  What  tools  and  applica=ons  can  we  add  to  the  OSDC  facilitate  data  intensive  cross-­‐disciplinary  discoveries?  

•  Ques=on  5.    How  can  we  beher  integrate  digital  IDs  and  file  sharing  services  into  the  OSDC?  

•  Ques=on  6.  What  are  3-­‐5  grand  challenge  ques=ons  that  leverage  the  OSDC?  

Page 55: Using the Open Science Data Cloud for Data Science Research

Ques=ons  

Page 56: Using the Open Science Data Cloud for Data Science Research

Robert  Grossman  is  a  faculty  member  at  the  University  of  Chicago.    He  is  the  Chief  Research  Informa=cs  Officer  for  the  Biological  Sciences  Division,  a  Faculty  Member  and  Senior  Fellow  at  the  Computa=on  Ins=tute  and  the  Ins=tute  for  Genomics  and  Systems  Biology,  and  a  Professor  of  Medicine  in  the  Sec=on  of  Gene=c  Medicine.    His  research  group  focuses  on  big  data,  biomedical  informa=cs,  data  science,  cloud  compu=ng,  and  related  areas.        He  is  also  the  Founder  and  a  Partner  of  Open  Data  Group,  which  has  been  building  predic=ve  models  over  big  data  for  companies  for  over  ten  years.        He  recently  wrote  a  book  for  the  general  reader  that  discusses  big  data  (among  other  topics)  called  the  Structure  of  Digital  Compu=ng:  From  Mainframes  to  Big  Data,  which  can  be  purchased  from  Amazon.    He  blogs  occasionally  about  big  data  at  rgrossman.com.