DataLab’SAD’1.00! System’Architecture’Document...

43
DataLabSAD1.00 System Architecture Document For the NOAO Data Lab Project Revised: March 4, 2015

Transcript of DataLab’SAD’1.00! System’Architecture’Document...

Page 1: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

     

     

DataLab-­‐SAD-­‐1.00    

System  Architecture  Document    

For  the    

NOAO  Data  Lab  Project    

Revised:    March  4,  2015                        

   

Page 2: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  2  

   

Revision  History    Date   Author   Changes  /  Comments   Version  Sep  10,  2014   M.  Fitzpatrick   First  Draft   0.1  Dec  01,  2014   M.  Fitzpatrick   Restructured  draft   0.2  Dec  04,  2014   M.  Fitzpatrick   Another  restructure   0.3  Dec  22,  2014   M.  Fitzpatrick   More  text   0.4  Jan  12,  2015   M.  Fitzpatrick   Fleshing  out  contens   0.5  Jan  15,  2015   M.  Fitzpatrick   Incorporated  comments,  arch  description   0.6  Jan  21,  2015   M.  Fitzpatrick   First  complete  draft   0.7  Jan  26,  2015   M.  Fitzpatrick   Typos,  4-­‐level  TOC   0.71  Jan  27,  2015   M.  Fitzpatrick   Included  Ridgway  comments,  new  logo   0.72  Jan  27,  2015   M.  Fitzpatrick   Included  Sec.  7  tracking  from  Mighell   0.80  Mar  04,  2015   K.  Mighell   Final  edits   1.00                                          

 

Page 3: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  3  

 Table  of  Contents  

 

1   Document  Overview  ......................................................................................................................................................  5  1.1   Purpose  .....................................................................................................................................................................  5  1.2   Document  Scope  .....................................................................................................................................................  5  1.3   Referenced  Documents  ........................................................................................................................................  5  1.4   Key  Concepts  ...........................................................................................................................................................  5  1.4.1   Large  Catalogs  ............................................................................................................................................................................  5  1.4.2   Data  Publication  and  Data  Services  ..................................................................................................................................  5  1.4.3   Virtual  Storage  ...........................................................................................................................................................................  6  1.4.4   Compute  Services  .....................................................................................................................................................................  6  1.4.5   Task  Containers  .........................................................................................................................................................................  6  1.4.6   FUSE  Filesystems  ......................................................................................................................................................................  6  1.4.7   Visualization  ...............................................................................................................................................................................  6  1.4.8   Distributable  Data  Lab  Components  ................................................................................................................................  6  1.5   Abbreviations  and  Acronyms  ............................................................................................................................  6  1.6   System  Context  for  the  NOAO  Data  Lab  ..........................................................................................................  7  

2   Software  Architecture  ...................................................................................................................................................  8  2.1   Infrastructure  Architecture  ...............................................................................................................................  8  2.1.1   Presentation  Layer  ...................................................................................................................................................................  8  

2.1.1.1   Astronomer’s  Desktop  Tools  ................................................................................................................................  9  2.1.1.2   Data  Lab  Operations  Tools  ....................................................................................................................................  9  

2.1.2   Public  Services  Layer  ..............................................................................................................................................................  9  2.1.3   Private  Services  Layer  ............................................................................................................................................................  9  2.1.4   Data  Access  Services  Layer  ................................................................................................................................................  10  

2.1.4.1   TAP  ................................................................................................................................................................................  10  2.1.4.2   SIA/SCS/SSA  ..............................................................................................................................................................  11  2.1.4.3   VOSpace  .......................................................................................................................................................................  11  2.1.4.4   SQL  Service  ................................................................................................................................................................  11  

2.1.5   Resource  Layer  ........................................................................................................................................................................  12  2.1.5.1   External  Resources  .................................................................................................................................................  12  

2.2   Component  Descriptions  ..................................................................................................................................  13  2.2.1   Authentication  (Services  Layer)  .......................................................................................................................................  13  2.2.2   Query  Manager  (Services  Layer)  ......................................................................................................................................  13  2.2.3   Job  Manager  (Services  Layer)  ............................................................................................................................................  14  2.2.4   Virtual  Storage  Manager  (Services  Layer)  ....................................................................................................................  14  2.2.5   Resource  Resolver  Interface  (Services  Layer)  ............................................................................................................  15  2.2.6   Public  Repository  (Services  Layer)  ..................................................................................................................................  15  2.2.7   Private  Repository  (Services  Layer)  ...............................................................................................................................  15  2.2.8   Operations  Monitor  (Services  Layer)  .............................................................................................................................  15  2.2.9   Data  Access  Services  (Data  Access  Layer)  ....................................................................................................................  16  2.2.10   SQL  Service  (Data  Access  Layer)  ....................................................................................................................................  16  

3   Software  Deployment  .................................................................................................................................................  18  3.1   Client  Software  .....................................................................................................................................................  18  3.2   Content  Servers  ...................................................................................................................................................  18  3.2.1   Large  Catalogs  ..........................................................................................................................................................................  19  3.2.2   NSA  Proxy/SIA  Service  ........................................................................................................................................................  19  3.2.3   Survey/PI  Data  Access  Services  .......................................................................................................................................  19  3.3   Storage  Servers  ....................................................................................................................................................  20  3.4   Compute  Servers  .................................................................................................................................................  20  3.5   MyDB  Server  .........................................................................................................................................................  20  

Page 4: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  4  

3.6   Data  Lab  Services  Server  ..................................................................................................................................  20  4   Distributable  Data  Lab  Components  .....................................................................................................................  22  4.1   Software  Packaging  and  Distribution  ..........................................................................................................  22  4.2   Virtual  Storage  .....................................................................................................................................................  23  4.3   Data  Publication  ..................................................................................................................................................  24  4.4   Processing  Tools  and  Services  .......................................................................................................................  24  4.5   An  Example  ...........................................................................................................................................................  24  

5   System  Interfaces  ........................................................................................................................................................  26  5.1   Security  ..................................................................................................................................................................  26  5.2   Command-­‐line  Tools  ..........................................................................................................................................  26  5.3   Web  Portals  ...........................................................................................................................................................  27  5.4   Legacy  Applications  ...........................................................................................................................................  28  5.5   Data  Query  ............................................................................................................................................................  28  5.6   Processing  Task  Control  ...................................................................................................................................  29  5.6.1   Task  Containers  .......................................................................................................................................................................  29  5.6.2   Job  Control  .................................................................................................................................................................................  29  5.7   Virtual  Storage  .....................................................................................................................................................  31  

6   Implementation  Tools  and  Standards  ..................................................................................................................  31  6.1   Implementation  Languages  .............................................................................................................................  31  6.1.1   Language  Versions  .................................................................................................................................................................  31  6.2   Development  Platforms  ....................................................................................................................................  32  6.3   Software  Development  Standards  .................................................................................................................  32  6.3.1   Software  Licensing  .................................................................................................................................................................  32  6.3.2   Public  Repository  ...................................................................................................................................................................  32  6.3.3   Private  Repository  .................................................................................................................................................................  32  6.3.4   Testing  Framework  ...............................................................................................................................................................  32  6.3.5   Bug  and  Issue  Tracking  ........................................................................................................................................................  32  6.4   Web  Interfaces  .....................................................................................................................................................  33  6.5   Database  Technologies  .....................................................................................................................................  33  6.6   Machine  Virtualization  .....................................................................................................................................  33  

7   Requirements  Tracking  .............................................................................................................................................  34  7.1   Core  Data  Lab  Capabilities  ...............................................................................................................................  34  7.2   User-­‐Provided  Science  Capabilities  ..............................................................................................................  37  

Appendix  I:    Vocabulary  /  Acronyms  Used  .................................................................................................................  39  Appendix  II:    List  of  Figures  .............................................................................................................................................  43    

 

Page 5: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  5  

1 Document  Overview  

1.1 Purpose    This  System  Architecture  Document  for  the  NOAO  Data  Lab  is  intended  to:    

1. Provide   a   high-­‐level   conceptual   design   of   a   Data   Lab   system   that   satisfies   all   Operational   and  Science  requirements.  

2. Describe  and  define  the  components  of  the  Data  Lab,  their  implementation  and  interfaces,  3. Describe   the   interaction  between  components   to  show  how  the   functional   requirements  of   the  

system  are  satisfied.    

1.2 Document  Scope    The   scope   of   this   document   is   the   entire   Data   Lab   Project.     This   document   will   evolve   over   time   as  requirements  and  designs  are  finalized.    

1.3 Referenced  Documents    This  document  may  reference  additional  documentation  identified  below.    

[1]  Science  Use  Cases           (SUC)  [2]  Science  Requirements  Document       (SRD)  [3]  Operational  Concepts  Document       (OCD)  [4]  Operational  Requirements  Document       (ORD)  [5]  Project  Execution  Plan       (PEP)  

   

1.4 Key  Concepts      Throughout  this  document  we  may  use  several  phrases  or  terms  that  refer  to  specific  Data  Lab  components  or   activities.     These   are   briefly   explained   here   for   context,   a  more   detailed   explanation   is   provided   in   the  documents  referenced  in  Section  1.3  above  and  in  the  descriptions  given  below.  

1.4.1 Large  Catalogs    The   term  Large  Catalogs   is  used   for  a  specific  dataset   requiring  dedicated  hardware   to  manage  distributed  query  processing.    Examples  include  the  Dark  Energy  Survey  (DES)  Catalog,  but  the  term  generally  refers  to  any  database  larger  than  can  typically  fit  on  a  modern  desktop  machine.  

1.4.2 Data  Publication  and  Data  Services    The   terms  Data   Publication   may   be   used   to   refer   to   datasets   hosted   in   the   Data   Lab   and   served   publicly  through   standard   Virtual   Observatory   (VO)   interfaces.     These   typically   represent   high-­‐level   data   products  (images,  catalogs,  spectra,  time  series,  etc.)  created  by  a  Survey  Team  or  individual  PI.    

The  term  Data  Service  may  be  used  to  refer  to  any  web-­‐service  providing  an  interface  to  query  and  access  a  data  collection.    This  may  include  Large  Catalogs  or  private  databases  that  use  custom  interfaces.  

Page 6: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  6  

1.4.3 Virtual  Storage    The   term   Virtual   Storage   is   used   to   refer   to   the   Data   Lab   services   managing   distributed   storage   of   data  through  web  interfaces.    It   is  similar  to  Cloud  Storage1  but  in  the  Data  Lab  is  more  closely  associated  with  a  service  running  at  a  particular  location.  

1.4.4 Compute  Services    The  term  Compute  Services  refers  to  data  processing  elements  of  the  Data  Lab.    The  may  be  implemented  as  web   services   available   as   a   RESTful   interface   (i.e.   an   HTTP   service)   or   as   a   specific   computational   task  executed   as   part   of   a   larger   workflow.     Within   the   Data   Lab   these   services   are   used   to   perform   general  transformation  of  data  files  (e.g.,  an  image  cutout  service)  or  some  specific  analysis  (e.g.,  to  detect  variability  in  a  time  series).  

1.4.5 Task  Containers    A  task  container,  specifically  a  Linux  Container,  is  a  virtualization  method  for  running  applications  in  isolated  Linux  systems  (i.e.  the  container)  on  a  single  host  machine.    Unlike  Virtual  Machines  (VMs)  that  virtualize  an  entire  machine,  containers  are  generally  much   lighter-­‐weight  and  share  elements  of   the  host  machine  (e.g.,  binaries   and   libraries),   allowing   them   to   be   started   almost   instantly.     Specific   task   dependencies   (e.g.,  language   versions)   can   be   bundled   with   the   container,   allowing   for   a   more   heterogeneous   computing  environment.    Within   the   Data   Lab,   containers   are   used   to   package   Compute   Services   and   for   distributed  software.  

1.4.6 FUSE  Filesystems    A  FUSE  (Filesystem  in  Userspace)  filesystem  is  an  operating  system  mechanism  that  allows  a  non-­‐privileged  user  to  mount  a  data  source  as  a  standard  Unix  filesystem.    Within  the  Data  Lab,  this  is  used  to  mount  a  user’s  virtual   storage   to   provide   transparent   access   to   their   data  without   requiring   applications   to   use   the  web-­‐service  protocol  that  implements  the  storage.  

1.4.7 Visualization    Visualization   is   used   to   refer   to   the   plotting   or   image   display   capabilities   in   the   Data   Lab   and   on   the  astronomer’s  desktop.    Examples  may  include  purpose-­‐built  web  tools  or  the  use  of  more  general  plotting  or  display   tools   that   interact  with  Data  Lab  components.    Within   the  Data  Lab  architecture,  visualization   is  an  application  or  a  Compute  Service  and  not  specifically  an  architectural  component.  

1.4.8 Distributable  Data  Lab  Components    Distributable  components  are  those  elements  of  the  Data  Lab  that  can  be  downloaded  and  installed  on  a  user’s  machine.    These  can  be  services  such  as  Virtual  Storage  that  operate  on  local  data  or  tasks  developed  in  the  Data   Lab   that   execute   as   command-­‐line   tools   in   the   user’s   environment.     Distributable   components   are  described  in  Section  4  below.      

1.5 Abbreviations  and  Acronyms    A  complete  list  of  acronyms  and  abbreviations  used  in  this  document  is  given  in  Appendix  I.    

                                                                                                                                       1  http://en.wikipedia.org/wiki/Cloud_storage  

Page 7: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  7  

1.6 System  Context  for  the  NOAO  Data  Lab    As  a  Project,  the  Data  Lab  is  developed  within  the  Science  Data  Management  (SDM)  group  at  NOAO  (soon  to  be   the  NOAO  System  Science  and  Data  Center,  NSSDC).      The  NOAO  Science  Archive   (NSA)  will   continue   to  ingest  and  archive  raw   image  data   from  KPNO  and  CTIO  as  well  as   the  pipeline-­‐reduced  data   (i.e.   from  the  NOAO  High-­‐Performance  Pipeline  System,  NHPPS)  for  the  Mosaic,  NEWFIRM,  and  DECam  instruments.    Data  Lab  will  not  replace  the  NSA;  rather  it  will  work  alongside  NSA  or  act  as  a  client  when  pixel  data  are  required.    

 

Figure  1.3:    Context  Diagram  for  the  NOAO  Data  Lab.      

Within  NOAO,  the  Survey  Programs  and  individual  Principal  Investigator  (PI)  programs  will  continue  to  have  their  raw  data  archived  by  the  NSA,  but  may  also  choose  to  import  that  data  into  the  Data  Lab  for  further  analysis  or  as  a  means  to  share  intermediate  results  or  offer  collaborative  access  to  the  data  using  the  Virtual  Storage   services  provided.    Additionally,  users  may  wish   to  publish   their   final  data  products   (e.g.,   catalogs,  image  stacks,  etc.)  using  the  Data  Publication  services.        

Within  the  wider  astronomical  community,  Data  Lab  will  be  a  consumer  of  data  and  services  from  other  data   centers,   using   both   standard   Virtual   Observatory   (VO)   and   proprietary   protocols.     Additionally,  community  PI  users  may  request  a  Data  Lab  account   in   support  of   their   science  program,  either   importing  data   for   analysis   or   to   be   used   in   conjunction  with   services   provided   by   the   Data   Lab   (e.g.,   catalog   cross-­‐matching,   target   lists   for  pixel  data,  etc.).  Lastly,   the  distributable  parts  of   the  Data  Lab  running  on  a  user’s  local   hardware   can   interact  with   similar   components  within   the  main  Data   Lab   itself,   for   example,   to   sync  data  to  virtual  storage  or  to  do  local  software  development  before  uploading  a  workflow  for  a  long-­‐running  job.  

Page 8: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  8  

2 Software  Architecture  

2.1    Infrastructure  Architecture    

 Figure  2.1:  Data  Lab  software  architecture  diagram.    Elements  of  the    

diagram  are    described  in  more  detail  below.    

2.1.1 Presentation  Layer    The  components  of  the  Data  Lab’s  Presentation  Layer  are  shown  in  blue  in  the  top  level  of  Figure  2.1.    This  layer  consists  of:    

• Web-­‐page  interfaces  to  specific  services,  including:  o A  Data  Lab  login  portal  allowing  users  to  authenticate  themselves  to  the  system  and  access  

resources  assigned  to  them.    This  includes  a  control  page  to  manage  the  user’s  information  (password,  contact  addresses,  etc.)  once  logged  in.  

o Resource-­‐specific  web  pages,  including:  § Web-­‐based  virtual  storage  browsers  § Dataset   specific   query   interfaces,   e.g.,   a   custom   interface   for   the   DES   catalog   and  

query  pages  and  descriptions  of  published  datasets.  § Data  publication  tools.  

o Compute-­‐process  status  and  monitoring  pages.  • Command-­‐line  applications,  including:  

o Desktop  tools  run  on  the  Astronomer’s  Desktop  that  access  Data  Lab  services  remotely.  o Science   workflows   created   within   the   Data   Lab.       These   include   tools   and   scripted  

applications  executed  within  the  user’s  Data  Lab  login  shell.  • Legacy   software   that   may   use   Data   Lab   services   either   through   existing   standard   interfaces   (e.g.,  

HTTP  requests)  or  inclusion  of  Data  Lab  client  code.    These  may  be  individual  tasks  or  development  environments/languages.  

Astronomer’s Desktop

Legacy AppsUser CodeCmdline ToolsWeb Page

Data Lab Ops

User Mgmt Monitoring

Data Access Services

VOSpaceUWS

SCSSSASIA TAPUWS

SQL ServiceUWS

Public Services

Resource ResolverStorage MgrQuery ManagerJob ManagerAuthentication

Private Services

Ops MonitorPrivate RepoPublic Repo

Storage Resource

UserSpace

VirtualSpace

Compute Resource

Compute Jobs

External Resources

VO DataVO Svcs

NSA

Databases

Data PubOps DBs

Large CatsUWSMyDB

PresentationLayer

ServicesLayer

Data AccessLayer

ResourcesLayer

Page 9: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  9  

• User-­‐developed  code  as  described  in  Sec  2.1.1.1  below.      

2.1.1.1 Astronomer’s  Desktop  Tools    Astronomer’s   Desktop   Tools   are   defined   to   be   the   web   interfaces,   software   distributions   or   command-­‐line  tools  described  in  Sec.  2.1.1  that  are  executed  by  users  of  the  Data  Lab.    These  tools  may  also  be  run  within  the  Data  Lab  system  (e.g.,  from  the  login-­‐shell  portal)  and  include  any  application  or  interface  that  uses  Data  Lab  services  but  was  developed  by  an  individual  astronomer  or  science  collaboration.  

 

2.1.1.2 Data  Lab  Operations  Tools    Operations  Tools  are  defined  to  be  the  web  interface  and  command-­‐line  tools  used  by  Data  Lab  Operators  to  manage  and  monitor  the  system.    These  tools  will  generally  not  be  available  to  normal  users  and  include:  

 • Utilities  to  manage  user  accounts  • System  backup  and  restore  commands  • Tools  to  monitor  and  control  Data  Lab  components  • Utilities  used  in  Data  Publishing  • Tools  for  system  logging  and  reporting  

 Tools  with  administrative   functions   that  may  be  useful   to  users  when  managing  Data  Lab   components  

running  on  their  local  machine  are  considered  User  Tools  and  may  have  different  capabilities.    

2.1.2 Public  Services  Layer    The  components  of  the  Data  Lab’s  Public  Services  Layer  are  shown  in  cyan  in  the  2nd  level  of  Figure  2.1.    This  layer  consists  of:    

• Authentication  services.    This   is   the  primary   interface   for  clients  to   identify  themselves  to  the  Data  Lab.  See  Sec.  2.2.1.  

• Authorization   services.   This   is   the   primary   interface   for   clients   to   obtain   permission   to   access  resources  in  the  Data  Lab.  See  Sec.  2.2.1.  These  services  will  be  done  by  the  Authentication  Service.  

• The  Job  Manager.    This  is  the  primary  interface  for  clients  to  submit  processing  jobs  to  the  Data  Lab.  See  Sec.  2.2.3.  

• The  Query  Manager.  This  is  the  primary  interface  for  clients  to  submit  queries  to  the  Data  Lab  data  resources.  See  Sec.  2.2.2.    

• The  Resource  Resolver.  This  is  the  primary  interface  for  clients  to  resolve  URI’s  to  service  endpoints  in  the  Data  Lab.    It  may  be  replaced  by  a  VO  Publishing  Registry  in  the  future  and  serves  as  a  Registry  proxy  in  the  interim.  See  Sec.  2.2.5.    

 Role:  This  layer  exposes  a  client-­‐facing  API  used  by  the  Presentation  Layer  to  access  Data  Lab  components.    It  is  built  on  internal  services  to  provide  functionality  that  may  be  accessed  via  alternate  APIs  from  lower-­‐level  services   or   interfaces;   those   internal   interfaces   are   described   below.     Other   services   (e.g.,   data   access   or  virtual  storage)  may  also  be  public   in  terms  of  being  available  to  clients,  but  these  are  exposed  as  standard  interfaces  outside  the  context  of  the  Data  Lab  (e.g.,  a  VO  Simple  Image  Access  service).    

2.1.3 Private  Services  Layer    The  components  of   the  Data  Lab’s  Private  Services  Layer  are  shown  in  yellow  in  the  2nd   level  of  Figure  2.1.    This  layer  consists  of:    

Page 10: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  10  

• Private   repository   services.     This   is   the   internal   repository   used   for   Data   Lab   Operations   or   for  development   in-­‐progress   by   both  Users   and  Developers   and   is   distinct   from   the   public   repository  used  for  released  software  distributions.  See  Sec.  2.2.7.  

• Operations   monitoring   and   logging   services.     These   are   used   primarily   by   Data   Lab   Operators   to  monitor   system   health   (using   internal   interfaces),   to   log   system   activity   and   to   generate   usage  reports.  See  Sec.  2.2.8.  

 Role:   These  are   the  primary   interfaces   for  Data  Lab  Developers  and  Operators   to  work  within   the   system.    Users   may   be   allowed   use   of   or   access   to   some   private   repositories   but   are   not   guaranteed   access   to   all  administrative  services.    

2.1.4 Data  Access  Services  Layer    The  components  of  the  Data  Lab’s  Data  Access  Services  Layer  are  shown  in  magenta  in  the  3rd  level  of  Figure  2.1.    This  layer  consists  of:    

• Simple  VO  Data  Access  Layer   interfaces.  These  services  are  distinguished  by  the  property   that   they  permit   a   query   of   a   service   based   on   a  minimal   set   of   parameters,   e.g,   a   radius/box   around   some  celestial  position,  a  bandpass,  or  data  type.    These  services  are  standard  VO  protocols  implemented  as  a  minimal  interface  to  all  data  services.  

• Advanced  VO  Data  Access  Layer  (DAL)  interfaces.  These  services  provide  for  more  complex  queries  of  data,   e.g.,   an   SQL-­‐like   query   of   a   database   schema,   or   a   custom   interfaces   to   a   specific   Data   Lab  resource.    These   services  are  appropriate  only   for   catalog  datasets,  however   there   is  no  guarantee  that  all  catalogs  will  implement  this  interface.    Advanced  services  may  also  include  protocols  which  permit  processing  of  data  before  returning  the  result  of  a  query.  

• Virtual   Storage   interfaces.     These   services   provide   a   high-­‐level   interface   to   the   virtual   storage  suitable  for  clients  in  the  Presentation  Layer  (e.g.,  a  web-­‐based  storage  browser).    They  hide  many  of  the  details  of  the  underlying  protocol  and  present  an  abstract  interface  to  the  virtual  storage  system  (thus  allowing  use  of  both  local  and  remote  resources  transparently).  

• Custom   SQL   database   access   interfaces.     In   certain   instances   it   is   preferable   to   bypass   other   data  access  interfaces  in  order  to  talk  to  the  database  directly  (e.g.,  from  legacy  clients  or  low-­‐level  utility  code).    These  interfaces  will  allow  authorized  clients  to  access  the  database  on  a  read-­‐only  basis.  

 

2.1.4.1 TAP    TAP  (Table  Access  Protocol)2  is  a  web-­‐service  protocol  from  the  Virtual  Observatory  that  provides  access  to  collections   of   tabular   data.     Large   or   complex   catalogs   are   typically   stored   in   a   relational   database,   TAP  services   allow   clients   to   query   any   of   the   columns   in   any   of   the   database   tables,   perform   joins  with   user-­‐supplied  tables  and  submit  queries  using  a  SQL  variant  (ADQL,  the  Astronomical  Data  Query  Language3)  with  extension  function  specific  to  astronomy.    TAP  services  require  no  special  authentication/authorization.    Primary  functions  of  TAP  services  are:    

• To  respond  to  data  queries  of  complex  tabular  data  collections,  • To  respond  to  metadata  queries  to  allow  clients  to  determine  the  names  of  tables  and  columns  to  be  

used  in  queries,  

                                                                                                                                       2  Dowler, P., Rixon, G., Tody, D., “Table Access Protocol, Version 1.0”, http://ivoa.net/documents/TAP/, IVOA Recommendation 27 March 2010  3  Oriz, I., et al, “IVOA Astronomical Data Query Language, Version 2.0”, http://ivoa.net/documents/ADQL/ , IVOA Recommendation 30 October 2008  

Page 11: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  11  

• To  respond  to  standard  interface  queries  used  to  supply  metadata  about  service  availability  (e.g.,  for  operational  monitoring  services),  

• To  provide  synchronous  and  asynchronous  execution  of  queries    Role:  Within  the  Data  Lab  architecture  TAP  services  provide  a  VO  standard  interface  to  expose  data  to  legacy  client  applications  through  VO-­‐compliant  protocols.    

2.1.4.2 SIA/SCS/SSA    The  Virtual  Observatory  simple  protocols  support  parameterized  queries  of  data  collections  of  a  specific  type,  e.g.,   images   (SIA,   the   Simple   Image   Access   protocol),   catalogs   (SCS,   the   Simple   Cone   Search   protocol),   or  spectra  (SSA,  the  Simple  Spectral  Access  protocol),  amongst  others.      These  interfaces  are  ideal  for  web-­‐form  clients  or  single  object  queries  in  a  synchronous  execution  environment.    The  Simple  Cone  Search  (SCS)  used  for  catalog  queries  returns  results  directly,   the   image  and  spectral   forms  permit  a  query  and   then  return  a  result  table  with  enough  information  to  allow  a  client  to  decide  which  data  to  actually  download  in  a  second  step.     A   celestial   position   is   usually   the   key   search   parameter,   however   results   may   additionally   be  constrained  by  other  metadata  such  as  the  bandpass,  time  of  observation,  resolution  element,  etc.        

These   services   require   no   special   authentication/authorization.     These   services  may   optionally   be  layered  upon  an  underlying  TAP  service.    Role:  Within  the  Data  Lab  architecture  the  Simple  services  provide  a  VO  standard  interface  to  expose  data  to  legacy  client  applications  through  VO-­‐compliant  protocols.    

2.1.4.3 VOSpace    The  VOSpace  protocol  will  be  used  to  implement  the  Data  Lab  virtual  storage  system.    Clients  will  be  able  to  access  their  storage  using  VOSpace  protocols  by  communicating  directly  with  the  service.  Transfers  into/out  of  the  space  may  be  synchronous  or  asynchronous  as  allowed  by  the  protocol,  however,  asynchronous   jobs  will  be  managed  by  the  VOSpace  service  itself  and  not  the  Data  Lab  Job  Manager.      

This  service  uses  the  Authorization  service  to  verify  the  requesting  client  has  permission  to  use  the  resource.  The  Storage  Manager  and  Job  Manager  both  use  this  service.      Role:  Within  the  Data  Lab  architecture  the  VOSpace  service  provides  a  standard  interface  to  the  user’s  virtual  storage   space   for   legacy   VO   applications.     Exposing   the   service   implementation   at   this   level   additionally  allows  it  to  be  packaged  and  exported  for  use  outside  the  Data  Lab.    

2.1.4.4 SQL  Service    The   SQL   service   provides   an   abstract   database   interface   that   allows   clients   low-­‐level   access   to   query   a  database   or   process   the   results.     Clients   are   presented  with   a   uniform   interface   regardless   of   the   backend  database  used,  however  the  abstraction  supports  only  the  common  intersection  of  capabilities  available  in  the  databases  used  within  the  Data  Lab.    Direct  access  to  the  database  with  this  service  is  useful  in  the  following  scenarios:      

• A  client  application  wishes  to  step  through  a  query  result  row-­‐by-­‐row,    • The  entire  result  set  must  be  copied  to  another  database  in  the  most  efficient  way  possible,  • The  results  are  to  be  serialized  into  a  format  other  than  what  the  VO  services  provide,  • A  query  is  more  complex  than  can  be  handled  by  a  VO  data  service.  

 This  service  uses  the  Authorization  service  to  verify  the  requesting  client  has  permission  to  use  the  resource.      

Page 12: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  12  

 Role:    Within  the  Data  Lab  architecture  the  SQL  service  provides  direct  (but  authorized)  access  to  databases  for  use  by  client  code  that  needs  a  low-­‐level  database  interface.    In  some  cases,  this  interface  will  be  used  to  optimize  some  higher-­‐level  Data  Lab  functionality  (e.g.,  saving  query  results  to  a  user’s  personal  database).    

2.1.5 Resource  Layer    The   components  of   the  Data  Lab’s  Resource  Layer  are   shown   in  grey   in   the   lower  part  of  Figure  2.1.    This  layer  consists  of:    

• Databases  used  for  published  data  collections  and  operational  purposes.    Public  databases  are  those  used   to   support   Data   Services,   Private   databases   are   used   for   internal   Data   Lab   operations   (e.g.,  logging  systems,  job  control,  etc.).  

• Storage   resources.     The   development   hardware   system   will   have   available   up   to   400TB   of   disk  storage  for  use  with  virtual  storage  of  user  files  and  database  storage  of  published  data.    We  expect  the  hardware  allocation  to  be  re-­‐provisioned  prior  to  full  production.  

• Compute  resources.    The  development  hardware  system  will  have  up  to  XYZ  cores  available  for  use  to   support   database,   compute,   visualization   and   storage   operations.   We   expect   the   hardware  allocation  to  be  re-­‐provisioned  prior  to  full  production.  

• External  data  and  compute  resources.    See  Sec  2.1.5.1.    

2.1.5.1 External  Resources    

External  Resources  are  defined  to  be  resources  that  may  be  accessed  in  a  workflow  or  from  a  Data  Lab  component,  but  are  not  maintained  or  managed  by  the  Data  Lab  directly.  

2.1.5.1.1 NOAO  Science  Archive    Data  Lab  will  use  the  NOAO  Science  Archive  (NSA)  as  the  primary  source  of  raw  and  pipeline-­‐reduced  DECam,  MOSAIC   and   NEWFIRM   image   data.     The   NSA   provides   a   Simple   Image   Access   interface   as   well   as   other  custom   interfaces   that  may   be   used   to   query   for   images   or   to   access   a   specific   image.   Clients   of   the   NSA  include  core  Data  Lab  components  and  tools  used  in  user-­‐defined  analysis.    Role:    Within  the  Data  Lab  architecture  the  NSA  provides  image  data  that  may  be  used  in  science  workflows  that   require   access   to   source   pixels.     Full-­‐sized   images   may   be   sent   to   Compute   Services   for   additional  processing  (e.g.,  cutouts)  before  results  are  returned  to  the  user.  

2.1.5.1.2 External  Data  Services  (VO  Data)    External  data  services  refer  to  all  non-­‐NOAO  data  sources  that  may  be  used  within  the  Data  Lab.    These  may  be  either  VO  data  services  or  data  available  through  custom  interfaces  requiring  specialized  clients  (e.g.,  the  Sesame   name   resolver).     Clients   for   these   services   include   core   Data   Lab   components   (e.g.,   compute   or  visualization  services)  and  tools  used  in  user-­‐defined  analysis.    Role:    Within  the  Data  Lab  architecture  external  data  services  are  used  to  supply  additional  data  needed  by  science  workflows  or  core  Data  Lab  components.  

2.1.5.1.3 External  Compute  Services  (VO  Services)    

Page 13: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  13  

External   compute   services   refer   to   all   non-­‐NOAO   compute   services   that  may   be   used  within   the  Data   Lab.  These  may  be  either   standard  VO  services   (e.g.,  DataLink4)  or   services  available   through  custom   interfaces  requiring   specialized   clients   that   perform   a   specific   function   (e.g.,   catalog   cross-­‐match).     Clients   for   these  services   include  core  Data  Lab  components  (e.g.,  compute  or  visualization  services)  and  tools  used   in  user-­‐defined  analysis.    Role:    Within  the  Data  Lab  architecture  external  compute  services  are  used  to  perform  some  action  on  data  used  in  science  workflows  or  by  core  Data  Lab  components.      

2.2    Component  Descriptions    This  section  provides  a  more  detailed  description  of  individual  components  listed  in  Section  2.1  above.    

2.2.1 Authentication  (Services  Layer)    The  Authentication  Service  implements  the  following  functionality    

• Maintains  a  database  of  users  and  account  information  (login  name,  password,  contact  info,  etc.)  • Provides  an  interface  allowing  users  to  retrieve  or  modify  their  account  information.  • Provides   an   interface   that   allows   an   application   to   present   a   login/password   and   receive   an  

authorization  credential.  • Provides  an  interface  allowing  users  to  create  named  Groups  of  users.  • Provides  an  interface  allowing  the  owner  of  a  Group  to  add  or  delete  users  from  the  Group.  • Provides   an   interface   allowing   the   owner   of   a  Group   to   set   or   get   access  permissions   to   a   specific  

resource  on  behalf  of  all  members  of  a  Group.  • Provides  an  interface  allowing  the  owner  of  a  Group  to  transfer  ownership  to  a  member  of  the  Group.  • Provides  an  interface  allowing  clients  to  get  a  list  of  all  members  of  a  Group,  or  of  all  Groups  to  which  

the  user  belongs.  • Verifies  that  a  user  (identified  by  a  credential)  is  authorized  to  access  a  specific  resource.  • Verifies  that  a  user  (identified  by  a  credential)  is  a  member  of  a  specified  Group  and  thus  has  all  the  

privileges  of  that  Group.  • Provides  an   interface   that  allows  privileged  users  (i.e.  Data  Lab  Operators,   identified  with  a  secure  

credential)  full  access  to  all  user  records.    A  detailed  description  of   the  Authentication   Service  design   and   requirements   is   to  be  provided   in   a   future  document.  

 

2.2.2 Query  Manager  (Services  Layer)    The  Query  Manager  implements  the  following  functionality:    

• Presents  a  uniform  HTTP  interface  to  functionality  available  in  the  Large  Catalog  data  services:  o Table  Access  Protocol  (TAP)  o SQL  Service  

• Composes  the  query  into  a  form  suitable  for  the  requested  service.  • Validates  that  the  user  is  authorized  to  access  the  requested  service.  

                                                                                                                                       4  DataLink   is   a   VO   specification   for   connecting  metadata   discovered   about   a   dataset   to   the   data,  metadata  products,  or  web-­‐services  that  can  act  upon  the  data.    Examples  include  finding  links  to  preview  or  progenitor  datasets,  or  services  that  can  extract  cutouts,  re-­‐orient  /  re-­‐scale  images,  etc.    

Page 14: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  14  

• Submits  the  query  to  the  requested  service  for  synchronous  or  asynchronous  execution  • Maintains  the  state  of  the  submitted  asynchronous  jobs,  allowing  clients  to  poll  for  status  or  results  

from  completed  queries.  • Serializes  the  query  results  into  a  user-­‐specified  format  (e.g.,  HTML,  CSV/TSV/ASV,  FITS).  • Returns  results  to  the  calling  application  or  orchestrates  the  process  to  save  results  to  Virtual  Storage  

or  MyDB  personal  database.    

The  Query  Manager  does  not  provide  an   interface   to   the  Simple  Cone  Search  (SCS)  services  because  of  their   mandatory   synchronous   execution.     The   Query   Manager   may   call   the   Job   Manager   to   process  asynchronous  query  jobs.    

A   detailed   description   of   the   Query   Manager   design   and   requirements   is   to   be   provided   in   a   future  document.    

2.2.3 Job  Manager  (Services  Layer)    The  Job  Manager  implements  the  following  functionality:    

• Validates  that  the  user  is  authorized  to  submit  a  job  for  execution.  • Provides  an  interface  to  queue  jobs  for  execution.  • Provides  an  interface  to  determine  the  status  of  all  queued  and  running  jobs.  • Provides  an  interface  to  set  or  change  the  properties  of  a  queued  job  (e.g.,  execution  time).  • Provides  an  interface  to  change  the  state  of  a  queued  job  (e.g.,  to  begin  execution  immediately).  • Provides  an  interface  to  remove  jobs  from  a  queue,  or  to  kill  a  running  job.  • Provides   an   interface   to   execute   jobs   on   remote   servers   in   either   a   synchronous   or   asynchronous  

manner.  • Creates  the  compute  job  on  the  remote  server.  • Sets  the  parameters  for  the  remote  compute  job.  • Collects  results  from  the  remote  compute  job  and  presents  them  to  the  calling  client.  • Starts  and/or  stops  the  compute  job  on  the  remote  server.  • Maintains  a  history  of  sumitted  jobs  and  their  status  for  the  service-­‐monitoring  task.  

 The   Job  Manager   provides   a   simplified   job-­‐control   interface   for   clients   and   implements   the   Universal  

Worker  Service  (UWS)5  design  pattern  internally  to  manage  individual  jobs.      

A   detailed   description   of   the   Job  Manager   design   and   requirements   is   to   be   provided   in   a   future  document.    

2.2.4 Virtual  Storage  Manager  (Services  Layer)    The  Virtual  Storage  Manager  implements  the  following  functionality:    

• Validates  that  the  user  is  authorized  to  access  the  requested  virtual  storage  space.  • Provides  an  interface  to  browse  the  contents  of  the  space.  • Provides  an  interface  to  move  data  into  the  storage  space  from  a  user’s  desktop.  • Provides  an  interface  to  move  data  into  the  storage  space  from  an  external  URL.  • Provides  an  interface  to  directly  access  data  stored  in  the  space,  e.g.,  for  file  download  or  transfer.    • Provides  an  interface  to  move  data  between  separate  instances  of  the  space.  • Provides  an  interface  to  set  or  display  VOSpace  properties,  capabilities  and  views.  

                                                                                                                                         5  http://www.ivoa.net/documents/UWS  

Page 15: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  15  

A   detailed   description   of   the   Virtual   Storage  Manager   design   and   requirements   is   to   be   provided   in   a  future  document.    

2.2.5 Resource  Resolver  Interface  (Services  Layer)    The  Resource  Resolver  implements  the  following  functionality:    

• Provides  an  interface  to  allow  clients  retrieve  information  about  a  Data  Lab  service  given  a  resource  URI.    Clients  may  request  a  single  value  or  the  entire  record.  

• Provides  an   interface   to  allow  services   (or  Operators)   to   register  new  resource   records  describing  the  service.  

• Provides   an   interface   to   allow   services   (or  Operators)   to   remove   their   resource   records  once   they  become  invalid  (i.e.  the  service  moves  or  shuts  down).  

• Provides  an  interface  to  allow  clients  to  list  all  available  services.  • Provides  an  interface  to  allow  clients  to  search  for  services  by  keyword  or  service  type.  

 Public   data   and   compute   services   will   be   registered  with   the   VO   Registry   to   allow   external   clients   to  

discover   and   access   the   service   directly.     The   Resource   Resolver   is   used   primarily   for   services   that   are  internal   to   the   Data   Lab   (e.g.,   available   compute   servers   or   virtual   storage   instances)   or   which   may   be  transient  (e.g.,  VO  data  access  services  created  as  part  of  a  VOSpace  capability).    This  allows  client  software  to  access  a   service   such  as  a  VOSpace  using  a   location-­‐independent  URI   that   the   resolver  will   translate   into  a  service   URL   endpoint   for   use   by   an   application.   External   services   running   as   distributed   Data   Lab  components  will  register  themselves  when  started  to  make  the  service  known  to  other  Data  Lab  tasks.    

A  detailed  description  of  the  Resource  Resolver  design  and  requirements  is  to  be  provided  in  a  future  document.    

2.2.6 Public  Repository  (Services  Layer)    Data   Lab  will   use  GitHub   (http://www.github.com)   as   the   Public   software   and  document   repository.     This  repository  will  be  used  for  all  released  software  and  documentation.    

2.2.7 Private  Repository  (Services  Layer)    Data   Lab   will   use   an   internal   instance   of   the   GitLab   (http://about.gitlab.com)   Git   repository  management  software   as   the   Private   repository.     This   repository   will   be   used   for   software   or   documentation   in  development,  as  well  as  for  operational  data  such  as  configuration  files,  deployment  notes,  etc.    

2.2.8 Operations  Monitor  (Services  Layer)    The  Operations  Monitor  implements  the  following  functionality:    

• Provides  an  interface  to  add  or  remove  services  from  monitoring.  • Provides  an  interface  to  display  summary  information  from  the  Job  Manager.  • Provides  an  interface  to  display  the  current  status  and  availability  of  services  being  monitored.    • Regularly  accesses  each  of  the  VO  services  under  its  control  to  determine  if  the  service  is  responding  

correctly.  • Regularly   accesses   supporting   software   services   (e.g.,   databases,  web   servers,   etc.)   to   determine   if  

service  is  responding  correctly  • Sends   alerts   (in   a   TBD  manner)   to   Data   Lab   Operators   summarizing   the   list   of   any   non-­‐working  

services.    

Page 16: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  16  

 The   VO   data   access   services   will   provide   VO   Support   Interface   (VOSI)   methods   that   the   Operations  

monitor   will   use   to   determine   service   availability;   these   services   may   be   accessed   directly   by   client  applications  as  well.  

 A  detailed  description  of  the  Operations  Monitor  design  and  requirements   is  to  be  provided  in  a   future  

document.      

2.2.9 Data  Access  Services  (Data  Access  Layer)    The  following  Data  Access  services  are  provided  within  the  architecture:    

SCS  (Simple  Cone  Search)  –  Provides  a  synchronous  VO-­‐standard  positional-­‐query  interface  to  catalogs.    Details  are  provided  at  http://www.ivoa.net/Documents/latest/ConeSearch.html  

 SIA   (Simple   Image   Access)   -­‐-­‐   Provides   a   synchronous   VO-­‐standard   query   interface   to   image   data  collections.    Details  are  provided  at  http://www.ivoa.net/Documents/SIA

SSA   (Simple   Spectral  Access)   -­‐-­‐  Provides   a   synchronous  VO-­‐standard  query   interface   to   spectral   data  collections.    Details  are  provided  at  http://www.ivoa.net/Documents/SSA  

 TAP   (Table   Access   Protocol)   –   Provides   a   sevice   protocol   for   general   table   data   access   using   either  synchronous   or   asynchronous   access   methods   under   a   Universal   Worker   Service   (UWS)   interface.    Details  are  provided  at  http://www.ivoa.net/Documents/TAP  

 VOSpace  –  Virtual   storage   services   can  optionally  provide   functionality   to  make  data   searchable  using  one  or  more  of  the  above  service  types  in  addition  to  the  direct  access  capabilities  of  the  service  itself.  

 Client  applications  may  access  these  services  directly,  they  may  also  be  accessed  by  the  Job  Manager  as  

part  of  a  Compute  Service.    

Additional  information  about  individual  Data  Access  Services  is  given  in  Sec  2.1.4.    

2.2.10 SQL  Service  (Data  Access  Layer)    The  SQL  Service  implements  the  following  functionality:    

• Validates  that  the  user  is  authorized  to  access  the  requested  resource  (when  required).  • Provides  an  abstract  read-­‐only  database  API  allowing  client  applications  to:  

o Submit  SQL  queries  directly  to  the  database  o Provides  an  interface  that  allows  clients  to  step  through  a  query  result  row-­‐by-­‐row.  o Provides   an   interface   that   allows   a   query   result   set   to   be   copied   to   another   database  

efficiently.  • Provides  synchronous  or  asynchronous  (UWS)  job  control  for  queries.  • Provides  an   interface   to  allow   the  upload  of  user-­‐specified   temporary   tables   to  be  used   in  a  query  

(e.g.,  to  perform  a  join  operation  using  user  data).    

The  SQL  Service  provides  an  alternate  interface  to  databases  for  clients  that  do  not  wish  to,  or  cannot,  use  a  TAP  service.    The  SQL  Service  does  not  provide   support   for  ADQL   functions  used   in  queries  or  metadata  discovery   using   the   TAP_SCHEMA  mechanism   found   in   a   TAP   service.     The   VO   Support   Interfaces   (VOSI)  endpoints   will   be   implemented   as   a   means   to   inform   clients   about   service   availability   and   basic   column  information.  

Page 17: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  17  

 A   detailed   description   of   the   SQL   Service   design   and   requirements   is   to   be   provided   in   a   future  

document.      

 

Page 18: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  18  

3 Software  Deployment    The  infrastructure  for  the  Data  Lab  will  be  deployed  to  a  mix  of  dedicated  and  shared  hardware  resources  in  the  Tucson  NOAO  headquarters.     The  deployment  described  here   reflects   a   conceptual   layout  more   than   a  specific   hardware   configuration   because   of   the   fluid   nature   of   the   shared   hardware   and   expected   future  upgrades  to  the  system.  

     

Figure  2.4:    The  Data  Lab  deployment  diagram.  Arrows  indicate  the    flow  of  requests  and  data  in  the  system.  

 

3.1 Client  Software    Client  software  is  any  application  that  makes  use  of  Data  Lab  services.    Examples  include:      

• The  Astronomer’s  Desktop  –  This  may  include  web  pages  running   in  a  browser,  command-­‐line  tools  installed  as  part  of  a  Data  Lab  software  distribution,  or  legacy  code  used  for  analysis.  

• A   user-­‐supplied   analysis   script   –   This   includes   Data   Lab   command-­‐line   tools   running   in   the   user’s  login   shell   called   from   a   scripting   environment   (e.g.,   C-­‐shell/Bourne,   Python,   IRAF,   IDL,   etc.),   or  applications  developed  by  the  astronomer  using  programmatic  interfaces.  

• A  visualization  requesting  data  for  display  –  Core  Data  Lab  plotting  or  image  display  tools  may  act  as  a  client  for  data  services  based  on  user-­‐supplied  parameters  for  the  query.  

• A  higher-­‐level  Data  Lab  system  component  –  The  Query  and  Job  Manager  interfaces  may  act  as  a  client  for  lower-­‐level  VO  protocol  services.  

   

3.2 Content  Servers    

Client    Software          Content    Servers            Resource  Servers        R    

Page 19: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  19  

Content   servers   are   the   machines   that   host   the   data   access   services,   i.e.   the   VO   protocol   services   for   the  supported   data   types   as  well   as   the   custom  Data   Lab   interfaces   to   access   large   catalogs.     These  machines  provide  read-­‐only  access  to  data  available  from  the  Data  Lab.    

3.2.1 Large  Catalogs    Large   catalogs  will   be   served   using   a   distributed   database   and  will   involve  multiple  machines   to   host   the  dataset.    The  master  node  is  responsible  for  the  primary  interface  to  the  data;  there  may  be  multiple  worker  nodes  in  the  background  to  process  individual  queries  on  the  partitioned  data.    

The  master  node  is  responsible  for:    

• Managing  the  query  as  either  a  synchronous  or  asynchronous  job  • Distributing  the  query  amongst  the  worker  nodes  • Collating  results  of  queries  from  worker  nodes  • Responding  to  job-­‐control  requests  from  the  Query  Manager  

 Data   Lab   will   use   the   QServ   database   system6   from   LSST   as   the   distributed   database   system   for   Large  Catalogs.    Because  this  is  a  system  still  in-­‐development,  we  expect  to  re-­‐deploy  these  services  multiple  times  as  the  system  evolves.    Additional  information  on  the  QServ  system  is  available  from  Sec.  6.5.    

3.2.2 NSA  Proxy/SIA  Service    The  NOAO  Science  Archive  (NSA)  currently  provides  a  VO  Simple  Image  Access  (SIA)  version  1  (v1)  service,  but  as  of  this  writing  the  prospect  for  an  SIA  version  2  (v2)  compatible  service  is  uncertain.    While  the  current  service  provides  a  basic  positional  query  capability,  the  SIA  v2  service  allows  additional  constraints  as  part  of  the  standard  query  (e.g.,  bandpass  or  temporal  constraints)  that  may  prove  necessary  for  some  science  cases.  Both   SIA   v1   and   v2   versions   allow   service-­‐specific   parameters   to   take   advantage   of   native   archive  functionality.    

Data  Lab  will  present  an  SIA  v2  interface  to  the  NSA  (as  it  will  to  all  image  services)  in  one  of  two  ways:    

• Using  the  native  SIA  v2  service  (if  it  is  available),  • Using  a  proxy  service  to  the  NSA  that  takes  advantage  of  an  existing  generic  SQL-­‐query  interface  to  

the  NSA  to  provide  an  SIA  v2  façade  interface    

In  either  case,  the  query  service  will  execute  on  a  Data  Lab  content  server  that  is  connected  over  a  socket  to  the  NSA  service  and  not  directly  on  NSA  hardware.    

3.2.3 Survey/PI  Data  Access  Services    These   services   represent   the   NOAO   Survey/PI   data   publication   component   of   the   Data   Lab.     They   are  individual   data   collections   hosted   by   the   Data   Lab   that   present   standard   VO   interfaces   to   their   data   as  independent  services  (i.e.  there  is  no  global  entry  into  all  the  collections,  just  the  individual  services).    These  datasets   are   generally   much   smaller   than   Large   Catalogs   but   may   represent   complex   data   collections,   i.e.                                                                                                                                          6 Wang, D.L.; Monkewitz, S.M.; Kian-Tat Lim; Becla, J., "Qserv: A distributed shared-nothing database for the LSST catalog,"2011 International Conference for High  Performance  Computing,  Networking,  Storage  and  Analysis  (SC)”,   vol., no., pp.1,11, 12-18 Nov. 2011  

Page 20: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  20  

multiple  services  for  the  catalog,   image  and/or  spectral  data  holdings  in  the  collection.    Each  service  in  this  case  is  independent  of  any  others  and  so  may  be  deployed  on  different  machines  if  needed.        

Because   clients   access   these   services   independently,   their   physical   location   is   irrelevant,   accessing  the  data  as  a  coherent  collection  is  the  job  of  a  knowledgeable  client  application,  or  an  advanced  service  such  as  DataLink7  that  understands  the  links  between  the  services  (e.g.,  that  a  particular  catalog  object  may  have  an  associated  spectrum  or  image  cutout).  

 

3.3 Storage  Servers    The  storage  server  manages  the  Data  Lab  Virtual  Storage  system,  providing  the  central  VOSpace  interface  in  the   system.     As   of   this  writing   and   for   development   purposes,   a   disk   system  of  ~400  TB   is   available   on   a  shared-­‐use  basis,  we  expect  this  to  be  augmented  to  provide  dedicated  storage  prior  to  public  release  of  the  Data  Lab.    The  disk  array  is  managed  using  the  GPFS8  file  system  already  in  operation  at  NOAO.    

3.4 Compute  Servers    The  compute  server  is  a  multi-­‐CPU,  multi-­‐core  system9  to  be  used  for  parallel  execution  of  processing  tasks  in  workflows   (e.g.,   image   cutouts   or   reprojections).     Processes   execute   as   containerized   applications   under   a  specific  userid   as   described   in   Sec.   1.4.5;   as   such,   each   process  will   have   access   to   a   user’s   virtual   storage  space  (mounted  in  the  container  as  a  user  filesystem)  to  provide  file  access  for  the  task  without  requiring  it  to  use  specific  Data  Lab  interfaces.        

Multiple  machines  may  be   deployed   to   allow   for   greater   capacity,   each  machine  will   have   a  modest  amount   of   local   disk   that   may   be   used   for   intermediate   processing.     The   Job   Manager   is   responsible   for  starting  the  process  on  the  machine  and  shutting  it  down  once  processing  is  complete.    

3.5 MyDB  Server    The  personal  database  server  is  configured  as  a   large  database  machine  and  is  responsible  for  handling  the  MyDB  personal  databases  assigned  to  each  user.    Access  to  the  MyDB  tables  require  authentication  in  the  Data  Lab.    New  tables  in  the  database  may  be  created  from:    

• A  query  result  on  either  the  Content  Server  or  Large  Catalogs,  • A  saved  result  from  a  client  query  of  external  data  resources  (e.g.,  a  VO  service  result),  • Data  saved  to  virtual  storage  that  uses  a  VOSpace  capability  to  create  a  database  table  

 

3.6 Data  Lab  Services  Server    This  server  will  host  the  bulk  of  the  public  Data  Lab  services,  including:    

• The  Job  Manager  responsible  for  managing  tasks  running  on  the  Compute  Server,  • The  Query  Manager  responsible  for  asynchronous  queries  of  the  Content  Server  or  Large  Catalogs,  

                                                                                                                                       7  Reference:  Dowler, P., et al, “DataLink, Version 1.0”, http://ivoa.net/documents/DataLink/, IVOA Recommendation 05 May 2014  8  http://en.wikipedia.org/wiki/IBM_General_Parallel_File_System  9  For  development  we  are  using  a  16-­‐CPU  quad-­‐core  server  with  16GB  of  RAM  and  16TB  local  disk.  

Page 21: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  21  

• Any  needed  proxy  services  for  the  NOAO  Science  Archive,  • The  Storage  Manager  responsible  for  access  to  virtual  storage,  • The  Resource  Resolver  responsible  for  resolving  local  resource  URIs  into  service  endpoints,  • The  Authentication  service  responsible  for  providing  secure  access  to  Data  Lab  services.  

 

Page 22: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  22  

4 Distributable  Data  Lab  Components    The  public   services   such  as   virtual   storage  or  data  publication  provided  by   the  Data  Lab  generally   require  system  privileges  and  additional  software  in  order  to  be  deployed  (e.g.,  an  application  server  such  as  Tomcat  and/or   database   backing)   in   addition   to   hardware   adequate   to   support   multiple   users.     However,  configuration  of  some  services  can  be  simplified  or  packaged  in  a  way  that  the  service  can  be  distributed  for  use  on  a  single-­‐user  machine  with  minimal  installation  requirements.    This  has  a  number  of  advantages:    

• Users  can  create  a  local  Data  Lab  environment  for  use  in  software  development  prior  to  executing  a  workflow  on  a  full  dataset  within  the  Data  Lab.  

• Data   Lab   functionality   can   be   exported   to   a   user’s   machine   instead   of   requiring   user   data   to   be  imported   into   the  Data   Lab   for   use,   allowing   components   to   run   closer   to   the   data   on  which   they  operate.  

• Components   can   work   together   intelligently,   e.g.,   virtual   storage   can   be   synchronized   between  multiple   sites  or  data   services  on   the  user’s  machine  can  be  used   transparently  within  a  workflow  executing  in  the  Data  Lab.  

• User  hardware  can  be  leveraged  to  increase  the  effective  computing  capacity  available.    The  deployment  of  services  within  the  Data  Lab  involves  multiple  machines  already;  the  goal  of  distributable  components  is  simply  to  extend  this  concept  to  services  running  on  machines  outside  the  primary  NOAO  data  center.    

4.1 Software  Packaging  and  Distribution    Data  Lab  software  will  be  distributed  in  three  ways:    

1. Public   Repository:     Users   will   be   able   to   access   sources   for   all   software   components   from   the  project’s   public   GitHub   repository,   allowing   them   to   retrieve   only   the   code   of   interest.     Minimal  installation  documentation  will  be  available  with  each  component,  however  the  user  will  be  required  to  configure  all  the  software  manually  to  deploy  a  working  system.    This  method  of  distribution  will  be  most  useful  to  developers  wishing  to  modify  the  code  to  add  new  functionality.    

2. Containerized   Applications:     Individual   components   will   be   packaged   as   containerized   Docker  applications,   requiring   only   minimal   configuration   and   the   Docker   framework   to   be   available   to  execute.    Although  each  container  could  theoretically  be  run  individually,  multiple  containers  will  be  packaged  into  a  download  file  to  provide  users  with  a  coherent  system  of  Data  Lab  capabilities.  

 3. Virtual  Machine  (VM)  Images:    Machine  virtualization  will  be  used  during  development  not  only  to  

create  testing  and  development  platforms,  but  also  to  maximize  utilization  of  the  available  hardware  during   operations.     Various   VM   images  will   be   created   and   configured  with   appropriate   Data   Lab  services  to  provide  standard  platforms  for  various  purposes,  e.g.,  as  a  “content  server”  or  a  “compute  server”,  or  as  a  “large  catalog  worker  node”.    Additionally,  VMs  may  be  configured  with  pre-­‐installed  analysis  environments   that  will   serve  as   the  base  operating  system  for  user   login  shells.    All  of   the  machine  images  will  be  available  for  users  to  download  for  use  locally.  

 Links  to  the  GitHub  repository  and  the  other  download  files  will  be  available  from  the  project  web  site.    In  the  following  sections  we  discuss  distribution  using  the  containerized  application  model  only.        

Page 23: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  23  

4.2 Virtual  Storage    The  virtual  storage  system  in  the  Data  Lab  will  be   implemented  as  a  VOSpace10  service,  where  the  protocol  defines  a  web-­‐service   interface   to  manage  distributed  storage.      The   service   is   typically  deployed  as  a  web  application  and  the  contents  of  the  space  are  managed  in  a  database,  the  physical  storage  of  files  may  use  any  number  of  backend  systems   including  a  standard   local   file  system.  Data  Lab  currently  provides  both  a   Java  and  Python  reference  implementation  of  the  VOSpace  protocol,  however  from  the  client  perspective  they  are  identical  in  terms  of  core  functionality.    

The   Python   implementation   is   ideally   suited   for   distribution   since   the   language   easily   supports   an  embedded  web   server   and   database   (e.g.,   SQLite11),  making   the   entire   service   self-­‐contained.     Further,   the  container  mechanism  allows  the  VOSpace  and  its  supporting  software  to  be  packaged  in  a  way  that  isolates  it  from  the  underlying  system,  reducing  the  installation  process  for  the  user  to  that  of  enabling  Docker  on  the  machine   (trivial   for   both   Linux   and  Mac   systems),   optionally  modifying   a   local   configuration   file   and   then  simply  executing  the  container  to  run  the  service.    

                     

 Figure  4.2:    Architecture  of  the  Virtual  Storage  service  Docker  container.  

 VOSpace  capabilities  and  views  are  supported  either  by  external  applications  or  are  integrated  into  the  

implementation  of  the  VOSpace  itself.    These  external  applications  are  themselves  packaged  as  containerized  tasks   that   are   available   as   part   of   the   software   distribution   discussed   in   Sec   4.1.       The   service   will   need  persistent  storage  to  maintain  the  database  contents;  this  can  be  achieved  by  using  a  specialized  data  storage  container  that  can  be  shared  by  all  distributed  components,  or  as  a  mounted  directory  from  the  user’s  local  machine.  Bundling  the  support  tasks  as  part  of  the  service  container  may  also  be  considered  as  an  alternative  distribution  mode  (see  Figure  4.2).  

 The  functionality  required  in  the  support  tools  include:  

 • FITS   header   metadata   scraping   tools   (i.e.   tools   used   to   collect   FITS   header   information   or   other  

metadata   from   keywords   or   the   file   contents)   to   enable   creation   of   a   searchable   database   for   an  SIA/SSA  service  on   image  or  spectral  data  stored   in  containers.    These   tools  are   the  same  as   those  used  when  creating  a  public  data  service  in  the  Data  Lab  and  can  be  containerized  for  distribution.  

• Image  conversion  tools  to  support  alternate  formats  of  image  data  (e.g.,  to  create  previews  from  FITS  image  files.  

• Table  conversion  tools  to  support  alternate  views  of  table  data  or  for  use  in  loading  a  database  table.  Supported  formats  will  minimally  include:  

o VOTable  (XML)  o FITS  BINTABLE  o SExtractor  output  files  o CSV,  TSV  and  ASCII  table  files  

                                                                                                                                       10  http://www.ivoa.net/documents/VOSpace/  11  http://www.sqlite.org/  

Virtual  Storage  Service  Container  

Image/Table  Support  Apps  

Data  Lab  Interfaces  

Python  

VOSpace    

Database  

Base  Docker  OS  

Local  Disk  Container  

Page 24: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  24  

• General  task  execution  code  to  allow  arbitrary  processing  of  container  contents.  • Any  other  code  needed  to  support  a  specific  capability  or  view  on  a  container.  

 A   configuration   file  will   be  used   to  manage   the   service  options   (e.g.,   directory   to  be  managed,   service  URL  including  port  number)  or  to  enable  specific  capabilities/views.    

4.3 Data  Publication    Creating  “simple”  VO  data  services  (i.e.SIA/SSA/SCS)  will  be  done  using  the  DALServer12  framework  from  the  VAO  as  this  provides  a  configuration-­‐only  option  for  creating  a  service  from  an  existing  database.    DALServer  runs   as   a   web-­‐application   deployed   to   an   application   server   such   as   Tomcat   and   can   provide   service  endpoints   for   multiple   datasets.     The   framework   and   its   supporting   code   can   be   containerized   for  distribution,  again  with  the  user  only  needing  to  provide  configuration  information  for  the  service  instead  of  deploying  all  of  the  underlying  code.       As  with  the  VOSpace  service  discussed  above,  persistent  storage  required  to  operate  the  service  can  make  use  of   a   specialized  storage  container  or  a   local  disk  mount.    Pre-­‐existing  databases  will  be  accessed  directly  from  the  DALServer  using  connection  information  provided  by  the  configuration  file.    New  searchable  databases   can   be   created   using   the   metadata   collection   tools   Sec.   4.4;   a   database   within   the   publication  container  (backed  by  the  persistent  storage)  will  be  available  to  users  when  creating  data  services   if  one  is  not  otherwise  available.       DALServer   can   additionally   build   a   web-­‐page   interface   to   its   services   that   allows   browser-­‐based  query  and  access   to   the  data.    Because   these   services  may  not  be   registered  with   the  VO   for  public  access,  legacy  applications,  desktop  tools  and  programmatic  VO  interfaces  can  query  and  access  the  service  by  calling  the   service   endpoint   directly.     At   this   time,   advanced   publication   services   (i.e.   VO  TAP   interfaces)   are   not  being  considered  for  distribution  due  to  their  complexity.    

4.4 Processing  Tools  and  Services    A  number  of  Compute  Services  used  within  the  Data  Lab  will  also  be  useful  to  users  on  their  local  machines,  e.g.,   image   cutout   or   catalog   crossmatch   tools.   Additionally,   command-­‐line   tools   used   to   collect   metadata,  convert  file  formats  or  other  utilities  used  within  the  Data  Lab  may  be  needed  by  the  user.    Within  the  main  Data  Lab  (i.e.  the  system  running  in  the  NOAO  computer  center)  these  tools  are  all  containerized  so  that  they  can  (optionally)  be  run  asynchronously  and  in  parallel  under  the  control  of  the  Job  Manager  using  the  UWS  design  pattern.    However,  in  the  distributable  Data  Lab  these  tools  are  likely  to  be  either  called  directly  by  the  user   or   from   a   scripted   application   and   will   be   packaged   as   command-­‐line   tasks   that   always   execute  synchronously.   Applications   requiring   a   container   (to   bundle   dependencies   or   isolate   them   from   the  underlying  system)  will  be  wrapped  with  a  shell  command  to  provide  the  task  interface.      

4.5 An  Example    As  an  example  of  how  distributable  Data  Lab  components  might  be  used,  consider  the  situation  show  in  Fig  4.5  below:        

• A   PI   (User   1)   has   access   to   all   Data   Lab   services   running   in   the  main   NOAO   computer   center   (as  shown  in  the  top  box)  in  addition  to  services  installed  on  his/her  local  machine.    

                                                                                                                                       12  http://vaosa-­‐vm1.aoc.nrao.edu/vo/dalserver/  

Page 25: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  25  

• A   student   (User   2)   has   installed   only   the   virtual   storage   service   and   will   be   analyzing   data   using  legacy  tools.  

• Queries   from   the   PI’s   desktop   to   the   Large   Catalog   (red   arrows)   service   can   store   results   in   their  virtual  storage  space,  these  results  are  then  copied  automatically  back  for  the  PI  and  Student  to  use  at  a  later  time.  Alternatively,  the  PI  may  choose  to  store  the  results  in  the  Data  Lab  MyDB  database;  other  desktop  tools  may  in  turn  query  these  results  later.  

• Tasks  the  PI  may  have  created  in  the  Data  Lab  (blue  arrows)  can  query  data  services  running  on  the  PI’s   desktop   (e.g.,   from   a   local   analysis)   and   have   the   results   stored   to   the   student’s   storage   for  further  analysis.      

 

 

NOAO Data Lab DL Task

Virtual Storage Svcs Large Catalog Svcs

DL Task

Data Publication Svcs

PI/Survey NSA

MyDB

User 1 Desktop

Virtual Storage Svc DL TaskDL TaskMyDB

User 2 Laptop Virtual Storage Svc Legacy Tools

Data Publication Svc

Figure  4.5:    Example  uses  of  downloadable  Data  Lab  component.  

Page 26: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  26  

5 System  Interfaces    All  components  described  in  Sec  2.1  and  2.2  present  specific  interfaces  to  either  user-­‐facing  client  software  or  other  Data  Lab  components.     In  cases  where  a  detailed  design  document  exists   (or  will  exist)   for  a  specific  component,   a   description   of   the   interface  will   be   detailed   in   that   document   and   referenced   in   the   section  describing  the  component  here.    Standard  Web  or  VO  interfaces  (e.g.,  http,  VOSpace,  SIA,  etc.)  will  reference  the  appropriate  specification  when  needed.  Here  we  describe  the  interfaces  built  within  the  system.    

5.1 Security    Resources  in  the  Data  Lab  will  require  differing  levels  of  security:    

None  at  All       For  completely  public  services  such  as  the  Large  Catalog  or  published  datasets.  Proprietary       For  services  such  as  the  NSA  where  the  user  may  be  required  to  identify  themselves  

before  gaining  access  to  proprietary  data.    (Note  the  NSA  also  presents  a  public  data  service  requiring  no  special  authorization).  

Restricted   For  resources  allocated  to  registered  Data  Lab  users  only,  e.g.,  a  personal  database  or  virtual  storage  space.  

 External   resources  may   additionally   have   their   own   authentication   requirements,   e.g.,   the   x.509   certificate  required  by  some  Grid  computing  networks  or  other  VO  services.  Security  in  the  Data  Lab  then  is  a  matter  of  providing   an   authentication  method   to   protect   allocated   resources   and   secure   user’s   data   in   the  Data   Lab  itself,  and  of  managing  credentials  needed  to  access  external  services  that  may  be  called  from  within  the  Data  Lab  by  applications  or  services.    

Registered  Data  Lab  users  may  login  using  their  Data  Lab  Identity  username/password  combination  to  establish  a  session,  a  user  logged  in  under  their  NOAO  Identity  (as  is  used  for  access  to  the  NSA)  or  under  the  VO  Single  Sign-­‐On   toolkit  will   similarly  be  a   recognized  user   so   long  as   those   identities  match  a   registered  Data   Lab   user   account.     For   simplicity,   the   Data   Lab   identity   will   be   used   to   authenticate   the   user   to   all  services  operated  by  the  Data  Lab,  a  resource  can  then  use  the  Authorization  Service  to  determine  whether  the  user  has  permission  to  use  the  resource.    

A  user’s  account  record  will   contain  multiple  bits  of   information  used   to  determine  which  resources  may  be  accessed,  and  how  they  are  accessed.    For  example,  the  user  may  import  multiple  identity  tokens  to  be  used   with   their   account   and   then   associate   specific   services   with   a   particular   token.     The   Authorization  Service  will  not  only  answer  whether  a  user  can  access  a  particular  resource,  but  can  respond  to  indicate  that  a  particular  identity  token  (e.g.,  a  cookie  or  an  x.509  certificate)  should  be  passed  when  access  the  resource.    

The  high-­‐level  Authorization  Service   interface   is  described   in  Sec  2.2.1.  A  detailed  description  of   the  Authentication  Service  design  and  requirements  is  to  be  provided  in  a  future  document.      

5.2 Command-­‐line  Tools    Command-­‐line  tools  provide  a  high-­‐level  client  interface  to  Data  Lab  functionality  that  can  be  easily  used  by  both   users   and   called   from  many   analysis   environments.     These   tools   can   also   serve   as   a   testing   interface  during  development  and  for  monitoring  the  health  of  the  system  once  in  operations.    

Data  Lab  will  develop  a  suite  of  command-­‐line  tools  to  interface  to  its  components  that  can  be  used  by  both   anonymous   (for   access   to   public   services)   and   authorized  users   (for   access   to   restricted  Data   Lab  

Page 27: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  27  

resources).  For  example,  a   “login”  command  tool  would  authorize  a  user   to  access  Data  Lab  virtual  storage  services  or  proprietary  data  in  subsequent  command  calls,  whereas  an  unauthorized  (i.e.  anonymous)  “query”  command  tool  might  return  a  result  of  the  query  directly  for  only  the  public  data  available  from  that  service  directly  to  the  user.        

The   description   of   the   planned   command-­‐line   tools   is   contained   in   the   PEP   and   detailed   design  documents  it  references.      

5.3 Web  Portals    The  term  Web  Portal  as  used  here  refers  to  any  web-­‐page  interface  used  to  access  a  component  of  the  Data  Lab  system.    These  include  specific  web  pages  for:    

• Authorization:  This  web  page  is  responsible  for  allowing  a  user  to  “Log  In”  to  the  system  and  obtain  the  appropriate  credential  for  accessing  his/her  Data  Lab  resources.    This  page  is  visible  to  all  public  visitors.  

• User  Management:  This  web  page  allows   the  user   to  set/change  personal   information   (e.g.,   contact  email,   reset   their   password)   related   to   his/her   account   and   to   set   permissions   on   user-­‐defined  groups   created   to   share   resources.     This   page   is   only   visible   once   the   user   has   identified  himself/herself  to  the  system.  

• Data  query  and  access  services:  These  pages  allow  a  user  to  query  a  data  service  and  view  the  results.    Specific  pages  with  added  functionality  may  be  created  for  high-­‐value  datasets  (e.g.,  Large  Catalogs).  In   other   cases   a   standardized   interface   will   be   created   automatically   from   a   template   for   user  datasets   published   through   the   Data   Lab.   These   pages   are   visible   to   all   public   visitors.   However,  additional  features  may  be  revealed  (e.g.,  a  search  for  proprietary  data  or  the  ability  to  save  to  virtual  storage)  to  users  with  proper  credentials.  

• Virtual   storage   services:   This   web   page   allows   a   user   to   browse   his/her   virtual   storage   holdings,  navigating  between  containers  or  viewing  individual  files.    Users  can  also  use  this  page  to  designate  items   as   public,   shared   or   private   to   restrict   access.     Additionally,   users   can   enable/disable  capabilities   and   views   associated   with   containers.     This   page   is   only   visible   once   the   user   has  identified  himself/herself  to  the  system.  

• Job   submission,   control   and   monitoring:   This   page   allows   users   to   submit   new   jobs   for   execution  (query   or   processing),   check   the   status   of   previously   submitted   jobs,   or   cancel   running   jobs.   This  page  is  only  visible  once  the  user  has  identified  himself/herself  to  the  system.      

• Admin  Portals:  Operations  staff  will  have  access  to  several  administrative  web  pages  not  available  to  public  visitors  or  registered  users.    These  include  special-­‐purpose  pages  needed  to:  

 o Manage  user  accounts,  e.g.,  to  allow  bulk  registration  of  users,  delete  users  or  to  edit  account  

information.  o Monitor  or  manage  compute  jobs.    This  provides  the  same  functionality  as  the  user-­‐level  job  

page,  however  all  jobs  for  all  users  on  all  servers  are  visible.    

In   some   cases,   administrative   functions   will   be   available   on   other   portal   pages   when   logged   in  through  an  administrative  account.          

   

Page 28: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  28  

5.4 Legacy  Applications    Legacy  applications  or  analysis  environments13  can  interact  with  the  Data  Lab  in  one  or  more  of  the  following  ways:  

 • Existing  support  for  Data  Lab  interfaces:    Data  Lab  exposes  a  number  of  standard  VO  protocols  to  its  

services.   Legacy   systems   that   already   provide   support   for   these   protocols   may   use   the   Data   Lab  services  directly.    Additionally,  tools  that  can  access  data  given  a  URI  will  be  able  to  access  a  limited  number  of  Data  Lab  services,  either  direct  access  to  files  or  the  result  of  a  service  call.  

• Transparent  access   to  Data  Lab   services:     In   some  circumstances,  Data  Lab  capabilities  will   require  that  no  special  interface  be  used.    For  example,    

o FUSE-­‐mounted   filesystems   will   provide   access   to   virtual   storage.     Users   will   authenticate  themselves  when  mounting   the   storage   locally,   however   legacy   tools  will   be   unaware   the  data  are  remote.  

o Legacy   apps  may  modify   a   local   filesystem  under   control   of   a   local   VOSpace   service.     The  service   will   track   changes   to   the   files   to   keep   the   Data   Lab   service   interface   current,   the  contents   of   the   controlled   space   may   be   synchronized   with   other   Data   Lab   services  transparently.  

o Storage  (local  or  remote)  under  VOSpace  control  may  provide  capabilities  that  allow  a  legacy  app  to  access  data  in  alternate  formats  transparently.    

• Updated  code  using  Data  Lab  programmatic   interfaces:     In   later  releases  of  Data  Lab,  programmatic  interfaces  will  be  available  to  allow  apps  to  work  directly  with  Data  Lab  components.    Legacy  tools  may   optionally   be   updated   once   these   are   available   to   use   these   interfaces   and   allow   a   tighter  integration  between  the  legacy  application  and  the  Data  Lab.      

5.5 Data  Query    The  high-­‐level  Query  Manager  interface  is  described  in  Sec  2.2.2.    The  public  VO  data  services  will  all  expose  the  interface  appropriate  for  the  service  type  as  specified  by  the  corresponding  IVOA  standard.    These  include:    

• Simple  Image  Access  (SIA,  for  images),    • Simple  Spectral  Access  (SSA,  for  spectra),    • Simple  Cone  Search  (SCS,  for  catalogs)  • Table  Access  Protocol  (TAP,  for  tabular  collections)  

 These   services   may   optionally   be   provided   by   VOSpace   containers   and   will   use   the   same   VO   interface  standards.  Additionally,  these  services  will  also  implement  the  VO  Support  Interface  (VOSI)  recommendation;  these  service  endpoints  are  used  by  the  operations  monitoring  system  to  check  on  service  availability.      Both  VOSpace  and  TAP  services  implement  the  Universal  Worker  Service  (UWS)  recommendation  as  part  of  their  public  interface.    

                                                                                                                                       13  Defined  to  be  tools  commonly  used  in  the  community  prior  to  public  release  of  the  Data  Lab.  

Page 29: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  29  

 

5.6 Processing  Task  Control    The  high-­‐level  Job  manager  interface  is  described  in  Sec  2.2.3.    Much  of  the  analysis  and  functional  work  of  the  Data  Lab  will  be  done  by  tasks,  i.e.  some  application  or  web  service   that   manipulates   data   or   performs   some   specific   analysis.     In   order   to   provide   the   functionality  needed  at  all   levels  of   the  Data  Lab  some  new  development  will  be  undertaken,  but   in  other  cases  existing  applications  will  be  used  (or  will  require  only  a  small  wrapper   interface).    This   implies   that  a  broad  mix  of  runtime  environments  will  be  required  to  support  the  heterogeneous  collection  of  tasks  to  be  used.    

5.6.1 Task  Containers       Traditionally,  virtual  machines  could  be  used  to  configure  multiple  environments,  however  in  many  cases  we  don’t  need   to  virtualize  an  entire  machine   just   to  support  a   single  application.    Linux  containers14  provide  a  method  to  run  applications  in  an  isolated  environment  much  more  efficiently  (i.e.  many  more  tasks  can  be  supported  on  the  same  amount  of  physical  hardware,  and  startup  times  for  the  tasks  are  sub-­‐second).  Data   Lab  will   use   the  Docker15   container   system   to   build   self-­‐contained   application   containers   (see   Figure  5.6a)  that  will  be  execute  on  the  compute  servers  under  the  control  of  the  Job  Manager.    

 Figure  5.6(a):    Components  of  a  Linux  task  container.  

 Containers  are  composed  of  a  base  operating  system  (OS)  image  that  can  share  binaries  and  libraries  

with  the  host  machine,  meaning  they  are  usually  much  smaller  than  the  entire  OS  to  be  used.    We  can  then  add  special  Data  Lab  support  code  (e.g.,  libraries  or  utility  tools)  that  may  be  needed  by  the  task,  this  can  be  optionally   stripped  down   further   to  minimize   the   container   size.      Together,   the  base  OS  and   the  Data  Lab  code   can   form   the   basis   for   other   containers   to   standardize   the   environments,   or   to   create   custom  environments   that   support   different   language   or   OS   versions   that   may   be   needed   by   an   application.    Additionally,  we  can  mount  a  user’s  virtual  storage  space  using  the  FUSE  mechanism  as  well  as  a  specialized  storage  container  used  as  a  disk  cache  that  can  be  shared  between  instances  of  a  task  container  (providing  faster  I/O  than  virtual  storage  or  network  access  since  the  disk  cache  will  be  on  the  host  machine).  

 

5.6.2 Job  Control    The  application  itself   is   installed  in  the  container  as   if   it  were  installed  on  a  real  machine,  this  may  

include   the   configuration   of   web   servers   or   other   services   used   by   the   application.     In   cases   where   the  container  provides  a  web  service  it  can  be  deployed  directly  by  the  Job  Manager  (or  be  a  persistent  service  running   on   the  machine)   since   containers   have   individual   IP   addresses   and   port  mapping   allows  multiple  

                                                                                                                                       14  https://linuxcontainers.org/ 15  http://www.docker.com  

TaskingInterface <<Task>>

Data Lab Support Code

Base OS Image

Disk CacheMount

Virtual Storage

FUSE

Task Container

Params

Results

Page 30: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  30  

containers  to  co-­‐exist  without  conflict.    If  the  container  is  used  for  an  application,  then  an  additional  tasking  interface  is  built  into  the  container  configuration  to  control  execution  of  the  task  (see  Figure  5.6b).  

 The  Job  Manager  spawns  containers  on  the  compute  server  when  a  new  job  is  to  be  created.    This  can  

be   done   using   a   simple   ssh   interface   to   initiate   the   job   on   the   remote  machine   and   then   interact  with   the  remote  process.    In  a  synchronous  job  (left  side,  Fig  5.6b)  the  tasking  interface  executes  the  task  and  acts  to  redirect  the  task’s  standard  I/O  streams  (i.e.  stdin/stdout/stderr)  to  sockets  used  to  communicate  with  the  Job  Manager16.    Once  the  task  is  complete,  the  interface  then  cleans  up  the  process  and  the  container  exits.    In  this  case  the  Job  Manager  must  supply  all   information  needed  to  start  the  task  when  it  is  executed,  e.g.,  through  command-­‐line  arguments.  

 In  an  asynchronous  job  (right  side,  Fig  5.6b),  the  tasking  interface  first  creates  a  Universal  Worker  

Service  (UWS)  client  as  the  control  process  that  responds  to  requests  from  the  Job  Manager  for  the  lifespan  of  the  job.    The  UWS  design  pattern  provides  a  set  of  HTTP  service  endpoints  that  allow  the  Job  Manager  to  set  task  parameters,  start/stop  task  execution,  poll  for  completion  status,  and  collect  results.    Upon  receipt  of  the  start   request,   the   UWS   client   forks   the   application   and   sets   up   the   stdio   sockets   as   in   a   synchronous   job,  however   the  output  streams  are  saved   to  a   result  object   that   isn’t   returned  until   the   task  exits  and   the   Job  Manager   requests   it.     During   execution   the  UWS   client   can   respond   to   status   requests   so   the   Job  Manager  knows  when  it  has  completed,  or  it  can  abort  the  task  once  some  execution  time  limit  has  been  exceeded.    In  this  mode,  the  Job  Manager  is  responsible  for  notifying  the  calling  client  the  task  has  completed,  for  returning  results,  and  for  issuing  the  task  cleanup  request  once  the  task  is  no  longer  needed.  

 

   Figure  5.6(b):  Breakdown  of  task  execution  for  synchronous  (left)  and  asynchronous  (right)  jobs.  

   Task  containers  isolate  applications  both  from  the  underlying  system  and  from  other  containers  that  

may   be   running   the   same   task,   greatly   simplifying   their   deployment   to   compute   servers   and   for   use   in  massively   parallel   workflows.     The   Job   Manager   is   able   to   distribute   execution   to   provide   load-­‐balancing  capabilities,  and  since  it  is  a  web-­‐service  itself,  it  could  likewise  be  packaged  as  a  container  and  made  part  of  the   Data   Lab   software   distribution.     Similarly,   the   containers   could   be   deployed   under   other   execution  frameworks  (e.g.,  Condor).  

                                                                                                                                       16  A  similar  mechanism  is  used  in  the  IRAF  Networking  protocol  to  provide  access  to  remote  data  and  tasks.  

Tasking Interface

<<Task>>

Tasking Interface

UWS Client

<<Task>>

fork()

fork()

stdiostreams

stdiostreams

Job Manager Job Manager

ssh ssh

Sync Job ASync Job

Page 31: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  31  

 

5.7 Virtual  Storage    The  high-­‐level  Storage  Manager  interface  is  described  in  Sec  2.2.4.    The   virtual   storage   system  will   be   implemented   using   the  VOSpace   standard   for   distributed   storage.     This  interface  will   be   exposed   to   clients   to   allow   direct   access   to   the   storage   space   and   provides   the   low-­‐level  interface   used   by   the   Storage   Manager.     Client   applications   and   users   must   identify   themselves   to   the  Authorization  Service  before  gaining  access  to  this  resource.    Users  can  additionally  access  the  storage  if  it  is  mounted  using  a  FUSE  (Filesystem  in  User  Space17)    client.    In  this   case,   the   client  would   access   the   space   using   the   standard   VOSpace   protocols,   however   to   the   user   it  would  appear   to  be   interfaced  as  a  normal  Unix   filesystem.    The  FUSE  client   in   this   case   is   responsible   for  authenticating  itself  to  the  service  using  the  Authorization  Service.    Data   stored   in   virtual   storage   that  may   be   exposed   using   a   data   access   service   (e.g.,   via   a   capability   on   a  storage  container)  will  be  interfaced  as  described  in  Sec  4.2.      

6 Implementation  Tools  and  Standards    This   section   describes   the   planned   implementation   tools   and   technologies   to   be   used   in   the   Data   Lab.  Additional  tools  may  be  used  as  necessary  and  will  be  documented  in  a  detailed  design  for  the  component  in  question.  

6.1 Implementation  Languages    Data   Lab   will   not   mandate   that   a   particular   development   language   be   used   for   all   components   given   the  reliance  on  adapting  existing  code  bases  [DL-­‐ORD-­‐51010].        

• Modification   of   existing   software   will   be   done   using   the   original   implementation   language.     As  needed,  code  may  be  updated  to  use  a  more  modern  version  of  the  language.  

• New   development  will   be   done   using   the  most   appropriate   language   for   the   tool   or   service   being  implemented.      

• Client-­‐side   interfaces  will   be   generally   implemented  using  C/C++   as   the   core   language  with  multi-­‐language  interface  bindings  generated  by  SWiG  where  appropriate  in  order  to  be  as  open  as  possible  to  user-­‐provided  application  development.  

• A   limited   suite   of   custom   client-­‐side   interfaces   will   be   implemented   for   Python   application  development   (currently   the   most   popular   scripting   language   among   astronomers).   Updates   to   an  existing  similar  python  interface  will  be  preferred  to  new  development.  

• All   dependencies   for   software   tools   (e.g.,   specific   versions   of   libraries   or   3rd   party   code)  must   be  justified.  

6.1.1 Language  Versions    All  core  component  services  must  be  compatible  with  the  following  language  versions:    

• Java-­‐based  applications/services  must  be  compatible  with  Java  7  

                                                                                                                                       17  http://fuse.sourceforge.net/  

Page 32: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  32  

• Python-­‐based  applications/services  must  be  compatible  Python  2.7    Exceptions  will  be  made   in   cases  where  a  physical/virtual  machine   is  dedicated   to  hosting  a   single   service  requiring  an  alternate  version  of  the  language.    

6.2 Development  Platforms    Data  Lab  hardware  will  use  a  standardized  operating  systems  across  all  machines  hosting  core  services.    At  present   this   is   Linux   CentOS   6.5;   the   operating   system   used   in   the   final   deployment   of   the   Data   Lab  may  change  subject  to  a  requirement  that  a  common  OS  is  used  across  all  NOAO  SDM  (Science  Data  Management)  systems.    Two  exceptions  will  be  made:    

1. A  collaboration/user  requesting  use  of  a  private  Virtual  Machine  is  free  to  request  an  alternate  base  operating  system  (however  support  for  these  systems  will  be  limited).  

2. Users   creating   containerized   applications   are   free   to   use   a   base   image   from   another   operating  system.  

 Client  software  is  expected  to  run  on  modern  versions  of  Linux  and  Mac  OSX.    

6.3 Software  Development  Standards    Data  Lab  will  not   require  use  of   specific   integrated  development  environment   (IDE)   for   implementation  of  services.    Release  documentation  must  specify  the  complete  process  for  building,  configuring  and  deploying  a  tool  or  service  from  source  code.  

6.3.1 Software  Licensing    All   Data   Lab   software  will   be  Open   Source   and   available   under   a   {TBD}   license.   Individual   applications   or  components  may  have  different  licenses  with  the  goal  that  all  software  will  be  released  with  the  most  lenient  license  possible.  Software   imported  and  extended   for  use   in   the  Data  Lab  will  be  made  available  under   the  original  software  license  terms.    Data  Lab  will  not  use  proprietary  software  that  cannot  be  redistributed.  

6.3.2 Public  Repository    All   software   released   through   the  Data  Lab  will   be   available  on   the  GitHub   (github.com)  public   repository.    Deployment  of  an  application/service  within  the  Data  Lab  will  use  code  available  from  this  repository.  

6.3.3 Private  Repository    Data  Lab  shall  maintain  a  self-­‐hosted  GitLab  (gitlab.com)  repository  for  code  not  yet  released  but  still  under  version   control.     This   repository   solution   is   compatible   with   GitHub,  making   it   possible   to  migrate   to   the  public  repository  when  software  is  released  publically.    

6.3.4 Testing  Framework    

See  description  in  the  PEP  document.  

6.3.5 Bug  and  Issue  Tracking    Data  Lab  will  use  the  JIRA  issue  and  project  tracking  system  already  in  use  by  SDM.    Additional  issue  tracking  will  be  done  using  the  public  Github  repository  mechanism.    

Page 33: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  33  

6.4 Web  Interfaces    Data  Lab  will  use  Apache  HTTP  server  and  Apache  Tomcat  for  public  web  interfaces.    The  GitLab  repository  requires   an   alternate   HTTP   server   (NGinx)   and   will   be   isolated   from  machines   hosting   public   services   to  avoid  potential  conflicts  with  Apache  servers.    

6.5 Database  Technologies    Data  Lab  will  support  a  number  of  different  databases  within  the  system:    

• QServ  will  be  used  to  host  extremely  large  catalogs  (e.g.,  DES)  on  dedicated  machines.  • MySQL  and  PostgreSQL  will  be  available  on  machines  hosting  public  data  services.    Either  or  both  of  

these  databases  may  be  used  depending  on  the  optimizations  required  during  data  publication.  • A  user’s  MyDB  database  will  use  MySQL   to  maintain  maximum  compatibility  with  results  obtained  

from  the  QServ-­‐based  datasets.  • SQLite  may  be  used  internally  by  some  applications  or  services.  

 Use  of  other  databases  will  not  be  supported  without  sufficient  justification.    

6.6 Machine  Virtualization    Data  Lab  currently  uses  Oracle’s  VirtualBox  product  as  its  machine  virtualization  tool  to  create  and  maintain  Virtual  Machines   (VMs)  within   the   Data   Lab.     Virtual  machines   are   used   to   create  machines  with   a   single  purpose  (e.g.,  to  host  internal  administrative  services)  in  order  to  maximize  hardware  utilization.    Process   virtualization   will   use   the   Docker   (docker.io)   container   system   to   create   distributed   applications  and/or  to  create  a  compute  service  that  can  run  in  isolation  of  other  processes  on  the  machine.    Because  of  their  lightweight  nature  and  portability,  containers  are  ideal  for  building  specialized  services  within  the  Data  Lab  that  can  be  deployed  as  needed.      

Page 34: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  34  

7 Requirements  Tracking    This  section  traces  the  elements  of  the  architecture  presented  here  (right-­‐hand  column)  back  to  the  Science  Use   Cases   presented   in   the   SUC.     Unless   otherwise   specified,   numbers   in   the   right-­‐hand   column   refer   to  sections  in  this  document.      

7.1 Core  Data  Lab  Capabilities      Access  to  SQL  catalogs  

DL-­‐SRD-­‐21000    DL  must  provide  access  to  SQL  catalogs  with  command  line  tools  for  experienced  users.    

2.1.4.4  2.1.4.1  5.2  DL-­‐OCD-­‐2500  

DL-­‐SRD-­‐21002    DL  must  provide  access  to  SQL  catalogs  with  Web-­‐based  tools  for  intermediate  and  novice  users.    

2.1.4.4  2.1.4.1  5.3  DL-­‐OCD-­‐2500  

DL-­‐SRD-­‐21004    DL  must  provide   the  capability   to  create   table   joins  of  DL-­‐based  SQL  catalogs.  

2.1.4.4  2.1.4.1  

DL-­‐SRD-­‐21006  DL  must  provide  asynchronous  state-­‐full  access   to  DL-­‐based  SQL  catalogs.  

2.1.4.4  2.1.4.1  DL-­‐OCD-­‐3130  DL-­‐OCD-­‐3135  

DL-­‐SRD-­‐21008    DL  must  provide  synchronous  access  to  DL-­‐based  SQL  catalogs.  2.1.4.4  2.1.4.1  DL-­‐OCD-­‐3125  

   User  database  storage  (local  &  remote)    

DL-­‐SRD-­‐21050     DL   must   provide   for   the   storage   of   databases   on   the   user’s  desktop  computer  (local  storage).    

2.2.4  2.2.2  DL-­‐OCD-­‐3111  

DL-­‐SRD-­‐21055    DL  must  provide   for   the  storage  of  databases  at   the  DL  (remote  storage  

2.2.4  2.2.2  DL-­‐OCD-­‐3110  

   Light  curve  data  generation  from  catalogs  DL-­‐SRD-­‐21100    The  DL  must  provide  the  means  to  generate  light  curve  data  from  multi-­‐epoch  flux/magnitude  measurements  in  SQL  catalogs  served  by  the  DL.  

1.4.4  DL-­‐OCD-­‐2540  

   Virtual  Storage  Service  DL-­‐SRD-­‐21150    DL  must  provide  means  for  users  to  store  results  of  SQL  database  queries  near  the  computational  resources  serving  the  major  SQL  catalogs  at  the  DL.    

2.2.4  DL-­‐OCD-­‐3200  

DL-­‐SRD-­‐21151    DL  must  provide  means  for  users  to  create  their  own  data  objects  (files)  in  the  DL  distributed  storage  network.   2.2.4  DL-­‐SRD-­‐21152    DL  must  provide  means  for  users  to  delete  their  own  data  objects  in  the  DL  distributed  storage  network.  DL-­‐SRD-­‐21153    DL   must   provide   means   for   users   to   upload   data   objects   from  local  (desktop)  to  remote  storage.  

2.2.4  DL-­‐OCD-­‐4125  

Page 35: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  35  

DL-­‐SRD-­‐21154    DL  must  provide  means  for  users  to  download  data  objects  from  remote  to  local  storage.  

2.2.4  DL-­‐OCD-­‐2508  

DL-­‐SRD-­‐21155    DL  must  provide  means  for  users  to  manipulate  metadata  of  their  own  data  objects.  

2.2.4    

DL-­‐SRD-­‐21156    DL  must  provide  means  for  users  to  set  access  privileges  of  their  own  data  objects.  

2.2.4  DL-­‐OCD-­‐3205  

DL-­‐SRD-­‐21157    DL  must   provide  means   for   users   to   access   the   content   of   data  objects  within  the  DL  distributed  storage  network.    

2.2.4  DL-­‐OCD-­‐2573  

   IVOA  registry  searches  DL-­‐SRD-­‐21200    DL  must   provide  means   for   users   to   search   (cone-­‐searches)   for  observations   (images/fluxes)   obtained   at   different   wavelengths.   IVOA   registry  searches  for  Spectral  Energy  Distributions  (galaxies  and  stars).   2.1.4.2  

2.2.5  DL-­‐SRD-­‐21205    DL  must   provide  means   for   users   to   search   for   Spectral   Energy  Distributions  (galaxies  and  stars).          Access  to  external  image  surveys  DL-­‐SRD-­‐21250    DL  must   provide  means   for   users   to   access   significant   ground-­‐based   optical/near-­‐infrared   image   surveys   (e.g.,   DSS,   2MASS,   ESO   Vista  Hemisphere  Survey).  

2.1.5.1  DL-­‐OCD-­‐2520  DL-­‐OCD-­‐2522  

   Galactic  Extinction/Reddening  Service  DL-­‐SRD-­‐21300  DL  must  provide  means  for  users  to  get  extinction  and  reddening  values   due   to   Galactic   dust   as   a   function   of   position   on   the   sky.    DataLab_SAD_v0.72.docx  

2.1.5.1  

   Magellanic  Clouds  Extinction  Service  DL-­‐SRD-­‐21350    DL  must  provide  means  for  users  to  get  extinction  due  to  dust  in  the  Magellanic  Clouds  as  a  function  of  position  on  the  sky.      

2.1.5.1  

   Color-­‐Magnitude  &  Hess  Diagram  plotting  tool  DL-­‐SRD-­‐21400    DL  must   provide   Color-­‐Magnitude   and   Hess   Diagram   (with   the  option  of  contour  overlays)  plotting  tools   to  enable  the  graphical  analysis  of  data  samples  of  possibly  millions  of  stars.    

1.4.7  DL-­‐OCD-­‐2554  

   Variable  resolution  display  tool  for  remote  users  DL-­‐SRD-­‐21450   DL   must   provide   an   interactive   plotting/visualization/  analysis  tool  with  variable  resolution  for  remote  users  in  order  to  improve  the  user  interaction  experience.    

1.4.7  

   Phase-­‐folded  light  curves  DL-­‐SRD-­‐21500    DL  must  provide  the  means  to  produce  phase-­‐folded  light  curves  for  a  given  period  value  from  light  curve  data.  

1.4.7  DL-­‐OCD-­‐2552  

   Create  animations/movies  of  variable  objects  DL-­‐SRD-­‐21550     DL   must   provide   the   means   to   create   animations/movies   of  observations  of  variable  objects  (e.g.,  RR  Lyrae  stars,  supernova,  etc.)  

1.4.7  DL-­‐OCD-­‐2553  

   Image  Cutout  Service    

DL-­‐SRD-­‐21600    DL  must  provide  a  general  asynchronous  state-­‐full  Image  Cutout  Service  that  will  serve  subimages  (image  cutouts)  of  images  based  at  the  DL.    

1.4.4  DL-­‐OCD-­‐2524  DL-­‐OCD-­‐3415  

DL-­‐SRD-­‐21601    The  Image  Cutout  Service  must  be  able  to  be  run  in  a  synchronous   1.4.4  

Page 36: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  36  

mode.   DL-­‐OCD-­‐3410  DL-­‐SRD-­‐21602     The   Image   Cutout   Service   must   be   capable   of   delivering   small  images   (“postage   stamps”   with   ~100   pixels)   with   position   orientation  metadata  from  DL-­‐based  images.  

1.4.4  DL-­‐OCD-­‐2521  

DL-­‐SRD-­‐21603     The   Image   Cutout   Service   must   be   capable   of   delivering   large  images   (millions   of   pixels   covering   possibly   more   than   one   deg2)   with   position  orientation  metadata  from  DL-­‐based  images.  

1.4.4  DL-­‐OCD-­‐2521  

DL-­‐SRD-­‐21604    The   Image  Cutout   Service  must   be   able   to   serve  100,000   image  cutouts  as  part  of  large  asynchronous  batch  jobs.   1.4.4  DL-­‐SRD-­‐21605    The  Image  Cutout  Service  must  serve  images  in  a  format  suitable  for  the  creation  of  animations/movies  of  variable  objects  (see  DL-­‐SRD-­‐21550).        Task  automation  tools    DL-­‐SRD-­‐21650    DL   must   provide   task   automation   tools   to   enable   computation  tasks  (workloads)  to  be  spread  over  many  cores  and/or  machines.    

1.4.4  

   Positional  Cross-­‐Match  Service    DL-­‐SRD-­‐21700     DL   must   provide   an   asynchronous   state-­‐full   Positional   Cross-­‐Match   Service   (PCMS)   that   will   enable   a   DL   user   to   cross-­‐match   objects   with  positions   in  a  custom  database  with  SQL  catalogs  served  by   the  DL  (e.g.,   the  DES  catalog).  

1.4.4  DL-­‐OCD-­‐2504  

DL-­‐SRD-­‐21702    The  PCMS  must  be  able  to  be  run  in  a  synchronous  mode.   1.4.4  DL-­‐OCD-­‐3410  

DL-­‐SRD-­‐21704    The  PCMS  must   be   able   to   process   a  million  object   positions   as  part  of  large  asynchronous  batch  jobs.   1.4.4  DL-­‐SRD-­‐21706  The  PCMS  must  provide  access  to  external  robust  IVOA-­‐standards-­‐compliant  positional  cross-­‐match  services  (e.g.,  CDS’s  VisieR).        Periodogram  Service  DL-­‐SRD-­‐21750    The  DL   should   provide   an   asynchronous   state-­‐full   Periodogram  Service  that  will  return  periodograms  of  time  series  data.    

1.4.4  

DL-­‐SRD-­‐21752     The   Periodogram   Service   should   be   able   to   be   run   in   a  synchronous  mode.  

1.4.4  DL-­‐OCD-­‐3410  

DL-­‐SRD-­‐21754    The   Periodogram   Service   should   be   able   to   analyze   5,000   light  curves  as  part  of  large  asynchronous  batch  jobs.  

1.4.4  

DL-­‐SRD-­‐21756    The  Periodogram  Service   should  be  able   to  process   light   curves  that  were  generated  by  DL-­‐SRD-­‐21100.  

1.4.4  DL-­‐OCD-­‐2504  

   Stellar  photometry  codes  DL-­‐SRD-­‐21800    DL  must   provide   users  with   at   least   one   executable   binary   of   a  standard   stellar   photometry   code   (e.g.,   SExtractor,   Dophot,   DAOPHOT,   etc.)   for  inclusion  in  user-­‐designed  photometric  pipelines.  

1.4.4  DL-­‐OCD-­‐2530  

   Statistical  time  series  analysis  tools  DL-­‐SRD-­‐21850    DL  must  provide  light  curve  (time  series)  statistical  analysis  tools  that  determine  if  the  flux  of  an  object  varies  in  time  for  a  given  level  of  statistical  significance.     1.4.4  

DL-­‐OCD-­‐2542  DL-­‐SRD-­‐21855   The   statistical   time   series   analysis   tools   should   be   able   to  determine   the   statistical   nature   of   a   variable   object:     periodic,   aperiodic  (semiregular),  random  (stochastic),  or  transient.      Compute  Service  DL-­‐SRD-­‐21900   The   DL   should   provide   an   asynchronous   state-­‐full   Compute   1.4.4  

Page 37: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  37  

Service   that   would   do   computationally   intensive   calculations   in   a   few   hours   or  days  instead  of  weeks  or  months  

DL-­‐OCD-­‐2572  

 

7.2 User-­‐Provided  Science  Capabilities        User  tools  to  determine  crowding  factor  DL-­‐SRD-­‐22000     DL   should   provide   the   means   to   enable   user-­‐defined   database  tools   to   determine   crowding   factor   (nearest   neighbor   distances,   N-­‐point  correlations,  etc.)  .  

1.4.4  DL-­‐OCD-­‐2570  

   Database  of  theoretical  isochrones  DL-­‐SRD-­‐22050    DL  should  provide  at  least  one  set  of  theoretical  stellar  isochrones  transformed  to  the  DES  (SDSS)  filter  set.   1.4.4  

   Registration  of  large/complex  images  DL-­‐SRD-­‐22100    DL  should  provide  resources  to  enable  the  spatial  register/cross-­‐match   large/complex   images   observed   with   different   filters,   exposure   times,  rotation  angles,  etc.  

1.4.4  

   Capture  interactive  results  DL-­‐SRD-­‐22150    DL  should  provide  resources  to  enable  the  capture  of  interactive  results  for  reproducibility  and  sharing  within  collaborations.   1.4.4  

   Poisson-­‐based  Matched-­‐Filter  Service    DL-­‐SRD-­‐22200    DL  should  provide  resources  to  enable  the  identification  of  unique  stellar   populations   in   complex   stellar   fields   contaminated   by   multiple   external  stellar  populations  in  the  Milky  Way  or  other  Local  Group.  

1.4.4  

     Estimate  reddening  of  RR  Lyraes  from  light  curves  DL-­‐SRD-­‐22250    DL  should  provide  resources  to  enable  analysis  tools  to  estimate  reddening  of  individual  RR  Lyrae  stars  based  on  time  series  observations.     1.4.4  

   User-­‐defined  analysis  tools  

DL-­‐SRD-­‐22300    DL  should  provide  resources  to  enable  user-­‐defined  analysis  tools  (code,  scripts,  templates,  etc.).  

1.4.4  DL-­‐OCD-­‐2543  DL-­‐OCD-­‐2570  

   High-­‐order  Polynomial  Background  Fitting  DL-­‐SRD-­‐22350     DL   should   provide   resources   to   enable   high-­‐order   polynomial  background  fitting  in  complex  star  fields.   1.4.4  

   Digital  image  filters  for  feature/object  detection  DL-­‐SRD-­‐22400   DL   should   provide   digital   filters   for   feature   recognition/  object  detection  in  images   1.4.4  

   Variable  Object  Classification  Service  DL-­‐SRD-­‐22450    DL  should  provide  resources  to  determine  what  type  of  variable  an  object  is  based  on  its  light  curve.   1.4.4  

   Galaxy  Morphology  Analysis  Service  DL-­‐SRD-­‐22500     DL   should   provide   resources   to   determine   to   enable   galaxy  morphology   analysis   codes   like  Galfit,   Galphot,   etc.   to   analyze   large   (many  pixel)  

1.4.4  DL-­‐OCD-­‐2571  

Page 38: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  38  

galaxy  images.     DL-­‐OCD-­‐2572  DL-­‐SRD-­‐22505    DL  should  provide  resources  to  enable  morphological  analysis  of  galaxy  blob  images  (with  a  small  number  of  pixels).     1.4.4  

   Single-­‐Galaxy  Photometric  Redshift  DL-­‐SRD-­‐22550     DL   should   provide   resources   to   enable   the   determination   of  photometric   redshifts   of   a   galaxy   from   multiband   observations   using   Spectral  Energy  Distribution  (SED)  template  libraries.  

1.4.4  

   Interactive  User-­‐Defined  Plotting/Visualization  Tools  DL-­‐SRD-­‐22600     DL   should   provide   resources   to   enable   the   graphical   user  interface  tools  developed  by  users  to  enhance  the  visualization  or  understanding  of  complex  images  or  databases.  

1.4.4  

   Astrometry  for  large  images  DL-­‐SRD-­‐22650    DL  should  provide  resources  to  enable  the  development  of   tools  for  the  computation  of  astrometric  solutions  of  large  astronomical  images  that  may  be  distorted  due  to  imager  optics.  

1.4.4  

   Extended  object/non-­‐point-­‐source  detection  DL-­‐SRD-­‐22700    DL  should  provide  resources  to  enable  the  development  of  image  analysis  tools  for  the  detection  of  astrophysical  objects  that  are  not  point  sources.   1.4.4  

 

 

Page 39: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

Appendix  I:    Vocabulary  /  Acronyms  Used    AAS     (American  Astronomical  Society)  ADASS     (Astronomical  Data  Analysis  Software  and  Systems)  conference  ADQL   (Astronomical   Data   Query   Language)   An   SQL-­‐like   language   which   includes  

astronomical  facilities  to  query  a  database.  AGN     (Active  Galactic  Nucleus)  API     (Application  Programming  Interface)  The  documentation  of  the  interface         to  a  software  library  or  tool.  ASCII   (American   Standard   Code   for   Information   Interchange)   A   character-­‐encoding  

scheme  based  on   the  English   alphabet  where  128   specific   characters   are   encoded  into  7-­‐bit  binary  integers.  

ASV   (ASCII  Space  Values)  AURA     (Association  of  Universities  for  Research  in  Astronomy)  CADC     (Canadian  Astronomy  Data  Centre)  CDS     Centre  de  Données  astronomiques  de  Strasbourg  CMD     (Color  Magnitude  Diagram)  CSV     (Comma  Separated  Values)  CTIO     (Cerro  Tololo  Inter-­‐American  Observatory)  DAL     (Data  Access  Layer)  The  VO  protocols  that  define  how  VO  applications       access  data  resources.  Datalink   VO  protocol  for  associating  complex  astronomical  data  DECaLS   DECam  Legacy  Survey  DECam     (Dark  Energy  Camera)    A  520  megapixel  digital  camera  on  the  Blanco       4-­‐m  telescope  at  CTIO.  DES     (Dark  Energy  Survey)  a  survey  to  prove  the  origin  of  the  accelerating       Universe  and  help  uncover  the  nature  of  dark  energy  by  measuring  the  14         billion-­‐year  history  of  cosmic  expansion  with  high  precision  over  five  years         beginning  in  summer  2013.  DESI     (Dark  Energy  Spectroscopic  Instrument)  An  instrument  to  measure  the       effect  of  dark  energy  on  the  expansion  of  the  universe  by  obtaining  optical         spectra  for  tens  of  millions  of  galaxies  and  quasars  (beginning  2018).  DESDM   (Dark   Energy   Survey   Data  Management)   Project   that   developed   and   operates   the  

DESDM  system  at  NCSA.  DL     (Data  Lab)  DAOPHOT   Package  for  crowded  field  stellar  photometry.  Docker   An  open  platform  for  developers  and  system  administrators  to  build,  ship,  and  run  

distributed  applications.  DoPHOT   CCD  PSF  fitting  photometry  program.  DS9   SAOimage  DS9,  an  astronomical  imaging  and  data  visualization  application.  DSS   (Digitized  Sky  Survey)  ESO   (European  Southern  Observatory)  FITS     (Flexible  Image  Transport  System)  An  open  standard  defining  a  digital  file         format  for  storage,  transmission,  and  processing  of  astronomical  (and         other  scientific)  data.  FTP     (File  Transfer  Protocol)  A  standard  network  protocol  used  to  transfer       computer  files  from  one  host  to  another  host  over  a  TCP-­‐based  network.  

Page 40: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  40  

FUSE     (FileSystem  in  User  Space)  An  operating  system  mechanism  that  lets       non-­‐priviledged  users  to  create  their  own  file  systems.  GAVO     (German  Astrophysical  Virtual  Observatory)  GMS     (Group  Management  Services)  GPFS   (General  Parallel  File  System)  A  high-­‐performance  clustered   file  system  developed  

by  IBM  Hess  diagram   Plots   the   relative   density   of   the   occurrence   of   stars   at   different   color-­‐    

  magnitude  positions  of  Hertzsprung-­‐Russell  diagram  for  a  given  galaxy.    HSB     (High  Surface  Brightness)  HST     (Hubble  Space  Telescope)  HTTP     (HyperText  Transfer  Protocol)  An  application  protocol  for  distributed,         collaborative,  hypermedia  information  systems.  IDL     (Interactive  Data  Language)  A  programming  language  used  for  data       visualization  and  analysis.  IPAC     (Infrared  Processing  and  Analysis  Center)  IRAF     (Image  Reduction  and  Analysis  Facility)  NOAO  image  reduction/analysis         and  visualization  software  system.  IVOA   (International   Virtual   Observatory   Alliance)   The   international   VO   community  

responsible  for  developing  VO  standards.  JIRA   A  commercial  tool  for  software  teams  to  plan,  build,  and  track  projects.  JPEG   (Joint  Photographic  Experts  Group)  Lossy  compression  for  digital  images  LDAP   (Lightweight  Directory  Access  Protocol)  An   industry  standard  application  protocol  

for   accessing   and   maintaining   distributed   directory   information   services   over   an  Internet  Protocol  (IP)  network.  

LSB   (Low  Surface  Brightness)  LMC     (Large  Magellanic  Clouds)  LSST   (Large  Synoptic  Survey  Telescope)  MAST   (Mikulski  Archive  for  Space  Telescopes)  MPC   (Minor  Planet  Center)  MCs   (Magellanic  Clouds)  MySQL   Popular  open  source  database.    MyDB   A  read-­‐write  database  available  to  users  for  saving  results  from  queries  of  read-­‐only  

databases.    This   is   similar   to   the  SDSS  MyDB   concept.Simple  database  wrapper   for  MySQL.  

NASA   (National  Aeronautics  and  Space  Administration)  NCSA   (National  Center  for  Supercomputing  Applications)  NHPPS   (NOAO  High-­‐Performance  Pipeline  System)  An  event-­‐driven,  multi-­‐process  executor  

system  developed   to  manage  pipeline  applications   in  a  coarse-­‐grained,  distributed  processing  environment.  

NOAO   (National  Optical  Astronomy  Observatory)  NSA   (NOAO  Science  Archive)  NSSDC   (NOAO  System  Science  and  Data  Center)  OCD   (Operational  Concept  Document)  ORD   (Operational  Requirements  Document)  OS   (Operating  System)  PCMS   (Positional  Cross-­‐Match  Service)  PNG   (Portable  Network  Graphics)  Raster  graphics  file  format  that  supports  lossless  data  

compression.  PSF   (Point  Spread  Function)  QServ   The  LSST  database  management  system.  

Page 41: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  41  

R   A   programming   language   and   software   environment   for   statistical   computing   and  graphics.  

RDBMS   (Relational   DataBase   Management   System)   A   DBMS   that   represents   data   using   a  relational  database.  

Relational  database   A   database   that   stores   data   in   a   structure   consisting   of   one   or   more   tables   (aka  

relations)  of  rows  and  columns,  which  may  be  interconnected.  ReST   (Representational   State   Transfer)   An   approach   to   web   services   that   uses   the  

standard  HTTP  GET  and  POST  protocols.  SAD   (System  Architecture  Design)  document  SAMP   (Simple  Applications  Messaging  Protocol)  A  VO  protocol  for  desktop  messaging.  SCS   (Simple  Cone  Search)  SDM   (Science  Data  Management)  group  SDSS   (Sloan  Digital  Sky  Survey)  SED   (Spectral  Energy  Distribution)  Plot  of  brightness  of  flux  density  versus  frequency  or  

wavelength.  SExtractor   A  program  that  builds  a  catalogue  of  objects  from  an  astronomical  image.  SIA/SIAP   (Simple   Image   Access   Protocol)   A   VO   protocol   that   supports   queries   for   images  

available  in  a  given  data  collection  near  a  given  position  on  the  sky.  SMASH   (Survey  of  the  MAgellanic  Stellar  History)  PI:  Nidever  SMC   (Small  Magellanic  Cloud)  SN   (Super  Nova)  SQL   (Structured   Query   Language)   The   standard   language   used   to   communicate   with  

RDBMS’s.  SQLite   A   software   library   that   implement   a   self-­‐contained,   serverless,   zero-­‐configuration,  

transactional  SQL  database  engine.  SRD   (Science  Requirements  Document)    SSh   Secure  Shell  SSA   (Simple  Spectral  Access)  A  VO  protocol  for  spectral  query/retrieval.  SSO   (Single  Sign-­‐On)  SUC   (Science  Use  Case)  document  SVC   An  abbreviation  for  a  Web  service.  SWIG     (Simplified  Wrapper  and  Interface  Generator)  An  open  source  software       tool  used  to  connect  C  or  C#  programs  or  libraries  with  scripting         languages.  TAP     (Table  Access  Protocol)    A  VO  protocol  for  querying  remote  databases.  TB     (Tera  Bytes)  1012    bytes  or  1,000,000,000,000  bytes  (base  10)  TiB     (Tebibyte)  240  bytes  or  1,099,511,627,776  bytes  (base  2)  TCP     (Transmission  Control  Protocol)  One  of  the  core  protocols  of  the  Internet         protocol  suite,  commonly  referred  to  as  TCP/IP.  TSV   (Tab-­‐Separated   Values)   A   simple   file   format   often   used   to   move   tabular   data  

between  computer  programs  that  support  the  format,  e.g.,  transferring  information  from  a  database  program  to  a  spreadsheet.  

URI     (Uniform  Resource  Identifier)  An  address  standard  for  a  resource       available  on  the  Internet.  URL     (Uniform  Resource  Locator)  The  global  address  of  documents  and  other         Resources  on  the  World  Wide  Web.    The  address  contains  2  parts:         specification  of  the  protocol  to  be  used  in  accessing  the  resource  and  its         network  location.  UWS     (Universal  Worker  Service)  pattern  defines  how  to  manage  asynchronous    

Page 42: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  42  

    execution  of  jobs  on  a  service.  VAO     (Virtual  Astronomical  Observatory)  The  US  VO  project.  VM     (Virtual  Machine)  VO     (Virtual  Observatory)  VOSI     (VO  Support  Interfaces)  The  minimum  interface  that  a  SOAP  or  REST-­‐         based  web  service  requires  for  compatibility  with  the  IVOA.  VOSpace   The   IVOA   interface   to   distributed   storage   that   specifies   how   VO   agents   and  

applications  can  use  network  attached  data  stores  to  persist  and  exchange  data  in  a  standard  way.  

XML     (eXtensible  Markup  Language)  2MASS-­‐PSC   (2  Micron  All  Sky  Survey  –  Point  Source  Catalog)    

Page 43: DataLab’SAD’1.00! System’Architecture’Document ...datalab.noao.edu/docs/DataLab_SAD_v1.00.pdf · ! 2!!! RevisionHistory! ’ Date’ Author’ Changes’/’Comments’ Version’

  43  

Appendix  II:    List  of  Figures      Page  7   Figure  1.3:    Context  Diagram  for  the  NOAO  Data  Lab.  Page  8   Figure  2.1:  Data  Lab  software  architecture  diagram.  Page  17   Figure  2.4:    The  Data  Lab  deployment  diagram.  Page  21   Figure  4.2:    Architecture  of  the  Virtual  Storage  service  Docker  container.  Page  23   Figure  4.5:    Example  uses  of  downloadable  Data  Lab  component  Page  27   Figure  5.6(a):    Components  of  a  Linux  task  container.  Page  28   Figure  5.6(b):  Breakdown  of  task  execution  for  synchronous  and  asynchronous  jobs.