DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf ·...

24
DataHub A hosted pla1orm for organizing, managing, sharing, collabora;ng, and processing data Anant Bhardwaj MIT CSAIL

Transcript of DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf ·...

Page 1: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

DataHub A  hosted  pla1orm  for  organizing,  managing,  sharing,  collabora;ng,  and  processing  data

Anant  Bhardwaj  MIT  CSAIL  

Page 2: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Anant  Bhardwaj  Sam  Madden  David  Karger  

Elizabeth  Bruce  Eugene  Wu  

Aaron  Elmore   Aditya  Parameswaran   Amol  Deshpande    

Page 3: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

A  bit  of  history:  once  upon  a  ;me

Applica6on  1  User  1  

User  2  

User  3  DB  Servers  RDBMS  

Applica6on  2  User  1  

User  2  

User  3  

Page 4: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

A  bit  of  history:  once  upon  a  ;me

Applica6on  1  SQL  

Data  Warehouse  RDBMS  

Network  

Analy6cs  +  BI  SQL  

DB  Servers  RDBMS  

Applica6on  2   SQL  

ETL  Tools  

Page 5: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Fast  Forward  (2015)

Applica6on  1  SQL  

Data  Warehouse  

Analy6cs  +  BI  

Map/Reduce  

Data  Layer  

Applica6on  2   JSON/CSV  

HDFS  

HiveQL  

SparkQL  Hive   Spark  

Pig   …  

SQL  

HBase   Cassandra  

Data  Lake  (Data  Swamp)    

Page 6: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

CIDR  2015  

“The  value  of  data  is  directly  propor6onal  to  the  degree  to  which  it  is  accessible.  As  this  data  grows  in  size,  in  type,  in  dimensionality  and  in  complexity,  accessibility  becomes  of  paramount  importance.    Accessibility  means  a  very  low  barrier  of  entry:  allowing  product  designers,  innovators  and  non-­‐technical  users  to  explore  and  navigate  the  store  without  requiring  them  to  be  familiar  with  Big  Data  tools  or  in-­‐depth  understanding  of  data  schemas  and  informaXon  models.  Search  and  navigaXon  support  for  idenXfying  the  data  necessary  to  solve  a  parXcular  analyXcal  problem  is  lacking.”  

Page 7: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Accessibility  means  a  very  low  barrier  of  entry    

allowing product  designers,  innovators  and  non-­‐technical  users  to

explore  and  navigate  the  store

without requiring  them  to  be  familiar  with  Big  Data  tools,  data  schemas  and  

informa;on  models

Page 8: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Data  Lake  or  Data  Swamp

Applica6on  1  SQL  

Data  Warehouse  

Analy6cs  +  BI  

Map/Reduce  

Data  Layer  

Applica6on  2   JSON/CSV  

HDFS  

HiveQL  

SparkQL  Hive   Spark  

Pig   …  

SQL  

HBase   Cassandra  

Page 9: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Lessons

•  End  users  don’t  care  about  Data  Models  •  They  care  about  tasks  •  They  will  choose  tools  that  can  get  their  tasks  done  

•  SupporXng  end-­‐user  tasks  is  very  important  •  People  love  to  use  MS  Excel,  Google  Sheets  to  whatever  extent  they  can  

Page 10: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Can  we  design  Data  Management  System  keeping  end-­‐users  in  mind?

Page 11: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Who  are  the  end  users

•  Programmers  •  They  chose  data-­‐management  system  that  best  fits  their  app  (javascript  developer  is  likely  to  choose  JSON  based  data-­‐model)  

•  Data  ScienXsts  •  They  care  about  extracXng  informaXon  from  the  data  –  data  analysis    

•  Business  Owners  •  They  care  about  numbers/charts/graphs  –  data-­‐driven  decision  making  

•  Data  Administrators  •  They  care  about  privacy,  access  control,  compliance  

Page 12: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

A  sample  user  class:  Journalists  

•  Typical  Problems  •  How  to  get  the  data  files  in  from  data.gov  to  excel  

•  Ingest  problem  •  How  to  query  (especially  by  linking  data  across  files)  

•  Query  problem  (SQL  is  not  the  answer)  •  How  to  quickly  make  sense  of  it    

•  a  visualizaXon  that  summarizes  the  data  

Page 13: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data
Page 14: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data
Page 15: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data
Page 16: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Introducing  DataHub  

a  unified  data  management  and  collabora6on  plaSorm  for  making  data-­‐processing  easy  

Page 17: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

DataHub  Pla1orm

•  a  flexible  data  store  (files,  relaXonal  databases,  extendible  to  other  backends)  with  sharing/collaboraXon  capabiliXes,  managed  on  behalf  of  different  users/groups  

 •  an  app  ecosystem  that  hosts  apps  for  various  data-­‐processing  acXviXes    •  apps  for  ingesXon,  curaXon,  integraXon,  analyXcs,  visualizaXon,  and  machine  learning  •  a  new  applicaXon  can  be  wriden/published  to  the  DataHub  App  Center  using  our  thrie-­‐SDK  (20+  supported  languages)  •  the  DataHub  users  can  use  any  of  the  apps  from  the  App  Center  for  processing  their  data  as  it  fits  their  need  

 

Page 18: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

DataHub  Pla1orm

DataHub  Core  

Data   Data   Data  

Users/Groups  Repository/Catalogs  

Sharing/CollaboraXon  

Versioning  

App  Ingest/ETL  

App  Data  Science  

App  VisualizaXon  

App  Clustering  

App  SenXment  Analysis  

App  …………  

App  Ecosystem  

Page 19: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Demo  Time

Page 20: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Accessibility  means  a  very  low  barrier  of  entry    

allowing product  designers,  innovators  and  non-­‐technical  users  to

explore  and  navigate  the  store

without requiring  them  to  be  familiar  with  bigdata  tools,  data  schemas  and  informa;on  

models

Page 21: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

The  real  challenge  today  is  not  the  data  itself but

an  ecosystem  for  solving  data-­‐processing  tasks which  end-­‐users  can  use

Page 22: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

DataHub:  an  ecosystem  for  end-­‐users

•  Powerful,  easy  to  use  collaboraXon  •  collaboraXon  on  data  has  never  been  so  easy  before  

•  Non  Technical  users  can  do  stuff  •  it  gives  non-­‐technical  users  the  ability  to  find  apps  that  best  fits  their  data-­‐processing  needs  •  they  can  find  apps  for  clustering,  senXment  analysis,  predicXon,  data  cleaning,  or  anything  from  the  app  center  

•  Data  ScienXsts  can  use  tools/algorithms  of  their  choice  •  data  scienXsts  can  use  R,  Matlab,  Julia,  Java,  anything  that  they  are  most  comfortable  with  •  mulXple  data  scienXsts  can  collaborate  seamlessly  because  an  intermediate  result  computed  in  one  language  can  be  used  in  other  language  (DataHub  enables  true  collaboraXon)  

•  Programmers  can  use  DataHub  in  almost  any  language  (20+  thrie  supported  languages)  •  DataHub  can  be  used  as  a  real  backend  in  any  web  applicaXon,  iPhone  app,  Android  app,  scripXng  languages  (Python,  PHP,  Ruby,  etc.),  and  system  languages  (Java,  C++,  Go,  etc.)  

Page 23: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Architectural  Challenges

• A  new  architecture  for  storing  and  managing  large  numbers  of  diverse  datasets  •  managed  on  behalf  of  different  users/groups  with  sharing/collaboraXon  capabiliXes.  

• An  infrastructure  for  hosXng  a  large  number  of  data-­‐processing  apps  (the  app  ecosystem).  

• AutomaXng  large  parts  of  data  science  pipeline  by  lehng  users  connect  DataHub  apps.  

Page 24: DataHub - University of Washingtonnwds.cs.washington.edu/files/nwds/pdf/anantb-datahub-talk.pdf · DataHub A)hosted)pla1ormfor)organizing,)managing,)sharing,)collabora;ng,)and)processing)data

Data  Science  +  Design  Challenges

•  Enabling  easy  ingest,  manage,  clean,  analyze,  and  visualize  large  volumes  of  potenXally  unstructured  and  noisy  data.  

• Designing  the  user-­‐experience  (human-­‐computer  interacXon  aspect)  to  enable  non-­‐experts,  such  as  social  scienXsts  and  journalists,  to  effecXvely  use  the  system  and  complete  their  tasks.  

• Providing  powerful  capabiliXes  for  expert  users  to  perform  more  complex  reasoning  tasks  and  analyXcs  than  what  is  possible  today.