Keynote on 2015 Yale Day of Data

60
Big Data & Analy-cs: Five Trends and Five Research Challenges Robert Grossman University of Chicago & Open Data Group September 18, 2015

Transcript of Keynote on 2015 Yale Day of Data

Page 1: Keynote on 2015 Yale Day of Data

Big  Data  &  Analy-cs:  Five  Trends  and  Five  Research  Challenges  

Robert  Grossman  University  of  Chicago  

&    Open  Data  Group  

 September  18,  2015  

Page 2: Keynote on 2015 Yale Day of Data

Part  1  What  is  Big  Data?  

Researchers  and  policymakers  are  beginning  to  realize  the  poten-al  for  channeling  these  torrents  of  data  into  ac-onable  informa-on  that  can  be  used  to  iden-fy  needs  &  provide  services  for  the  benefit  of  low-­‐income  popula-ons.    Source:  Big  Data,  Big  Impact:  New  Possibili-es  for  Interna-onal  Development,  World  Economic  Forum,  2012.  

Page 3: Keynote on 2015 Yale Day of Data

•  Volume  •  Velocity  •  Variety  •  Veracity  •  Value  

•  Megabytes  •  Gigabytes  •  Terabytes    •  Petabytes  •  Etabytes  •  Zetabytes  

Page 4: Keynote on 2015 Yale Day of Data

The  Name  Changes  1830      sta-s-cs    1980      computa-onally  intensive  sta-s-cs  1993      data  mining  &  knowledge  discovery  in  databases  1997      business  analy-cs  2004      predic-ve  analy-cs  2011      big  data,  data  science  &  data  analy-cs  

Source:  Google  Trends,  www.google.com/trends  

Page 5: Keynote on 2015 Yale Day of Data

What  is  Big  Data?    (Opera-ons  POV)  

A  marke-ng  term  introduced  by  O’Reilly:    Big  data  is  data  that  exceeds  the  processing  capacity  of  conven-onal  database  systems.  The  data  is  too  big,  moves  too  fast,  or  doesn’t  fit  the  strictures  of  your  database  architectures.  To  gain  value  from  this  data,  you  must  choose  an  alterna-ve  way  to  process  it.      Edd  Dumbill,  What  is  Big  Data?,  strata.oreilly.com,  January  11,  2012.    

Page 6: Keynote on 2015 Yale Day of Data

What  is  Big  Data?  (POV:  New  Types  of  Data  that  IT  Cannot  Manage)  

 

Period   New  types  of  data   Term  Used  1990’s   Clicks  on  the  Internet,  

POS  transac-ons  Data  mining  

2000’s   Unstructured  data,  graph  data  

Predic-ve  Analy-cs  

2010’s   Mobile  data,  IoT  data   Big  Data  

Page 7: Keynote on 2015 Yale Day of Data

What  Is  Small  Data?  

•  100  million  movie  ra-ngs  •  480  thousand  customers  •  17,000  movies  •  From  1998  to  2005  •  Less  than  2  GB  data.  •  Fits  into  memory,  but  very  sophis-cated  models  required  to  win.  

Page 8: Keynote on 2015 Yale Day of Data

What  are  the  origins  of  big  data?  

Page 9: Keynote on 2015 Yale Day of Data

Basic  Choice  with  Hardware:  Scale  Up  or  Out  

More  memory,  more  processors,  more  disk  ($K)  

Specialized  hardware    (e.g.  connects)($100K)  

Specialized    devices  ($M)  

One  machine   Cluster  (racks)  ($100K)  

Cyber    Pod  $M  

Distributed  cyber  pods  $10M+  

Page 10: Keynote on 2015 Yale Day of Data

Source:  Interior  of  one  of  Google’s  Data  Center,  www.google.com/about/datacenters/  

Computa-onal  adver-sing  finds  the  “best  match”  between  a  given  user  in  a  given  context  and  a  suitable  adver-sement  ($100+  B  market).    

Page 11: Keynote on 2015 Yale Day of Data

The  Google  Data  Stack  

•  The  Google  File  System  (2003)  •  MapReduce:  Simplified  Data  Processing…  (2004)  •  BigTable:  A  Distributed  Storage  System…  (2006)  

11  

Page 12: Keynote on 2015 Yale Day of Data

Source:  Terence  Kawaja,  hnp://www.slideshare.net/tkawaja  

Page 13: Keynote on 2015 Yale Day of Data

•  The  leaders  in  big  data  analy-cs  measure  data  in  Megawans.        – As  in,  Facebook’s  leased  data  centers  are  typically  between  2.5  MW  and  6.0  MW.  

– Facebook’s  new  Pineville  data  center  is  30  MW.  

 

What  is  Big  Data?  (My  computer  is  a  data  center  POV)  

Page 14: Keynote on 2015 Yale Day of Data

Part  2  What  is  Analy-cs?  

Source:  Aaron  Parecki,  Everywhere  I’ve  Been,  aaronparecki.com.  

Page 15: Keynote on 2015 Yale Day of Data

What  is  Analy-cs?  Short  Defini8on  •  Using  data  to  make  decisions.  Longer  Defini8on  •  Using  data  to  take  ac-ons  and  make  decisions  using  models  that  are  sta-s-cally  valid  and  empirically  derived.  

 Defini-on  of  Sta-s-cs  from  ASA  web  page:  •  Sta-s-cs  is  the  science  of  learning  from  data,  and  of  measuring,  controlling,  and  communica-ng  uncertainty  …    

15  

Source:  American  Sta-s-cal  Associa-on,    www.amstat.org/careers/wha-ssta-s-cs.cfm,  from:  Davidian,  M.  and  Louis,  T.  A.,  10.1126/science.1218685.  

Page 16: Keynote on 2015 Yale Day of Data

16  1993   2004  

Data  Mining    &  KDD  

1984  

Computa-onally  Intensive  Sta-s-cs  

Predic-ve  Analy-cs  

Big  Data  &  Data  Science  

2011  

PageRank  Spanner  TX  algorithm  

Devices/IoT  Internet  POS  Direct  marke-ng  

ID3  &  C4.5  

Page 17: Keynote on 2015 Yale Day of Data
Page 18: Keynote on 2015 Yale Day of Data

1.  Given  n  planes  A1,  …,  An.      Assume  each  plane  Ai  has  bij  bullet  holes  in  the  tail,  wing,  fuselage  and  other  (j=1,  2,  3,  4,  respec-vely).    

2.  Compute  where  to  put  addi-onal  armor  to  maximize  the  chance  that  planes  return.  

Page 19: Keynote on 2015 Yale Day of Data

Part  3.  Data  Science  

Page 20: Keynote on 2015 Yale Day of Data

A  picture  of  Cern’s  Large  Hadron  Collider  (LHC).    The  LHC  took  about  a  decade  to  construct,  and  cost  about  $4.75  billion.      Source  of  picture:  Conrad  Melvin,  Crea-ve  Commons  BY-­‐SA  2.0,  www.flickr.com/photos/58220828@N07/5350788732  

Some  fields  have  (one)  billion  dollar  (or  more)  instrument  that  generates  big  data.  

Page 21: Keynote on 2015 Yale Day of Data

A  genomics  sequencing  facility  might  have  3-­‐5  next  genera-on  sequencing  instruments  that  cost  $250,000  or  more  each.    

Some  fields  have  hundreds  or  thousands  of  million  dollar  instruments  that  in  aggregate  produce  big  data.  

Page 22: Keynote on 2015 Yale Day of Data

Some  fields  have  millions  of  hundred  dollar  sensors  that  in  aggregate  produce  big  data.  

Page 23: Keynote on 2015 Yale Day of Data

Math  &  Sta-s-cs  

Computer  Science  

Disciplinary  Science  

Data  Science  

Page 24: Keynote on 2015 Yale Day of Data

Understanding  Salmon  (A  Cau-onary  Tale)      

Source:  Salmo  salar,  (Atlan-c  Salmon),  wikipedia.org    

Page 25: Keynote on 2015 Yale Day of Data

Methods  

Subject.  One  mature  Atlan-c  Salmon  (Salmo  salar)  par-cipated  in  the  fMRI  study.  The  salmon  was  approximately  18  inches  long,  weighed  3.8  lbs,  and  was  not  alive  at  the  -me  of  scanning.    Task.  The  task  administered  to  the  salmon  involved  comple-ng  an  open-­‐ended  mentalizing  task.  The  salmon  was  shown  a  series  of  photographs  depic-ng  human  individuals  in  social  situa-ons  with  a  specified  emo-onal  valence.  The  salmon  was  asked  to  determine  what  emo-on  the  individual  in  the  photo  must  have  been  experiencing.    Design.  S-muli  were  presented  in  a  block  design  with  each  photo  presented  for  10  seconds  followed  by  12  seconds  of  rest.  A  total  of  15  photos  were  displayed.  Total  scan  -me  was  5.5  minutes.      

Page 26: Keynote on 2015 Yale Day of Data

Several  ac-ve  voxels  were  discovered  in  a  cluster  located  within  the  salmon’s  brain  cavity  (Figure  1,  see  above).  The  size  of  this  cluster  was  81  mm3  with  a  cluster-­‐level  significance  of  p  =  0.001.  Due  to  the  coarse  resolu-on  of  the  echo-­‐planar  image  acquisi-on  and  the  rela-vely  small  size  of  the  salmon  brain  further  discrimina-on  between  brain  regions  could  not  be  completed.  Out  of  a  search  volume  of  8064  voxels  a  total  of  16  voxels  were  significant.      

Page 27: Keynote on 2015 Yale Day of Data

The  bigger  the  data,  the  easier  it  is  to  do  stupid  things  with  it,  such  as  forgetng  to  correct  for  mul-ple  tests.  

Page 28: Keynote on 2015 Yale Day of Data

Part  4.  What  Instrument  Do  we  Use  to    Make  Discoveries  in  Data  Science?  

How  do  we  build  a  “datascope?”  

Page 29: Keynote on 2015 Yale Day of Data

experimental  science  

simula-on  science  

1609  30x  

1670  250x  

1976  10x-­‐100x  

data  science  

Page 30: Keynote on 2015 Yale Day of Data

experimental  science  

simula-on  science  

data  science  

1609  30x  

1670  250x  

1976  10x-­‐100x  

2004  10x-­‐100x  

“Cyberpod”  

Page 31: Keynote on 2015 Yale Day of Data

Could  we  con-nuously  re-­‐analyze  the  world’s  cancer  data?  

Page 32: Keynote on 2015 Yale Day of Data

Complex  sta-s-cal  models  over  small  data  that  are  highly  manual  and  update  infrequently.  

Simpler  sta-s-cal  models  over  large  data  that  are  highly  automated  and  updated  frequently.  

memory   databases  

GB   TB   PB  

W   KW   MW  

datapods  

cyber  pods  

Page 33: Keynote on 2015 Yale Day of Data

Part  5  Five  Trends  

Source:  Google  Trends,  for  term  “data  commons”,  www.google.com/trends.  

Page 34: Keynote on 2015 Yale Day of Data

Trend  1  Data  Commons  

Source:  NEXRAD,  NOAA,  www.noaa.org  

Page 35: Keynote on 2015 Yale Day of Data

The  Standard  Model  of  Biomedical  Compu-ng  No  Longer  Works  

Public  data  repositories  

Private  local  storage  &  compute  

Network  download  

Local  data  ($1K)  

Community  souware  

Souware,  sweat  and  tears  ($100K)  

Page 36: Keynote on 2015 Yale Day of Data

Data  Commons  

Data  commons  co-­‐locate  data,  storage  and  compu-ng  infrastructure,  and  commonly  used  tools  for  analyzing  and  sharing  data  to  create  a  resource  for  the  research  community.  

Source:  Interior  of  one  of  Google’s  data  centers,  www.google.com/about/datacenters/  

Page 37: Keynote on 2015 Yale Day of Data

Open  Science  Data  Cloud  (Open  Cloud  Consor-um,  2012)  

NCI  Data  Commons    (UChicago,  Nov  2015)  

Bionimbus  Protected  Data  Cloud  (UChicago,  2013)  

NOAA  Data  Commons    (Open  Cloud  Consor-umOct  2015)  

Page 38: Keynote on 2015 Yale Day of Data
Page 39: Keynote on 2015 Yale Day of Data
Page 40: Keynote on 2015 Yale Day of Data

Purple  balls  are  lung  adenocarcinoma.    Grey  are  lung  squamous  cell  carcinoma.    Green  are  misdiagnosed.    

Page 41: Keynote on 2015 Yale Day of Data

Hospitals,  medical  research  centers  and  doctors  

Data  commons  containing    genomic  and  clinical  data.  

Pa-ents  

Output:  con-nuously  updated,  data-­‐driven,    analy-cs-­‐informed    discovery,  diagnosis  and  treatment.  

Page 42: Keynote on 2015 Yale Day of Data

Trend  2  Analy-cs  of  Things,  People  and  Places  

Source:  Urban  sensor  on  street  pole  in  Chicago  (conceptual),  arrayouhings.github.io/  

Page 43: Keynote on 2015 Yale Day of Data

People  and  things  genera-ng  streaming    data  that  are  relevant  for  research.  

Page 44: Keynote on 2015 Yale Day of Data

Places  that  generate  data  Source:  Jane  Macfarlane,  Here,  a  Division  of  Nokia.  

Page 45: Keynote on 2015 Yale Day of Data

Trend  3  Languages  for  Data,  Sta-s-cal  Models,  Data  Science  Workflows  &  Exploratory  Data  Analysis  

Source:  M.  Bostock,  hnp://bl.ocks.org/mbostock/4063318  

Page 46: Keynote on 2015 Yale Day of Data

Portable  Format  for  Analy-cs  (PFA)  Predic-ve  Model  Markup  Language  (PMML)  

Grammar  of  Graphics  

d3.js  

Page 47: Keynote on 2015 Yale Day of Data

Trend  4  More  Policies  That  Make  Data  Available  and  Analy-cs  Repeatable  

Page 48: Keynote on 2015 Yale Day of Data

Execu-ve  Order  13642  (May  9,  2013)  Making  Open  and  Machine  Readable  the  Default  for  

Government  Informa-on  (“Open  Data  Policy”)  

OMB  Guidance  President’s  Ex  Order  

Page 49: Keynote on 2015 Yale Day of Data

Trend  5  Transla-onal  Data  Science  

How  do  we  translate  data  driven  discoveries  into  ac-ons  that  impact  society?    

Page 50: Keynote on 2015 Yale Day of Data

Imaging Informatics

Clinical InformaticsBioinformatics Public Health

Informatics

Basic Research

Applied Research

Practice (dx, treatment and prevention)

Molecular & cellular

processes

Tissues & organs

Individuals (patients)

Groups & populations

Quality & outcomesTranslational Informatics

Page 51: Keynote on 2015 Yale Day of Data

New  algorithms,  new  sta-s-cal  models  (data  science)  

Applica-ons  to  genomics,  analysis  of  EMR,  etc.  

Souware  stacks  for  data  intensive  compu-ng  (data  engineering)  

Data  driven  discoveries  

Data  driven  diagnosis  

Data  driven  therapeu-cs  

Develop  souware  stack  that  scales  to  a  “datapod”,  to  create  “commons”  for  data  driven  discoveries,  dx  &  treatment.    (Core  strategy  for  Center  for  Data  Intensive  Science,  University  of  Chicago)  

Transla-onal  Data  Science  

Page 52: Keynote on 2015 Yale Day of Data

Source:  Maria  T.  Panerson  and  Robert  L.  Grossman,  Detec-ng  localized  spa-al  panerns  of  disease  incidence  using  a  neighbor-­‐based  bootstrapping  method  on  electronic  medical  records  data  from  99.1  million  pa-ents,  to  appear.  

Page 53: Keynote on 2015 Yale Day of Data

Part  5  Five  Challenges  

Page 54: Keynote on 2015 Yale Day of Data

Challenge  1.  Is  More  Different?      

Source:  P.  W.  Anderson,  More  is  Different,  Science,  Volume  177,  Number  4047,  4  August  1972,  pages  393-­‐396.  

Do  New  Phenomena  Emerge  at  Scale  in  Data?  

Page 55: Keynote on 2015 Yale Day of Data

Challenge  2.  One  Million  Genomes  

•  Sequencing  a  million  genomes  would  likely  change  the  way  we  understand  genomic  varia-on  and  provide  a  founda-on  for  precision  medicine.  

•  The  genomic  data  for  a  pa-ent  is  about  1  TB  (including  samples  from  both  tumor  and  normal  -ssue).  

•  One  million  genomes  is  about  1000  PB  or  1  EB  •  With  compression,  it  may  be  about  100  PB  •  At  $1000/genome,  the  sequencing  would  cost  about  $1B  

•  Think  of  this  as  one  hundred  studies  with  10,000  pa-ents  each  over  three  years.  

Page 56: Keynote on 2015 Yale Day of Data

Challenge  3.    Datapods  

•  Databases  have  fundamentally  changed  the  way  we  manage  and  analyze  scien-fic  data.    

•  NoSQL  databases  allow  us  to  scale  out  to  mul-ple  racks  of  computers,  but  are  hard  to  to  operate.  

•  If  our  scien-fic  instrument  for  data  science  is  a  cyberpod  of  hardware  and  a  souware  stack  suppor-ng  data  analysis,  we  need  a  simple-­‐to-­‐manage,  open  source  “database”  that  scales  to  a  cyberpod.  

•  Call  this  a  “datapod.”  •  It  could  support  open  source  data  commons  and  allow  them  to  peer.  

Page 57: Keynote on 2015 Yale Day of Data

Challenge  4.    A  Billion  Predic-ve  Models  

•  Develop  technology  to  generate  automa-cally  1  to  10  billion  heterogeneous  segmented  models  

•   Applica-ons  – George  Church’s  challenge  individual  predic-ve  models  for  each  human  genome  6.5  Billion  humans.  

– 1  Million  cancer  genomes  x  1,000  models  /  genome.  

– Urban  science  –  instrumen-ng  ci-es.  – Consumer  Marke-ng  -­‐  large  adver-sers  will  see  1-­‐3  billion  different  consumers    

Page 58: Keynote on 2015 Yale Day of Data

Challenge  5.    HDSI  

•  Human  Computer  Interac-on  (HCI)  was  an  important  field  before  everyone  got  a  computer  and  became  an  expert.  

•  Think  of  Human  Data  Science  Interac-on  (HDSI)  of  how  humans  interact  with  the  souware  suppor-ng  the  analysis  of  data  science  at  the  scale  of  datapods  with  billion  models  and  trillions  of  hypotheses.  

•  How  can  we  improve  the  interac-on  to  improve  how  we  semi-­‐automa-cally  integrate  data,  validate  hypotheses,  interac-vely  explore  data,  etc.  

Page 59: Keynote on 2015 Yale Day of Data

Ques-ons?  

59  

rgrossman.com  @bobgrossman  

Page 60: Keynote on 2015 Yale Day of Data

For  More  Informa-on  

cdis.uchicago.edu  

www.opendatagroup.com  

rgrossman.com