It’s%AboutData:%% 50,000 0%Overview%of%DataTo%...

33
It’s About Data: 50,000 0 Overview of Data To Insight Center Beth Plale Professor and Director, Data To Insight Center

Transcript of It’s%AboutData:%% 50,000 0%Overview%of%DataTo%...

Page 1: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

It’s  About  Data:    50,000  0  Overview  of  Data  To  

Insight  Center    

Beth  Plale  Professor  and  Director,  Data  To  Insight  Center  

Page 2: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Dataset:    D2I-­‐AMSR-­‐E-­‐Provenance  Dataset  

Owner  and  Creator:    Data  to  Insight  Center  Size:    15MB    The  University  of  Alabama  in  Huntsville  processes  data  from  the  NASA  AMSR-­‐E  instrument.  The  Karma  project  at  Indiana  University  instrumented  the  ingest  processing  system  and  captured  provenance  for  3,890  runs  for  the  period  of  September  2  -­‐  October  4  2011.  The  details  of  the  runs  are  in  Figure  III-­‐16  below;  the  largest  provenance  graph  is  the  monthly  rain  graph  that,  when  represented  as  a  XML  is  approximately  13MB.        Luo,  Yuan,  Plale,  Beth,  Jensen,  Sco^,  Cheah,  You-­‐Wei,  Conover,  Helen.  2012.  Provenance  of  AMSR-­‐E  Data  from  the  Na`onal  Snow  and  Ice  Data  Center  (NSIDC).  OPM  XML  Ver.  1.1.,  Sep  2  -­‐  Oct  4,  2011.  Bloomington,  Indiana:  Data  to  Insight  Center.    h^p://dx.doi.org/10.5967/M0F47M2D  

Page 3: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Dataset:    10GB  Noisy  Provenance  Dataset    

Owner  and  Creator:    Data  to  Insight  Center  Size:    10GB          Provenance  of  scien`fic  data  is  a  key  piece  of  the  metadata  record  for  data's  ongoing  discovery  and  reuse.  Provenance  collec`on  systems  capture  provenance  on  the  fly,  however,  the  protocol  between  applica`on  and  provenance  tool  may  not  be  reliable.  Consequently,  the  provenance  record  can  be  par`al,  par``oned,  and  simply  inaccurate.  The  Gigabyte  Synthe`c  Provenance  Database  is  a  “noisy”  data  collec`on  generated  using  the  Workflow  Emulator  Tool  (WORKEM)  with  a  number  of  scien`fic  workflow  examples  that  includes  modeled  failures.      h^p://d2i.indiana.edu/provenance_gigabyte_database  

Page 4: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Dataset:    D2I-­‐Vortex2  Dataset    Owner  and  Creator:    Data  to  Insight  Center  Size:    100GB        D2I  generated  a  couple  hundred  short-­‐term  (12-­‐15  hour)  regional  (state  in  size)  weather  forecasts  late  Spring  2010.    These  forecasts  were  generated  to  support  the  Vortex2  campaign.    Funded  by  the  Na`onal  Science  Founda`on,  Vortex2  was  an  effort  to  move  sensi`ve  mobile  data  gathering  instruments  to  the  loca`on  of  severe  weather  ac`vity  during  late  spring  tornado  season.  The  Vortex2  campaign  covered  loca`ons  from  Texas  to  Wisconsin  over  the  course  of  six  weeks.    The  dataset  was  carefully  curated  in  2011  with  metadata  added  for  each  forecast.  It  has  since  become  a  significant,  used  in  subsequent  projects  to  develop  other  research  products  such  as  an  ontology  and  federated  searching  of  research  data  metadata.          Plale,  Beth,  Brewster,  Keith,  Ma^ocks,  Craig,  Bhangale,  Ashish,  Withana,  Eran  C.,  Herath,  Chathura,  Terkhorn,  Felix,  Chandrasekar,  Kavitha.    Weather  Forecast  Data  from  the  D2I-­‐Vortex2  project.  May  1  to  June  15,  2010.    Bloomington,  Indiana:  Data  to  Insight  Centerh^ps://scholarworks.iu.edu/dspace/handle/2022/14983  

Page 5: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Visualiza`on  of  Network  Data  Provenance  

With  Global  Research  Network  Opera`ons  Center  (GNOC)  

Page 6: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Mul`-­‐layer  provenance  capture  

•  Provenance:  lineage  of  data  product  or  collec`on  of  data  resul`ng  from  computa`onal  execu`on  of  some  sort  

•  Applica`ons  o0en  run  in  mul`ple  phases,  on  mul`ple  machines,  and  over  mul`layer  so0ware  stacks.    Provenance  can  be  captured  at  mul`ple  layers  of  stack.      

What  if  just  provenance  capture  just  at  layers  below  the  applica6on?  

6  

Page 7: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Case  I:    GENI  WiMAX  DDoS  WiMAX network; DoS attack exploiting WiMAX system parameters. Experiment uses 100 subscribers with varied configurations of 6 parameters. Current version runs on NS2.

With  Clemson  University  

Page 8: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Provenance  of  WiMAX  DDoS  Experiment  •  Provenance capture with NetKarma. NetKarma captures

•  provenance of packet movement, and

•  infers critical provenance about packets that were dropped, and by doing so is able to convey information about DDoS attacks through visualization

•  Improvement over earlier hand-worked ANOVA analysis.

Page 9: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Provenance captures causality:

Dropped packets increases as

frame duration increases from 0.01s to 0.02s  

Run  id  

Frame  duraJon  

number  of  aMackers  

aMack  backoff  start  

aMack  request  retry  

bw  backoff  start  

bw    request  retry    

1   0.004   20/80   1   2   1   2  

244   0.01   20/80   1   2   1   2  

487   0.02   20/80   1   2   1   2  

Page 10: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Provenance  capture  AMSR-­‐E  data  processing  pipeline  

Aug  2013   10  

Advanced  Microwave  Scanning  Radiometer  (AMSR-­‐E)  :  one  of  six  

sensors  aboard  Aqua  satellite.    A  passive  microwave  radiometer.  

 It  observes  atmospheric,  land,  oceanic,  and  cryospheric  

parameters,  including  precipita`on,  sea  surface  

temperatures,  ice  concentra`ons,  snow  water  equivalent,  surface  

wetness,  wind  speed,  atmospheric  cloud  water,  and  water  vapor.  

Page 11: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

NASA  AMSR-­‐E  imagery  ingest  processing  schedule  at  Univ  of  Alabama  Huntsville  

Page 12: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Aug  2013   12  

Provenance  History  Layout  Algorithm  •  Provenance  for  1  month  

processing  of  NASA  satellite  ingest  processing  pipeline.  

•  Can  help  tracing  error  back  to  its  cause.  

•  Shows  rela`onship  between  daily  products  (each  clover  flower  in  the  clover  leaf  chain)  and  final  monthly  products  at  the  le0-­‐end.  

Provenance  of  a  seaIce  daily  workflow  

Page 13: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Provenance  graph  compare:  failed  runs  

13  

Provenance  graph  on  le0  is  complete  provenance  of  successful  execu`on.  Comparing  it  with  provenance  graph  on  right  shows  that  right  one  is  a  failure,  because  of  final  data  product  (green)  in  le0  graph  cannot  be  matched.  

Page 14: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Graph  compare:    dropped  provenance  

14  

Le0  graph  is  provenance  of  successful  execu`on.  Graph  on  right  shows  that  although  right  graph  is  successful  execu`on,  it  has  dropped  no`fica`ons  in  provenance  capture,  because  all  nodes  except  some  edges  in  le0  graph  cannot  be  matched.  

Page 15: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Hathi  Trust  Research  Center  

 

Text  mining  at  scale  

 #HTRC  #HathiTrust    #HTRC  #HathiTrust  

Page 16: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

à  HathiTrust is large corpus providing opportunity for new forms of computation investigation. à  The bigger the data, the less able we are to move it to a researcher’s desktop machine à  Future research on large collections will require computation moves to the data, not vice versa

Page 17: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

HTRC  Partners  

IU  SoIC  and  Libraries  (Beth  Plale  and  Robert  McDonald);  UIUC  GLIS  and  Libraries  (J.  Stephen  Downie  and  Beth  Sandore),    Brandies  (John  Unsworth)  ;  University  of  Michigan  (HathiTrust)    h^p://www.hathitrust.org/htrc    

 #HTRC  #HathiTrust  

Page 18: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

HTRC  Non-­‐Consump`ve  Research  Paradigm  

•  No  ac6on  or  set  of  ac6ons  on  part  of  users,  either  ac6ng  alone  or  in  coopera6on  with  other  users  over  dura6on  of  one  or  mul6ple  sessions  can  result  in  sufficient  informa6on  gathered  from  collec6on  of  copyrighted  works  to  reassemble  pages  from  collec6on.  

•  Defini`on  disallows  collusion  between  users,  or  accumula`on  of  material  over  `me.    Differen`ates  human  researcher  from  proxy  which  is  not  a  user.    Users  are  human  beings.    

 #HTRC  #HathiTrust  

Page 19: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Themes  for  Authors  

•  Two  topics  with  iden`cal  centrali`es  but  separate  themes  

Page 20: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Yearly values of a ratio between two wordlists in three different genres. 4,275 volumes. 1700-1899.

Underwood et al. Research

Page 21: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Corpus  Usage  Pa^erns  Chapter 1

Chapter 1

Chapter 1

Page IV

Page IV

Page IV

Table of Contents 1………….# 2…………##

Table of Contents 1………….# 2…………##

Table of Contents 1………….# 2…………##

Access by chapter

Access by page

Access by special contents (table of contents, index, glossary)

21  

Page 22: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

•  Philosophy:    computa`on  moves  to  data  •  REST  based  Web  services  architecture  and  protocols  

•  Registry  of  services  and  algorithms  •  Solr  full  text  index  •  noSQL  store  as  volume  store  •  openID  authen`ca`on  •  Portal  front-­‐end,  programma`c  access  •  SEASR  text  mining  algos  

8/13/13   22  

Page 23: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Studies  in  Social-­‐Ecological  Systems  Data  Management  

with  David  Leake  and  Xiaozhong  Liu  

Page 24: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

What  is  SES  Framework?    

24  

Source:  Ostrom,  E.  2009.  A  General  Framework  for  Analyzing  the  Sustainability  of  Social  Ecological  Systems.  Science  325:  419-­‐422  

e.g.,  Forests,  fisheries,  grazing,  irriga`on  systems  for  agriculture  

Page 25: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

25  

SES  Coding  categories  

Page 26: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Int’l  Forestry  Resources  and  Ins`tu`ons  (IFRI)  database:  collec`on  

•  Data  collected  in  23  countries  by  13  Collabora`ng  Research  Centers.    

•  Data  collected  using  research  instrument  with  10  forms  packages,  totaling  180  pages,  with  some  packages  used  up  to  30  `mes  or  more  per  visit  

Page 27: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

IFRI  database:  data  •  Over  18  years  longitudinal  data  on  forest  resources,  use,  and  governance  

•  Consists  of  346  separate  site  visits  •  Rela`onal  database  captures  rela`ons  between  data  collec`on  packages  

•  Responses  to  each  ques`on  are  a  column  in  the  database;  922  ques`ons  in  the  IFRI  database  

Page 28: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

RS1

RS2

RS3

RS4

RS5

RS6

RS9

RU2

RU3

RU4

RU5

RU7

GS1

GS2

GS3

GS4

GS5

GS6

GS7

GS8U1U2U3U4U5U6U7U8U9 I1 I2 I3 I4 I5 I6 O1

O2

O3S1S3

2-IND-305_07-16-19932-IND-304_07-14-19932-IND-256_08-10-19932-IND-310_08-25-19932-IND-253_08-04-19932-IND-300_06-24-19932-IND-255_08-08-19932-IND-254_08-06-19932-IND-302_07-02-19932-IND-303_07-02-19932-IND-311_08-30-19932-IND-312_09-05-19932-IND-301_07-16-199322-IND-14_05-29-200522-IND-15_06-01-200522-IND-12_05-14-200522-IND-11_04-01-200422-IND-13_05-18-200522-IND-16_06-03-200522-IND-10_03-19-2004

Site Visits 161 - 180

Resource  System  

Resource  Units   Governance  

System  

Users  

Interac`ons  

Outcomes  Social  

Economic  Poli`cal  Setngs  

stud

y  sites  

SES  Category  

-­‐2   0   2   4   6   8  

Data  Density  (Z-­‐Score)  

IFRI  map  to  SES  for    data  discovery  

28  

IFRI  data  densi`es  seen  

through  SES  

ordered  heatmap  

 

Page 29: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Automated  Classifica`on  of  Survey  Ques`ons  to  SES  

•  Categoriza`on  of  each  ques`on  in  instrument  to  SES  Framework  in  automated  way  

•  Carry  out  word  frequency  calcula`on  on  instrument  ques`ons  and  SES  categories.  Used  this  to  cluster  ques`ons.      

•  Apply  machine  learning  to  avoid  need  for  learning  data  set  –  Examined  decision  tree  classifiers,  Naïve  Bayes,  support  vector  machine  (SVM).    Decision  tree  classifiers  performed  best*.      

*  Jensen,  S.,  et  al.    IEEE  Int’l  Conference  on  e-­‐Science,  Oct  2012  

Page 30: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

With  IU  Libraries,  Umichigan  (lead),  and  UIUC  

Page 31: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2
Page 32: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

SEAD  

SEAD  Virtual  Archive  (SVA)  -­‐-­‐  manage  sustainability  science  

window  to  mul`ple  IRs  -­‐-­‐OAIS  model    

IU  Scholarworks  IR  

publish   associate  

discover  

UIUC  IDEALS  IR  

UMich  Deep  Blue  IR  

ingest  

Page 33: It’s%AboutData:%% 50,000 0%Overview%of%DataTo% …salsahpc.indiana.edu/summerworkshop2013/slides/beth.pdf · 2013-08-23 · rs1 rs2 rs3 rs4 rs5 rs6 rs9 ru2 ru3 ru4 ru5 ru7 gs1 gs2

Hathi  Trust  Research  Center      h^p://www.hathitrust.org/htrc  

SEAD  DataNet        h^p://www.sead-­‐data.net  

socioeco  informa`cs    h^p://d2i.indiana.edu/socio-­‐eco-­‐informa`cs  

data  provenance    h^p://d2i.indiana.edu/provenance