Reproducibility in human cognitive neuroimaging: a community-driven data sharing framework for...

45
Reproducibility in human cogni4ve neuroimaging: a communitydriven data sharing framework for provenance informa4on integra4on and interoperability Nolan Nichols Dissertation Defense Biomedical and Health Informatics University of Washington Seattle, WA, USA December 8, 2014 1

Transcript of Reproducibility in human cognitive neuroimaging: a community-driven data sharing framework for...

Reproducibility  in  human  cogni4ve  neuroimaging:  a  community-­‐driven  data  sharing  framework  for  

provenance  informa4on  integra4on  and  interoperability  

Nolan Nichols

Dissertation Defense Biomedical and Health Informatics

University of Washington Seattle, WA, USA December 8, 2014

1  

Outline

•  Introduction •  Background •  Research approach •  Conclusions and future directions

2  

Outline

•  Introduction – Motivation for Research – Research Goal

•  Background •  Research approach •  Conclusions and future directions

3  

Introduction: Motivation for Research

• Human Cognitive Neuroimaging•  Inves4gates  brain  structure  and  func4on  in  normal  and  neuropsychiatric  condi4ons  to  improve  human  health  

•  Facilitates  clinical  decision  making  using  imaging  and  cogni4ve  phenotypes  

4  

•  Biomedical Informatics (BMI) – The interdisciplinary field that studies and

pursues the effective use of biomedical data, information, and knowledge for scientific inquiry, problem solving, and decision making, motivated by efforts to improve human health

•  Neuroinformatics – Applies BMI principles to develop techniques

and tools for acquiring, sharing, storing, publishing, analyzing, modeling, visualizing and simulating data across all levels of neuroscience

Introduction: Motivation for Research

5  

Poline et al. (2012), Frontiers in Neuroinformatics

• Neuroinformatics Perspective•  Research is a process with distinct stages•  Provenance links together each stage

Introduction: Motivation for Research

6  

•  Problem:  research  is  not  reproducibile  –  Ioannidis  JPA:  Why  Most  Published  Research  Findings  Are  False.  PLoS  Med  2005  

–  Donoho  D:  An  invita9on  to  reproducible  computa9onal  research.  Biosta.s.cs  2010.  

–  Yong  EE:  Replica9on  studies:  Bad  copy.  Nature  2012  –  Editorial:  Reducing  our  irreproducibility.  Nature  2012  –  Begley  CG:  Six  red  flags  for  suspect  work.  Nature  2013  –  Collins  FS,  Tabak  LA:  Policy:  NIH  plans  to  enhance  reproducibility.  Nature  2014  

•  Reproducibility  issues  exist  along  a  spectrum  –  Sta4s4cal  issues  –  Computa4onal  issues  

Introduction: Motivation for Research

7  

Introduction: Motivation for Research

Can different researchers from a different lab obtain consistent results using a different methodology and data? Can different researchers

from a different lab obtain consistent results using the same methodology?

Can the same researchers in the same lab obtain consistent results using the same methodology and data?

Repeatable  

Replicable  

Reproducible  

Confi

dence  in  Findings  

Reproducibility  Spectrum  8  

•  Sta4s4cal  issues  – Repor4ng  bias  of  brain  volume  (Ioannidis,  2011),  fMRI  ac4va4on  foci  (David,  2013)  

– Lack  of  sta4s4cal  power  in  neuroscience  (BuZon,  2013)  

– Data  collec4on  and  analysis  methods  are  highly  flexible  across  fMRI  studies  (Carp,  2012)  

•  Computa4onal  issues  – Lack  of  data  sharing  ,  code,  and  analysis  environments  

Introduction: Motivation for Research

9  

Adapted from Peng (2011), Science.

Introduction: Motivation for Research

•  Reusable  Research  –  Can  different  researchers  from  a  different  lab  apply  a  methodology  to  process  shared  data  from  different  researchers  in  a  different  lab?  

10  

Poline et al. (2012), Frontiers in Neuroinformatics

Introduction: Motivation for Research

Barriers  to  reusable  research  •  Data  management  systems  are  not  interoperable  •  Data  acquisi4on  and  analysis  methods  lack  provenance  •  Terminologies  are  not  harmonized  (e.g.,  brain  atlases,  schemas)

11  

•  To  enhance  the  reusability  of  neuroimaging  data  and  workflow  code  

•  To  advance  an  informa4cs  data  exchange  standard  that  incorporates  provenance  as  a  core  concept  

•  To  engage  the  neuroinforma4cs  community  as  a  partner  in  the  design  process  

Introduction: Research Goals

12  

Outline

•  Introduction •  Background – Data exchange – Provenance – Linked Open Data

•  Research approach •  Conclusions and future directions

13  

Background: Data Exchange

hZp://xkcd.com/927/  

•  My goal is to extend existing standards to facilitate data reusability and interoperability

14  

XML-­‐based  Clinical  Experiment  Data  Exchange  Schema,  Gadde  et  al.  2012  

XCEDE XML Schema•  Experiment Hierarchy is composed of five levels

of information relevant to neuroimaging data exchange– Project– Subject– Visit– Study– Episode– Acquisition

Background: Data Exchange

15  

•  Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability, or trustworthiness.–  Entity (e.g., files, data, publications)

•  a physical, digital, conceptual, or other kind or thing with some fixed aspects

–  Activity (e.g., workflow, editing a manuscript)•  something that occurs over a period of time and acts upon or

with entities–  Agent (e.g., person, software, organization)

•  something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent’s activity.

W3C PROV Specification Suite

Background: Data Exchange

16  

Background: Provenance

•  An image registration process–  wasAssociatedWith a registration algorithm–  used an native-space natomical MRI

•  A spatially-normalized anatomical MRI –  wasGeneratedBy an image registration process–  wasDerivedFrom an native-space anatomical MRI–  wasAttrbutedTo a registration algorithm

•  PROV is an extensible language to describe:–  Responsibility–  Data Flow–  Process Flow

17  

Background: Linked Open Data

Seman4c  Web  and  Resource  Descrip4on  Framework  

•  A  language  to  make  statements  about  unique  loca4ons  (URLs)  on  the  Web  

•  For  example,  at  the  URL  of  an  anatomical  MRI    –  ‘is  a’  hZp://neurolex.org/wiki/Nlx_156814  

18  

Background: Linked Open Data

19  

Outline

•  Introduction •  Background •  Research approach – Specific Aims – Study Design – Phase 1 – Phase 2

•  Conclusions and future directions

20  

Research Approach: Specific Aims

•  Aim 1: Research and design a framework to represent, access, and query neuroimaging data provenance

•  Aim 2: Develop an information system of Web services to compute and discover data provenance from brain imaging workflow

21  

Research Approach: Study Design

•  Phase 1 – Scalable Neuroimaging Initiative (SNI)–  West Coast collaboration funded by the National Academies

Keck Futures Initiative (NAKFI) on Imaging Science–  I led 15 meetings, 1 face-to-face workshop, and presented

preliminary results at 3 conferences

•  Phase 2 – Neuroimaging Data Sharing (NIDASH)–  Task force funded and organized by the International

Neuroinformatics Coordinating Facility (INCF)–  I gathered feedback and redesigned the initial SNI framework

over 14 face-to-face workshops, 2 hackathons, and weekly meetings over two years

22  

Research Approach: Study Design

23  

Evaluate  metadata  standards  for  data  exchange  (XCEDE)  

Extend  PROV  using  concepts  from  XCEDE  (Neuroimaging  Data  

Model)  

Redesign  NiQuery  using  a  sema4c  Web  service  oriented  architecture  

Demonstrated  a  system  for  computa4onal  access  

to  data  (NiQuery)  Phase  1  –  SNI  

Phase  2  –  NIDASH  

Aim  1  –  Data  Exchange   Aim  2  –  Informa9on  System  

Research Approach: Study Design

24  

Outline

•  Introduction •  Background •  Research approach – General Approach – Phase 1 – SNI – Phase 2 – NIDASH

•  Conclusions and future directions

25  

Research Approach: Phase 1 – SNI

•  Scalable  Neuroimaging  Ini4a4ve’s  Mission:  –  To  specify  and  demonstrate  an  applica4on  programming  interface  (API)  that  can  support  agile  explora4on  of  distributed  neuroimaging  data  sources  while  allowing  for  heterogeneous  and  evolving  data  management  systems,  ontologies,  image  data  formats,  image  processing  tools,  and  standard  anatomical  spaces.  

•  Aim  1  –  Data  Exchange:  – Applied  XCEDE  as  a  data  exchange  standard  for  two  neuroimaging  databases  

•  Aim  2  –  Informa4on  System:  –  Implemented  a  system  architecture  for  remote  access  to  content  within  neuroimaging  data  

26  

Aim 1•  Queries shipped out

to multiple sources•  Links are passed to

visualization app

Aim2•  Extract time series from

data remotely•  Browser and plotting all in

real-time

Research Approach: Phase 1 – SNI

27  

App#

NIQ#

Allen##Ins+tute# ABA#Common#

API#

www.niquery.org#

UW#Stanford#

…# UW# XNAT#Common#API#

Stanford## NIMS#Common#API#

Database#Registry#Common#Data#Exchange#Layer#WebLbased#

Applica+ons#

Query#Integrator#

Query#Processing#

NiQuery  presented  at  Neuroinforma4cs,  2012  Munich  Brinkley  (2012),  Query  Integrator.  JBI.  

•  System too slow for real-time access (~30 secs.)•  XCEDE too strict for changing datatype requirements•  Framework doesn’t incorporate formal provenance

Research Approach: Phase 1 – SNI

28  

Lessons  learned  •  Harmonizing  the  XCEDE  and  PROV  Schemas    

–  XCEDE has a strict hierarchical structure –  PROV is designed as a graph and compatible with semantic

Web technologies –  A harmonized XCEDE and PROV model could represent the

stages of electronic data capture, not just the experiment hierarchy

•  Solution 1: Extend PROV to represent XCEDE •  Solution 2: Redesign NiQuery using semantic Web

design concepts

Research Approach: Phase 1 – SNI

29  

Outline

•  Introduction •  Background •  Research approach – General Approach – Phase 1 – SNI – Phase 2 – NIDASH

•  Conclusions and future directions

30  

Research Approach: Phase 2 – NIDASH

•  Neuroimaging  Data  Sharing  Task  Force  Mission:  –  Aiming  at  reproducibility  for  the  sake  of  reproducibility  and  enhanced  research.  

•  Aim  1  –  Data  Exchange:  – Applied  XCEDE  as  a  data  exchange  standard  for  two  neuroimaging  databases  

•  Aim  2  –  Informa4on  System:  –  Implemented  a  system  architecture  for  remote  access  to  content  within  neuroimaging  data  

31  

Research Approach: Phase 2 – NIDASH Neuroimaging  Data  Model  (NIDM)  

32  

•  Extensions  to  PROV  using  elements  from  the  XCEDE  experiment  hierarchy,  workflow  tools,  and  derived  data  to  create  Domain  Object  Models  

•  Enables  a  model  bridging  informa4on  from  experiment,  workflow  provenance,  and    derived  data   Keator,  et  al.  2013  

Research Approach: Phase 2 – NIDASH

33  

Research Approach: Phase 2 – NIDASH

34  

NIDM  Collabora4on  •  Mee4ngs  on  Monday  and  Wednesday  to  discuss  previous  week’s  issues  

•  Satellite  mee4ngs  at  HBM,  SfN,  Imaging  Gene4cs,  and  Neuroinforma4cs  for  1-­‐2  days  each  

•  General  Workflow  to  Contribute  –  Contributors  create  a  “fork”  from  Github  (an  online  version  control  system  with  

–  Changes  the  vocabulary  ad  examples  are  logged  as  “commits”  in  the  contributors  “fork”  

–  Contributor  submits  a  “pull  request”  to  have  changes  reviewed  

–  Discussion  takes  place  online  un4l  consensus  is  reached  

35  

Aim 2: Design and MethodsWeb services for brain imaging: Demo Query App

36  

37  

38  

NIDM  Results  

•  A  full  descrip4on  is  outside  the  scope  of  this  talk…  but  

39  

NIDM Results  •  A  harmonized  model  for  repor4ng  task-­‐based  fMRI  across  SPM,  FSL  and  (soon)  AFNI  

hZp://nidm.nidash.org/specs/nidm-­‐results.html   40  

NIDM Results  •  All  terms  are  modeled  with  an  iden4fier,  a  defini4on,  domain/range,  and  examples  

•  Model  fipng:  

41  

NIDM  Results  

42  

Outline

•  Introduction •  Background •  Research approach •  Conclusions and future directions – Contributions –  Implications – Future Directions

43  

Conclusions  and  future  direc4ons  •  Collabora4ve  Framework  Outcomes  –   Github  is  an  effec4ve  tool  for  standards  development  

•  Closed  89  issues  •  1,087  commits  •  9  contributors  •  1  publica4on,  specifica4on  suite  

•  Sorware  engineering  outcomes  –  Implemented  in  Nipype  for  workflow  management  –  Being  used  to  model  task  fMRI    

•  Implemented  for  SPM  12  and  FSL  –  Being  incorporated  into  NeuroVault  for  automated  popula4on  of  a  database  to  share  SPMs  

44  

AcknowledgmentsCommittee MembersJames Brinkley (Chair)Susan Coldwell(GSR)Thomas GrabowskiNicholas Anderson

Neuroinformatics CommunitySatra Ghosh, Rich Stoner, JB

Poline, David Keator, Karl Helmer, Camille Maumet, Tom Nichols, Dan Marcus, Christian

Haselgrove, Jessica Turner, David Kennedy, Jack van Horn…

and many others!

Scalable Neuroimaging InitiativeUW: Todd Detwiler, Randy Frank

Stanford: Brian Wandell, Bob Dougherty, Gunnar Schaeffer

Integrated Brain Imaging CenterKatie Askren, Peter Boord, Elliot

Collins, Tina Guan, Clark Johnson, Tara Madhyastha, Sonya Mehta,

Todd Richards, Rosalia Tungaraza, Kurt Weaver, Karl Woelfer, Liza

Young… and everyone else!

45