Marine Zones iNOB. Four Zones of a Marine Ecosystem 1.Intertidal 2. Neritic 3. Oceanic 4. Benthic.
Data-knowledge transition zones within the biomedical research ecosystem
-
Upload
maryann-martone -
Category
Technology
-
view
182 -
download
0
Transcript of Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research
ecosystem
Maryann E Martone Ph DUniversity of California San Diego
bull NIF is an initiative of the NIH Blueprint consortium of institutesndash What types of resources (data tools materials services) are available to the
neuroscience community
ndash How many are there
ndash What domains do they cover What domains do they not cover
ndash Where are theybull Web sites
bull Databases
bull Literature
bull Supplementary material
ndash Who uses them
ndash Who creates them
ndash How can we find them
ndash How can we make them better in the future
httpneuinfoorg
bull PDF files
bull Desk drawers
NIF has been surveying
cataloging and tracking the
neuroscience resource
landscape since lt 2008
BD2K Big Data to Knowledgebull BD2K - a trans-NIH initiative established to enable biomedical research as a
digital research enterprise to facilitate discovery and support new knowledge and to maximize community engagement
bull BD2K aims to develop the new approaches standards methods tools software and competencies that will enhance the use of biomedical Big Data by
ndash Facilitating broad use of biomedical digital assets by making them discoverable accessible and citable
ndash Conducting research and developing the methods software and tools needed to analyze biomedical Big Data
ndash Enhancing training in the development and use of methods and tools necessary for biomedical Big Data science
ndash Supporting a data ecosystem that accelerates discovery as part of a digital enterprise
httpbd2knihgov
How many resources are there
How do resources get added to the NIFbullNIF curatorsbullNomination by the communitybullSemi-automated text mining pipelines
NIF RegistryRequires no special skillsManual and semi-automated updates
bullNIF Data FederationbullDISCO interopbullRequires some programming skillbullOpen Source Brain lt 2 hrbullAutomated update via NIF DISCO dashboard
Low barrier to entry incremental refinementMarenco et al 2010 2014
Registry vs Federation Metadata about resource vsmetadatadata in database
What resources are available for GRM1
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY
Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years
Anita Bandrowski and Burak Ozyurt
Population Coverage and Linkage of Resource Registry
bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates
ndash Identified for 570 resources
ndash 373 were not updated within the last 2
years (65)
bull Manual review of ~200 resources
ndash 38 not updated within the past 2 years
(~20)
ndash 8 migrated to new addresses or institutions
ndash 7 are no longer in service (~3)
ndash 3 were deemed no longer appropriate
What happens to these resources
The Registry provides a persistent identifier and metadata record for what once existed but no longer does
Keeping content up to date
Connectome
Tractography
Epigenetics
bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content
bullChange namebullChange scope
bullgt 7000 updates to the registry last year
Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
bull NIF is an initiative of the NIH Blueprint consortium of institutesndash What types of resources (data tools materials services) are available to the
neuroscience community
ndash How many are there
ndash What domains do they cover What domains do they not cover
ndash Where are theybull Web sites
bull Databases
bull Literature
bull Supplementary material
ndash Who uses them
ndash Who creates them
ndash How can we find them
ndash How can we make them better in the future
httpneuinfoorg
bull PDF files
bull Desk drawers
NIF has been surveying
cataloging and tracking the
neuroscience resource
landscape since lt 2008
BD2K Big Data to Knowledgebull BD2K - a trans-NIH initiative established to enable biomedical research as a
digital research enterprise to facilitate discovery and support new knowledge and to maximize community engagement
bull BD2K aims to develop the new approaches standards methods tools software and competencies that will enhance the use of biomedical Big Data by
ndash Facilitating broad use of biomedical digital assets by making them discoverable accessible and citable
ndash Conducting research and developing the methods software and tools needed to analyze biomedical Big Data
ndash Enhancing training in the development and use of methods and tools necessary for biomedical Big Data science
ndash Supporting a data ecosystem that accelerates discovery as part of a digital enterprise
httpbd2knihgov
How many resources are there
How do resources get added to the NIFbullNIF curatorsbullNomination by the communitybullSemi-automated text mining pipelines
NIF RegistryRequires no special skillsManual and semi-automated updates
bullNIF Data FederationbullDISCO interopbullRequires some programming skillbullOpen Source Brain lt 2 hrbullAutomated update via NIF DISCO dashboard
Low barrier to entry incremental refinementMarenco et al 2010 2014
Registry vs Federation Metadata about resource vsmetadatadata in database
What resources are available for GRM1
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY
Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years
Anita Bandrowski and Burak Ozyurt
Population Coverage and Linkage of Resource Registry
bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates
ndash Identified for 570 resources
ndash 373 were not updated within the last 2
years (65)
bull Manual review of ~200 resources
ndash 38 not updated within the past 2 years
(~20)
ndash 8 migrated to new addresses or institutions
ndash 7 are no longer in service (~3)
ndash 3 were deemed no longer appropriate
What happens to these resources
The Registry provides a persistent identifier and metadata record for what once existed but no longer does
Keeping content up to date
Connectome
Tractography
Epigenetics
bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content
bullChange namebullChange scope
bullgt 7000 updates to the registry last year
Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
BD2K Big Data to Knowledgebull BD2K - a trans-NIH initiative established to enable biomedical research as a
digital research enterprise to facilitate discovery and support new knowledge and to maximize community engagement
bull BD2K aims to develop the new approaches standards methods tools software and competencies that will enhance the use of biomedical Big Data by
ndash Facilitating broad use of biomedical digital assets by making them discoverable accessible and citable
ndash Conducting research and developing the methods software and tools needed to analyze biomedical Big Data
ndash Enhancing training in the development and use of methods and tools necessary for biomedical Big Data science
ndash Supporting a data ecosystem that accelerates discovery as part of a digital enterprise
httpbd2knihgov
How many resources are there
How do resources get added to the NIFbullNIF curatorsbullNomination by the communitybullSemi-automated text mining pipelines
NIF RegistryRequires no special skillsManual and semi-automated updates
bullNIF Data FederationbullDISCO interopbullRequires some programming skillbullOpen Source Brain lt 2 hrbullAutomated update via NIF DISCO dashboard
Low barrier to entry incremental refinementMarenco et al 2010 2014
Registry vs Federation Metadata about resource vsmetadatadata in database
What resources are available for GRM1
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY
Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years
Anita Bandrowski and Burak Ozyurt
Population Coverage and Linkage of Resource Registry
bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates
ndash Identified for 570 resources
ndash 373 were not updated within the last 2
years (65)
bull Manual review of ~200 resources
ndash 38 not updated within the past 2 years
(~20)
ndash 8 migrated to new addresses or institutions
ndash 7 are no longer in service (~3)
ndash 3 were deemed no longer appropriate
What happens to these resources
The Registry provides a persistent identifier and metadata record for what once existed but no longer does
Keeping content up to date
Connectome
Tractography
Epigenetics
bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content
bullChange namebullChange scope
bullgt 7000 updates to the registry last year
Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
How many resources are there
How do resources get added to the NIFbullNIF curatorsbullNomination by the communitybullSemi-automated text mining pipelines
NIF RegistryRequires no special skillsManual and semi-automated updates
bullNIF Data FederationbullDISCO interopbullRequires some programming skillbullOpen Source Brain lt 2 hrbullAutomated update via NIF DISCO dashboard
Low barrier to entry incremental refinementMarenco et al 2010 2014
Registry vs Federation Metadata about resource vsmetadatadata in database
What resources are available for GRM1
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY
Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years
Anita Bandrowski and Burak Ozyurt
Population Coverage and Linkage of Resource Registry
bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates
ndash Identified for 570 resources
ndash 373 were not updated within the last 2
years (65)
bull Manual review of ~200 resources
ndash 38 not updated within the past 2 years
(~20)
ndash 8 migrated to new addresses or institutions
ndash 7 are no longer in service (~3)
ndash 3 were deemed no longer appropriate
What happens to these resources
The Registry provides a persistent identifier and metadata record for what once existed but no longer does
Keeping content up to date
Connectome
Tractography
Epigenetics
bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content
bullChange namebullChange scope
bullgt 7000 updates to the registry last year
Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
How do resources get added to the NIFbullNIF curatorsbullNomination by the communitybullSemi-automated text mining pipelines
NIF RegistryRequires no special skillsManual and semi-automated updates
bullNIF Data FederationbullDISCO interopbullRequires some programming skillbullOpen Source Brain lt 2 hrbullAutomated update via NIF DISCO dashboard
Low barrier to entry incremental refinementMarenco et al 2010 2014
Registry vs Federation Metadata about resource vsmetadatadata in database
What resources are available for GRM1
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY
Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years
Anita Bandrowski and Burak Ozyurt
Population Coverage and Linkage of Resource Registry
bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates
ndash Identified for 570 resources
ndash 373 were not updated within the last 2
years (65)
bull Manual review of ~200 resources
ndash 38 not updated within the past 2 years
(~20)
ndash 8 migrated to new addresses or institutions
ndash 7 are no longer in service (~3)
ndash 3 were deemed no longer appropriate
What happens to these resources
The Registry provides a persistent identifier and metadata record for what once existed but no longer does
Keeping content up to date
Connectome
Tractography
Epigenetics
bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content
bullChange namebullChange scope
bullgt 7000 updates to the registry last year
Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Registry vs Federation Metadata about resource vsmetadatadata in database
What resources are available for GRM1
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY
Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years
Anita Bandrowski and Burak Ozyurt
Population Coverage and Linkage of Resource Registry
bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates
ndash Identified for 570 resources
ndash 373 were not updated within the last 2
years (65)
bull Manual review of ~200 resources
ndash 38 not updated within the past 2 years
(~20)
ndash 8 migrated to new addresses or institutions
ndash 7 are no longer in service (~3)
ndash 3 were deemed no longer appropriate
What happens to these resources
The Registry provides a persistent identifier and metadata record for what once existed but no longer does
Keeping content up to date
Connectome
Tractography
Epigenetics
bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content
bullChange namebullChange scope
bullgt 7000 updates to the registry last year
Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
What resources are available for GRM1
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY
Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years
Anita Bandrowski and Burak Ozyurt
Population Coverage and Linkage of Resource Registry
bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates
ndash Identified for 570 resources
ndash 373 were not updated within the last 2
years (65)
bull Manual review of ~200 resources
ndash 38 not updated within the past 2 years
(~20)
ndash 8 migrated to new addresses or institutions
ndash 7 are no longer in service (~3)
ndash 3 were deemed no longer appropriate
What happens to these resources
The Registry provides a persistent identifier and metadata record for what once existed but no longer does
Keeping content up to date
Connectome
Tractography
Epigenetics
bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content
bullChange namebullChange scope
bullgt 7000 updates to the registry last year
Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY
Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years
Anita Bandrowski and Burak Ozyurt
Population Coverage and Linkage of Resource Registry
bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates
ndash Identified for 570 resources
ndash 373 were not updated within the last 2
years (65)
bull Manual review of ~200 resources
ndash 38 not updated within the past 2 years
(~20)
ndash 8 migrated to new addresses or institutions
ndash 7 are no longer in service (~3)
ndash 3 were deemed no longer appropriate
What happens to these resources
The Registry provides a persistent identifier and metadata record for what once existed but no longer does
Keeping content up to date
Connectome
Tractography
Epigenetics
bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content
bullChange namebullChange scope
bullgt 7000 updates to the registry last year
Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years
Anita Bandrowski and Burak Ozyurt
Population Coverage and Linkage of Resource Registry
bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates
ndash Identified for 570 resources
ndash 373 were not updated within the last 2
years (65)
bull Manual review of ~200 resources
ndash 38 not updated within the past 2 years
(~20)
ndash 8 migrated to new addresses or institutions
ndash 7 are no longer in service (~3)
ndash 3 were deemed no longer appropriate
What happens to these resources
The Registry provides a persistent identifier and metadata record for what once existed but no longer does
Keeping content up to date
Connectome
Tractography
Epigenetics
bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content
bullChange namebullChange scope
bullgt 7000 updates to the registry last year
Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates
ndash Identified for 570 resources
ndash 373 were not updated within the last 2
years (65)
bull Manual review of ~200 resources
ndash 38 not updated within the past 2 years
(~20)
ndash 8 migrated to new addresses or institutions
ndash 7 are no longer in service (~3)
ndash 3 were deemed no longer appropriate
What happens to these resources
The Registry provides a persistent identifier and metadata record for what once existed but no longer does
Keeping content up to date
Connectome
Tractography
Epigenetics
bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content
bullChange namebullChange scope
bullgt 7000 updates to the registry last year
Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Keeping content up to date
Connectome
Tractography
Epigenetics
bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content
bullChange namebullChange scope
bullgt 7000 updates to the registry last year
Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views
250 sourcesgt 800 M records
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
What do you mean by dataDatabases come in many shapes and sizes
bull Primary data
ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)
bull Secondary datandash Data features extracted through
data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)
bull Tertiary datandash Claims and assertions about the
meaning of data
bull Eg gene upregulationdownregulation brain activation as a function of task
bull Registriesndash Metadata
ndash Pointers to data sets or materials stored elsewhere
bull Data aggregatorsndash Aggregate data of the same
type from multiple sources eg Cell Image Library SUMSdb Brede
bull Single sourcendash Data acquired within a single
context eg Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
NIF A search engine for data
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
NIF Information Framework Query and alignment
bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology
bull Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF uses ontologies to enhance search and discovery but is not constrained by them
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Find clinical trials that have data available
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Current challenge With so much available how do I find what I need
bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends
bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and
tools the answers may differ
bull Many databases have tool bases and workflows that they supportndash Much value has been added to
individual data sets
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Facets and filters Progressive refinement of search
FacetFilter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
More effective to start with a general query and use the navigation to refine search
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Concept Mapper Alignment and weighting
Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
ldquoData trailsrdquo Linking data and analysis tools
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Query across Registry and Federation
bull Registry and Federation were treated separately even though Federation comprises views of Registry entries
bull Experimenting with new combined index
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
SciCrunch A ldquosocial networkrdquo for resources
bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery
and general browsingbull Can perform analytics across
the spectrum of biomedical resources
bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according
to their needsbull Use their own branding
bull How do we create a system that satisfies community needs without creating another silo
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Put dkNET here
httpdknetorg
Autogenerated snippets
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Where can I find validated antibodies against CART
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
1 100 10000 1000000 10000000010000000000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunchFederation become immediately available through More Resources
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Breaking down silos Community enrichment
Itrsquos like a Mendeley for resources
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
SciCrunch
SharedResources
Undiagnosed Disease Program
Phenotype RCN
One Mind for Research
Consortia-PediaFaster Cures
Model Organism Databases
Community Outreach
Shared curation shared expertiseResource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Making use of community
FacetFilter
Source
Category
Index
Community Community
Community resources
SciCrunchdata (all)
Gene
Gemma
Gene OrganismExpression
level
GeoIntegrated Expression
Literature
Brings expertise of community to understanding how to work with data
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
KNOWLEDGE TO DATA GAP ANALYSIS
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Looking across the ecosystem Where are the data
Data Sources
Bringing knowledge to data Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
gt101
Data Sources
Revealing biases in the dataspace
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
SW Oh et al Nature 000 1-8 (2014) doi101038nature13186
Adult mouse brain connectivity matrix revenge of the midbrain
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed
bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures
bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail
bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory
Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo
Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Importance of comprehensive indices For how many proteins are there antibodies
0
1-10
11-100
101-1000
1001+
Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg
AntibodyregistryorgTrish Whetzel and Anita Bandrowski
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
ldquoThe Data Homunculusrdquo
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Data-Knowledge Mismatch
Dutowski et al 2013 Nature Biotechnology
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
The scourge of neuroanatomical nomenclature
bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)
bullTotal 1800 unique brain terms (excluding Avian)
bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
6 parcellation schemes of mouse prefrontal cortex based on Nissl alone
Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
How many neuron types are
there
NIH funding announcement BRAIN Initiative Transformative
Approaches for Cell-Type Classification in the Brain
ldquoThe mammalian brain contains a vast number of cells These cells are
generally grouped within broad classes (eg neurons or glia) but it is
currently unknown exactly how many classes existrdquo
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Transition Zones Neurons and their properties
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Analysis of Red Links in the Neuron Registry
bull INCF Project
ndash Neuron Registrybull Neurolexorg
bull Semantic MediaWiki
ndash gt 30 experts worldwide
ndash Fill out neuron pages in NeurolexWiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
NumberTotal
redlinkseasy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Domain Knowledge
Ontologies
AtlasesMaps
Annotation
Claims assertions
Registries
Derived data
Models and simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update
SciCrunch Creating a Data and Resource Discovery Environment
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
BD2K Creating a Data Discovery Index
bull BioCADDIE
ndash Dr Lucila Ohno-Machado PI
ndash FORCE11 Community engagement piece
bull What should a data discovery index do
ndash Task Forces
ndash Pilot projects
bull How should it be built httpbiocaddieorg
BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
NIF team (past and present)
Jeff Grethe UCSD Co Investigator Interim PI
Amarnath Gupta UCSD Co Investigator
Anita Bandrowski NIF Project Leader
Gordon Shepherd Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen Washington University
Erin Reid
Paul Sternberg Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark Harvard University
Paolo Ciccarese
Karen Skinner NIH Program Officer (retired)
Jonathan Pollock NIH Program Officer
And my colleagues in Monarch dkNet 3DVC Force 11
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
BD2K-K2BD Data Discovery Index
bull Accounting of what is availablendash Comprehensive resource registry
ndash UPCrsquos for research resources
bull Information frameworkndash Major concepts contained in data but also accounting of what happens to
data as it flows through the ecosystem (provenance)
bull Community-based portals into shared data resourcesndash Share expertise
ndash Metrics of trust
ndash Shared curation and upkeep
bull Two way validation of knowledge to data
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Registry vs Federation Metadata aboutresource vs metadatadata in database
With the thousands of databases and other information sources available simple descriptive metadata will not suffice
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
What have we learned Grabbing the long tail of small data
bull NIF is in a unique position to ask questions against the data resource landscape
bull The data space is not uniform
bull Data ldquoflowsrdquo from one resource to the next
ndash Data is reinterpreted reanalyzed or added to
bull Currently very difficult to track data as it moves across the landscape
ndash Makes it difficult to learn from combined efforts
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Working with and extending ontologies Neurolexorg
httpneurolexorg Larson et al Frontiers in Neuroinformatics in press
bullSemantic MediWiki
bullProvide a simple interface for defining the concepts required
bullLight weight semantics-sets of triples
bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
bullCommunity based
bullAnyone can contribute their terms concepts things
bullAnyone can edit
bullAnyone can link
bullAccessible searched by Google
bullGrowing into a significant knowledge base for neuroscience
Demo D03
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Neuron Lexicon Gauging the state of knowledge in neuroscience
bull Led by Dr Gordon Shepherd
bull gt 30 world wide experts
bull Simple set of properties
bull Consistent naming scheme
bull Integrated with Structural Lexicon
bull Used for annotation in other resources eg NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves
Data flows throughout the ecosystemvalue is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Buthellipeven our standards need standards
GSE13732
E-GEOD-13732
GEOGSE13732
Standard identifier format for all data federation sources text mining to deal with inconsistencies
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Same data different analysis
bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID
bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine
bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis
bullResults for 1 gene were opposite in DRG and Gemma
bull45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with it
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
How many do we use
These resources themselves need to be citable
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Resource Identification Initiative Linking resources to literature
bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique
identifier that resolves to a single resource)
ndash Outside of the paywallndash Uniform across journals and
publishers
bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms
Launched February 2014 gt 30 journals participating
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
What studies have used
bullgt200 articles have appeared to date
bullgt30 journals
bullData set being made available to community
bullgt 650 RRIDrsquos
bull~10 disappeared after copyediting
bull5 were in error
Database available at httpswwwforce11orgnode5635
C
Neurolex gt 1 million triples
Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers