RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
description
Transcript of RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
The Neuroscience Information Framework
Maryann E. Martone, Ph. D.University of California, San Diego
We say this to each other all the time, but we set up systems for scholarly advancement and communication that are the antithesis of integration
Whole brain data (20 um
microscopic MRI)
Mosiac LM images (1 GB+)
Conventional LM images
Individual cell morphologies
EM volumes & reconstructions
Solved molecular structures
No single technology serves these all equally well.Multiple data types;
multiple scales; multiple databases
A data integration problem
• NIF is an initiative of the NIH Blueprint consortium of institutes– What types of resources (data, tools, materials, services) are available to the
neuroscience community?– How many are there?– What domains do they cover? What domains do they not cover?– Where are they?
• Web sites• Databases• Literature• Supplementary material
– Who uses them?– Who creates them?– How can we find them?– How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
How many resources are there?
•NIF Registry: A catalog of neuroscience-relevant resources
• > 10,000 currently listed
• > 2500 databases•And we are finding more every day
June10, 2013 4
But we have Google!
• Current web is designed to share documents– Documents are
unstructured data
• Much of the content of digital resources is part of the “hidden web”
• Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.
Which databases do you use?
• Mouse Genome Database
• Allen Brain Atlas• Clinical Trials.gov• Pub Med• dbGAP• GEO• NIH Reporter• OMIM
• Bionumbers:– -a database of numerical values
extracted from literature
• Epigenomics– - human epigenomic data to
catalyze basic biology and disease-oriented research
• Antibody Registry– -2M antibodies
• BioGrid– an interaction repository of
protein and genetic interactions
June10, 2013 6Most resources are largely unknown and underutilized
NIF: A New Type of Entity for New Modes of Scientific Dissemination
• NIF’s mission is to maximize the awareness of, access to and utility of research resources produced worldwide to enable better science and promote efficient use– NIF unites neuroscience information without respect to domain,
funding agency, institute or community– NIF is like a “Pub Med” for all biomedical resources and a “Pub
Med Central” for databases– Makes them searchable from a single interface– Practical and cost-effective; tries to be sensible– Learned a lot about the effective data sharing
How do resources get added to the NIF?•NIF curators•Nomination by the community•Semi-automated text mining pipelines
NIF RegistryRequires no special
skillsSite map available for
local hosting
•NIF Data Federation• DISCO interop• Requires some
programming skill• Open Source Brain < 2
hr
Two tiered system: low barrier to entry
NIF searches across 3 main indices: Registry, Federation and Literature
Data Federation:200 databases/400M
recordsRegistry: 6300
resources(2500 databases)
Literature: 22 million articles
What resources are available for GRM1?
With the thousands of databases and other information sources available, simple descriptive metadata will not suffice
NIF makes it easier to browse different databases
Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion: Synonyms
and related conceptsBoolean queries
Data sources categorized by
“data type” and level of nervous
system
Common views across multiple
sources
Tutorials for using full resource when getting there from
NIF
Link back to record in
original source
Making it easier to access and understand distributed databases
Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases
NIF Semantic Framework: NIFSTD ontology
• NIF covers multiple structural scales and domains of relevance to neuroscience• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellular structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF capitalizes on the growing set of community ontologies available in biomedical science
NIF Concept Mapper: Reducing false positives
Is there a framework for neuroscience?
• Of the ~ 4000 columns that NIF queries, ~1300 map to one of our core categories:– Organism– Anatomical structure– Cell– Molecule– Function– Dysfunction– Technique
• When NIF combines multiple sources, a set of common fields emerges– >Basic information
models/semantic models exist for certain types of entities
Biomedical science does have a conceptual framework
PurkinjeCell
AxonTerminal
Axon DendriticTree
DendriticSpine
Dendrite
Cell body
Cerebellarcortex
Bringing knowledge to data: Ontologies as framework
There is little obvious connection between data sets taken at different scales using different microscopies without an explicit representation of the biological objects that the data represent
: CNeurolex: > 1 million triples
Dr. Yi Zeng: Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
• Incorporate basic neuroscience knowledge into search– Google: searches for string “GABAergic
neuron)– NIF automatically searches for types of
GABAergic neuronsTypes of GABAergic
neurons
NIF Concept-Based Search
Neuroscience Information Framework – http://neuinfo.org
Ontologies as a data integration framework
•NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
• Brain Architecture Management System (rodent)• Temporal lobe.com (rodent)• Connectome Wiki (human)• Brain Maps (various)• CoCoMac (primate cortex)• UCLA Multimodal database (Human fMRI)• Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385
01-10
11-100>101
Open World-Closed World: Mapping the knowledge - data space
Data Sources
NIF lets us ask: where isn’t there data? What isn’t studied? Why?
Forebrain
Midbrain
Hindbrain
01-10
11-100>101
The data space is not uniform
Data Sources
“The Data Homunculus”
Funding drives representation in the data space
What can we learn from the NIF Registry?
NIF supports a semantic model for describing research resources
24
Resource Curation
June10, 2013
• NIF Registry is hosted on Semantic Media Wiki platform Neurolex– Community can add,
review, edit without special privileges
– Searchable by Google– Integrated with NIF
ontologies– Graph structure
http://neurolex.org
Can we mine relationships between resources?
http://neuinfo.org
NIF semantic graph of research resources
Text mining gives a
picture of the most used resources
PDB
http://force11.org/Resource_identification_initiative
• Automated text mining is used to look for “web page last updated” or copyright dates
– Identified for 570 resources– 373 were not updated within the last 2
years (65%)• Manual review of ~200 resources
– 38 not updated within the past 2 years (~20%)
– 8 migrated to new addresses or institutions– 7 are no longer in service (~3%)– 3 were deemed no longer appropriate
Tracking digital resources since 2008
NIF helps stabilize the dynamic resource landscape
Keeping content up to dateConnectome
Tractography
Epigenetics
•New tags come into existence•New resource types come into existence, e.g., Mobile apps•Resources add new types of content
• Change name• Change scope
•> 7000 updates to the registry last year
It’s a challenge to keep the registry up to date; sitemaps, curation, ontologies, community review
What can we learn from the NIF Data Federation?
NIF supports a semantic model for describing research resources
dkCOIN Investigator's Retreat 29Jun-08 Dec-08 Jul-09 Jan-10 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-1310000
100000
1000000
10000000
100000000
1000000000
0
50
100
150
200
250
Num
ber o
f Fed
erat
ed R
ecor
ds (M
illio
ns)
Num
ber o
f Fed
erat
ed D
atab
ases
Data Federation GrowthNIF searches the largest collation of neuroscience-relevant data on the web
DISCO
June10, 2013
What do you mean by data?Databases come in many shapes and sizes
• Primary data:– Data available for reanalysis, e.g.,
microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)
• Secondary data– Data features extracted through
data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)
• Tertiary data– Claims and assertions about the
meaning of data• E.g., gene
upregulation/downregulation, brain activation as a function of task
• Registries:– Metadata– Pointers to data sets or materials
stored elsewhere
• Data aggregators– Aggregate data of the same type
from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede
• Single source– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
What have we learned: Grabbing the long tail of small data
• NIF is in a unique position to ask questions against the data resource landscape
• The data space is not uniform• Data “flows” from one resource to
the next– Data is reinterpreted, reanalyzed or added
to
• Currently very difficult to track data as it moves across the landscape
– Makes it difficult to learn from combined efforts
NIF is trying to make it easier to work with diverse data
Phases of NIF
• 2006-2008: A survey of what was out there• 2008-2009: Strategy for resource discovery
– NIF Registry vs NIF data federation– Ingestion of data contained within different technology platforms, e.g., XML vs relational
vs RDF– Effective search across semantically diverse sources
• NIFSTD ontologies
• 2009-2011: Strategy for data integration– Unified views across common sources– Mapping of content to NIF vocabularies
• 2011-present: Data analytics– Uniform external data references
• 2012-present: SciCrunch: unified biomedical resource services
NIF provides a strategy and set of tools applicable to all biomedical science
NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PIAmarnath Gupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd, Yale UniversityPerry MillerLuis MarencoRixin WangDavid Van Essen, Washington UniversityErin ReidPaul Sternberg, Cal TechArun RangarajanHans Michael MullerYuling LiGiorgio Ascoli, George Mason UniversitySridevi Polavarum
Fahim ImamLarry LuiAndrea Arnaud StaggJonathan CachatJennifer LawrenceSvetlana SulimaDavis BanksVadim AstakhovXufei QianChris ConditMark EllismanStephen LarsonWillie WongTim Clark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer (retired)Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11