Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their...
-
Upload
alannah-todd -
Category
Documents
-
view
219 -
download
0
Transcript of Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their...
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in
NIAID Bioinformatics Resource Centers
Richard H. Scheuermann, Ph.D.
Department of Pathology
Division of Biomedical Informatics
U.T. Southwestern Medical Center
Genome Sequencing Centers for Infectious Disease (GSCID)
Bioinformatics Resource Centers (BRC)
www.viprbrc.org www.fludb.org
High Throughput Sequencing
• Enabling technology– Epidemiology of outbreaks– Pathogen evolution– Host range restriction– Genetic determinants of virulence and pathogenicity
• Metadata requirements– Temporal-spatial information about isolates– Selective pressures– Host species of specimen source– Disease severity and clinical manifestations
Metadata Inconsistencies
• Each project was providing different types of metadata
• No consistent nomenclature being used• Impossible to perform reliable comparative
genomics analysis• Required extensive custom bioinformatics
system development
GSC-BRC Metadata Standards Working Group
• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs
• Develop metadata standards for pathogen isolate sequencing projects
• Bottom up approach• Assemble into a semantic framework
Metadata Standards Process• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources
(e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup
(core) and data fields that appear to be project specific• For each data field, provide common set of attributes, including definitions, synonyms,
allowed value sets preferably using controlled vocabularies, and expected syntax, etc.• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble set of pathogen-specific and project-specific metadata fields to be used in
conjunction with core fields• Compare, harmonize, map to other relevant initiatives, including OBI, MIGS, MIMARKS,
BioProjects, BioSamples (ongoing)• Assemble all metadata fields into a semantic network• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI)• Draft data submission spreadsheets to be used for all white paper and BRC-associated projects• Finalize version 1.0 metadata standard and version 1.0 data submission spreadsheet• Beta test version 1.0 standard with new white paper projects, collecting feedback
Metadata Standards Process
• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources (e.g.
CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data
fields that appear to be project specific• For each data field, provide:
– definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories,– data providers
• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble metadata fields into a semantic network (Scheuermann)• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) (Stoeckert, Zheng)• Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples• Establish policies and procedures for metadata submission workflows and GenBank linkage• Develop data submission spreadsheets to be used for all white paper and BRC-associated projects
organism
environmentalmaterial
equipment
person
specimensource role
specimencapture role
specimencollector role
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
specimen Xspecimen isolation
procedure X
isolationprotocol
has_input
has_output
plays
plays
has_specification
has_partden
otes
located_in
name
denotes
spatialregion
geographiclocation
denotes
located_in
affiliation
has_affiliation
ID
v2
v5-6
v3-4
v7v8
v15
v16
denotes
specimen typeinsta
nce_of
specimen isolationprocedure type
instance_of
Specimen Isolation
plays
has_input
Comments
????
v9
organism parthypothesis
v17
is_about
IRB/IACUCapproval
has_authorization
v19v18
b18
b22environment
has_quality
b23
b24
b28 b29
b25 b26 b27
b30
organismpathogenicdisposition
has part
has disp
osition
Metadata Processes
data transformations –image processing
assemblysequencing assay
specimen source – organism or environmental
specimencollector
input sample
reagents
technician
equipment
type ID qualities
temporal-spatialregion
data transformations –variant detection
serotype marker detect.gene detection
primarydata
sequencedata
genotype/serotype/gene data
specimen
microorganism
enrichedNA sample
microorganismgenomic NA
specimen isolationprocess
isolationprotocol
sample processing
data archivingprocess
sequencedata record
has_input
has_output
has_output
has_specification has_part has_part
is_about
has_input
has_output
has_input
has_input
has_input
has_output
has_output
has_output
is_about
GenBankID
denotes
located_in
denotes
has_input
has_qualityinstance_of
temporal-spatialregion
located_in
Specimen Isolation
Material Processing
Data ProcessingSequencing Assay
Investigation
assay X
samplematerial X
material X
person X
equipment X
lot #
primarydata
assayprotocol
temporal-spatialregion
has_input
located_in
has_specification
has_output
plays
spatialregion
temporalinterval
GPSlocation
date/time
spatialregion
geographiclocation
Generic Assay
has_part
located
_indenotes denotes
runID
assaytype
denotes
instance_of
reagentrole
reagenttype
instance
_of
denotes
sample ID
playstargetrole
sampletype
instance
_of
denotes
name
playstechnicianrole
species
instance
_of
denotes
serial #
playssignaldetection role
equipmenttype
instance
_of
denotes
has_input
has_input
has_input
objectives
has_part
analyte X
has_part
quality x
has_quality
input samplematerial X
is_about
materialtransformation X
samplematerial X
material X
person X
equipment X
lot #
outputmaterial X
material transformationprotocol
temporal-spatialregion
has_input
located_in
has_specification
has_output
plays
spatialregion
temporalinterval
GPSlocation
date/time
spatialregion
geographiclocation
Generic Material Transformation
has_part
located
_indenotes denotes
runID
material transformationtype
denotes
instance_of
reagentrole
reagenttype
instance
_of
denotes
sample ID
playstargetrole
sampletype
instance
_of
denotes
name
playstechnicianrole
species
instance
_of
denotes
serial #
playssignaldetection role
equipmenttype
instance
_of
denotes
has_input
has_input
has_input
objectives
has_part
quality x
has_quality
quality x
materialtype
has_quality
instance_of
sample IDden
otes
data transformation Xinputdata
outputdata
material X
algorithm
has_specification
has_output
is_about
software
has_input
located_in
person Xname
data analystrole
denotes
runID
denotes
Generic Data Transformation
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
spatialregion
geographiclocation
has_part
located
_indenotes denotes
data transformationtype
instance_of
plays
Generic Material (IC)
material X
ID
materialtype
quality x
has_quality
material Y
has_part
material Z
has_part
quality y
has_quality
denotes
instance_of
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located
_in
spatialregion
geographiclocation
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located
_in
spatialregion
geographiclocation
denotes denotes denotes
located_in located_in
Conclusions
• Metadata standards for microorganism sequencing projects• Bottom up approach focuses standard on important features• Two flavors of “minimum information” – MIBBI vs. dbMIBBI
– Distinguish between minimum information to reproduce an experiment and the minimum information to structure in a database for query and analysis
• Utility of semantic representation– Identified gaps in data field list (e.g. temporal components)– Includes logical structure for other, project-specific, data fields - extensible– Identified gaps in ontology data standards (use case-driven standard
development)– Identified commonalities in data structures (reusable)– Support for semantic queries and inferential analysis in future
• Ontology-based framework is extensible– Sequencing => “omics”