VectorBase PopBio Introduction
NIH/NIAID VectorBase site visitMarch 2015
What is PopBio?
Flexible database for sample and assay metadata for field- or lab-derived population biology data.
● collection event & location (GeoData)● basic sample information● assays
o species identificationo phenotypes (host species [e.g. from blood meal],
insecticide resistance, ...)o genotypeso manipulations (sampleA+sampleB->sampleC)
What is it for?
Allows integration of individual studies (e.g. insecticide resistance studies conducted in individual countries).
Enables meta-analysis of community data.
Data sources
Legacy:IRbaseUC Davis/UCLA (but updates planned)
Recent:Bulk imports (e.g. Malaria Atlas Project surveillance data)
Publications (typically with extra data direct from authors)
MalariaGen & 16 AnophelesOther unpublished/in progress
Future data sources
ICEMRsNational/international IR surveillance MalariaGenPartners (Vestergaard, Oxford University MAP)
Smaller published and unpublished datasets
Data model
GMOD Chado schema
Heavy reliance on CVs/ontologies → flexibility→ computability
Vastly oversimplified explanation of schema:Projects have samples have assays have results
Ontologies
VectorBase ontologies: insecticide resistance, malaria, dengue & anatomyThird party ontologies: sample properties, genomic variation types, placenames, phenotypic qualities
Curation and data import
ISA-Tab spreadsheet format
Investigation - Study - Assay
Widely used for 'omics metadata
Ontology-based annotation is well supported
Ontology term suggestion tools available in Google Spreadsheets
Challenges
● consistent representation of data and choice of ontology terms by curator(s) through time
● too complex for casual submitters
ISA-Tab's Study and its associated list of samples maps to PopBio's project and samples, while Assay maps to… assay!
High level "object relational mapper" Perl API handles storage into and retrieval from Chado database for consistency and maintainability.
Example: a sample may have several species identification assays. Our API provides a method for the sample object which returns the best single species term to summarise those results.
Updating existing data
1. Edit ISA-Tab, delete project and reload project from new ISA-Tab(stable IDs for project, samples and assays are retained)
2. Edit ISA-Tab but apply simple SQL updates or an API script to modify the database(as delete+reload can be slow)
No database → ISA-Tab route at present.
Scalability (storage + maintenance)
Current size: 121 projects, 57, 637 samples, 172, 636 assays (of which 4, 387 are IR)
API overhead some tasks take overnight⇒● loading for 1000+ sample datasets● search index generation
No issues yet with maintenance (e.g. backup and transfer of databases
Scalability (web-based retrieval)
"Dumb" API-based retrieval for "smart" web client (see next slide) is too slow on its own.
Currently using pre-filled RAM-based cache to speed up API requests for web-users. Not necessarily scalable. Still not very fast!
See future plans...
{"sample_manipulations":[], "name":"G05-2019", "species_identification_assays":[{"result_summary":"<span class=\"species_name\">Anopheles arabiensis</span> (PCR-based species identification)", "name":"G05-2019.species", "description":null, "props":[{"cvterms":[{"name":"species assay result", "accession":"VBcv:0000961"}, {"name":"Anopheles arabiensis", "accession":"VBsp:0002224"}]}], "protocols":[{"props":[], "name":"VBA0046035:PROTO2", "type":{"name":"PCR-based species identification", "accession":"MIRO:30000040"}, "description":"Mosquito DNA was extracted from the carcass and identified to species and molecular form using rDNA-based PCR assays.", "uri":""}], "performers":[], "id":"VBA0046035", "type":"species identification assay"}], "species":{"name":"Anopheles arabiensis", "accession":"VBsp:0002224"}, "description":null, "genotype_assays":[{"result_summary":"inversion: 2La/a; inversion: 2Rjb/b (cytological chromosome examination)", "genome_browser_path":null, "name":"G05-2019.karyotyping", "description":null, "genotypes":[{"uniquename":"VBA0046036:2La/a", "props":[{"value":"2La/a", "cvterms":[{"name":"inversion", "accession":"SO:1000036"}]}, {"value":"2L", "cvterms":[{"name":"chromosome_arm", "accession":"SO:0000105"}]}], "name":"2La/a", "type":{"name":"paracentric_inversion", "accession":"SO:1000047"}, "description":"inversion: 2La/a"}, {"uniquename":"VBA0046036:2Rjb/b", "props":[{"value":"2Rjb/b", "cvterms":[{"name":"inversion", "accession":"SO:1000036"}]}, {"value":"2R", "cvterms":[{"name":"chromosome_arm", "accession":"SO:0000105"}]}], "name":"2Rjb/b", "type":{"name":"paracentric_inversion", "accession":"SO:1000047"}, "description":"inversion: 2Rjb/b"}], "vcf_file":null, "props":[], "protocols":[{"props":[{"value":"microscope manufacturer: Olympus", "cvterms":[{"name":"protocol component", "accession":"VBcv:autocreated:protocol component"}]}, {"cvterms":[{"name":"protocol component", "accession":"VBcv:autocreated:protocol component"}, {"name":"Giemsa staining", "accession":"IDOMAL:0000552"}]}], "name":"VBA0046036:PROTO3", "type":{"name":"cytological chromosome examination", "accession":"MIRO:30000037"}, "description":"Ovaries were prepared for karyotype analysis according to standard procedures. The banding pattern was observed under a phase-contrast microscope (400×) and interpreted with reference to the chromosomal map and nomenclature of Coluzzi and colleagues. ", "uri":""}], "performers":[], "type":"genotype assay", "id":"VBA0046036"}], "props":[{"cvterms":[{"name":"sex", "accession":"EFO:0000695"}, {"name":"female", "accession":"PATO:0000383"}]}, {"cvterms":[{"name":"developmental stage", "accession":"EFO:0000399"}, {"name":"adult", "accession":"IDOMAL:0000655"}]}], "field_collections":[{"result_summary":"Burkina Faso (pyrethrum spray catch)", "name":"G05-2019.collect", "description":null, "geolocation":{"longitude":"-0.05727", "props":[{"cvterms":[{"name":"collection site", "accession":"VBcv:0000831"}, {"name":"Burkina Faso", "accession":"GAZ:00000905"}]}, {"value":"Bonsse", "cvterms":[{"name":"location", "accession":"VBcv:0000698"}]}, {"value":"Burkina Faso", "cvterms":[{"name":"country", "accession":"VBcv:0000701"}]}], "latitude":"12.1693", "geodetic_datum":"WGS 84", "name":"Burkina Faso", "altitude":null}, "props":[{"value":"2005-08-02", "cvterms":[{"name":"date", "accession":"VBcv:0000705"}]}], "protocols":[{"props":[], "name":"VBA0046034:PROTO1", "type":{"name":"pyrethrum spray catch", "accession":"MIRO:30000023"}, "description":"Freshly-fed female An. gambiae s.l. were collected in the morning while resting inside human dwellings by manual aspiration with the aid of electrical aspirators. Mosquitoes were kept in small cages wrapped in wet towels and stored inside cool boxes. Additionally, indoor insecticide space-sprays were carried out in the early afternoon.", "uri":"\n"}], "performers":[], "type":"field collection", "id":"VBA0046034"}], "species_qualifications":[{"name":"unambiguous", "accession":"VBcv:autocreated:unambiguous"}], "type":{"name":"individual", "accession":"EFO:0000542"}, "id":"VBS0015615", "phenotype_assays":[]}
Web interface
PopBio browser:https://www.vectorbase.org/popbio/A good example project page:https://www.vectorbase.org/popbio/project/?id=VBP0000010
New entry page currently in development:http://funcgen.vectorbase.org/popbio-map-preview/vb_geohashes_mean.html
Web interfacePlan to develop or modify something similar to MalariaGen's Panoptes with richer/more flexible metadata capabilities:
Plans
Map interface: delivery for June (VB-2015-06) release and present/demo at Kolymbari, ICEMR meetings
Spreadsheet submission wizard development scheduled for Fall 2015.
Year 2: Sample x genotype browser development, including e! REST and variation Solr work.
Year 2: Refactor project pages with scalable (but still flexible) data transfer (probably also Solr-driven) & update graphics.