Post on 12-Feb-2017
Chunlei Wu, Ph.D.cwu@scripps.edu
@chunleiwu
Associate Professor of Molecular MedicineDept. of Molecular Experimental Medicine
The Scripps Research InstituteLa Jolla, CA, USA
07/2016
High-performance web services forgene and variant annotations
MyVariant.info
MyGene.info
Biological knowledge is a complex network
No one-fit-all database can capture the entire knowledge space
Typical database representations
{ _id: 1017, name: CDK2, taxid: 9606}
Relational database
Document database
RDF triplestore
Tables JSON objects Triples
Key-value store
Key-value pairs
BioThings APIs are built on document databases
Why we picked document databases:• Object representation
• Rich data structures, handles heterogeneous data very well
• Atomic operations, built for big-data scale
Gene and Variant annotations represented in JSON documents
{ "_id": "chr1:g.196659237C>T", "cosmic": { "chrom": "1", "hg19": { "start": 196659237, "end": 196659237 }, "ref": "C", "alt": "T", "tumor_site": "breast", "mut_freq": 0.49, "mut_nt": "C>T", "cosmic_id": "COSM424915"}
{ “_id”: “1017”, “Symbol”: “CDK2”, “Ensembl”: “ENSG00000123374”, “RefSeq”: [ “NM_001798”, “NM_052827” ], “Reporter”: { “U95A”: [ “1792_g_at”, “1833_at” ], “U133A”:[ “211804_s_at”, “2045252_at”, “211803_at” ] }}
Keep data always up-to-date
Each data source is updated individually. Colors indicate their different updating
schedules.
Schematic view of MyVariant.info architecture
High-performance web service APIs
Schematic view of MyVariant.info architecture
MyGene.info + MyVariant.infoGene
G
Variant
V
MyVariant.info
MyGene.info
/v2/gene/<geneid>/v2/query?q=<query>
/v1/variant/<hgvsid>/v1/query?q=<query>
/v3/gene/<geneid>/v3/query?q=<query>
single query on GET, batch query on POST
We focus on building APIs. Try to …
Make it really easy to use
Just two endpoints
No registration/sign-in
No API key
Developer-friendly
Python/R clients(also js client for myvariant)
search “mygene” and “myvariant”in PyPI and Bioconductor
JSONPCORShttps
msgpackhttp compression
http cachingJSON-LD
Supported!
Aggregate Everything about gene and variant
MyVariant.info
MyGene.info
Support >15M genes for ~17K species ~ 200 annotation fields
Support > 334 M variants ~ 500 annotation fields
from 14 sources: ClinVar dbNSFP dbSNP …
Keep up-to-date
MyVariant.info
MyGene.info
Weekly ~Monthly
Support >15M genes for ~17K species ~ 200 annotation fields
Support > 334 M variants ~ 500 annotation fields
from 14 sources: ClinVar dbNSFP dbSNP …
High-performance and scalable
>95% queries response < 30ms
High-performance and scalable
“Stress test” suggests support for
>5,000 concurrent users for
~10,000 requests per minute
High-performance and scalable
High availability
99.999% over last year
MyVariant.info
MyGene.info
99.87% over last 6
months
Availability tracked by
Who is using
MinePath.org
Gene Wiki
JBrowse
Live applications:
Who is using
Many users use them in their
daily analysis pipelines or
simply caching annotations locally
MyGene.info recent usage statsrequests unique IPs
Jan-16 3,885,192 2,498Feb-16 5,313,950 2,786Mar-16 3,362,354 3,121Apr-16 10,918,104 3,065
May-16 10,776,858 3,803Jun-16 6,396,148 3,940
39%direct calls 38%
mygene.py
14%mygene.R
9%BioGPS
Over 40M requestsIn six months
MyVariant.info recent usage stats
requests unique IPsJan-16 83,519 1,330Feb-16 3,054,191 1,192Mar-16 272,424 1,771Apr-16 701,526 1,500
May-16 89,642 1,891Jun-16 213,767 1,924
21%direct calls 23%
myvariant.py
50%myvariant.R
6%myvariant.js
~4.5M requestsIn six months
Generalized BioThings SDK
BioThings SDK
MyVariant.info
MyGene.info JSON data aggregation mechanism
High-performance query engine
Well-designed REST API pattern
JSON-LD enabled Linked Data
Data-updating schedulerPython/R clients…
BioThings SDK
A tutorial here (more docs are coming):http://biothingsapi.readthedocs.io/en/latest/
v.biothings.io
g.biothings.io
BioThings SDK
gene
variant
s.biothings.io
species/taxonomy
alias to MyGene.info
alias to MyVariant.info
BioThings API for species/taxonomy
{"_id": "9606","_version": 1,"authority": [
"homo sapiens linnaeus, 1758"],"children": [ 63221, 741158],"common_name": "man","genbank_common_name": "human","has_gene": true,"lineage": [ 9606, 9605, 207598, …,131567, 1],"parent_taxid": 9605,"rank": "species","scientific_name": "homo sapiens","taxid": 9606,"uniprot_name": "homo sapiens"
}
http://s.biothings.io/v1/species/9606?include_children=true
BioThings API for species/taxonomy
{ "hits": [ { "_id": "1239", "_score": 10.971453, "common_name": […], "genbank_common_name": "gram-positive bacteria", "has_gene": false, "lineage": [1239, 1783272, 2, 131567, 1], "parent_taxid": 1783272, "rank": "phylum", "scientific_name": "firmicutes", "taxid": 1239, "uniprot_name": "firmicutes" } ], "max_score": 10.971453, "took": 12, "total": 1}
http://s.biothings.io/v1/query?q=rank:phylum AND common_name:gram-positive
Species API used in MyGene.info
You can now query for genes beyond species:
Q: Give me all lytic enzymes for any firmicutes
http://mygene.info/v3/query?q=lytic enzyme&species=1239&include_tax_tree=true
http://mygene.info/v3/query?q=lytic enzyme&species=1239
0 hits
5 hits
Very minimal code for building a species API
Have the flexibility to customize your query
v.biothings.io
g.biothings.io
BioThings SDK
s.biothings.io
c.biothings.io
gene
variant
species/taxonomydrugs/ compounds
∙ ∙ ∙ ∙ ∙ ∙
alias to MyGene.info
alias to MyVariant.info
disease d.biothings.io
BioThings APIs
A collection of data APIs A framework for building new APIs
Data as a service
Software as a serviceGot a new type of “BioThings”?
We can help you to build or even host your biothings API
BioThings TEAM
Funding and SupportU01HG008473U54GM114833
TSRI:
Chunlei WuAndrew SuJiwen XinCyrus AfrasiabiSebastien LelongGinger TsuengJulee AdesaraMike Mayers
U. Washington:
Sean MooneyMoritz JuchlerNikhil Gopal
Source code
• MyGene.infohttps://github.com/sulab/mygene.info
• MyVariant.infohttps://github.com/sulab/myvariant.info
• BioThings API for species/taxonomyhttps://github.com/sulab/biothings.species
• BioThings SDKhttps://github.com/sulab/biothings.api
DEMO time!
by Jiwen (Kevin) Xin
2441
2308
1917
18
9
5
Initial number of genes mutated in all four patients:
filter2 <- lapply(filter1, function(i) subset(i, cadd.consequence %in% c("NON_SYNONYMOUS", "STOP_GAINED", "STOP_LOST", "CANONICAL_SPLICE", "SPLICE_SITE")))
nVars <- countGenes(vars)
filter1 <- lapply(vars, function(i) subset(i, DP > 8 & FS < 30 & QD > 2))
Filtering for sequencing coverage and strand bias:
Filtering for nonsynonymous and splice site variants:
filter3 <- lapply(filter2, function(i) subset(i, exac.af < 0.01)) Filtering for rare variants based on allele frequencies from ExAC:
filter4 <- lapply(filter3, function(i) subset(i, sapply(dbnsfp.1000gp1.af, function(j) j < 0.01 )))
Filtering for rare variants based on allele frequencies from 1000 Genomes Project:
goBP <- data.frame(queryMany(top.genes$Var1, scopes="symbol", species="human", fields=c("go.BP", "name", "MIM", "uniprot")))
# The Bioconductor package go.DB is used to find all genes with a GO biological process annotation that # is a descendant of GO:0008152 - the GO id for metabolic process.miller.bp <- lapply(goBP$go.BP, function(i) unlist(i$id))bp.ancestor <- lapply(miller.bp, function(i) sapply(i, function(j) "GO:0008152" %in% unlist(GOBPANCESTOR[[j]])))candidate.genes <- top.genes$Var1[sapply(bp.ancestor, function(i) TRUE %in% i)]
Filtering by GO biological process annotation using MyGene.info:
Number of genes Filtering steps to prioritize candidate genes:
Demos in Jupyter notebooks
• Using myvariant and mygene in R for variant prioritizationhttp://nbviewer.jupyter.org/github/SuLab/myvariant.info/blob/master/docs/ipynb/myvariant_R_miller.ipynb
• Access ClinVar data from myvariant in Pythonhttp://nbviewer.jupyter.org/github/SuLab/myvariant.info/blob/master/docs/ipynb/myvariant_clinvar_demo.ipynb
• ID mapping using mygene module in Pythonhttp://nbviewer.jupyter.org/gist/newgene/6771106