High-performance web services for gene and variant annotations

Chunlei Wu, Ph.D.cwu@scripps.edu

@chunleiwu

Associate Professor of Molecular MedicineDept. of Molecular Experimental Medicine

The Scripps Research InstituteLa Jolla, CA, USA

07/2016

High-performance web services forgene and variant annotations

MyVariant.info

MyGene.info

Biological knowledge is a complex network

No one-fit-all database can capture the entire knowledge space

Typical database representations

{ _id: 1017, name: CDK2, taxid: 9606}

Relational database

Document database

RDF triplestore

Tables JSON objects Triples

Key-value store

Key-value pairs

BioThings APIs are built on document databases

Why we picked document databases:• Object representation

• Rich data structures, handles heterogeneous data very well

• Atomic operations, built for big-data scale

Gene and Variant annotations represented in JSON documents

{ "_id": "chr1:g.196659237C>T", "cosmic": { "chrom": "1", "hg19": { "start": 196659237, "end": 196659237 }, "ref": "C", "alt": "T", "tumor_site": "breast", "mut_freq": 0.49, "mut_nt": "C>T", "cosmic_id": "COSM424915"}

{ “_id”: “1017”, “Symbol”: “CDK2”, “Ensembl”: “ENSG00000123374”, “RefSeq”: [ “NM_001798”, “NM_052827” ], “Reporter”: { “U95A”: [ “1792_g_at”, “1833_at” ], “U133A”:[ “211804_s_at”, “2045252_at”, “211803_at” ] }}

Keep data always up-to-date

Each data source is updated individually. Colors indicate their different updating

schedules.

Schematic view of MyVariant.info architecture

High-performance web service APIs

Schematic view of MyVariant.info architecture

MyGene.info + MyVariant.infoGene

Variant

MyVariant.info

MyGene.info

/v2/gene/<geneid>/v2/query?q=<query>

/v1/variant/<hgvsid>/v1/query?q=<query>

/v3/gene/<geneid>/v3/query?q=<query>

single query on GET, batch query on POST

We focus on building APIs. Try to …

Make it really easy to use

Just two endpoints

No registration/sign-in

No API key

Developer-friendly

Python/R clients(also js client for myvariant)

search “mygene” and “myvariant”in PyPI and Bioconductor

JSONPCORShttps

msgpackhttp compression

http cachingJSON-LD

Supported!

Aggregate Everything about gene and variant

MyVariant.info

MyGene.info

Support >15M genes for ~17K species ~ 200 annotation fields

Support > 334 M variants ~ 500 annotation fields

from 14 sources: ClinVar dbNSFP dbSNP …

Keep up-to-date

MyVariant.info

MyGene.info

Weekly ~Monthly

Support >15M genes for ~17K species ~ 200 annotation fields

Support > 334 M variants ~ 500 annotation fields

from 14 sources: ClinVar dbNSFP dbSNP …

High-performance and scalable

>95% queries response < 30ms

“Stress test” suggests support for

>5,000 concurrent users for

~10,000 requests per minute

High availability

99.999% over last year

MyVariant.info

MyGene.info

99.87% over last 6

months

Availability tracked by

Who is using

MinePath.org

Gene Wiki

JBrowse

Live applications:

Who is using

Many users use them in their

daily analysis pipelines or

simply caching annotations locally

MyGene.info recent usage statsrequests unique IPs

Jan-16 3,885,192 2,498Feb-16 5,313,950 2,786Mar-16 3,362,354 3,121Apr-16 10,918,104 3,065

May-16 10,776,858 3,803Jun-16 6,396,148 3,940

39%direct calls 38%

mygene.py

14%mygene.R

9%BioGPS

Over 40M requestsIn six months

MyVariant.info recent usage stats

requests unique IPsJan-16 83,519 1,330Feb-16 3,054,191 1,192Mar-16 272,424 1,771Apr-16 701,526 1,500

May-16 89,642 1,891Jun-16 213,767 1,924

21%direct calls 23%

myvariant.py

50%myvariant.R

6%myvariant.js

~4.5M requestsIn six months

Generalized BioThings SDK

BioThings SDK

MyVariant.info

MyGene.info JSON data aggregation mechanism

High-performance query engine

Well-designed REST API pattern

JSON-LD enabled Linked Data

Data-updating schedulerPython/R clients…

BioThings SDK

A tutorial here (more docs are coming):http://biothingsapi.readthedocs.io/en/latest/

v.biothings.io

g.biothings.io

BioThings SDK

variant

s.biothings.io

species/taxonomy

alias to MyGene.info

alias to MyVariant.info

BioThings API for species/taxonomy

{"_id": "9606","_version": 1,"authority": [

"homo sapiens linnaeus, 1758"],"children": [ 63221, 741158],"common_name": "man","genbank_common_name": "human","has_gene": true,"lineage": [ 9606, 9605, 207598, …,131567, 1],"parent_taxid": 9605,"rank": "species","scientific_name": "homo sapiens","taxid": 9606,"uniprot_name": "homo sapiens"

http://s.biothings.io/v1/species/9606?include_children=true

BioThings API for species/taxonomy

{ "hits": [ { "_id": "1239", "_score": 10.971453, "common_name": […], "genbank_common_name": "gram-positive bacteria", "has_gene": false, "lineage": [1239, 1783272, 2, 131567, 1], "parent_taxid": 1783272, "rank": "phylum", "scientific_name": "firmicutes", "taxid": 1239, "uniprot_name": "firmicutes" } ], "max_score": 10.971453, "took": 12, "total": 1}

http://s.biothings.io/v1/query?q=rank:phylum AND common_name:gram-positive

Species API used in MyGene.info

You can now query for genes beyond species:

Q: Give me all lytic enzymes for any firmicutes

http://mygene.info/v3/query?q=lytic enzyme&species=1239&include_tax_tree=true

http://mygene.info/v3/query?q=lytic enzyme&species=1239

0 hits

5 hits

Very minimal code for building a species API

Have the flexibility to customize your query

v.biothings.io

g.biothings.io

BioThings SDK

s.biothings.io

c.biothings.io

variant

species/taxonomydrugs/ compounds

∙ ∙ ∙ ∙ ∙ ∙

alias to MyGene.info

alias to MyVariant.info

disease d.biothings.io

BioThings APIs

A collection of data APIs A framework for building new APIs

Data as a service

Software as a serviceGot a new type of “BioThings”?

We can help you to build or even host your biothings API

BioThings TEAM

Funding and SupportU01HG008473U54GM114833

Chunlei WuAndrew SuJiwen XinCyrus AfrasiabiSebastien LelongGinger TsuengJulee AdesaraMike Mayers

U. Washington:

Sean MooneyMoritz JuchlerNikhil Gopal

Source code

• MyGene.infohttps://github.com/sulab/mygene.info

• MyVariant.infohttps://github.com/sulab/myvariant.info

• BioThings API for species/taxonomyhttps://github.com/sulab/biothings.species

• BioThings SDKhttps://github.com/sulab/biothings.api

DEMO time!

by Jiwen (Kevin) Xin

Initial number of genes mutated in all four patients:

filter2 <- lapply(filter1, function(i) subset(i, cadd.consequence %in% c("NON_SYNONYMOUS", "STOP_GAINED", "STOP_LOST", "CANONICAL_SPLICE", "SPLICE_SITE")))

nVars <- countGenes(vars)

filter1 <- lapply(vars, function(i) subset(i, DP > 8 & FS < 30 & QD > 2))

Filtering for sequencing coverage and strand bias:

Filtering for nonsynonymous and splice site variants:

filter3 <- lapply(filter2, function(i) subset(i, exac.af < 0.01)) Filtering for rare variants based on allele frequencies from ExAC:

filter4 <- lapply(filter3, function(i) subset(i, sapply(dbnsfp.1000gp1.af, function(j) j < 0.01 )))

Filtering for rare variants based on allele frequencies from 1000 Genomes Project:

goBP <- data.frame(queryMany(top.genes$Var1, scopes="symbol", species="human", fields=c("go.BP", "name", "MIM", "uniprot")))

# The Bioconductor package go.DB is used to find all genes with a GO biological process annotation that # is a descendant of GO:0008152 - the GO id for metabolic process.miller.bp <- lapply(goBP$go.BP, function(i) unlist(i$id))bp.ancestor <- lapply(miller.bp, function(i) sapply(i, function(j) "GO:0008152" %in% unlist(GOBPANCESTOR[[j]])))candidate.genes <- top.genes$Var1[sapply(bp.ancestor, function(i) TRUE %in% i)]

Filtering by GO biological process annotation using MyGene.info:

Number of genes Filtering steps to prioritize candidate genes:

Demos in Jupyter notebooks

• Using myvariant and mygene in R for variant prioritizationhttp://nbviewer.jupyter.org/github/SuLab/myvariant.info/blob/master/docs/ipynb/myvariant_R_miller.ipynb

• Access ClinVar data from myvariant in Pythonhttp://nbviewer.jupyter.org/github/SuLab/myvariant.info/blob/master/docs/ipynb/myvariant_clinvar_demo.ipynb

• ID mapping using mygene module in Pythonhttp://nbviewer.jupyter.org/gist/newgene/6771106

High-performance web services for gene and variant annotations

Science

Transcript of High-performance web services for gene and variant annotations

Research Article Common Variant of FTO Gene, rs9939609 ...downloads.hindawi.com/journals/bmri/2013/324093.pdf · Research Article Common Variant of FTO Gene, rs9939609, and Obesity

Allele. Alternate form of a gene gene variant autosome.

arxiv.org · Abstract Annotations of gene structures and regulatory elements can inform genome-wide association studies (GWAS). However, choosing the relevant annotations for interpreting

Introduction to R - GitHub Pages...Introduction to R Introduction to tidyverse and Data Visualization with ggp10t2 Gene Annotations and Functional Analysis OF Gene Lists Generating

Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.

TOLLIP gene variant is associated with Plasmodium vivax ... · TOLLIP gene variant is associated with Plasmodium vivax malaria ... (Brasileirinho, ... TOLLIP gene variant is associated

Prolactin-Like Protein-C Variant: Complementary ... · Prolactin-Like Protein-C Variant: Complementary Deoxyribonucleic Acid, Unique Six Exon Gene Structure, and Trophoblast Cell-Specific

Annotations Gene therapy for cystic fibrosis - Europe PubMed Central

Analysis of microarray data. Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene.

Gene Variant Libraries - genscript.com variant libraries.pdf · Make Research Easy GenScript – The most cited biology CRO 3 Gene Services . Peptide Services . Antibody Services

Variant Annotation for TOPMed - nhlbiwgs.org · Example gene-based aggregation units – Gene + flanking regions – Gene + enhancer + promoter – UTR’s+ enhancer + promoter –

1 A functional variant of the gene involved in cholesterol ... · 19.09.2020 · 1 1 . A functional variant of the gene involved in cholesterol transport is. SIDT2 . 2 . associated

WhiteSciwhitesci.co.za/wp-content/uploads/2016/09/Multiplexing...• Gene expression information, variant calling, and fusion detection with known and novel gene fusion partners •

Gene Variant Libraries - The leader in molecular cloning ... variant libraries.pdf · Expression-Ready Gene Variant Libraries . ... e.g. cDNA libraries, ... Constructing Gene Libraries

SAP - ABAP Programming Model for SAP Fiori · EnterpriseSearch Annotations Hierarchy Annotations ObjectModel Annotations OData Annotations Search Annotations Semantics Annotations

A novel porcine bocavirus harbors a variant NP gene · Yoo et al. SpringerPlus DOI 10.1186/s40064-015-1155-8 SHORT REPORT A novel porcine bocavirus harbors a variant NP gene Sung

Increased Expressivity of Gene Ontology Annotations

BioGPS: building your own mash-up of gene annotations and ...sulab.org/wp-content/uploads/2016/01/Nucl.-Acids... · BioGPS: building your own mash-up of gene annotations and expression

The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO

Using computational predictions to improve literature-based Gene Ontology annotations, Julie Park