Systems Immunology -- 2014

46
Yannick Pouliot, PhD Biocomputational scientist Khatri Laboratory 04/09/2014 Databases, Web Services and Tools For Systems Immunology Databases for Systems Immunology

description

Databases useful for Systems Immunology

Transcript of Systems Immunology -- 2014

Page 1: Systems Immunology -- 2014

Yannick Pouliot, PhD

Biocomputational scientist

Khatri Laboratory

04/09/2014

Databases, Web Services and Tools For Systems ImmunologyDatabases for Systems

Immunology

Page 2: Systems Immunology -- 2014

GOALS

Convey understanding of:

1. a set of databases highly relevant to

Systems Immunology

2. the issues and pitfalls associated with

each DB

Page 3: Systems Immunology -- 2014

Systems Immunology, and particularly

the application of meta-analysis, can

reveal testable hypotheses.

But for that to happen, you need lots

of (diverse) …

DATA

Page 4: Systems Immunology -- 2014

Historically, data were typically

available in flat file formats only

Relational databases now used

increasingly

Page 5: Systems Immunology -- 2014

Huge Numbers of Databases

• See Nucleic Acids Research’ yearly database issueto see just how many there are…

• Many need to be licensed ($)▫ Ingenuity Pathways Analysis (IPA)

Excellent but pricey

▫ MetaCore competitor to IPA available from Lane Library

• Many more freely available▫ E.g., DAVID: similar to IPA and MetaCore

▫ Typically dirtier than commercial products, but sometimes much more comprehensive

Page 6: Systems Immunology -- 2014

But first: “Free” does not necessarily

mean “easy to use”

(yet another application of the “there

ain’t no free lunch” principle)

Page 7: Systems Immunology -- 2014

Typical Problems with Third Party Data 1: Data

Cleanup

• Third party data almost always requires preprocessing

▫ reshaping input data for reading by R or database upload

▫ substituting offending strings

single quotes

ending spaces

converting spaces to nulls

▫ normalizing equivalent strings (“Saline” = “saline”)

▫ semantic normalization

encoding source terms against controlled nomenclature

computing against same concept

enables cross-database queries

▫ reconciling descriptions of data to that in source papers

is this thing here what they are talking about in the paper?

missing data

extraneous data

▫ poorly described protocols

which *!&!! antibody did the authors actually use?

software version, parameters used

Page 8: Systems Immunology -- 2014

Typical Problems with Third Party Data 2: Must be

Downloadable

To be useful in Systems Immunology, a database needs to offer one of the following:

1. downloadable (FTP/SFTP) in text or other form

2. accessible programmatically over the Internet (e.g., Web service)

Otherwise, must write a scraping program assuming this is acceptable use

Manual Web interfaces don’t cut it…

Familiarity with databasing and programming skills essential

Page 9: Systems Immunology -- 2014

Relational Databases -- Take Your Pick

Page 10: Systems Immunology -- 2014

A Small Sample of DBs Useful in Systems Immunology

• NCBI:

▫ GEO: Gene expression

▫ PubChem: Drug an compound activity data

• DrugBank: Comprehensive info on drugs and their targets

• BioGPS, Expression Atlas: Compendia of gene expression across tissues

• Connectivity Map (CMAP): systematic survey of effects of compounds on

cells

• Comparative Toxicogenomics Database (CTD): effects of compounds on

genes; correlation of compounds with diseases

• Unified Medical Language System (UMLS): concept identification, DB

cross-querying

• ImmPort: Only multi-assay type immunological DB

• Stanford Data Miner (SDM): Human Immune Monitoring Core’s database

• The Cancer Genome Atlas (TCGA): Incredibly wide and deep repository of

human cancer data

Page 11: Systems Immunology -- 2014

Gene Expression

… including

- microarray

- qPCR

- RNA-Seq

Page 12: Systems Immunology -- 2014

GEO

• Vast repository of everything expression

▫ microarray gene expression as well as e.g., RNA-

Seq

▫ lots of disease and drug treatment data in

humans

• Semi-structured data

▫ limits searchability of GEO search engine

▫ minimal standards applied by GEO

manual curation required

Page 13: Systems Immunology -- 2014

GEO: Example

Goal: Identifying transcripts unique to individual leukocyte cell

types

Process:

1. Curate GEO gene expression datasets for immune cells; store in

MySQL

2. Classify cell types according to Cell Ontology

3. Compute z-score of expression for all genes in each cell type

Tools: RMySQL + shiny + ggplot

Page 14: Systems Immunology -- 2014

Drugs, compounds, bioactivity

Page 15: Systems Immunology -- 2014

PubChem

• Three components:▫ Compounds

▫ Substances

▫ BioAssay this is where the action is

• PubChem BioAssay is a repository of bioactivity for compounds▫ Very wide range of assays:

high throughput screening in vivo assays cell-free assays

• Complex data model (XML)▫ can be converted to relational, though…

Page 16: Systems Immunology -- 2014

PubChem BioAssay: Example

Approach: Create a model that correlates bioactivity profiles in screening assays with pattern of drug adversity

Enables prediction of adversity based on how a compound behaves in selected screens

Page 17: Systems Immunology -- 2014

DrugBank

• Comprehensive collection of detailed drug data

▫ chemical

▫ pharmacological

▫ pharmaceutical

▫ target

• Contents

▫ 7,680 drug entries

1,552 FDA-approved small molecule drugs

55 FDA-approved biotech (antibodies/protein/peptide) drugs

6,000 experimental drugs

• But …

Page 18: Systems Immunology -- 2014

Even When Data Are Available For Download, Converting Into Desired

Format Can Be Challenging…

Converting to relational or TSV formats doable but not trivial

Operating directly on XML not recommended…

Page 19: Systems Immunology -- 2014

DrugBank: ExampleSELECT distinct

c.`NAME` as drug_name,

case

when ( not(d.GENE_NAME = null) and (e.symbol = null)) then d.GENE_NAME

when ( d.GENE_NAME = '') and (not(e.symbol = null)) then e.symbol

else e.symbol

end as Symbol,

e.GeneID,

c.`DRUGBANK_ID` as drugbank_id,

c.`rxcui`,

c.CAS_NUMBER as cas_number,

d.`NAME` as gene_name

FROM

`target` a

join (`targets` b, drug c, partner d)

on (

a.`TARGETS_FKEY` = b.`TARGETS_PKEY`

and b.`DRUG_FKEY` = c.`DRUG_PKEY`

and a.`PARTNER` = d.`PARTNER_PKEY`

)

left join

annot_gene.`gene_info_hs` e

on

d.`NAME` = e.`name`

order by

drug_name,

Symbol;

• Retrieve known targets of drugs

• Find as many gene symbols as

possible for targets

Page 20: Systems Immunology -- 2014

Connectivity Map

• Contents: collection of microarray gene expression

datasets from panels of cell types treated with

multiple compounds at multiple doses

• Used to find drugs where the expression profiles

match that of a user’s query gene signature

▫ The system computes a similarity metric to quantify

the connection between that gene signature and

reference profiles

• Cells are all tumor cells from NCI-60 set

Page 21: Systems Immunology -- 2014

Connectivity Map: Example 1select

`a`.`instance_id` AS `instance_id`,

`a`.`probe_name` AS `probe_name`,

`b`.`direction` AS `direction`,

`b`.`msigdb_id` AS `msigdb_id`,

`a`.`rank` AS `rank`,

`c`.`cmap_name` AS `cmap_name`,

`c`.`cell` AS `cell_type`,

`c`.`catalog_name`,

`c`.`catalog_number`,

`c`.`cas_number`,

`c`.`rxcui`,

`c`.`batch_id`,

`c`.`perturbation_scan_id`

from

((`v_instance2probe1` `a`

join `gene_sets` `b`)

join `instances` `c`)

where

((`a`.`probe_id` = `b`.`probe_id`)

and (`a`.`instance_id` = `c`.`instance_id`))

1. Assemble gene expression data and metadata

stored in multiple files into a cohesive DB

schema

2. Retrieve results into an integrated view

Page 22: Systems Immunology -- 2014

Connectivity Map: Example 2

Goal: Find drugs that increase gene expression in the reversedirection to what is observed in Inflammatory Bowel Disease (IBD) vs. normal tissues should decrease symptoms

Method:

1. Characterize the effect of drugs on human gene transcript levels

2. Characterize the difference in human gene transcript levels between disease and normal tissue pairs

3. Find drugs that induce the reciprocal signature observed in disease

link using rxcui

GEO CM

Page 23: Systems Immunology -- 2014

Data Sources

Disease data: GEO

• Assemble MySQL database of 176 gene expression microarray datasets from GEO

▫ diseased vs. normal tissue pairs

▫ 100 specific diseases manually reviewed and encoded using UMLS identifiers

▫ drug names encoded against UMLS RXCUI

Drug data : Connectivity Map

Gene expression microarray profiles of effects of 164 drugs in:

▫ breast cancer: MCF7 epithelial cell line

▫ prostate cancer: PC3 epithelial cell line

▫ leukemia: HL60

▫ melanoma: SKMEL5

▫ drug names encoded against UMLS RXCUI

Page 24: Systems Immunology -- 2014

Comparative Toxicogenomics Database

• Based on curation of literature of interactions

between

▫ compounds and diseases

▫ compounds and genes

▫ genes and diseases

Page 25: Systems Immunology -- 2014

CTD: Example

Goal: Retrieve genes whose

expression is influenced by

testosterone-related compounds

Page 26: Systems Immunology -- 2014

Data integration

Page 27: Systems Immunology -- 2014

Unified Medical Language System (UMLS)… and why you need it

• Provided by the National Library of Medicine

• Inter-relates many controlled nomenclatures

• Assigns single concept identifiers

• enables collapsing of variant expressions into

one concept

• Particularly useful when dealing with drug or

compound names (RXCUI)

• Use it from NCI Metathesaurus

or create a MySQL DB

Page 28: Systems Immunology -- 2014

UMLS: Example 1

• Developed by National Library of Medicine data files and software that brings together multiple biomedical vocabularies and ontologies to enable semantic interoperability

▫ repository of terms, definitions and concepts in biomedicine, complete with cross-referencing and ontological relationships

• Essential but complex and large

• Requires free license

▫ or use it from NCI Metathesaurus

Page 29: Systems Immunology -- 2014

UMLS: Example 2“ I don’t like these drug names…”

SELECT distinct

a.drug_name,

c.STR as shorter_drug_name,

length(c.STR) as str_length

FROM

pharm_drugbank.`m_drug2gene` a,

kb_umls.`RXNCONSO` b,

kb_umls.`RXNCONSO` c

where

a.drug_name = b.`STR`

and b.RXCUI = c.RXCUI

and not(b.STR = c.STR)

and

length(c.`STR`)<length(b.`STR`)

and not(a.Symbol is null)

order by

a.drug_name,

length(c.STR) asc

First 10 rows…

Page 30: Systems Immunology -- 2014

Immunological data

- ImmPort

- Stanford Data Miner

Page 31: Systems Immunology -- 2014

ImmPort: The King of Immunology

Databases

Page 32: Systems Immunology -- 2014

• Very rich metadata

• Stores data for many

different types of assays

(unusual)

• Uniquely curated and

parsed

• Excellent database

schema

• well documented

on ImmPort site

• sample

Page 33: Systems Immunology -- 2014
Page 34: Systems Immunology -- 2014

ImmPort: Example 2SELECT distinct

a.`study_accession`,

i.`name` as fcs_file,

j.`panel`,

j.`number_of_markers`

FROM

kb_immport.`study` a,

kb_immport.`arm_or_cohort` b,

kb_immport.arm_2_subject c,

kb_immport.`subject` d,

kb_immport.`biosample` e,

kb_immport.`biosample_2_expsample` f,

kb_immport.`expsample` g,

kb_immport.`expsample_2_file_info` h,

kb_immport.`file_info` i,

kb_immport.fcs_annotation j

where

a.`study_accession` = b.`study_accession`

and b.`study_accession` = e.`study_accession`

and c.`subject_accession` = d.`subject_accession`

and d.`subject_accession` = e.`subject_accession`

and a.`workspace_id` = b.`workspace_id`

and b.`workspace_id` = e.`workspace_id`

and e.`workspace_id` = g.`workspace_id`

and g.`workspace_id` = i.`workspace_id`

and i.`workspace_id` = j.`workspace_id`

and e.`biosample_accession` = f.`biosample_accession`

and f.`experiment_accession` = g.`experiment_accession`

and g.`experiment_accession` = h.`experiment_accession`

and h.`experiment_accession` = j.`experiment_accession`

and f.`expsample_accession` = g.`expsample_accession`

and g.`expsample_accession` = h.`expsample_accession`

and h.`expsample_accession` = j.`expsample_accession`

and h.`file_info_id` = i.`file_info_id`

and i.file_info_id = j.`file_info_id`

and a.`official_title` regexp 'influenz'

and i.`name` regexp '\.fcs'

and d.species = 'Homo sapiens'

order by

a.`study_accession`,

j.`panel`

Goal: Retrieve all flow cytometry

files (FCS) and marker panels

associated with studies involving

influenza

Page 35: Systems Immunology -- 2014

ImmPort: Example 3

Goal: Retrieve HAI results for

influenza vaccinees, measured at

day 0 and 28 post-vaccination

Page 36: Systems Immunology -- 2014

Putting it all together:

Meta-analysis of human

influenza vaccination data in

ImmPort data to evaluate

changes in immunological

marker frequencies from flow

cytometry data using

automatic gating

Maecker, H., McCoy, J.P. & Nussenblatt, R. Standardizing immunophenotyping for the human immunology project. Nature reviews Immunology 12, 191-200 (2012).

10 studies370 subjects~17K FCS files

Question: What changes in marker frequencies are observed during influenza immunization?

Page 37: Systems Immunology -- 2014

Stanford Data Miner: The Prince of

Immunology Databases

Page 38: Systems Immunology -- 2014

SDM: Example

Retrieve cell type

frequencies from

CytOF data

following influenza

immunization

Page 39: Systems Immunology -- 2014

Integrated Disease Repositories

Page 40: Systems Immunology -- 2014

The Cancer Genomics Atlas (TCGA)

• Lots of cancers

• Clinical data

▫ Full pathology

▫ Imaging, radiology, immunohistochemistry

• Genomics: lots!

▫ both tumor and control tissues

▫ genotyping

▫ exome sequencing

▫ miRNA sequencing

▫ RNA-Seq

Page 41: Systems Immunology -- 2014

In Conclusion…

• Huge number of public resources▫ ultimately integratable

• Scientific power frequently lies in integrating data from multiple databases

• Data clean-up typically needed▫ mapping to ontologies or controlled nomenclatures

essential

• Domain-specific curation frequently required to structure otherwise semi-structured data▫ e.g., GEO

• All doable given today’s plethora of free/cheap tools and compute power

Page 42: Systems Immunology -- 2014

Questions?

Page 43: Systems Immunology -- 2014
Page 44: Systems Immunology -- 2014

Coming To Terms With MySQL

• Widest usage in bioinformatics

• Free (community edition)

• Runs on everything (Linux, Win, Mac)

• Easiest relational DB (short of MS Access)

• Resources

▫ Moes (2005): Beginning MySQL; Wiley

▫ DuBois (2007): MySQL Cookbook; O’Reilly

▫ Dyer (2008): MYSQL in a Nutshell; O’Reilly

Page 45: Systems Immunology -- 2014

Key R Packages

▫ RMySQL: accessing relational databases, e.g., MySQL

▫ ggplot2: hyper-powerful plotting

▫ RColorBrewer: assign colors to plot objects automatically, such as plotted ggplot

▫ plyr and dplyr: easy manipulation of data frames

▫ sqldf: query data frames using SQL

another easy way to manipulate data frames

▫ shiny: Web-based user interface

if you want interactive R analysis

Page 46: Systems Immunology -- 2014

Finding Drug Candidates Using Rank-Ordered,

Drug-Disease Anti-Correlation Scores

1. Compute an anti-correlation

score for each drug-disease pair

2. Compute P-values of anti-

correlation scores (significance

testing) using distance between

observed score vs. scores of 100

randomly-generated comparisons

3. Retain correlation that have FDR

values better than 0.05