An introduction to quantitative biology and R

Post on 08-Jan-2016

38 views 0 download

description

An introduction to quantitative biology and R. David Quigley dquigley@cc.ucsf.edu Helen Diller Comprehensive Cancer Center, UCSF Institute for Cancer Research, University of Oslo. Molecular biology 20 years ago. qualitative methods small-scale quantitative tests. - PowerPoint PPT Presentation

Transcript of An introduction to quantitative biology and R

An introduction to quantitative biology

and R

David Quigley

dquigley@cc.ucsf.eduHelen Diller Comprehensive Cancer Center, UCSFInstitute for Cancer Research, University of Oslo

Molecular biology 20 years ago

Suzuki Med Mol Morph 2010Oh PNAS 1996Mao Genes Dev 2004

qualitative methods small-scale quantitative tests

David Quigley dquigley@cc.ucsf.edu

Nik-Zainal Cell 2012Fullwood Nature 2009CGAN Nature 2012

Molecular biology nowqualitative methods small-scale quantitative testslarge-scale quantitative analysis

microarrays, *seq methods, cell phenotype screens...

David Quigley dquigley@cc.ucsf.edu

Hypothesis-generating:Normalize breast tumor expression data (three cohorts)Call genotypes from SNP arraysCalculate association between genotype and expression genome-wide (eQTL)

Identify an interesting candidateValid ate eQTL in independent cohort

Hypothesis-driven:Methylation analysis at MRPS30 promoter In vitro (two cell lines):

ChIP-PCR +/- estrogenqPCRsequencing

Many studies use both approaches

David Quigley dquigley@cc.ucsf.edu

Common quantitative techniques

Gene Expression transcription splicingmethylation protein binding (ChIP-seq)

Genomics and Geneticsde novo assemblyDNA copy number (CNV and tumors)germline variant analysistumor variant analysis

SNPs, indels, translocationanalysis of tumor clonality

Challenges

requires statistical sophisticationin study designin interpretation

many data points1,000 to 1,000,000 measurements per samplemany false positives which look like great stories

software becomes part of the experiment divide between engineering, biology culture & thinking

David Quigley dquigley@cc.ucsf.edu

Schillebeeckx Nature Biotech. 2013

Wet lab and quantitative skills: better job prospects

David Quigley dquigley@cc.ucsf.edu

What approachsare used to analyze quantitative data?

Chosing a tool

CostLearning curveEase of useFlexibility (closed to open-ended)Software ecosystem (none to extensive)

David Quigley dquigley@cc.ucsf.edu

Traditional programming languagesPython, C++, Java, others

can solve any computable problem creates the fastest tools freerequires programming expertise

complex to write and test high effort

David Quigley dquigley@cc.ucsf.edu

Specialized single-purpose programs

command line toolsacademic researchtype commands at a prompt or run scriptsPLINK, bowtie, GATK, bedtools

GUI (point and click)commercial software for a vendor’s platformslick, opaque, hard/impossible to automate

David Quigley dquigley@cc.ucsf.edu

Commercial statistics programs

STATA, SPSS, GraphPad, others

1) Load one dataset2) Select analysis by clicking on a GUI3) Generate a report

May have a built-in languageVery mature tools for traditional biostatisticsNot freeDavid Quigley dquigley@cc.ucsf.edu

Web-based tools

Galaxystring together pre-defined analysis steps

very easy to usereproducible

David Quigley dquigley@cc.ucsf.edu

Using R is like writing and using software

Traditionally, biologists did not do this.

R was written by statisticians to be a free replicaof another language called “S”.

R: a “software environment”

David Quigley dquigley@cc.ucsf.edu

Flexible, open-ended, open-source

Large library of packagespackage: easy-to use published methodslike a Qiagen kit

Free!

Why is R popular?

David Quigley dquigley@cc.ucsf.edu

You use R by typing at the prompt

There is no pull-down menu of statistical commands

David Quigley dquigley@cc.ucsf.edu

What’s good about this approach?

chain analyseswork with multiple datasetsuse packages of code easy to reproduce runs on anythingmakes sense to computer programmers

David Quigley dquigley@cc.ucsf.edu

What’s hard about this approach?

hard to get startedcryptic commandsbuilt-in help is amusingly unhelpful

David Quigley dquigley@cc.ucsf.edu

packages: collections of R functionscollection of R code that solves a specific task

limma: microarray normalization and analysissamr: differential expressionimpute: dealing with missing data

downloaded for free from a central repository

David Quigley dquigley@cc.ucsf.edu

bioconductorCurated collection of R packagesMicroarrays, aCGH, sequence analysis, advanced statistics, graphics, lots more

David Quigley dquigley@cc.ucsf.edu

Learning R data typesby comparing them

to Excel spreadsheets

ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible

Comparing Excel and R

David Quigley dquigley@cc.ucsf.edu

ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible

REasy jobs are hard at firstNon-trivial things are possibleEasy to make a paper trailBiostatistics researchers publish tools in RCan create publication-ready plots

Comparing Excel and R

David Quigley dquigley@cc.ucsf.edu

Organizing data in Excel

Each subject has a row.Each column has a feature of your subjects.

David Quigley dquigley@cc.ucsf.edu

R calls the data points variables

variablesnumbers and characters (letters, words)

numbers: 2.6, 4characters: “Flopsy”, “white, brown paws”

David Quigley dquigley@cc.ucsf.edu

R calls the columns vectors

vectorsordered collections of a variable

name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”]age: [2.5, 2.6, 2.5, 4]

David Quigley dquigley@cc.ucsf.edu

R calls the data set a data frame

data framea list of vectors (columns) that have nameselements can be read and written by row & column

David Quigley dquigley@cc.ucsf.edu

I can slice and dice the data frame

David Quigley dquigley@cc.ucsf.edu

Tell R to do things using functionsfunction_name( details about how to do it )

generate sequence from 1 to 5 counting by 0.5parameters for seq are named from, to, and by

David Quigley dquigley@cc.ucsf.edu

Tell R to do things using functionsfunction_name( details about how to do it )

report the mean of my.data. Result of one function is fed into another one.

David Quigley dquigley@cc.ucsf.edu

Tell R to do things using functionsfunction_name( details about how to do it )

define a new function that adds 2 to whatever it’s passed

compare to original value of my.data

David Quigley dquigley@cc.ucsf.edu

Walk-through a straightforward

analysis

Primary data from METABRIC study

gene expression TP53 sequence

1,400 samples from 5 hospitals

Is there an association between breastcancer subtype and TP53 mutation?

David Quigley dquigley@cc.ucsf.edu

Tasks

Normalize databatch effectsunwanted inter-sample variation

Identify outliers

associations between p53 and subtype

David Quigley dquigley@cc.ucsf.edu

Quantile Normalization (limma)

Force every array to have the same distribution ofexpression intensities

> library(limma)

> raw = read.table('raw_extract.txt’, ...)

> raw.normalized = normalize.quantiles( raw )

> normalized = log2( raw.normalized )

David Quigley dquigley@cc.ucsf.edu

Identify batch effects in microarrays

Principle Components AnalysisIdentify strongest variation in a matrix

gene 2

gene

1

David Quigley dquigley@cc.ucsf.edu

Identify batch effects in microarrays

Principle Components AnalysisIdentify axes of maximal variation in a matrix

gene 2

gene

1

first prin

ciple co

mponent

David Quigley dquigley@cc.ucsf.edu

Identify batch effects in microarrays

Principle Components AnalysisIdentify strongest variation in a matrix

gene 2

gene

1

gene 2

gene

1

group Agroup B

David Quigley dquigley@cc.ucsf.edu

PCA of identifies a batch effect

first principle component

seco

nd p

rinci

ple

com

pone

nt

hospital 3 (yellow)

> my.pca = prcmp( t( expression.data ) )

> plot( my.pca, ... )

David Quigley dquigley@cc.ucsf.edu

batch correction reduces bias (ComBat)se

cond

prin

cipl

e co

mpo

nent

first principle component

ComBat package reduces user-defined batch effects

David Quigley dquigley@cc.ucsf.edu

Molecular subtypes of breast carcinoma, defined by gene expression

Luminal AN=507

Luminal BN=379

Her2N=161

BasalN=234

ER status

> sa = read.table(‘patients.txt’, ...)

> tumor.counts = table( sa$ER.status, sa$PAM50Subtype)

(convert counts to percentages)

> barplot( c( tumor.counts[1], tumor.counts[2] ), col=c(“red”,”green”), ... )

David Quigley dquigley@cc.ucsf.edu

Find interactions: TP53 and subtype

Fit a linear model:> fitted.model = lm( dependent ~ independent )

Perform Analysis of Variance:> anova( fitted.model )

general form of my analysis:> anova( lm( gene.expression ~ PAM * TP53 )

18,000 genesPAM: {LumA, LumB, Her2, Basal}TP53: {mutant, WT}

David Quigley dquigley@cc.ucsf.edu

Automate with loopsCalculate anova for 18,000 genes by looping through each gene and storing result.

> n_genes = 18000> result = rep( 0, n_genes )

> for( counter in 1:n_genes ){ result[counter] = anova(...) }

sort results identify significant interaction

repeat 18,000 times

David Quigley dquigley@cc.ucsf.edu

absent mild severe

CD3E

log 2

exp

ress

ion

log 2

exp

ress

ion

infiltration

Immune infiltration in TP53-WT Basal

David Quigley dquigley@cc.ucsf.edu

Does p53 have a role in immune surveillance?

Next steps:getting help and

learning more

online forums: expert help for free

all of bioinformatics

David Quigley dquigley@cc.ucsf.edu

online forums: expert help for free

all of bioinformaticsNextgen sequencing

David Quigley dquigley@cc.ucsf.edu

online forums: expert help for free

all of bioinformaticsNextgen sequencing

statistics

David Quigley dquigley@cc.ucsf.edu

Library classes and information

Formal courses (BMI, Biostatistics)

Cores (Computational Biology, Genomics)

QGDG monthly methods discussion group

UCSF resources

David Quigley dquigley@cc.ucsf.edu

Online classes and blogs

Free courses on data analysishttp://jhudatascience.orgsimplystatistics.orgCoursera etc...

Good tutorials on sequence analysishttp://evomics.org/learning

David Quigley dquigley@cc.ucsf.edu

Reproducible research?You mean, there’s

another kind?

detailed protocols (not printed in the methods)

extensive optimization

reagents that might be unique or hard to get

techniques that require years of experience

Replicate a wet lab experiment

David Quigley dquigley@cc.ucsf.edu

published algorithms (if novel)

published source codesometimes “available from the authors”

well-specified input and deterministic output

no reagentsOkay, maybe a supercomputer or cloud

How hard could it be?

Replicate a dry lab experiment

David Quigley dquigley@cc.ucsf.edu

Bookkeeping errorsTransposed column headersOut-of-date/changed annotationsOff-by-oneMisunderstood sample labels

Batch effects

Cryptic cohort stratification

Inappropriate analytical methods

Many chances to make honest errors

David Quigley dquigley@cc.ucsf.edu

eQTL differences between ethnic cohorts

Claim:Many genetic loci associated with gene expression differbetween Asian, western people

David Quigley dquigley@cc.ucsf.edu

poor study design batch processing effects

2003-2004processed European samples

2005-2006processed Asian samples

Processing year perfectly counfounded with ethnicity.

Claim:Many genetic loci associated with gene expression differbetween Asian, western people

David Quigley dquigley@cc.ucsf.edu