quantitative proteomics reveal factors regulating rna biology as ...
An introduction to quantitative biology and R
description
Transcript of An introduction to quantitative biology and R
![Page 1: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/1.jpg)
An introduction to quantitative biology
and R
David Quigley
[email protected] Diller Comprehensive Cancer Center, UCSFInstitute for Cancer Research, University of Oslo
![Page 2: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/2.jpg)
Molecular biology 20 years ago
Suzuki Med Mol Morph 2010Oh PNAS 1996Mao Genes Dev 2004
qualitative methods small-scale quantitative tests
David Quigley [email protected]
![Page 3: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/3.jpg)
Nik-Zainal Cell 2012Fullwood Nature 2009CGAN Nature 2012
Molecular biology nowqualitative methods small-scale quantitative testslarge-scale quantitative analysis
microarrays, *seq methods, cell phenotype screens...
David Quigley [email protected]
![Page 4: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/4.jpg)
Hypothesis-generating:Normalize breast tumor expression data (three cohorts)Call genotypes from SNP arraysCalculate association between genotype and expression genome-wide (eQTL)
Identify an interesting candidateValid ate eQTL in independent cohort
Hypothesis-driven:Methylation analysis at MRPS30 promoter In vitro (two cell lines):
ChIP-PCR +/- estrogenqPCRsequencing
Many studies use both approaches
David Quigley [email protected]
![Page 5: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/5.jpg)
Common quantitative techniques
Gene Expression transcription splicingmethylation protein binding (ChIP-seq)
Genomics and Geneticsde novo assemblyDNA copy number (CNV and tumors)germline variant analysistumor variant analysis
SNPs, indels, translocationanalysis of tumor clonality
![Page 6: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/6.jpg)
Challenges
requires statistical sophisticationin study designin interpretation
many data points1,000 to 1,000,000 measurements per samplemany false positives which look like great stories
software becomes part of the experiment divide between engineering, biology culture & thinking
David Quigley [email protected]
![Page 7: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/7.jpg)
Schillebeeckx Nature Biotech. 2013
Wet lab and quantitative skills: better job prospects
David Quigley [email protected]
![Page 8: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/8.jpg)
What approachsare used to analyze quantitative data?
![Page 9: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/9.jpg)
Chosing a tool
CostLearning curveEase of useFlexibility (closed to open-ended)Software ecosystem (none to extensive)
David Quigley [email protected]
![Page 10: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/10.jpg)
Traditional programming languagesPython, C++, Java, others
can solve any computable problem creates the fastest tools freerequires programming expertise
complex to write and test high effort
David Quigley [email protected]
![Page 11: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/11.jpg)
Specialized single-purpose programs
command line toolsacademic researchtype commands at a prompt or run scriptsPLINK, bowtie, GATK, bedtools
GUI (point and click)commercial software for a vendor’s platformslick, opaque, hard/impossible to automate
David Quigley [email protected]
![Page 12: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/12.jpg)
Commercial statistics programs
STATA, SPSS, GraphPad, others
1) Load one dataset2) Select analysis by clicking on a GUI3) Generate a report
May have a built-in languageVery mature tools for traditional biostatisticsNot freeDavid Quigley [email protected]
![Page 13: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/13.jpg)
Web-based tools
Galaxystring together pre-defined analysis steps
very easy to usereproducible
David Quigley [email protected]
![Page 14: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/14.jpg)
Using R is like writing and using software
Traditionally, biologists did not do this.
R was written by statisticians to be a free replicaof another language called “S”.
R: a “software environment”
David Quigley [email protected]
![Page 15: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/15.jpg)
Flexible, open-ended, open-source
Large library of packagespackage: easy-to use published methodslike a Qiagen kit
Free!
Why is R popular?
David Quigley [email protected]
![Page 16: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/16.jpg)
You use R by typing at the prompt
There is no pull-down menu of statistical commands
David Quigley [email protected]
![Page 17: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/17.jpg)
What’s good about this approach?
chain analyseswork with multiple datasetsuse packages of code easy to reproduce runs on anythingmakes sense to computer programmers
David Quigley [email protected]
![Page 18: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/18.jpg)
What’s hard about this approach?
hard to get startedcryptic commandsbuilt-in help is amusingly unhelpful
David Quigley [email protected]
![Page 19: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/19.jpg)
packages: collections of R functionscollection of R code that solves a specific task
limma: microarray normalization and analysissamr: differential expressionimpute: dealing with missing data
downloaded for free from a central repository
David Quigley [email protected]
![Page 20: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/20.jpg)
bioconductorCurated collection of R packagesMicroarrays, aCGH, sequence analysis, advanced statistics, graphics, lots more
David Quigley [email protected]
![Page 21: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/21.jpg)
Learning R data typesby comparing them
to Excel spreadsheets
![Page 22: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/22.jpg)
ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible
Comparing Excel and R
David Quigley [email protected]
![Page 23: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/23.jpg)
ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible
REasy jobs are hard at firstNon-trivial things are possibleEasy to make a paper trailBiostatistics researchers publish tools in RCan create publication-ready plots
Comparing Excel and R
David Quigley [email protected]
![Page 24: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/24.jpg)
Organizing data in Excel
Each subject has a row.Each column has a feature of your subjects.
David Quigley [email protected]
![Page 25: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/25.jpg)
R calls the data points variables
variablesnumbers and characters (letters, words)
numbers: 2.6, 4characters: “Flopsy”, “white, brown paws”
David Quigley [email protected]
![Page 26: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/26.jpg)
R calls the columns vectors
vectorsordered collections of a variable
name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”]age: [2.5, 2.6, 2.5, 4]
David Quigley [email protected]
![Page 27: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/27.jpg)
R calls the data set a data frame
data framea list of vectors (columns) that have nameselements can be read and written by row & column
David Quigley [email protected]
![Page 29: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/29.jpg)
Tell R to do things using functionsfunction_name( details about how to do it )
generate sequence from 1 to 5 counting by 0.5parameters for seq are named from, to, and by
David Quigley [email protected]
![Page 30: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/30.jpg)
Tell R to do things using functionsfunction_name( details about how to do it )
report the mean of my.data. Result of one function is fed into another one.
David Quigley [email protected]
![Page 31: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/31.jpg)
Tell R to do things using functionsfunction_name( details about how to do it )
define a new function that adds 2 to whatever it’s passed
compare to original value of my.data
David Quigley [email protected]
![Page 32: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/32.jpg)
Walk-through a straightforward
analysis
![Page 33: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/33.jpg)
Primary data from METABRIC study
gene expression TP53 sequence
1,400 samples from 5 hospitals
Is there an association between breastcancer subtype and TP53 mutation?
David Quigley [email protected]
![Page 34: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/34.jpg)
Tasks
Normalize databatch effectsunwanted inter-sample variation
Identify outliers
associations between p53 and subtype
David Quigley [email protected]
![Page 35: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/35.jpg)
Quantile Normalization (limma)
Force every array to have the same distribution ofexpression intensities
> library(limma)
> raw = read.table('raw_extract.txt’, ...)
> raw.normalized = normalize.quantiles( raw )
> normalized = log2( raw.normalized )
David Quigley [email protected]
![Page 36: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/36.jpg)
Identify batch effects in microarrays
Principle Components AnalysisIdentify strongest variation in a matrix
gene 2
gene
1
David Quigley [email protected]
![Page 37: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/37.jpg)
Identify batch effects in microarrays
Principle Components AnalysisIdentify axes of maximal variation in a matrix
gene 2
gene
1
first prin
ciple co
mponent
David Quigley [email protected]
![Page 38: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/38.jpg)
Identify batch effects in microarrays
Principle Components AnalysisIdentify strongest variation in a matrix
gene 2
gene
1
gene 2
gene
1
group Agroup B
David Quigley [email protected]
![Page 39: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/39.jpg)
PCA of identifies a batch effect
first principle component
seco
nd p
rinci
ple
com
pone
nt
hospital 3 (yellow)
> my.pca = prcmp( t( expression.data ) )
> plot( my.pca, ... )
David Quigley [email protected]
![Page 40: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/40.jpg)
batch correction reduces bias (ComBat)se
cond
prin
cipl
e co
mpo
nent
first principle component
ComBat package reduces user-defined batch effects
David Quigley [email protected]
![Page 41: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/41.jpg)
Molecular subtypes of breast carcinoma, defined by gene expression
Luminal AN=507
Luminal BN=379
Her2N=161
BasalN=234
ER status
> sa = read.table(‘patients.txt’, ...)
> tumor.counts = table( sa$ER.status, sa$PAM50Subtype)
(convert counts to percentages)
> barplot( c( tumor.counts[1], tumor.counts[2] ), col=c(“red”,”green”), ... )
David Quigley [email protected]
![Page 42: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/42.jpg)
Find interactions: TP53 and subtype
Fit a linear model:> fitted.model = lm( dependent ~ independent )
Perform Analysis of Variance:> anova( fitted.model )
general form of my analysis:> anova( lm( gene.expression ~ PAM * TP53 )
18,000 genesPAM: {LumA, LumB, Her2, Basal}TP53: {mutant, WT}
David Quigley [email protected]
![Page 43: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/43.jpg)
Automate with loopsCalculate anova for 18,000 genes by looping through each gene and storing result.
> n_genes = 18000> result = rep( 0, n_genes )
> for( counter in 1:n_genes ){ result[counter] = anova(...) }
sort results identify significant interaction
repeat 18,000 times
David Quigley [email protected]
![Page 44: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/44.jpg)
absent mild severe
CD3E
log 2
exp
ress
ion
log 2
exp
ress
ion
infiltration
Immune infiltration in TP53-WT Basal
David Quigley [email protected]
Does p53 have a role in immune surveillance?
![Page 45: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/45.jpg)
Next steps:getting help and
learning more
![Page 47: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/47.jpg)
online forums: expert help for free
all of bioinformaticsNextgen sequencing
David Quigley [email protected]
![Page 48: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/48.jpg)
online forums: expert help for free
all of bioinformaticsNextgen sequencing
statistics
David Quigley [email protected]
![Page 49: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/49.jpg)
Library classes and information
Formal courses (BMI, Biostatistics)
Cores (Computational Biology, Genomics)
QGDG monthly methods discussion group
UCSF resources
David Quigley [email protected]
![Page 50: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/50.jpg)
Online classes and blogs
Free courses on data analysishttp://jhudatascience.orgsimplystatistics.orgCoursera etc...
Good tutorials on sequence analysishttp://evomics.org/learning
David Quigley [email protected]
![Page 51: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/51.jpg)
Reproducible research?You mean, there’s
another kind?
![Page 52: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/52.jpg)
detailed protocols (not printed in the methods)
extensive optimization
reagents that might be unique or hard to get
techniques that require years of experience
Replicate a wet lab experiment
David Quigley [email protected]
![Page 53: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/53.jpg)
published algorithms (if novel)
published source codesometimes “available from the authors”
well-specified input and deterministic output
no reagentsOkay, maybe a supercomputer or cloud
How hard could it be?
Replicate a dry lab experiment
David Quigley [email protected]
![Page 54: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/54.jpg)
Bookkeeping errorsTransposed column headersOut-of-date/changed annotationsOff-by-oneMisunderstood sample labels
Batch effects
Cryptic cohort stratification
Inappropriate analytical methods
Many chances to make honest errors
David Quigley [email protected]
![Page 55: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/55.jpg)
eQTL differences between ethnic cohorts
Claim:Many genetic loci associated with gene expression differbetween Asian, western people
David Quigley [email protected]
![Page 56: An introduction to quantitative biology and R](https://reader037.fdocuments.in/reader037/viewer/2022110102/56813f55550346895daa1996/html5/thumbnails/56.jpg)
poor study design batch processing effects
2003-2004processed European samples
2005-2006processed Asian samples
Processing year perfectly counfounded with ethnicity.
Claim:Many genetic loci associated with gene expression differbetween Asian, western people
David Quigley [email protected]