An introduction to quantitative biology and R

An introduction to quantitative biology

David Quigley

dquigley@cc.ucsf.eduHelen Diller Comprehensive Cancer Center, UCSFInstitute for Cancer Research, University of Oslo

Molecular biology 20 years ago

Suzuki Med Mol Morph 2010Oh PNAS 1996Mao Genes Dev 2004

qualitative methods small-scale quantitative tests

David Quigley dquigley@cc.ucsf.edu

Nik-Zainal Cell 2012Fullwood Nature 2009CGAN Nature 2012

Molecular biology nowqualitative methods small-scale quantitative testslarge-scale quantitative analysis

microarrays, *seq methods, cell phenotype screens...

Hypothesis-generating:Normalize breast tumor expression data (three cohorts)Call genotypes from SNP arraysCalculate association between genotype and expression genome-wide (eQTL)

Identify an interesting candidateValid ate eQTL in independent cohort

Hypothesis-driven:Methylation analysis at MRPS30 promoter In vitro (two cell lines):

ChIP-PCR +/- estrogenqPCRsequencing

Many studies use both approaches

Common quantitative techniques

Gene Expression transcription splicingmethylation protein binding (ChIP-seq)

Genomics and Geneticsde novo assemblyDNA copy number (CNV and tumors)germline variant analysistumor variant analysis

SNPs, indels, translocationanalysis of tumor clonality

Challenges

requires statistical sophisticationin study designin interpretation

many data points1,000 to 1,000,000 measurements per samplemany false positives which look like great stories

software becomes part of the experiment divide between engineering, biology culture & thinking

Schillebeeckx Nature Biotech. 2013

Wet lab and quantitative skills: better job prospects

What approachsare used to analyze quantitative data?

Chosing a tool

CostLearning curveEase of useFlexibility (closed to open-ended)Software ecosystem (none to extensive)

Traditional programming languagesPython, C++, Java, others

can solve any computable problem creates the fastest tools freerequires programming expertise

complex to write and test high effort

Specialized single-purpose programs

command line toolsacademic researchtype commands at a prompt or run scriptsPLINK, bowtie, GATK, bedtools

GUI (point and click)commercial software for a vendor’s platformslick, opaque, hard/impossible to automate

Commercial statistics programs

STATA, SPSS, GraphPad, others

1) Load one dataset2) Select analysis by clicking on a GUI3) Generate a report

May have a built-in languageVery mature tools for traditional biostatisticsNot freeDavid Quigley dquigley@cc.ucsf.edu

Web-based tools

Galaxystring together pre-defined analysis steps

very easy to usereproducible

Using R is like writing and using software

Traditionally, biologists did not do this.

R was written by statisticians to be a free replicaof another language called “S”.

R: a “software environment”

Flexible, open-ended, open-source

Large library of packagespackage: easy-to use published methodslike a Qiagen kit

Why is R popular?

You use R by typing at the prompt

There is no pull-down menu of statistical commands

What’s good about this approach?

chain analyseswork with multiple datasetsuse packages of code easy to reproduce runs on anythingmakes sense to computer programmers

What’s hard about this approach?

hard to get startedcryptic commandsbuilt-in help is amusingly unhelpful

packages: collections of R functionscollection of R code that solves a specific task

limma: microarray normalization and analysissamr: differential expressionimpute: dealing with missing data

downloaded for free from a central repository

bioconductorCurated collection of R packagesMicroarrays, aCGH, sequence analysis, advanced statistics, graphics, lots more

Learning R data typesby comparing them

to Excel spreadsheets

ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible

Comparing Excel and R

ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible

REasy jobs are hard at firstNon-trivial things are possibleEasy to make a paper trailBiostatistics researchers publish tools in RCan create publication-ready plots

Comparing Excel and R

Organizing data in Excel

Each subject has a row.Each column has a feature of your subjects.

R calls the data points variables

variablesnumbers and characters (letters, words)

numbers: 2.6, 4characters: “Flopsy”, “white, brown paws”

R calls the columns vectors

vectorsordered collections of a variable

name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”]age: [2.5, 2.6, 2.5, 4]

R calls the data set a data frame

data framea list of vectors (columns) that have nameselements can be read and written by row & column

I can slice and dice the data frame

Tell R to do things using functionsfunction_name( details about how to do it )

generate sequence from 1 to 5 counting by 0.5parameters for seq are named from, to, and by

report the mean of my.data. Result of one function is fed into another one.

define a new function that adds 2 to whatever it’s passed

compare to original value of my.data

Walk-through a straightforward

analysis

Primary data from METABRIC study

gene expression TP53 sequence

1,400 samples from 5 hospitals

Is there an association between breastcancer subtype and TP53 mutation?

Normalize databatch effectsunwanted inter-sample variation

Identify outliers

associations between p53 and subtype

Quantile Normalization (limma)

Force every array to have the same distribution ofexpression intensities

> library(limma)

> raw = read.table('raw_extract.txt’, ...)

> raw.normalized = normalize.quantiles( raw )

> normalized = log2( raw.normalized )

Identify batch effects in microarrays

Principle Components AnalysisIdentify strongest variation in a matrix

gene 2

Principle Components AnalysisIdentify axes of maximal variation in a matrix

gene 2

first prin

ciple co

mponent

Principle Components AnalysisIdentify strongest variation in a matrix

gene 2

group Agroup B

PCA of identifies a batch effect

first principle component

hospital 3 (yellow)

> my.pca = prcmp( t( expression.data ) )

> plot( my.pca, ... )

batch correction reduces bias (ComBat)se

first principle component

ComBat package reduces user-defined batch effects

Molecular subtypes of breast carcinoma, defined by gene expression

Luminal AN=507

Luminal BN=379

Her2N=161

BasalN=234

ER status

> sa = read.table(‘patients.txt’, ...)

> tumor.counts = table( sa$ER.status, sa$PAM50Subtype)

(convert counts to percentages)

> barplot( c( tumor.counts[1], tumor.counts[2] ), col=c(“red”,”green”), ... )

Find interactions: TP53 and subtype

Fit a linear model:> fitted.model = lm( dependent ~ independent )

Perform Analysis of Variance:> anova( fitted.model )

general form of my analysis:> anova( lm( gene.expression ~ PAM * TP53 )

18,000 genesPAM: {LumA, LumB, Her2, Basal}TP53: {mutant, WT}

Automate with loopsCalculate anova for 18,000 genes by looping through each gene and storing result.

> n_genes = 18000> result = rep( 0, n_genes )

> for( counter in 1:n_genes ){ result[counter] = anova(...) }

sort results identify significant interaction

repeat 18,000 times

absent mild severe

infiltration

Immune infiltration in TP53-WT Basal

Does p53 have a role in immune surveillance?

Next steps:getting help and

learning more

online forums: expert help for free

all of bioinformatics

all of bioinformaticsNextgen sequencing

statistics

Library classes and information

Formal courses (BMI, Biostatistics)

Cores (Computational Biology, Genomics)

QGDG monthly methods discussion group

UCSF resources

Online classes and blogs

Free courses on data analysishttp://jhudatascience.orgsimplystatistics.orgCoursera etc...

Good tutorials on sequence analysishttp://evomics.org/learning

Reproducible research?You mean, there’s

another kind?

detailed protocols (not printed in the methods)

extensive optimization

reagents that might be unique or hard to get

techniques that require years of experience

Replicate a wet lab experiment

published algorithms (if novel)

published source codesometimes “available from the authors”

well-specified input and deterministic output

no reagentsOkay, maybe a supercomputer or cloud

How hard could it be?

Replicate a dry lab experiment

Bookkeeping errorsTransposed column headersOut-of-date/changed annotationsOff-by-oneMisunderstood sample labels

Batch effects

Cryptic cohort stratification

Inappropriate analytical methods

Many chances to make honest errors

eQTL differences between ethnic cohorts

Claim:Many genetic loci associated with gene expression differbetween Asian, western people

poor study design batch processing effects

2003-2004processed European samples

2005-2006processed Asian samples

Processing year perfectly counfounded with ethnicity.

Claim:Many genetic loci associated with gene expression differbetween Asian, western people

An introduction to quantitative biology and R

Documents

Transcript of An introduction to quantitative biology and R

Quantitative Life Sciences Ph.D. Program Proposal Draft ... › science › files › science › ... · includes ecology, physiology, genetics, systems biology, computational biology,

quantitative biology

COLO SPRING HARBOR LABORATORY OF QUANTITATIVE BIOLOGY

Quantitative Biology Lecture 2 (probability distributions ...

Quantitative proteomics signature ... - Biology Direct

Quantitative)Imaging)in)Cell)Biology) · Quantitative)Imaging)in)Cell)Biology)) by# # Jaime#Mirit#Yassif# # # # # # # A#dissertation#submitted#in#partial#satisfaction#of#the# #requirements#for#the#degree#of##

Quantitative Biology Bootcamp Intro to RNA-seq · PDF fileIntro to RNA-seq Quantitative Biology Bootcamp ... /Users/cmdb/data/genomes 8 /Users/cmdb/data/day1 $ cd .. /Users/cmdb/data

Optical Interferometry for Biology and Medicinephysics.purdue.edu/nlo/TableContents.pdf · Interferometry, applied to biology and medicine, provides unique quantitative ... excellent

Quantitative Biology: populations

AP Biology Quantitative Skills - College Board · PDF fileAP Biology Quantitative Skills: ... as part of their exploration into biology. ... These examples reflect the type of analysis

Quantitative Biology for the 21st Century

AP Biology QuAntitAtive SkillS - College Board · 2017-04-21 · AP Biology Quantitative Skills: A Guide for Teachers The College Board New York, NY. ... Chapter 2: data Analysis

Quantitative Biology BOOT CAMP - Rutgers University · plinary Quantitative Biology Boot Camp designed to augment education for quantitative and biological scientists. Our overarching

Workshop: Quantitative finance with R

Statistics and Quantitative Biologyqobweb.igc.gulbenkian.pt/courses/sqb2010/docs/SQB2010_4.pdf1 Thursday, September 30, 2010 Statistics and Quantitative Biology Hypothesis testing

University of Lucknow · 2019. 11. 30. · Quantitative Biology, Biosystematics & Evolutionary Biology Biochemistry, Inheritance biology and Biotechnology Developmental Biology &

Physics 176/276 Quantitative Molecular Biology

Quantitative biology: where modern biology meets physical ...

Advanced Biology and its SecurityAdvanced Biology and …€¦ · · 2015-03-23Introduction of JST ... Computational, Quantitative, and Synthetic Biology Bioinformatics Viral Disease

Quantitative Understanding in Biology 2.3 Quantitative ...