Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc....

49
Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology [email protected] http://bioinformatics.bc.edu/ marthlab Pfizer visit, March 7. 2006

Transcript of Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc....

Page 1: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Software tools for the analysis of medically

important sequence variations

Gabor T. Marth, D.Sc.Boston CollegeDepartment of [email protected]://bioinformatics.bc.edu/marthlab

Pfizer visit, March 7. 2006

Page 2: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Our lab focuses on three main projects…

2. software for SNP discovery in clonal and re-sequencing data,

1. software tools for clinical case-control association studies

3. connecting HapMap and pharmaco-genetic data

Page 3: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

1. We developing computer software to aid tagSNP selection and association testing

gene annotations

tags

association statistics

input data views

LD views

GUI

user control interface

reference samples

representative computational samples

tag evaluationmarker selectionassociation testing

study specificationuser input

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

LA LD (r2)

5-s

ite

Co

mp

uta

ion

all

y G

en

era

ted

LD

(r

2)

1-4 Mrk Sep.

5-9 Mrk Sep.

10-17 Mrk Sep.

18-26 Mrk Sep.

computationalsample database

(discussed in more detail)

Page 4: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

• inherited (germ line) polymorphisms are important as they can predispose to disease

1.

2. We build computer tools for SNP discovery

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

• we have a 5-year NIH R01 grant to re-develop our computer package, PolyBayes© , our SNP discovery tool originally developed while the PI was at the Washington University Medical School

Marth et al. Nature Genetics 1999

• looking for SNPs and short INDELs

Page 5: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Apply our tools for genome-scale SNP mining

Sachidanandam et al. Nature 2001

~ 10 million

EST

WGS

BAC

genome reference

Page 6: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Extend our methods for SNP detection in medical re-sequencing data from traditional Sanger sequencers…

Homozygous T

Homozygous C

Heterozygous C/T

Page 7: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

… and in 454 pyrosequence data

454 sequence from the NCBI Trace Archive

• accurate base calling for de novo sequencing

• detection of heterozygotes in medical re-sequencing data

Figure from Nordfors, et. al. Human Mutation 19:395-401 (2002)

(discussed in more detail)

Page 8: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Developing methods to detect somatic mutations (as distinguished from inherited polymorphisms)

© Brian Stavely, Memorial University of Newfoundland

• the detection of somatic mutations, and their distinction from inherited polymorphism, will be important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer

(discussed in more detail)

Page 9: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Process DNA methylation data obtained with sequencing

DNA methylation is important e.g. because hypo- and hypermethylation is consistently present in various cancers

Issa. Nature Reviews Cancer, 4, 2004: 988-993

we are developing methods to interpret DNA methylation data obtained with sequencing, in the presence of methodological artifacts such as incomplete bi-sulfite conversion of un-methylated cytosines

Lewin et. al. Bioinformatics, 20:3005-30012, 2004

Page 10: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

… and tools to integrate genetic and epigenetic data from varied sources to find “common themes” during cancer development

chromatin structure

gene expression profiles

copy number changes

methylation profiles

chromosome rearrangement

s

repeat expansions

somatic mutations

Page 11: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

3. We are planning a project to connect multi-marker haplotypes to drug metabolic phenotypes

• predicting metabolic phenotypes (ADR) based on haplotype markers

• evolutionary origin of drug metabolizing enzyme polymorphisms

Page 12: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Computer software to aid case-control association studies: tagSNP selection and association testing (details)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

LA LD (r2)

5-s

ite

Co

mp

uta

ion

all

y G

en

era

ted

LD

(r

2 )

1-4 Mrk Sep.

5-9 Mrk Sep.

10-17 Mrk Sep.

18-26 Mrk Sep.

Dr. Eric Tsung

Page 13: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Clinical case-control association studies – concepts

• association studies are designed to find disease-causing genetic variants

• searching “significant” marker allele frequency differences between cases and controls

AF(cases)

AF(

contr

ol

s)

clinical cases

clinical controls

• genotyping cases and controls at various polymorphisms

Page 14: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Association study designs

• region(s) interrogated: single gene, list of candidate genes (“candidate gene study”), or entire genome (“genome scan”)

• direct or indirect:

causative variant causative variantmarker that is co-inherited with causative variant

• single-SNP marker or multi-SNP haplotype marker

• single-stage or multi-stage

Page 15: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Marker (tag) selection for association studies

2. LD-driven – based entirely on the reduction of redundancy presented by the linkage disequilibrium (LD) between SNPs; tags represent other SNPs they are correlated with

1. hypothesis driven (i.e. based on gene function)

causative variant

for economy, one cannot genotype every SNP in thousands of clinical samples: marker selection is the process where a subset of all available SNPs is chosen

Page 16: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

The International HapMap project

http://www.hapmap.org

The international HapMap project was designed to provide a set of physical and informational reagents for association studies by mapping out human LD structure

Page 17: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

LD varies across samples

African reference (YRI)

there are large differences in LD between different human populations…

European reference (CEU)

… and even between samples from the same population.

Other European samples

Page 18: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Sample-to-sample LD differences make tagSNP selection problematic

groups of SNPs that are in LD in the HapMap reference samples may not be in a future set of clinical samples…

… and tags that were selected based on LD in the HapMap may no longer work (i.e. represent the SNPs they were supposed to) in the clinical samples…

… possibly resulting in missed disease associations.

Page 19: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Natural marker allele frequency differences confound association testing

reference samples: ~ 120 chromosomes

cases: 500-2,000 chromosomes

controls: 500-2,000 chromosomes

• the HapMap reference samples are much smaller than clinical sample sizes

• difficult to accurately assess both marker allele frequency (single-SNP or haplotype frequency) in the clinical samples and naturally occurring variation of marker allele frequency differences between cases and controls

AF(cases)

AF(

contr

ol

s)

• therefore difficult to assess statistical significance of candidate associations

Page 20: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

We are developing technology for assessing sample-to-sample variance in silico

reference

cases

controlstag evaluationtag selection

association testing

we estimate LD differences betweenHapMap and future clinical samples…

“cases”

“controls”

…by generating “computational” samples representing future clinical samples…

… and use computational “proxy” samples for tabulating LD and allele frequency differences.

Page 21: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Two methods of computational sample generation

“HapMap” “cases”

“controls”HapMap

Method 1. “Data-relevant Coalescent”. This algorithm uses a population genetic model to connect mutations in the HapMap reference to mutations in future clinical samples. Full model but computationally slow.

Method 2. The PAC method (product of approximate conditionals, Li & Stephens). This method constructs “new” samples as mosaics of existing haplotypes, mimicking the effects of recombination. An approximation but fast.

Page 22: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Computational samples

HapMap (CEU)

Computational (PAC)

Computational (Coalescent)

Extra genotypes (Estonia)

Page 23: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

MARKER EVALUATION with computational samples

test if markers selected from the HapMap continue to “tag” other SNPs in their original LD group

Page 24: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

MARKER SELECTION with computational samples

selecting tags in multiple consecutive sets of computational samples and choosing for the association study the best-performing tags

Page 25: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

ASSOCIATION TESTING with computational samples

“cases”

“controls”

“cases”

“controls”

“cases”

“controls”

tabulating ΔAF in “cases” vs. “controls” in multiple consecutive computational pairs of samples provides the natural range of allele frequency differences to decide if a candidate association is statistically significant

AF(cases)

AF(

contr

ol

s)

Page 26: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Do computational samples represent future clinical genotypes realistically?

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

we quantify the quality of representation by comparing the correlation of LD between corresponding pairs of markers (i.e. ask if two markers were in strong LD in one set of samples, are they ALSO in strong LD in the other set?

Page 27: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

LD difference -- comparison to extra experimental genotypes

0.949 +/- 0.013

0.978 +/- 0.0100.963 +/- 0.014

• we have analyzed two extra genotype sets collected at the HapMap SNPs in three genome regions, from our clinical collaborators (Prof. Thomas Hudson, McGill; Prof. Stanley Nelson, UCLA)

Page 28: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

AF difference -- comparisons to extra experimental genotypes

0

0.01

0.02

0.03

0.04

0.05

0.06

0 0.01 0.02 0.03 0.04 0.05 0.06

AF Diff, Estonian Data

AF

Dif

f, C

om

p S

am

ple

s

• according to our limited initial test, computational samples can represent future clinical samples well for estimating sample-to-sample variability

Page 29: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

A new marker selection and association testing software tool

• data visualization

reference samples

representative computational samples

• representative computational sample generation

• advanced tag selection functionality

gene annotations

tags

LD views

• gene annotations overlaid on physical map of SNPs (i.e. the human genome sequence)

association statistics0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

LA LD (r2)

5-s

ite

Co

mp

uta

ion

all

y G

en

era

ted

LD

(r

2)

1-4 Mrk Sep.

5-9 Mrk Sep.

10-17 Mrk Sep.

18-26 Mrk Sep.

• advanced association testing functionality

• multi-level user customization including user conveniences e.g. tag prioritization based on SNP assay score

Page 30: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

User community

• companies designing new generations of whole-genome or specialized SNP arrays

• researchers comparing alternative platforms (e.g. Affymetrix 500K and the Illumina 300K ) most suitable for their study

• clinical researchers designing candidate gene studies

• researchers designing second-stage follow-up studies in specific genome regions after an initial genome scan (our methods can take advantage of first-stage data already available in the clinical samples)

• the association testing features should be useful for analysts regardless of study design

Page 31: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Base calling and SNP detection in sequence traces including 454 data

Aaron Quinlan

Page 32: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Base calling and SNP detection in sequence traces including 454 “pyrogram” data

• PolyBayes was originally written to find SNPs in clonal sequences in large SNP discovery projects

• medical re-sequencing projects require the detection of SNPs in heterozygous diploid sequence traces

C

CG

G AT

CG

5’

3’

5’

3’

Page 33: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Heterozygote detection in sequence traces

Ind. 1

Ind. 2

Ind. 3

Ind. 4

Page 34: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Individual traces

• we use a machine learning method (Support Vector Machine, SVM) to recognize characteristic features of homozygous vs. heterozygous positions

Page 35: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Aggregating information from multiple traces

forward/reverse sequences from same individual

P(GT ) = .993

resultant genotype call

P(GT | Read) = .98

P(GT | Read) = .87

Page 36: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Discovery vs. genotyping

Prior(CT) = .001

discovery: “uninformed prior”don’t know if site is polymorphichave to test each site

Prior(CT) = 0.34

genotyping: “informed prior”1. site is known to be polymorphic2. allele frequency estimate

Page 37: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Our heterozygote detection works better than other methods

Performance Measured on ~1000 Alignments covering 500Kb Region of Chromosome 4

Fraction of Data

Analyzed

False Discovery

Rate

Fraction of Heterozygotes

Found

Fraction of Homozygotes

Found

PolyBayes+ 85.1 0.0375 86.60% 97.8%

Polyphred 5 86.17 0.0389 83.16% 82.63%

Page 38: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Base calling for “pyrograms”

From NCBI Trace Archive

• we have access to standardized data formats

• readout in pyrosequencing is based on instantaneous detection of base incorporation… multiple bases of the same type are incorporated in the same cycle

26 55 24 15 10 7 5 4 2 1 0 0

TCAGGGGGGGGGGGACGACAAGGCGTGGGGA• the identity of consecutive bases is very reliable but the length of mono-nucleotide runs (base number) is difficult to quantify (great for re-sequencing; but problematic for de novo sequencing)

Page 39: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

SNP genotyping with pyrosequencers

Nordfors, et. al. Human Mutation 19:395-401 (2002)

we are in the process of identifying discriminating pyrogram features to use in our machine-learning methods to recognize polymorphic positions within traces

Page 40: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Somatic mutation detection

Michael Stromberg

Page 41: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Somatic mutations

© Brian Stavely, Memorial University of Newfoundland

the detection of somatic mutations, and their distinction from inherited polymorphism, is important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer

1. detect the mutations

2. classify whether somatic or inherited

Page 42: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Detecting somatic mutations with comparative data

• based on comparison of cancer and normal tissue from the same individual

• often cancer tissue is highly heterogeneous and the somatic mutant allele may represent at low allele frequency

Page 43: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Detecting somatic mutations with subtraction

• if normal tissue samples are not available, we detect SNPs in cancer tissue against e.g. the human genome reference sequence

• subtract apparent mutations that are present in sequence variation databases

• search for evidence that these mutations are genetic

Page 44: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Detecting somatic mutations with subtraction

• we have applied our methods for somatic mutation detection in murine mitochondrial sequences

heteroplasmy homoplasmy

• we will be applying our methods for human nuclear DNA from our collaborators

Page 45: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Using new haplotype resources to connect genotype and clinical outcome in pharmaco-genetic systems

• the HapMap was designed as a tool to detect high-frequency (common) phenotypic (e.g. disease-causing) alleles

• important drug metabolizing enzymes are relatively few in number, well studied, are at known genome locations, many associated phenotypes are well described

• many functional alleles are known, and of high frequency (common)

• multi-SNP alleles are highly predictive of metabolic phenotype

• clinical phenotype (adverse drug reaction) less predictable

• ideal candidate for applying haplotype resources

Page 46: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Multi-marker haplotypes as accurate markers for ADRs?

functional allele (known metabolic

polymorphism)

genetic marker (haplotype) in genome

regions of drug metabolizing enzyme

(DME) genes

molecular phenotype (drug concentration measured in blood

plasma)

clinical endpoint (adverse drug

reaction)computational prediction

based on haplotype structure

Page 47: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Resources

• specifics of enzyme-drug interactions

• LD and haplotype structure in the HapMap reference samples, based on high-density SNP map

• functional alleles

• existing DME P genotyping chips

Page 48: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Evolutionary questions

• mutation age?

• mutations single-origin or recurrent?• geographic origin of mutations?

• analysis based on complete local variation structure and haplotype background of functional mutations

• specifics of the selection process that led to specific functional alleles?

Page 49: Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Proposed steps of analysis

• haplotypes vs. metabolic phenotype?

• complete polymorphic structure?

• ethnicity?

• additional functional SNPs?

• haplotypes vs. functional alleles?

haplotype block?

functional allele(genotype)

metabolic phenotype

clinical phenotype(ADR)haplotype

• haplotypes vs. ADR phenotype?