Human genetic variation and its contribution to complex traits

Deplancke Lab

Monica Albarca

Jean-Daniel Feuz

Carine Gubelmann

Korneel Hens

Alina Isakova

Irina Krier

Andreas Massouras

Sunil Raghav

Jovan Simicevic

Sebastian Waszak

Wiebke Westhall

You?

deplanckelab.epfl.ch

Human genetic variation and its contribution to complex traits

Laboratory of Systems Biology and Genetics

26 June 2000

Bart Deplancke ([email protected])

The human genome First announcement

In June 2000: first announcement of a working draft (haplotype!) with the Nature and Science papers in February 2001 In June 2001: finished chromosome 20, with others following until finishing of chromosome 1 in May 2006

International Human Genome Sequencing Consortium (2001) Nature 409:860-921; Venter et al. (2001) Science

291:1304-1351.

Gregory et al. (2006), Nature, 441, 315-321

James Kent (UCSC) Eugene Myers (Celera)

Why are we so phenotypically different?

Classes of human genetic variation Common versus rare Refers to the frequency of the minor allele in the human population:

• Common variants = minor allele frequency (MAF) >1% in the population. Also described as polymorphisms. • Rare variants = MAF < 1%

Neutrality: • The vast majority of genetic variants are likely neutral = no contribution to phenotypic variation. • Some may reach significant frequencies, but this is chance.

Two different nucleotide composition classes:

• Single nucleotide variants • Structural variants

Single nucleotide variants

ATTGCAATCCGTGG...ATCGAGCCA…TACGATTGCACGCCG…

ATTGCAAGCCGTGG...ATCTAGCCA…TACGATTGCAAGCCG…


ATTGCAATCCGTGG...ATCGAGCCA…TACGATTGCACGCCG…


T/G T/G A/C

../../../My Presentations/fig2.bmp

Simple 5’ to 3’ read-out

How are SNPs detected?

Unique oligonucleotide primers to generate minimally overlapping lone range-PCR products of 10-kb average

length

High-density oligonucleotide arrays

Flanking issues

Chee et al., Science, 1996

How are SNPs detected? Other strategies

Reduced representation

shotgun sequencing followed by genomic

alignment

From Rothberg et al. Nature Biotech, 2001

Clustered alignment

Gene-centric studies

Reference sequence

The SNP database - dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/

Three “out of Africa” genomes: • 1.2 million (67%) (all three), 1.7 million (52%) (any two), 1.0 million (30%) unique • Overall, 5.2 million SNPs in the three genomes, the majority being present in dbSNP • Data indicate that most SNVs are common rather than rare

>

> High

Single nucleotide variants • Estimated that the human genome contains > 11 million SNPs (~7 million with MAF > 5%, rest between 1-5%). • Unknown how many rare or even novel (“de novo”) SNVs • SNP alleles in the same genomic interval are often correlated with one another “Linkage disequilibrium (LD)” = Nonrandom association of alleles – varies in complex and unpredictable manner across the genome and between different populations. • International HapMap Project can we divide the genome into groups of highly correlated SNPs that are generally inherited together = “LD bins” Number of tag SNPs required to capture common Phase II SNPs

Single nucleotide variants • International HapMap Project can we divide the genome into groups of highly correlated SNPs that are generally inherited together = “LD bins” Number of tag SNPs required to capture common Phase II SNPs

Pairwise linkage disequilibrium (LD) r2 (if 1 SNPs statistically indistinguishable)

Based on genotyping over 3.1 million SNPs in 270 individuals from 4 geographically diverse populations (Frazer et al., Nature, 2007)

Recap

By genotyping the DNA sample of an individual with a “tagging” SNP from each LD bin, knowledge regarding 80% of SNPs with a

MAF > 5% across the genome is gained. (Frazer et al., Nature Rev. Genetic., 2010)

Scan Entire Genome - 500,000 SNPs

Querying human genetic variation

http://www.biotech.umb.edu/Affymetrix.htm

Population Stratification Subdivision of a population into different ethnic groups with

potentially different marker allele frequencies and thus different disease prevalence

Principle Component Analysis reveals SNP-vectors explaining largest variation in the data

From Sven Bergmann, UNIL

Ethnic groups cluster according to geographic distances

PC1 PC1

PC

2

PC

2

Population Stratification


PCA of POPRES cohort

Population Stratification


A classic that opened the door to structural variant research:

Structural variants

(Frazer et al., Nature Rev. Genetic., 2010)

Sebat et al. Large-Scale Copy Number Polymorphism in the Human Genome. Science, 2004.

Used ROMA technique to detect copy number variants

1) Genome digestion 2) Adapters to sticky ends and

PCR amplification 3) After PCR, representations of

the entire genome (restriction fragments) are amplified to pronounce relative increases, decreases or preserve equal copy number in the two genomes.

4) Representations of the two different genomes are labeled with different fluorophores and co-hybridized to a microarray with probes specific to restriction site locations across the entire human genome.

Representational Oligonucleotide Microarray Analysis (ROMA)

Representational Oligonucleotide Microarray Analysis (ROMA)

On average, individuals (20 tested) differed by 11 CNPs (average length = 465 kb)

affecting 70 genes.

Our ability to detect SVs is still very poor (see later)

Structural variants (SVs)

(Frazer et al., Nature Rev. Genetic., 2010)

Structural variants (SVs) Fosmid-based library

sequencing of 8 humans (4 Yorubian and 4 non-African)

(Kidd et al., Nature, 2008)

• 1 million fosmid clones/individual • Both ends of each clone insert sequenced a pair of high-quality end sequences (termed an end-sequence pair (ESP).

(~450 bp/sequence)

Only SVs over 8 kb can be detected

Structural variants (SVs) Fosmid-based library sequencing of 8 humans (4 Yorubian

and 4 non-African) (Kidd et al., Nature, 2008)

~2,000 SVs that were experimentally verified

Novel sequence (either in

gaps (black) or not

(orange))



~2,000 SVs that were experimentally verified

Novel sequence (either in

gaps (black) or not

(orange))

• 50% of SVs seen >1 individual • ~50% outside regions previously annotated as SVs nearly half lay outside regions of the genome previously described as structurally variant • 525 new insertion sequences • 20% of all genetic variants = SVs, but covers >70% of nucleotide variation • SVs b/w 9- 25 Mb (~0.5-1% of the genome) • The majority of SVs are yet to be discovered



Regions of increased SNV

density

Structural variants and linkage disequilibrium McCarroll et al., Nature Genet., 2008

• Most common, diallelic CNPs (with MAF greater than 5%) were perfectly captured (r2 = 1.0) by at least one SNP tag from HapMap Phase II • Mean r2 as a function of distance from a polymorphism = indistinguishable for SNPs and diallelic CNPs common, diallelic CNPs are ancestral mutations

Common SVs are in LD with tagging SNPs

Contribution of variants to phenotypes?

Common versus rare “Common disease – common variant hypothesis”

versus Common complex traits are the summation of low-frequency, high-penetrance variants

OR = odd ratio or PAR = population attributable risk = measure of the multifactorial inherited component of a disease

Whole Genome Association studies

How significant is this?

P-value

Note: “Genome-wide” is a misnomer • 20% of common SNPs not or only partially tagged • Rare variants not tagged at all

Whole genome association studies


* * *

* * Scan Entire Genome - 500,000 SNPs

Identify local regions of interest, examine genes, SNP density regulatory regions, etc

Replicate the finding

-lo

g 10(p

) -l

og 1

0(p

)


Concept

McCarthy et al., Nature Rev. Genet., 2008

Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).


Visualization

* * *

* * Scan Entire Genome - 500,000s SNPs

Identify local regions of interest, examine genes, SNP density regulatory regions, etc

Replicate the finding

-lo

g 10(p

) -l

og 1

0(p

)


Concept

From Sven Bergmann (UNIL)

Whole genome association studies An avalanche of GWA studies

• From 2006 >220 studies reported to date • For over 80 phenotypes 300 loci have been implicated • Most implicated loci were identified for the first time (no prior knowledge)

Whole genome association studies Type 2 diabetes: an example

• 18 genomic intervals with 4 containing previously implicated genes • Major message: the molecular diversity of T2D genes was not anticipated, thus:

(Patients with = disease) ≠ (Patients with = underlying biological disorder)

Frazer et al., Nat. Rev. Genet., 2010

Whole genome association studies Overlap of genetic risk factor loci for common diseases

• 15 loci are associated with two or more diseases (8 are shown) • Not necessarily same impact (PTPN22 + Crohn’s, - for other ai diseases • Different diseases may have similar molecular underpinnings

• Expected: ai diseases (same clinical features) • Unexpected: e.g. GCKR in both TGC levels and ai disease

Frazer et al., Nat. Rev. Genet., 2010

Whole genome association studies From association to molecular mechanism

• Very difficult: • what are the precise variants associated with a trait? • if located in exons: easy, but outside, then what? • most are located outside exons! (e.g. 9p21 <-> myocardial infarction is located 150 kb from the nearest gene!) • May have a regulatory function, i.e. control gene expression

1 c2 3

A G

• humans are heterozygous at more functional cis-regulatory sites than at amino acid positions, with 10,700 functional biallelic cis-regulatory polymorphisms in a typical human (Rockman and Wray. Mol. Biol. Evol., 2002: 19, 1991). • 34% of promoter polymorphisms (170 tested) significantly modulated reporter gene expression (>1.5-fold) (Hoogendoorn et al., Hum. Mol. Genet., 2003: 12, 2249). • Case study with the CC chemokine receptor 5, a major chemokine coreceptor of HIV-1 necessary for viral entry into cells

• G to A SNP of CCR5 at –2459 nt • CCR5 density – low (homozygous GG), intermediate (GA), and highest (homozygous –2459AA) (Salkowitz et al., Clin. Immunol., 2003: 108, 234).

Whole genome association studies Mapping eQTLs

• Transcript abundance = a quantitative trait that can be mapped with considerable power = eQTLs

Environment Genetics

Heritability (H2) = genetic variance over total trait variance with 0 = no genetic effects and 1 = all variance is under genetic control

Classic paper: Schadt et al., Nature, 2003 Genetics of gene expression surveyed in maize, mouse and man

• Liver tissues from 111 F2 mice constructed (from C57BL/6J and DBA/2J) • Microarray analysis of 23,574 genes: 7,861 significantly differentially expressed (either in the

parental strains or in at least 10% of the F2 mice)

• eQTL identification (log of the odds ratio (LOD) > 4.3 (P-value < 0.00005))for 2,123 genes • These eQTLs explained 25% of the transcription variation of the corresponding genes


Schadt et al., Nature, 2003

% eQTL across 920 evenly spaced bins, each 2 cM wide

• Several hotspots (>1% of detected eQTLs are located within a 4 cM

interval)

• 40% of genes with ≥ 1 eQTL (LOD > 3.0) had more than one eQTL, and

close to 4% of such genes had more than three eQTL

Gene expression = complex trait



Known polymorphisms between the two parental strains • Overlap between polymorphism and

eQTL = cis-acting transcriptional regulation

For example:

• The C5 gene 2 bp deletion in the coding region in DBA mice resulting in

rapid transcript decay compared with B6. A LOD of 27.4 centred over the C5 gene

on chromosome 2 is readily detected (black curve).

• The Alad gene present in 2 copies in DBA



Combining clinical, gene expression and genetic factors

• Classical QTLs for FPM: 4 significant loci

• Further analyses with subgroups:

additional loci identified

• Some QTLs only affect a subset of the F2 population, demonstrating the complexity

underlying traits such as obesity


Dixon et al., Nature Genet., 2007: A genome-wide association study of global gene expression

• 206 families of British descent using immortalized lymphoblastoid cell lines (LCLs) from 400 children (Affy microarrays; 54,675 transcripts ~ 20,599 genes)

~15,000 H2 > 0.3

Gene Ontology descriptors for: • Response to unfolded protein (HSFs, chaperones) • Immune responses and apoptosis • Regulation of progression through the cell cycle, • RNA processing and DNA repair.


Dixon et al., Nature Genet., 2007: A genome-wide association study of global gene expression

• 206 families of British descent using immortalized lymphoblastoid cell lines (LCLs) from 400 children (Affy microarrays; 54,675 transcripts ~ 20,599 genes)

• Trans effects are weaker than those in cis

• Nevertheless, significant trans associations were detected:

e.g. 1) ~700 transcripts with the peak of association on the same chromosome but

>100 kb from the nearest transcribed gene, 2) 10,382 transcripts, the peak of

association was on a different chromosome


Libioulle et al., PLOS Genet., 2007

Using eQTLs to better understand GWAS results

GWAS for Crohn’s disease

• Disease-associated polymorphisms may be regulating PTGER4 expression in cis, but >250 kb away more research needed but likely regulatory polymorphism

1.25 Mb Gene desert

• One of the neighboring genes PTGER4 may be involved • Trace eQTLs in LCL data


Stranger et al., Science, 2007: Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes

We looked at SNPs but what about other structural variants?

• LCLs of 210 unrelated HapMap individuals from four populations • Copy number variants were identified via CGH against a common reference individual

SNP CNV

From probe associated with linked gene From probe associated with linked gene

• 83.6% and 17.7% of the total detected genetic variation in gene expression • SNPs close to their respective genes, less so for CNVs • Little overlap between SNP and CNV associations (only 20%) • Not “mere” gene dosage effects

Whole genome association studies How universal are GWAS findings?

• Allele frequencies are different in different populations • LD patterns across loci that co-segregate with a causally associated variant may be different from population to population • Control for population differences is essential in large studies

Frazer et al., Nat. Rev. Genet., 2010 Associated with myocardial

infarction

LD less strong in African population bottleneck principle

Red = high pairwise SNP correlation

SNPs that efficiently (r2 > 0.8) tag one another are

connected

Whole genome association studies Impact so far

• No complex traits for which there is > 10% of the genetic variance explained e.g. T2D: 18 genetic variants together < 4% of the total trait liability

• Sample size may compensate (increased statistical power) But…studies for lipid phenotypes involving >40,000 people still <10% … some diseases have only a low number of affected individuals

• Does the answer lie in structural variants? Most are still unmapped But… they are likely in LD with common SNPs

• Does the answer lie in rare variants? Possibly…

• Rare variants are not in LD with tagging SNPs and thus so far undetected (Amish study) • Can have very high penetrance • However, how to detect on a population-wide basis?

The power of whole-genome sequencing

• Sequenced genomes of 2 parents and 2 children, both affected by Miller Syndrome • Identified 3.7 million SNPs that varied within the family • Resequenced 34000 candidate mutations 28 de novo mutations • Narrowing down via “rare” assumption and knowledge of recessive inheritance • Found one gene, dihydroorotate dehydrogenase (DHOH) known to be involved

Miller syndrome: autosomal recessive genetic trait (Roach et al., Science, 2010)


Toward the elucidation of each person’s genetic make-up Entering the age of personalized medicine

Necessary for: 1) DNA-based risk assessment for common complex disease 2) Drug discovery (new implicated genes can be identified)

But also to: 3) Identify molecular signatures for disease diagnosis and prognosis

And for:

4) A DNA-guided therapy and dose selection A person’s genetic make-up significantly affects the efficacy of a drug

• Polymorphisms in the VKORC1 and CYP2C9 genes dictate the effective dose levels of the anti-coagulant Warfarin • Polymorphisms in the UGT1A1 gene correlate with increased toxicity of the anti-colon cancer drug Irinotecan • Polymorphisms in the MTHFR gene are associated with increased toxicity of Methotrexate used to treat Crohn’s disease • Polymorphisms in the CYP2D6 gene dictates the probability of relapse in women with breastcancer treated with Tamoxifen

The revolution of high-throughput sequencing: Illumina Entering the age of personalized medicine

Solid phase amplification: 1) initial priming and extending of the single-stranded, single-molecule template, and 2) bridge amplification of the immobilized template with immediately adjacent primers to form clusters.

Metzker et al., Nat. Rev. Genet., 2010

1

1

From sequence to genome: mapping reads Entering the age of personalized medicine

Trapnell and Salzberg, Nat. Biotech., 2009

Four sequences of equal strength = seeds

If 1SNP, the other 3 seeds intact; If 2 SNPs, the other 2 seeds intact; Thus, max 2 SNPs/read Limitation: Indexing takes up huge memory

Using BW, the index for the entire human genome fits into < 2

Gb of memory

Is 30 times faster than indexing

Also is limited to 2 SNPs within one

read

Burrows-Wheeler transform

Entering the age of personalized medicine

Wikipedia

Easier to compress strings with runs of repeated characters

A first human genome project using HTS


Bentley et al., Nature, 2008 • Solexa Technology • First: X-chromosome

• 204 million reads • Sampling of sequence fragments is close to random (GC content slight effect)

A first human genome project using HTS


Bentley et al., Nature, 2008 • 135 Gb of sequence (~4 billion paired 35-base reads) (8 weeks) • The approximate consumables cost = $250,000 • 97% of the reads were aligned using MAQ • 99.9% of the human reference covered with ≥ 1 reads at 40.6X

99% agreement with HapMap results!

More human genome projects


Snyder et al., G&D, 2010

Tackling the SV problem using HTS


• Really difficult and progress is limited. • Existing methods are based on two approaches:

• Paired-end mapping (PEM) • Depth-of-coverage (DOC) approach

• The ends of each fragment tagged by a biotinylated (B) nucleotide • Circularization forms a junction between the two ends • Random fragmentation and recovery of biotinylated fragments • Circularized DNA is randomly fragmented and the biotinylated junction fragments are recovered • Standard sequencing procedure thereafter

Tackling the SV problem using HTS: paired-end mapping


Medvedev et al., Nature Meth., 2009

Tackling the SV problem using HTS: DOC


Snyder et al., G&D, 2010 Campbell et al., Nature Genet., 2008


Snyder et al., G&D, 2010

Tackling the SV problem using HTS: state-of-the-art

Human genetic variation and its contribution to complex traits

Technology

Transcript of Human genetic variation and its contribution to complex traits