Human genetic variation and its contribution to complex traits
-
Upload
groovescience -
Category
Technology
-
view
2.537 -
download
0
description
Transcript of Human genetic variation and its contribution to complex traits
Deplancke Lab
Monica Albarca
Jean-Daniel Feuz
Carine Gubelmann
Korneel Hens
Alina Isakova
Irina Krier
Andreas Massouras
Sunil Raghav
Jovan Simicevic
Sebastian Waszak
Wiebke Westhall
You?
deplanckelab.epfl.ch
Human genetic variation and its contribution to complex traits
Laboratory of Systems Biology and Genetics
26 June 2000
Bart Deplancke ([email protected])
The human genome First announcement
In June 2000: first announcement of a working draft (haplotype!) with the Nature and Science papers in February 2001 In June 2001: finished chromosome 20, with others following until finishing of chromosome 1 in May 2006
International Human Genome Sequencing Consortium (2001) Nature 409:860-921; Venter et al. (2001) Science
291:1304-1351.
Gregory et al. (2006), Nature, 441, 315-321
James Kent (UCSC) Eugene Myers (Celera)
Why are we so phenotypically different?
Classes of human genetic variation Common versus rare Refers to the frequency of the minor allele in the human population:
• Common variants = minor allele frequency (MAF) >1% in the population. Also described as polymorphisms. • Rare variants = MAF < 1%
Neutrality: • The vast majority of genetic variants are likely neutral = no contribution to phenotypic variation. • Some may reach significant frequencies, but this is chance.
Two different nucleotide composition classes:
• Single nucleotide variants • Structural variants
Single nucleotide variants
ATTGCAATCCGTGG...ATCGAGCCA…TACGATTGCACGCCG…
ATTGCAAGCCGTGG...ATCTAGCCA…TACGATTGCAAGCCG…
ATTGCAAGCCGTGG...ATCTAGCCA…TACGATTGCAAGCCG…
ATTGCAATCCGTGG...ATCGAGCCA…TACGATTGCACGCCG…
ATTGCAAGCCGTGG...ATCTAGCCA…TACGATTGCAAGCCG…
T/G T/G A/C
Simple 5’ to 3’ read-out
How are SNPs detected?
Unique oligonucleotide primers to generate minimally overlapping lone range-PCR products of 10-kb average
length
High-density oligonucleotide arrays
Flanking issues
Chee et al., Science, 1996
How are SNPs detected? Other strategies
Reduced representation
shotgun sequencing followed by genomic
alignment
From Rothberg et al. Nature Biotech, 2001
Clustered alignment
Gene-centric studies
Reference sequence
The SNP database - dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/
Three “out of Africa” genomes: • 1.2 million (67%) (all three), 1.7 million (52%) (any two), 1.0 million (30%) unique • Overall, 5.2 million SNPs in the three genomes, the majority being present in dbSNP • Data indicate that most SNVs are common rather than rare
>
> High
Single nucleotide variants • Estimated that the human genome contains > 11 million SNPs (~7 million with MAF > 5%, rest between 1-5%). • Unknown how many rare or even novel (“de novo”) SNVs • SNP alleles in the same genomic interval are often correlated with one another “Linkage disequilibrium (LD)” = Nonrandom association of alleles – varies in complex and unpredictable manner across the genome and between different populations. • International HapMap Project can we divide the genome into groups of highly correlated SNPs that are generally inherited together = “LD bins” Number of tag SNPs required to capture common Phase II SNPs
Single nucleotide variants • International HapMap Project can we divide the genome into groups of highly correlated SNPs that are generally inherited together = “LD bins” Number of tag SNPs required to capture common Phase II SNPs
Pairwise linkage disequilibrium (LD) r2 (if 1 SNPs statistically indistinguishable)
Based on genotyping over 3.1 million SNPs in 270 individuals from 4 geographically diverse populations (Frazer et al., Nature, 2007)
Recap
By genotyping the DNA sample of an individual with a “tagging” SNP from each LD bin, knowledge regarding 80% of SNPs with a
MAF > 5% across the genome is gained. (Frazer et al., Nature Rev. Genetic., 2010)
Scan Entire Genome - 500,000 SNPs
Querying human genetic variation
Population Stratification Subdivision of a population into different ethnic groups with
potentially different marker allele frequencies and thus different disease prevalence
Principle Component Analysis reveals SNP-vectors explaining largest variation in the data
From Sven Bergmann, UNIL
Ethnic groups cluster according to geographic distances
PC1 PC1
PC
2
PC
2
Population Stratification
From Sven Bergmann, UNIL
PCA of POPRES cohort
Population Stratification
From Sven Bergmann, UNIL
A classic that opened the door to structural variant research:
Structural variants
(Frazer et al., Nature Rev. Genetic., 2010)
Sebat et al. Large-Scale Copy Number Polymorphism in the Human Genome. Science, 2004.
Used ROMA technique to detect copy number variants
1) Genome digestion 2) Adapters to sticky ends and
PCR amplification 3) After PCR, representations of
the entire genome (restriction fragments) are amplified to pronounce relative increases, decreases or preserve equal copy number in the two genomes.
4) Representations of the two different genomes are labeled with different fluorophores and co-hybridized to a microarray with probes specific to restriction site locations across the entire human genome.
Representational Oligonucleotide Microarray Analysis (ROMA)
Representational Oligonucleotide Microarray Analysis (ROMA)
On average, individuals (20 tested) differed by 11 CNPs (average length = 465 kb)
affecting 70 genes.
Our ability to detect SVs is still very poor (see later)
Structural variants (SVs)
(Frazer et al., Nature Rev. Genetic., 2010)
Structural variants (SVs) Fosmid-based library
sequencing of 8 humans (4 Yorubian and 4 non-African)
(Kidd et al., Nature, 2008)
• 1 million fosmid clones/individual • Both ends of each clone insert sequenced a pair of high-quality end sequences (termed an end-sequence pair (ESP).
(~450 bp/sequence)
Only SVs over 8 kb can be detected
Structural variants (SVs) Fosmid-based library sequencing of 8 humans (4 Yorubian
and 4 non-African) (Kidd et al., Nature, 2008)
~2,000 SVs that were experimentally verified
Novel sequence (either in
gaps (black) or not
(orange))
Structural variants (SVs) Fosmid-based library sequencing of 8 humans (4 Yorubian
and 4 non-African) (Kidd et al., Nature, 2008)
~2,000 SVs that were experimentally verified
Novel sequence (either in
gaps (black) or not
(orange))
• 50% of SVs seen >1 individual • ~50% outside regions previously annotated as SVs nearly half lay outside regions of the genome previously described as structurally variant • 525 new insertion sequences • 20% of all genetic variants = SVs, but covers >70% of nucleotide variation • SVs b/w 9- 25 Mb (~0.5-1% of the genome) • The majority of SVs are yet to be discovered
Structural variants (SVs) Fosmid-based library sequencing of 8 humans (4 Yorubian
and 4 non-African) (Kidd et al., Nature, 2008)
Regions of increased SNV
density
Structural variants and linkage disequilibrium McCarroll et al., Nature Genet., 2008
• Most common, diallelic CNPs (with MAF greater than 5%) were perfectly captured (r2 = 1.0) by at least one SNP tag from HapMap Phase II • Mean r2 as a function of distance from a polymorphism = indistinguishable for SNPs and diallelic CNPs common, diallelic CNPs are ancestral mutations
Common SVs are in LD with tagging SNPs
Contribution of variants to phenotypes?
Common versus rare “Common disease – common variant hypothesis”
versus Common complex traits are the summation of low-frequency, high-penetrance variants
OR = odd ratio or PAR = population attributable risk = measure of the multifactorial inherited component of a disease
Whole Genome Association studies
How significant is this?
P-value
Note: “Genome-wide” is a misnomer • 20% of common SNPs not or only partially tagged • Rare variants not tagged at all
Whole genome association studies
Whole Genome Association studies
* * *
* * Scan Entire Genome - 500,000 SNPs
Identify local regions of interest, examine genes, SNP density regulatory regions, etc
Replicate the finding
-lo
g 10(p
) -l
og 1
0(p
)
From Sven Bergmann, UNIL
Concept
McCarthy et al., Nature Rev. Genet., 2008
Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
Whole Genome Association studies
Visualization
* * *
* * Scan Entire Genome - 500,000s SNPs
Identify local regions of interest, examine genes, SNP density regulatory regions, etc
Replicate the finding
-lo
g 10(p
) -l
og 1
0(p
)
Whole genome association studies
Concept
From Sven Bergmann (UNIL)
Whole genome association studies An avalanche of GWA studies
• From 2006 >220 studies reported to date • For over 80 phenotypes 300 loci have been implicated • Most implicated loci were identified for the first time (no prior knowledge)
Whole genome association studies Type 2 diabetes: an example
• 18 genomic intervals with 4 containing previously implicated genes • Major message: the molecular diversity of T2D genes was not anticipated, thus:
(Patients with = disease) ≠ (Patients with = underlying biological disorder)
Frazer et al., Nat. Rev. Genet., 2010
Whole genome association studies Overlap of genetic risk factor loci for common diseases
• 15 loci are associated with two or more diseases (8 are shown) • Not necessarily same impact (PTPN22 + Crohn’s, - for other ai diseases • Different diseases may have similar molecular underpinnings
• Expected: ai diseases (same clinical features) • Unexpected: e.g. GCKR in both TGC levels and ai disease
Frazer et al., Nat. Rev. Genet., 2010
Whole genome association studies From association to molecular mechanism
• Very difficult: • what are the precise variants associated with a trait? • if located in exons: easy, but outside, then what? • most are located outside exons! (e.g. 9p21 <-> myocardial infarction is located 150 kb from the nearest gene!) • May have a regulatory function, i.e. control gene expression
1 c2 3
A G
• humans are heterozygous at more functional cis-regulatory sites than at amino acid positions, with 10,700 functional biallelic cis-regulatory polymorphisms in a typical human (Rockman and Wray. Mol. Biol. Evol., 2002: 19, 1991). • 34% of promoter polymorphisms (170 tested) significantly modulated reporter gene expression (>1.5-fold) (Hoogendoorn et al., Hum. Mol. Genet., 2003: 12, 2249). • Case study with the CC chemokine receptor 5, a major chemokine coreceptor of HIV-1 necessary for viral entry into cells
• G to A SNP of CCR5 at –2459 nt • CCR5 density – low (homozygous GG), intermediate (GA), and highest (homozygous –2459AA) (Salkowitz et al., Clin. Immunol., 2003: 108, 234).
Whole genome association studies Mapping eQTLs
• Transcript abundance = a quantitative trait that can be mapped with considerable power = eQTLs
Environment Genetics
Heritability (H2) = genetic variance over total trait variance with 0 = no genetic effects and 1 = all variance is under genetic control
Classic paper: Schadt et al., Nature, 2003 Genetics of gene expression surveyed in maize, mouse and man
• Liver tissues from 111 F2 mice constructed (from C57BL/6J and DBA/2J) • Microarray analysis of 23,574 genes: 7,861 significantly differentially expressed (either in the
parental strains or in at least 10% of the F2 mice)
• eQTL identification (log of the odds ratio (LOD) > 4.3 (P-value < 0.00005))for 2,123 genes • These eQTLs explained 25% of the transcription variation of the corresponding genes
Whole genome association studies Mapping eQTLs
Schadt et al., Nature, 2003
% eQTL across 920 evenly spaced bins, each 2 cM wide
• Several hotspots (>1% of detected eQTLs are located within a 4 cM
interval)
• 40% of genes with ≥ 1 eQTL (LOD > 3.0) had more than one eQTL, and
close to 4% of such genes had more than three eQTL
Gene expression = complex trait
Whole genome association studies Mapping eQTLs
Schadt et al., Nature, 2003
Known polymorphisms between the two parental strains • Overlap between polymorphism and
eQTL = cis-acting transcriptional regulation
For example:
• The C5 gene 2 bp deletion in the coding region in DBA mice resulting in
rapid transcript decay compared with B6. A LOD of 27.4 centred over the C5 gene
on chromosome 2 is readily detected (black curve).
• The Alad gene present in 2 copies in DBA
Whole genome association studies Mapping eQTLs
Schadt et al., Nature, 2003
Combining clinical, gene expression and genetic factors
• Classical QTLs for FPM: 4 significant loci
• Further analyses with subgroups:
additional loci identified
• Some QTLs only affect a subset of the F2 population, demonstrating the complexity
underlying traits such as obesity
Whole genome association studies Mapping eQTLs
Dixon et al., Nature Genet., 2007: A genome-wide association study of global gene expression
• 206 families of British descent using immortalized lymphoblastoid cell lines (LCLs) from 400 children (Affy microarrays; 54,675 transcripts ~ 20,599 genes)
~15,000 H2 > 0.3
Gene Ontology descriptors for: • Response to unfolded protein (HSFs, chaperones) • Immune responses and apoptosis • Regulation of progression through the cell cycle, • RNA processing and DNA repair.
Whole genome association studies Mapping eQTLs
Dixon et al., Nature Genet., 2007: A genome-wide association study of global gene expression
• 206 families of British descent using immortalized lymphoblastoid cell lines (LCLs) from 400 children (Affy microarrays; 54,675 transcripts ~ 20,599 genes)
• Trans effects are weaker than those in cis
• Nevertheless, significant trans associations were detected:
e.g. 1) ~700 transcripts with the peak of association on the same chromosome but
>100 kb from the nearest transcribed gene, 2) 10,382 transcripts, the peak of
association was on a different chromosome
Whole genome association studies Mapping eQTLs
Libioulle et al., PLOS Genet., 2007
Using eQTLs to better understand GWAS results
GWAS for Crohn’s disease
• Disease-associated polymorphisms may be regulating PTGER4 expression in cis, but >250 kb away more research needed but likely regulatory polymorphism
1.25 Mb Gene desert
• One of the neighboring genes PTGER4 may be involved • Trace eQTLs in LCL data
Whole genome association studies Mapping eQTLs
Stranger et al., Science, 2007: Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes
We looked at SNPs but what about other structural variants?
• LCLs of 210 unrelated HapMap individuals from four populations • Copy number variants were identified via CGH against a common reference individual
SNP CNV
From probe associated with linked gene From probe associated with linked gene
• 83.6% and 17.7% of the total detected genetic variation in gene expression • SNPs close to their respective genes, less so for CNVs • Little overlap between SNP and CNV associations (only 20%) • Not “mere” gene dosage effects
Whole genome association studies How universal are GWAS findings?
• Allele frequencies are different in different populations • LD patterns across loci that co-segregate with a causally associated variant may be different from population to population • Control for population differences is essential in large studies
Frazer et al., Nat. Rev. Genet., 2010 Associated with myocardial
infarction
LD less strong in African population bottleneck principle
Red = high pairwise SNP correlation
SNPs that efficiently (r2 > 0.8) tag one another are
connected
Whole genome association studies Impact so far
• No complex traits for which there is > 10% of the genetic variance explained e.g. T2D: 18 genetic variants together < 4% of the total trait liability
• Sample size may compensate (increased statistical power) But…studies for lipid phenotypes involving >40,000 people still <10% … some diseases have only a low number of affected individuals
• Does the answer lie in structural variants? Most are still unmapped But… they are likely in LD with common SNPs
• Does the answer lie in rare variants? Possibly…
• Rare variants are not in LD with tagging SNPs and thus so far undetected (Amish study) • Can have very high penetrance • However, how to detect on a population-wide basis?
The power of whole-genome sequencing
• Sequenced genomes of 2 parents and 2 children, both affected by Miller Syndrome • Identified 3.7 million SNPs that varied within the family • Resequenced 34000 candidate mutations 28 de novo mutations • Narrowing down via “rare” assumption and knowledge of recessive inheritance • Found one gene, dihydroorotate dehydrogenase (DHOH) known to be involved
Miller syndrome: autosomal recessive genetic trait (Roach et al., Science, 2010)
Whole genome association studies
Toward the elucidation of each person’s genetic make-up Entering the age of personalized medicine
Necessary for: 1) DNA-based risk assessment for common complex disease 2) Drug discovery (new implicated genes can be identified)
But also to: 3) Identify molecular signatures for disease diagnosis and prognosis
And for:
4) A DNA-guided therapy and dose selection A person’s genetic make-up significantly affects the efficacy of a drug
• Polymorphisms in the VKORC1 and CYP2C9 genes dictate the effective dose levels of the anti-coagulant Warfarin • Polymorphisms in the UGT1A1 gene correlate with increased toxicity of the anti-colon cancer drug Irinotecan • Polymorphisms in the MTHFR gene are associated with increased toxicity of Methotrexate used to treat Crohn’s disease • Polymorphisms in the CYP2D6 gene dictates the probability of relapse in women with breastcancer treated with Tamoxifen
The revolution of high-throughput sequencing: Illumina Entering the age of personalized medicine
Solid phase amplification: 1) initial priming and extending of the single-stranded, single-molecule template, and 2) bridge amplification of the immobilized template with immediately adjacent primers to form clusters.
Metzker et al., Nat. Rev. Genet., 2010
1
1
From sequence to genome: mapping reads Entering the age of personalized medicine
Trapnell and Salzberg, Nat. Biotech., 2009
Four sequences of equal strength = seeds
If 1SNP, the other 3 seeds intact; If 2 SNPs, the other 2 seeds intact; Thus, max 2 SNPs/read Limitation: Indexing takes up huge memory
Using BW, the index for the entire human genome fits into < 2
Gb of memory
Is 30 times faster than indexing
Also is limited to 2 SNPs within one
read
Burrows-Wheeler transform
Entering the age of personalized medicine
Wikipedia
Easier to compress strings with runs of repeated characters
A first human genome project using HTS
Entering the age of personalized medicine
Bentley et al., Nature, 2008 • Solexa Technology • First: X-chromosome
• 204 million reads • Sampling of sequence fragments is close to random (GC content slight effect)
A first human genome project using HTS
Entering the age of personalized medicine
Bentley et al., Nature, 2008 • 135 Gb of sequence (~4 billion paired 35-base reads) (8 weeks) • The approximate consumables cost = $250,000 • 97% of the reads were aligned using MAQ • 99.9% of the human reference covered with ≥ 1 reads at 40.6X
99% agreement with HapMap results!
More human genome projects
Entering the age of personalized medicine
Snyder et al., G&D, 2010
More human genome projects
Entering the age of personalized medicine
Snyder et al., G&D, 2010
More human genome projects
Entering the age of personalized medicine
Snyder et al., G&D, 2010
Tackling the SV problem using HTS
Entering the age of personalized medicine
• Really difficult and progress is limited. • Existing methods are based on two approaches:
• Paired-end mapping (PEM) • Depth-of-coverage (DOC) approach
• The ends of each fragment tagged by a biotinylated (B) nucleotide • Circularization forms a junction between the two ends • Random fragmentation and recovery of biotinylated fragments • Circularized DNA is randomly fragmented and the biotinylated junction fragments are recovered • Standard sequencing procedure thereafter
Tackling the SV problem using HTS: paired-end mapping
Entering the age of personalized medicine
Medvedev et al., Nature Meth., 2009
Tackling the SV problem using HTS: DOC
Entering the age of personalized medicine
Snyder et al., G&D, 2010 Campbell et al., Nature Genet., 2008
Entering the age of personalized medicine
Snyder et al., G&D, 2010
Tackling the SV problem using HTS: state-of-the-art