The Ashkenazi Genome Project
description
Transcript of The Ashkenazi Genome Project
The Ashkenazi Genome Project
Shai CarmiPe’er lab, Columbia University
Joint Group MeetingNovember 2012
Recent History of Ashkenazi Jews
• Mediterranean origin (?)• Ca. 1000: Small communities
in N. France, Rhineland
• Migration east
• Expansion
• ~10M today, mostly
in US and Israel
• Relative isolation
Ashkenazi Jewish Genetics
Behar et al., Nature 2010.Bray et al., PNAS 2010.Guha et al., Genome Biology 2012.
300 Jewish individuals; SNP arrays
• Recently, AJ shown to be a genetically distinct group• Close to Middle-Eastern & South-European populations
Price et al., PLoS Genetics 2008.Olshen et al., BMC Genetics 2008.Need et al., Genome Biology 2009.Kopelman et al., BMC Genetics, 2009.
AJ
Atzmon et al., AJHG 2010
Jewish non-AJ
Middle-Eastern
Europeans
Recent Demography & IBDIn small populations, common ancestors are likely recent.
A B
Generation
1
2
3
Recent Demography & IBDIn small populations, common ancestors are likely recent.
AB
A shared segment
• For g-generation ancestor, chances of IBD , but length (M).
• IBD is highly informative on recent history!
Many long haplotypes identical-by-descent
A B
Generation
Formal Inference Using IBD• Assume a population of historical size . • Total shared segments of length :
A B
AB
A shared segment
Palamara et al., AJHG 2012
IBD sharing abundant in AJ
Atzmon et al., AJHG 2012Gusev et al., MBE 2011
• Detect IBD in sample Infer history .
AJ Genetic History
Expansion rate ≈34% per generation
2,300
N
t
Effective size
45,000270
4,300,000
Years ago
800
Present
Palamara et al., AJHG 2012
High potential for genetic studies!
0%
20%
40%
60%
80%
100%
0 50 100 150 200 250 300 350 400 450 500
# of Sequenced Individuals
% A
dditi
onal
Info
rmati
on P
oten
tial
WTCCC AJ_SCZ AJUK
Pow
er o
f im
puta
tion
by IB
D
The Ashkenazi Genome Consortium
10 labs from NY area and Israel.Goals:• Sequence to high coverage hundreds of healthy AJ
o Use as a reference panel for o Association studieso Imputationo Clinical interpretation
o Understand AJ population historyo Understand AJ functional genetic variation
(negative/positive selection)
The Ashkenazi Genome ConsortiumPhase I:
• 144 AJ personal genomes • ~60yo, healthy controls• Unrelated, PCA-validated AJ• Selected to maximize sharing with rest of cohort• Technology: Complete Genomics• Sequenced so far: ~100 genomes• Data presented: 58 genomes
Phase II:
• Hundreds of genomes (2013?)• More collaborators
Quality MeasuresProperty Genome (exome)Coverage ~55x
Fraction called 96.5±0.003% (98%)Fraction with coverage > 20x 92.4±0.018% (94.9%)Concordance with SNP array 99.87±0.1%
Ti/Tv ratio 2.14±0.003 (3.05)
Ti/T
v
Variant statisticsStatistic Per genome (exome)
Total SNPs 3.4M (22k)
Novel SNPs 3.7% (4%)
Het/hom ratio 1.64 (1.67)
Insertions count 223k (246)
Deletions count 237k (218)
Multi-nucleotide variant count 83k (374)
Synonymous SNPs 10525
Non-synonymous SNPs 9695
Nonsense SNPs 71
Other disrupting 241
CNV count 336
SV count 1486
MEI count 3475
Comparison to Europeans
TAGCFlemish
All SNPs 3000000320000034000003600000
Het/hom1.4
1.6
Insertions Deletions MNPs0
100000
200000
(M)
(k)
Similar results in 13 CG European public genomes.
Novel SNPs (%)
0
2
4TAGCEU
Extrapolated to 100% genome
Het/Hom Ratio
Het/hom1.4
1.6
• Significant in comparison to both Flemish and HapMap EU.
• Was observed in SNP arrays (Need et al., Genome Biology 2009).
• Did I not just say that AJ have more IBD?
AJ EU
IBD observed
Het/Hom Ratiot
Years ago
Present
Data Flow PipelineBackup 3x
CGA tools
VCFtestVariants
Fix Plink/Seq
QC
PlinkCompress, index
Phase
Distribute
Quality ControlFalse positive rate assessment by runs of homozygosity:• Assume hets in high confidence roh are FP.
hets
PaternalMaternal
• High confidence rohs only (>7.5MB, no gaps).• 7 segments in 7 individuals (total 72MB).• Count het SNPs in original files.• Genome wide extrapolation: ~20,000 per genome.• ~3-5% FP rate for indels.
Quality Control
Indels and MNPs
Low-quality SNPs
Multi-allelic SNPs
Half-calls
SNPs with high no-call rate
SNPs not in HWE
Monomorphic reference SNPs
Inbred individual
Remove:
FP after QC: ~5,000 per genome.
Applicability to Clinical Genomics
• Variants of unknown significance– Technical false positives– True variants without health impact
All After QC Not in panel
020000400006000080000
100000120000140000
Total
All After QC Not in panel
0
100
200
300
400
500
600
Non-synonymous
Nov
el v
aria
nts p
er sa
mpl
e
Not in TAGC
Not in TAGC
Phasing• Sequencing is in mate-pairs
• Haplotype information available for ~30-35% of hets.• BEAGLE error rate: 3-4%.• Seqphase: new phasing tool– Based on SHAPEIT– Incorporates reads– 18 hours on chromosome 1.
Distance between phased hets
100 300 500Fr
eque
ncy
Variant Discovery• Number of non-reference variants.
• Extrapolation using Gravel et al., PNAS 2011.
Variant Discovery• Number of segregating sites Sn(t), heterozygosity H(t).• Zivkovic and Stephan, Theor. Pop. Biol. 2011.
• N(t): # diploids at time t; N=N(t=0); ρ(t)=N(t)/N; n: # diploid samples• t: #generations/2N; θ=4Nμ; μ: mutation rate per generation
• Use double expansion model of Palamara et al., AJHG 2012.• Define t=0 at the start of the first expansion.• Match H(t).
Variant Discovery
?
Allele Frequency Spectrum
All
Pop.-specific
Counts Fractions
Demographic Inference• Folded allele frequency spectrum + coalescent simulations.• Double expansion model + ancient AJ foundation bottleneck.• Find maximum likelihood solution (Gutenkunst et al., PLoS Genet. 2009)
– Average over simulations to obtain expected spectrum.– Assume mutation frequency is drawn according to expected spectrum.– Multinomial probability approximated as Poisson.
100
10
1
0.1
%sit
es
Demographic Inference
• Similar to Palamara et al., with somewhat larger population sizes.
• To do: Gene flow from EU; better inference tools.
Years ago
3,000
N
t
Effective size
90,000
500
7,500,000
875
Present
5000
Ongoing Analysis• Exome analysis
– Genes w/ AJ-specific high mutation load
• Mobile elements insertion– Common insertions frequencies
correlated with 1KG
• AJ disease genes (Ostrer & Skorecki, Human Genetics 2012)– Some carriers detected– 276 non-synonymous mutations,
>65 known– 60 loss-of-function
0 2 4 6 8 10 12 14 16 180
10
20
30Average missense allele load
TAGC
CEU
0 0.2 0.4 0.6 0.8 10
0.5
1R² = 0.746069558135442
MEI frequency
1000Genomes
TAGC
Summary• AJ bottleneck and expansion reveal potential for
genetics studies.• High quality genomes sequenced by TAGC indicate
utility in clinical setting.• Complete variant discovery improves demographic
inference; subtle differences from Europeans.
• Future directions:– Imputation power using TAGC vs. 1000Genomes– Local ancestry inference– Effect of natural selection
Thank you!TAGC consortium members:Columbia University Computer Science:Itsik Pe’er, Pier Francesco PalamaraUndergrads: Fillan Grady, Ethan Kochav, James XueIT: Shlomo HershkopLong-Island Jewish Medical Center:Todd Lencz, Semanti Mukherjee, Saurav GuhaColumbia University Medical Center:Lorraine Clark, Xinmin LiuAlbert Einstein College of Medicine:Gil Atzmon, Harry OstrerMount Sinai School of Medicine:Inga Peter, Laurie OzeliusMemorial Sloan Kettering Cancer Center:Ken Offit, Vijai JosephYale School of Medicine:Judy Cho, Ken Hui, Monica BowenThe Hebrew University of Jerusalem:Ariel Darvasi
Funding:Human Frontiers Science program.
VIB, Gent, BelgiumHerwig Van Marck, Stephane PlaisanceComplete GenomicsJason Laramie