The Ashkenazi Genome Project

The Ashkenazi Genome Project

Shai CarmiPe’er lab, Columbia University

Joint Group MeetingNovember 2012

Recent History of Ashkenazi Jews

• Mediterranean origin (?)• Ca. 1000: Small communities

in N. France, Rhineland

• Migration east

• Expansion

• ~10M today, mostly

in US and Israel

• Relative isolation

Ashkenazi Jewish Genetics

Behar et al., Nature 2010.Bray et al., PNAS 2010.Guha et al., Genome Biology 2012.

300 Jewish individuals; SNP arrays

• Recently, AJ shown to be a genetically distinct group• Close to Middle-Eastern & South-European populations

Price et al., PLoS Genetics 2008.Olshen et al., BMC Genetics 2008.Need et al., Genome Biology 2009.Kopelman et al., BMC Genetics, 2009.

AJ

Atzmon et al., AJHG 2010

Jewish non-AJ

Middle-Eastern

Europeans

Recent Demography & IBDIn small populations, common ancestors are likely recent.

A B

Generation

1

2

3

Recent Demography & IBDIn small populations, common ancestors are likely recent.

AB

A shared segment

• For g-generation ancestor, chances of IBD , but length (M).

• IBD is highly informative on recent history!

Many long haplotypes identical-by-descent

A B

Generation

Formal Inference Using IBD• Assume a population of historical size . • Total shared segments of length :

A B

AB

A shared segment

Palamara et al., AJHG 2012

IBD sharing abundant in AJ

Atzmon et al., AJHG 2012Gusev et al., MBE 2011

• Detect IBD in sample Infer history .

AJ Genetic History

Expansion rate ≈34% per generation

2,300

N

t

Effective size

45,000270

4,300,000

Years ago

800

Present

Palamara et al., AJHG 2012

High potential for genetic studies!

0%

20%

40%

60%

80%

100%

0 50 100 150 200 250 300 350 400 450 500

# of Sequenced Individuals

% A

dditi

onal

Info

rmati

on P

oten

tial

WTCCC AJ_SCZ AJUK

Pow

er o

f im

puta

tion

by IB

D

The Ashkenazi Genome Consortium

10 labs from NY area and Israel.Goals:• Sequence to high coverage hundreds of healthy AJ

o Use as a reference panel for o Association studieso Imputationo Clinical interpretation

o Understand AJ population historyo Understand AJ functional genetic variation

(negative/positive selection)

The Ashkenazi Genome ConsortiumPhase I:

• 144 AJ personal genomes • ~60yo, healthy controls• Unrelated, PCA-validated AJ• Selected to maximize sharing with rest of cohort• Technology: Complete Genomics• Sequenced so far: ~100 genomes• Data presented: 58 genomes

Phase II:

• Hundreds of genomes (2013?)• More collaborators

Quality MeasuresProperty Genome (exome)Coverage ~55x

Fraction called 96.5±0.003% (98%)Fraction with coverage > 20x 92.4±0.018% (94.9%)Concordance with SNP array 99.87±0.1%

Ti/Tv ratio 2.14±0.003 (3.05)

Ti/T

v

Variant statisticsStatistic Per genome (exome)

Total SNPs 3.4M (22k)

Novel SNPs 3.7% (4%)

Het/hom ratio 1.64 (1.67)

Insertions count 223k (246)

Deletions count 237k (218)

Multi-nucleotide variant count 83k (374)

Synonymous SNPs 10525

Non-synonymous SNPs 9695

Nonsense SNPs 71

Other disrupting 241

CNV count 336

SV count 1486

MEI count 3475

Comparison to Europeans

TAGCFlemish

All SNPs 3000000320000034000003600000

Het/hom1.4

1.6

Insertions Deletions MNPs0

100000

200000

(M)

(k)

Similar results in 13 CG European public genomes.

Novel SNPs (%)

0

2

4TAGCEU

Extrapolated to 100% genome

Het/Hom Ratio

Het/hom1.4

1.6

• Significant in comparison to both Flemish and HapMap EU.

• Was observed in SNP arrays (Need et al., Genome Biology 2009).

• Did I not just say that AJ have more IBD?

AJ EU

IBD observed

Het/Hom Ratiot

Years ago

Present

Data Flow PipelineBackup 3x

CGA tools

VCFtestVariants

Fix Plink/Seq

QC

PlinkCompress, index

Phase

Distribute

Quality ControlFalse positive rate assessment by runs of homozygosity:• Assume hets in high confidence roh are FP.

hets

PaternalMaternal

• High confidence rohs only (>7.5MB, no gaps).• 7 segments in 7 individuals (total 72MB).• Count het SNPs in original files.• Genome wide extrapolation: ~20,000 per genome.• ~3-5% FP rate for indels.

Quality Control

Indels and MNPs

Low-quality SNPs

Multi-allelic SNPs

Half-calls

SNPs with high no-call rate

SNPs not in HWE

Monomorphic reference SNPs

Inbred individual

Remove:

FP after QC: ~5,000 per genome.

Applicability to Clinical Genomics

• Variants of unknown significance– Technical false positives– True variants without health impact

All After QC Not in panel

020000400006000080000

100000120000140000

Total

All After QC Not in panel

0

100

200

300

400

500

600

Non-synonymous

Nov

el v

aria

nts p

er sa

mpl

e

Not in TAGC

Not in TAGC

Phasing• Sequencing is in mate-pairs

• Haplotype information available for ~30-35% of hets.• BEAGLE error rate: 3-4%.• Seqphase: new phasing tool– Based on SHAPEIT– Incorporates reads– 18 hours on chromosome 1.

Distance between phased hets

100 300 500Fr

eque

ncy

Variant Discovery• Number of non-reference variants.

• Extrapolation using Gravel et al., PNAS 2011.

Variant Discovery• Number of segregating sites Sn(t), heterozygosity H(t).• Zivkovic and Stephan, Theor. Pop. Biol. 2011.

• N(t): # diploids at time t; N=N(t=0); ρ(t)=N(t)/N; n: # diploid samples• t: #generations/2N; θ=4Nμ; μ: mutation rate per generation

• Use double expansion model of Palamara et al., AJHG 2012.• Define t=0 at the start of the first expansion.• Match H(t).

Variant Discovery

?

Allele Frequency Spectrum

All

Pop.-specific

Counts Fractions

Demographic Inference• Folded allele frequency spectrum + coalescent simulations.• Double expansion model + ancient AJ foundation bottleneck.• Find maximum likelihood solution (Gutenkunst et al., PLoS Genet. 2009)

– Average over simulations to obtain expected spectrum.– Assume mutation frequency is drawn according to expected spectrum.– Multinomial probability approximated as Poisson.

100

10

1

0.1

%sit

es

Demographic Inference

• Similar to Palamara et al., with somewhat larger population sizes.

• To do: Gene flow from EU; better inference tools.

Years ago

3,000

N

t

Effective size

90,000

500

7,500,000

875

Present

5000

Ongoing Analysis• Exome analysis

– Genes w/ AJ-specific high mutation load

• Mobile elements insertion– Common insertions frequencies

correlated with 1KG

• AJ disease genes (Ostrer & Skorecki, Human Genetics 2012)– Some carriers detected– 276 non-synonymous mutations,

>65 known– 60 loss-of-function

0 2 4 6 8 10 12 14 16 180

10

20

30Average missense allele load

TAGC

CEU

0 0.2 0.4 0.6 0.8 10

0.5

1R² = 0.746069558135442

MEI frequency

1000Genomes

TAGC

Summary• AJ bottleneck and expansion reveal potential for

genetics studies.• High quality genomes sequenced by TAGC indicate

utility in clinical setting.• Complete variant discovery improves demographic

inference; subtle differences from Europeans.

• Future directions:– Imputation power using TAGC vs. 1000Genomes– Local ancestry inference– Effect of natural selection

Thank you!TAGC consortium members:Columbia University Computer Science:Itsik Pe’er, Pier Francesco PalamaraUndergrads: Fillan Grady, Ethan Kochav, James XueIT: Shlomo HershkopLong-Island Jewish Medical Center:Todd Lencz, Semanti Mukherjee, Saurav GuhaColumbia University Medical Center:Lorraine Clark, Xinmin LiuAlbert Einstein College of Medicine:Gil Atzmon, Harry OstrerMount Sinai School of Medicine:Inga Peter, Laurie OzeliusMemorial Sloan Kettering Cancer Center:Ken Offit, Vijai JosephYale School of Medicine:Judy Cho, Ken Hui, Monica BowenThe Hebrew University of Jerusalem:Ariel Darvasi

Funding:Human Frontiers Science program.

VIB, Gent, BelgiumHerwig Van Marck, Stephane PlaisanceComplete GenomicsJason Laramie

The Ashkenazi Genome Project

Documents

Transcript of The Ashkenazi Genome Project