Population genomics of Drosophila transposable...
Transcript of Population genomics of Drosophila transposable...
Casey M. Bergman
Faculty of Life SciencesUniversity of Manchester
[email protected]://bioinf.manchester.ac.uk/bergman
Population genomics of Drosophila transposable elements.
Daborn et al. (2002) Science 297:2253-2256
TE insertion is the causal mutation for insecticide resistance in D. melanogaster
Parallel TE insertions increase cyp6g1 expressionin D. melanogaster and in D. simulansPutative XRE binding sites in the (a) 5’ flank of Cyp6g1 and in the (b) Doc genomic sequence.
a)
CG8447 stop
codon
Cyp6g1 transcription
start
CG8447 transcription
stop
taatgaaattcacaaatgcatcaaaagcttgacgagaaagccggttgtgttt
aattatttatagattatagcgtgcaatacttttcatatcgtatatgtattgc
gttaacgcttttaaaaatctaactaaaccatagcacacaaaaagtaaataag
gttgttaaaactaagaatcattataataaatgtaatcatgacttgtaattat
cttagagtccctctggatttgctgtggtttgtttgtcgtattttaaagcttt
ttccaccacacaggtgaatttataagtatgcacttgaaattgctatctcaga
acttttgagactttcgagtataaaaacgcaaacaacatttcaaatcgcccca
Barbie boxBarbie boxOct-1
Oct-1Ahr-
CHOP
Oct-1
Oct-1
Oct-1
Oct-1
Oct-1
4.8kb Doc insertion in D.simulans
Accord insertion in D.melanogaster
Schlenke & Begun (2004) Proc. Natl. Acad Sci. 101:1626-31
Signatures of parallel selective sweeps in D. melanogaster and in D. simulans
Schlenke & Begun (2004) Proc. Natl. Acad Sci. 101:1626-31
(A)n
DNA transposons (cut+paste)
RNA retrotransposons (copy+paste)
3 major types of transposable element (TE)
Terminal Inverted Repeat (TIR)
LINE-like (non-LTR)
Long Terminal Repeat (LTR)(A)n
Why study TEs in genome sequences?
• Genome assembly & alignment
• Mechanisms of transposition
• Genome and chromatin structure
• Population and comparative genomics
hmmall-by-allRECON
BLASTER
RepeatMasker
TBLASTX
RMBLR
Release 3
Release 4
Quesneville, Bergman et al. (2005) PLoS Comp. Biol. 1:e22
High resolution annotation of TEs in D. melanogaster
Genomic TE distribution in D. melanogaster
~3% ~20%
genome-wideaverage ~5.5%
10
20
30
40
50
5 10 15 20
X# TEs per 50kb
~ centromere~ high-low rec.
Bergman, Quesneville et al. (2006) Genome Biology 7:R112
What can we learn about TE evolution from a high quality reference genome?
A brief introduction to transposable element (TE) evolution: the current paradigm
• TEs are mobile DNA sequences, intra-genomic parasites
• Transposition rates >> excision rates
• Equilibrium maintained by transposition-selection balance
• Mode of natural selection is debated
- deleterious effects of transposition
- deleterious effects of TE insertion
- deleterious effects of TE-mediated ectopic recombination
✴ TE insertions observed at low frequency in nature
Estimating the age of ‘pseudogene-like’ retrotransposons
Petrov & Hartl (1998) Mol. Biol. Evol. 15:293-302
Alignment of paralogous TEs
Petrov & Hartl (1998) Mol. Biol. Evol. 15:293-302
Estimating the age of ‘pseudogene-like’ retrotransposons
D. mel - D. sim speciation
Bergman & Bensasson (2007) PNAS 104:11340-5
a_in
vader2
_6
b_m
icro
pia
_4
c_T
abor_
3
d_17.6
_11
e_S
talk
er_
4
f_ro
ver_
3
g_flea_16
h_copia
_28
i_m
dg3_10
j_ro
o_86
k_T
ranspac_4
l_opus_16
m_blo
od_22
n_412_24
o_B
urd
ock_13
p_div
er_
9
q_T
irant_
20
r_jo
ckey2_7
s_H
ele
na_7
t_C
r1a_36
u_baggin
s_6
v_G
4_10
w_D
oc3_7
x_G
5_8
y_B
S_15
z_Juan_9
zz_D
oc_53
0.00
0.02
0.04
0.06
0.08
0.10
0.12D
iverg
ence (
sub/s
ite)
0
1.80
3.60
5.41
7.21
9.01
10.81
Age (
Mya)
Retrotransposon demographics in D. melanogaster
LTR mobilization coincides with out-of-Africa migration at the end of the Pleistocene (~16 kya)
Lachaise et al. (1988) Evolutionary Biology 22:159-225
Bartolome et al. (2009) Genome Biology 10:R22
Horizontal transfer of D. melanogaster TE families
silent site divergence
Current paradigm interprets low TE frequency as evidence for purifying selection
Aquadro et al. (1986) Genetics 114:1165-1190
Current work (w/ Justin Blumenstiel)
• Develop a non-equilibrium model of neutral TE evolution that relaxes the assumption of a constant TE insertion rate.
• Obtain allele frequency data for a large sample of TEs in ancestral and derived populations of D. melanogaster.
• Test whether observed TE allele frequencies are consistent with ages of TE insertion estimated from genomic data to infer forces controlling TE evolution.
An age-of-allele model for TE insertions
• Question: what is the probability that an allele of age t is present in i copies in a sample of n chromosomes?
• Calculate probability of i descendants from a single ancestor given j ancestors (Feller 1957)
• Calculate probability of j ancestors at time t under standard neutral model (Tavare 1984)
• Calculate probability of insertion at time t given s substitutions in a fragment of length l under Poisson process (Bayes 1763)
Allele frequency data for TE insertions
• 190 loci (90 LTR and 100 non-LTR)
• 2 PCR per loci per strain (TE+flank / L+R flanking regions)
• 12 strains from 2 populations - Zimbabwe (from Stephan Lab) & North Carolina (from Mackay Lab)
• Insertion in genomic sequence is included as 13th allele to account for ascertainment bias
• Individual strain allele frequency data consistent with pooled strain allele frequency data from Gonzalez et al. (2008)
lower frequencythan expected
higher frequencythan expected
Fit of expected allele frequency under neutral model to observed frequency in North Carolina
0 50 100 150
-50
510
rank difference
observed-expected
Expected allele frequency fits observed allele frequency over a wide range of ages
-8 -7 -6 -5 -4 -3
-50
510
log(subs/site)
observed-expected
Preliminary observations
• Majority of TE insertions in North Carolina are at or close to expected frequency given age since insertion under neutrality
• Some loci deviate strongly from predicted frequency and may reflect loci under positive and negative selection
• Null model accurately predicts observed allele frequency over wide range of insertion ages
• Null model parameterized with current estimate of African population size leads to poor predictions but yields better fit with ancestral population size
Ongoing and Future Work
• Analysis of the fit of model to data according to various genomic features (TE class, TE family, X vs. autosome, recombination).
• Use model to generate maximum likelihood estimate of Ne under assumption that insertion alleles are neutral.
• Resolution of best summary statistic(s) to assess global fit of the model to the data.
• Inclusion of variable population size.
• Properly model ascertainment bias.
What can we learn about TE evolution from next generation sequencing (NGS)?
Hundreds of D. melanogaster genomes are currently being sequenced
Population genomics of TEs using NGS
Strain X
454 Reads
TEs
Unbiased estimates of TE content using NGS
0
5
10
15
20
25
geno
me
norm
al re
c.
low
rec.
RA
L-30
1
RA
L-30
3
RA
L-30
6
RA
L-35
8
RA
L-37
5
RA
L-73
2
% T
E
non-LTR
LTR
TIR
Population genomics of TEs using NGS
Hybrid TE-unique reads“Unique Flank Tags”
Strain X
454 Reads
TEs
KNOWN ✓ReferenceNEW !
Population genomics of TEs using NGS:known insertions
TEs in reference sequence
Known INE-1 insertion present in NC and AF strains
chr3L: 5000000 10000000 15000000 20000000Release 5 TEs
NC301 overlap UFT
NC303 overlap UFT
NC306 overlap UFT
NC358 overlap UFT
NC375 overlap UFT
NC732 overlap UFT
AF28-5 overlap UFT
AF56-4 overlap UFT
AF63-5 overlap UFT
Inferences about reference TEs present in natural strains based on ~1X 454 shotgun data
Sackton, Kulathinal, Bergman, et al (2009) Genome Biology and Evolution 1:439-455
• ~22% of annotated TEs found in >=1 wild strain
• ~72% found in nature in low recomb. regions
• DNA transposon insertions (~30%) found more often than non-LTR (~15%) or LTR (10%) retrotransposons
• ~12% TE sequence in each genome
• ~97% of known TE families are found in all strains
Inferences about reference TEs present in natural strains based on ~1X 454 shotgun data
Sackton, Kulathinal, Bergman, et al (2009) Genome Biology and Evolution 1:439-455
Population genomics of TEs using NGS:novel insertions
TEs in reference sequence
Novel jockey insertion present in >1 strain
Scalechr3L:
--->
Gap
GDP Insertions
20 bases15088220 15088230 15088240 15088250 15088260 15088270 15088280 15088290
T T G T G C A A A G A C A G T G C T G C A A G C C G G C C G A C T A A G A C T C A T C C A A G T C G A A A T T G C A G C C G A A A G T G A A G G T A T T G C A G C A G T A GDGRP TEs
User Supplied Track
Gap Locations
Gene Disruption Project P-element and Minos Insertion Locations
FlyBase Protein-Coding Genes
FlyBase Noncoding Genes
P-element-375F
375-12X_454_XLR_0286.fa:P-element:E39SX1T01AKL7O375-12X_454_XLR_1407.fa:P-element:E4HAZC106GRHJ1375-12X_454_XLR_0372.fa:P-element:E39SX1T03DS6QW375-12X_454_XLR_0104.fa:P-element:E0EVD6P02IRG02
375-12X_454_XLR_1185.fa:P-element:E4G93Y305GGZVH375-12X_454_XLR_1011.fa:P-element:E4F9TWK08I6AS9
375-12X_454_XLR_0797.fa:P-element:E3VM6BK08JAAGJ375-12X_454_XLR_0769.fa:P-element:E3VM6BK07IL05U
375-12X_454_XLR_0089.fa:P-elementP-element:E0EVD6P02F46LZ375-12X_454_XLR_0035.fa:P-element:E0EVD6P01COH3T
Aats-glyAats-gly
Insertion site, target site duplications (TSDs) and strand orientation can be annotated using NGS
ORF0 ORF1 ORF2 ORF3
CAT... ...ATG
Using genomic data to infer TE target site preferences: the P-element as a case study
Linheiro & Bergman (2008) Nucl. Acids Res. 36:6199-6208
0.00.10.20.30.40.5
bits
5! -25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10 -9 -8 -7 -6 -5 -4 -3
CG
TA
-2
G
C
A
T
-1
G
A
0
T
ACG
1
A
CGT
2
AGTC
3
AGTC
4
TCAG
5
TCAG
6
T
GCA
7
A
TGC
8 9
C
G
TA
10
GC
AT
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
3!
0.00.10.20.30.40.5
bits
5! -25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10 -9 -8 -7
C
G
T
A
-6 -5 -4
A
T
-3 -2
CG
TA
-1
G
C
AT
0
T
A
G
1
T
ACG
2
A
CGT
3
AGTC
4
AGTC
5
T
CAG
6
TCAG
7
T
CGA
8
A
TGC
9
A
G
T
C
10
CTGA
11
GC
AT
12 13
G
T
C
A
14 15 16 17 18 19 20 21 22 23 24 25
3!
0.00.20.40.60.81.0
bits
5! -25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10 -9 -8 -7 -6
A
T
-5
T
A
-4
G
T
A
-3 -2
A
T
G
-1
G
A
0
G
A
T
1
T
C
AG
2
G
AC
T3 4
A
G
TC
5
T
C
AG
6 7
CTG
A8
A
G
TC
9 10A
C
T
11A
T
C
12 13
T
14
C
A
T
15
T
A
16
A
T
17 18 19 20 21 22 23 24 25
3!
Using NGS population genomic data to infer TE target site preferences
Artificial P-element insertions
NaturalP-element insertions
NaturalHobo insertions
n=10221
n=702
n=892
Summary
• LTR insertions are not at equilibrium in D. melanogaster
• Population genomics using NGS will help resolve forces controlling TE evolution
• Drosophila is an excellent system for studying the impact of TE insertion on genome structure and evolution
• Population genomics using NGS will provide rich material for understanding mechanisms of transposition
• Many retrotransposon alleles are at frequencies expected under neutraility
Top tip #1: UCSC Source Tree
http://bergman-lab.blogspot.com/2009/03/compiling-ucsc-source-tree-utilities-on.html
~600 command line utilites for “ sorting, splitting, or merging fasta sequences; record parsing and data conversion using GenBank, fasta, nib, and blast data formats; sequence alignment; motif searching; hidden Markov model development; and much more”
Top tip #2: VITAL-IT“Vital-IT is pleased to invite proposals for cost-free use of its facilities from individuals, institutions and companies from Switzerland or any of the EU Member and Associated States.”
Douda BensassonAndy Clark
Fiona HeJustin Blumenstiel
Raquel LinheiroMax Haussler
Michael AshburnerHadi Quesneville