Rainer Lehtonen PhD, Genomics and genetics project leader Metapopulation Research Group Department...
-
Upload
lawrence-fletcher -
Category
Documents
-
view
218 -
download
3
Transcript of Rainer Lehtonen PhD, Genomics and genetics project leader Metapopulation Research Group Department...
Glanville fritillary butterflygenomics and genetics
Rainer LehtonenPhD, Genomics and genetics project leaderMetapopulation Research GroupDepartment of Biological and Environmental Sciences, University of Helsinki
2
Glanville fritillary butterfly – genomics and genetics
Background Genome project Genome assembly >> Panu
Somervuo Some NGS applications Conclusions
3
Glanville fritillary as a model Glanville fritillary is an internationally
recognized metapopulation model system in ecological and evolutionary studies
Studied since 1991 in the Åland Islands in Finland
Data available from different populations:- Fragmented landscape vs. continuous- Isolated vs. metapopulation- Large vs. small- Same vs. different population history
Field studies, indoor & outdoor cage + laboratory experiments, controlled crosses, molecular studies
4
Collaborative genome project
SEQUENCE DATA PRODUCTION
DNA (+RNA) SAMPLES
QC + ASSEMBLY
ASSEMBLY VALIDATION (ref g)
ANNOTATION + PUBLICATION
GENOME ANALYSIS
VARIATION IN THE GENOME
GENETIC TOOLS
INSTITUTE OF BIOTECH, KAROLINSKA INSTITUTE
INSTITUTE OF BIOTECHNOLOGY
INSTITUTE OF BIOTECH, DEP COMPUTER SCI
INSTITUTE OF BIOTECH, DEP COMPUTER SCI
EBI, ENSEMBL GENOMES
EBI, OTHER GENOME PROJECTS
INSTITUTE OF BIOTECH, DEP COMPUTER SCIFIMM, BIOMEDICUM HKI, INSTITUTE OF BIOTECH, ILLUMINA INC.
ESTs REF GENOME
GENOME ANNOTATION
DATA FROMOTHER SOURCES
NEX-GEN SEQUENCING454, SOLiD3, SOLEXA
REF DNA +RNA SAMPLES
GENOME ASSEMBLY
NEX-GEN RE-SEQUENCINGSOLiD4/SOLEXA
CROSSES/POP POOLS/INDS
MAPPING TO REF GENOME VARIATION
GENETIC MAP(MARKER
LOCATIONS)
GENETIC VARIATIONGENE EXPRESSION
PLATFORM FOR LARGE SCALE TARGETED GENOTYPING
GENOTYPING OF LARGE POPULATION SAMPLES (>50K)
Reference genome + variation
EST ASSEMBLY
Heliconius Genome Meeting 6
Variation & other nex-gen data
25.-26.3.2010
Sample Aim Platform
Read Type Read Length
Runs to be done
RNA, pool used in RNAseq
Gene start sitesGene 5’ variation
SOLiD4 Pair-end 50+25
1/4
Amp DNA, 4 crosses
Construction of genetic map
SOLiD4 Single read, RAD tag library
50+25
3
Amp DNA, pool ~30 ind
SNPs & other genetic variation
SOLiD4 Pair-end 50+25
1
RNA, pooled pop samplesfrom 5+1 pop
Variation in 5+2 popSNPs in ESTs, Expression
SOLiD4 Pair-end 50+25
1(-2)
DNA from selected individuals
Pgi & flanking genes +Sdhd, Hsp70
Sure-Select + 454Sanger seq
Single read 400 1/4
7
Deep re-sequencing
RAD-tag (Restriction Enzyme Associated DNA) known also as “Deep sequencing of reduced representation library”
Example: Construction of a high-density genetic map:*4 controlled Spain-Finland crosses* Parents and 50 individuals from each family to be sequencedGenetic or linkage map defines an order and distance between markers
based on a recombination frequency (1cM = 1% recombination rate) in meiosis
SureSelect (Agilent)Target Enrichment + deep sequencing with 454
Example: Population comparison of the Pgi + flanking genes (+ some other)
in a sample of 24 individuals or pools
8
Genetic map with RAD-tag NGS
150-200bp pair-end library
50bp seq 25 bp seq
SNP1 SNP2
Nathan A et al.PloS ONE 2008
Now:500MReads50 bp each
9
RAD-tagging in Glanville fritillary
Average fragment size454 Glanville gContigs Heliconius
NcoI 13.3 14 XhoI 11.5 4 EcoRI 4.5 2
Mappable reads • Restriction site > 250bp from the end of a gContig• Targets = 2x sites• 454-Newbler assembly: 320Mbp (out of ~550Mbp genome in 220K contigs (>500bp) • Expected number of SNPs 1/300bp, read lenght 50-25bp
----------------------------------------------------- #sites #mappable #exp #SNPsNcoI* ccatgg 24,064 38,880 48,128 12,032XhoI ctcgag 27,788 45,925 55,576 13,894EcoRI gaattc 70,474 117,293 140,948 35,2367BsphI* tcatga 66,967 110,731 133,934 33,483NdeI catatg 73,629 121,628 147,258 36,814
*The most probable combination > ~45,000 SNPs• Reads have to unique• 10-20x coverage/ individual (>~5000x on average)• Heavy data filtering needed > probably only 30-50% of data is usable
In silico restriction analysis made by Panu Somervuo, MRG
10
Targeted enrichment + resequencing
Max 55K 120 meroligos
Glanville fritillary butterfly SureSelectTarget enrichment (10x tiling):•To identify “lethal” haplotypes associatedto a known homozygous genotype•To define structure and variations of the hypervariable Pgi gene* To design tag-SNPs for large scale genotyping
11
Uneven coverage
TCMID_3 - Tas_pooli_Cinxia Sure Select_F3
TCMID_51 - Tas_pooli_Cinxia Sure Select_A1
TCMID_53 - Tas_pooli_Cinxia Sure Select_C1
TCMID_55 - Tas_pooli_Cinxia Sure Select_E1
TCMID_57 - Tas_pooli_Cinxia Sure Select_G1
TCMID_59 - Tas_pooli_Cinxia Sure Select_A2
TCMID_61 - Tas_pooli_Cinxia Sure Select_B3
TCMID_63 - Tas_pooli_Cinxia Sure Select_1
TCMID_65 - Tas_pooli_Cinxia Sure Select_3
TCMID_67 - Tas_pooli_Cinxia Sure Select_5
TCMID_69 - Tas_pooli_Cinxia Sure Select_E3
TCMID71 -
5 0
00
10
000
15
000
20
000
25
000
30
000
35
000
30 753 31 488
7 998 11 072
20 699 13 346
10 540 4 568
9 164 7 520
9 863 1 131
11 687 9 362
12 959 13 717
16 644 9 214 9 780
20 851 17 110
22 316 21 122
14 731
1154612197
31284343
82045236
41441791
35872829
3581444
44943613
49835361
649936213708
77186324
79607774
5468
Cinxia Sure Select
Bases kbp (total 128 555 kbp)Reads (total 337 635)
Hypothesis driven samplingcompare samples (24) from different populations with different tag-SNP genotype frequencies
>Hardy-Weinberg equilibrium > Hardy-Weinberg
disequilibrium
¼ 454 Titanium run: 444-12197 kb/sample = 15-406 x coverageFigure by Pia LaineInstitute of BiotechnologyUniversity of Helsinki
12
How well SureSelect works?
Data from Agilent
Our very preliminary result:~40% of the datacomes from the target
Heliconius Genome Meeting 13
Comparison of haplotypes
25.-26.3.2010
Sampsa Hautaniemi, Marko Laakso, Sirkku Karinen, Rainer [email protected]
14
Message
Whole genome sequencing is doable for a “non-genome” oriented research group
Most work on data filtering and analysis Tools for data management and
analysis under strong development Down-stream efforts need to be
compatible with available genome data