Linking Candidate Genes to Drought for SNP...
Transcript of Linking Candidate Genes to Drought for SNP...
Outline1) SNPs and Association Genetics2) Haploryza - ind/jap SNPs3) Genome-wide SNP discovery
•OryzaSNP project4) Summary 5) The Future
SNiPs & Associations
What is a SNiP?
is it the sound of ?
or the Scottish National Party
~Pàrtaidh Nàiseanta na h-Alba~
?No…
• a difference between the nucleotide at a corresponding locus in one genome as compared to another.
• variation can be o either A to G or C to T transitions o or A/G to C/T transversions.
• the bulk of genetic variation is due to SNPs
a SNP is a Single Nucleotide Polymorphism
A
T
G
CFor random mutations, Transition/Transversion = 0.5
Yet, for mammals the observed ratio is 1.4 - 1.7 indicating that transitions are 2.8 - 3.4 times as likely as the tranversions.
Therefore, SNPs are usually bi-allelic.
In Synthetic Populations• Map as any other marker
In Natural Populations• Use recombinations over historical time• Measure the association of alleles with one
another and with phenotypes.• Use estimates of equilibrium/disequilibrium for
the gametic phase.
How do we use SNPs for mapping?
• Bi-allelic nature results in easier genotyping as opposed to SSRs
• Allelic variation might be indicative of function
• Association of alleles (2-states) with one another and with phenotypes is easily tested.
• Easy to estimate linkage equilibrium/ disequilibrium (gametic phase)
Why use SNPs for mapping and association?
Rafalski 2002
p = 0.001 (Fisher’s Exact Test) p = 0.716
Rafalski 2002
Rafalski 2002
Tag SNPs for mapping
Goldstein & Cavalleri, Nature 27 Oct 2005
Power of Association Tests• Pre-requisites
o High density genotypes with coverage at level of LD in target population
o Detailed, replicated phenotype datao Integrated database and Computation (!!)
• Use the pool of recombinants created over historical timeo Length (Haplotype blocks) → a minimum length
over many generations of recombination• Identify regions or genes by correlation of their genotypic pattern
with a component of a trait.
• Advantages over classical mapping with SSRso Finding an association you’ve gotten the gene or are at least
very close to it.o Either alone or when combined with specialized bridging
populations, the power of detection is increased over QTL mapping.
HaplOryzaHaplOryza ProjectHD SNP GenotypingHD SNP Genotyping
G. Droc, C. Billot, B. Courtois, A.A. Farouk, N. Ahmadi, G. Clément, J. Taillebois, B. Barry, G. Second, D. Brunel, A. Bérard, M. Lathrop, M. Foglio, KL McNally
HaplOryza Platform• 1536 SNPs chosen from Nipponbare (japonica) and 93-11 (indica)
genome comparisons foro LD scan of the entire genome with 1 SNP/320 kb (76%)o A closer LD scan on specific regions with 1 SNP/50 kb (24 %)
• test for introgression of different material:• two low density areas (chr7) and two normal density areas (chr 12)
o all types of mutation (syn vs non-synonymous), o Best Illumina scores (> 0.9)
• 900 accessionso A good representation of the overall genetic diversity (GCP Composite)
• “MiniGB” accessions (241), Complement from remaining 2757 accessions (231), and from previous study (46)
o Accessions presenting putative patterns of introgressions between indica and japonica groups
• Highland rice from Madagascar (132), Guinean rice (39), Chinese rice (12), Surinam rice (23), Breeding crosses: European (12), Brasilian (11)
o 20 Reference accessions chosen for the OryzaSNP project133 Wild accessions
• Illumina GoldenGate assay
Rice SNP pipeline
• Only 1 polymorphism over a 50 bpwindow
• Only 1 hit to japonica or indicapseudomolecules (Excluding redundant hits)
• 100% identical sequence of 20 bp on either side of the polymorphism
• Only 1 SNP by location (Excluding redundant SNPs between the 3 datasets)
SNP repartition in the genomeIntergenic & genic Introns & exons
Results on a subset of 548 samples• 44 monomorphic SNPs (2.8 %) over 548 samples• 31 (2.1%) over 50% missing data• 13 (2.1 %) samples over 50 % missing loci• 535 samples x 1461 SNPs with 7.4 % missing data• Diversity analysis - 2.1 % heterozygous
Axis 1: 51.9 %, Axis 2: 5.7 %
IndicaJap Trop
Jap Temp
Surinam
Madagascar
Diversity – Subsample O.sativa• Random choice of 100 SNP (whole genome scanning)• 283 accessions representing sativa diversity• All sativa groups are well defined
• Factorial analysis, based on simple matching index• Axis 1: 54 %, Axis 2: 3.5 %
Indica
Japonica
Aus / BoroBasmati
Haplotyping – Subsample O.sativa
GenomeGenome--wide SNP Variation in wide SNP Variation in Landraces and Modern Varieties of RiceLandraces and Modern Varieties of Rice
K.L. McNally, K.L. Childs, R. Bohnert, R.M. Davidson, K. Zhao, V.J. Ulat, G. Zeller,R. Clark, D. Hoen, T. Bureau, R. Stokowski, D.G. Ballinger, K.A. Frazer, D.R. Cox,
B. Padhukasahasram, C.D. Bustamante, D. Weigel, R. Bruskiewich, G. Rätsch, C.R. Buell, D.J. Mackill, H. Leung, and J.E. Leach
Platform• High density oligomer arrays based on Affymetrix
technology• 25-mers tiled with 1-base offset for ALL non-
repetitive regions greater than 60 bp in length• Complete tiling of both strands with full degeneracy
o 8 oligos per base• 5 in. X 5 in. arrays each interrogating 25 Mb using
over 160 million separate features
Perlegen Technology
Whole-wafer high-density microarraysWhole-wafer high-density microarrays
0.7 mm thick0.7 mm thick 5-inches5-inches
5-inches5-inches
Up to 160 million ŅfeaturesÓ
�� Perlegen has exclusive access to Affymetrix whole waferPerlegen has exclusive access to Affymetrix whole waferarrays, which can contain up to 160 million distinct featuresarrays, which can contain up to 160 million distinct features
ACAGTCATGCCGTATCGGTACGTTC
Checking for letter “A” in middle (13th) position.
About a million copies of a 25-letter long piece of DNA are built on eachfeature
One of the 160 million features
ACAGTCATGCCGTATCGGTACGTTC
ACAGTCATGCCGTATCGGTACGTTC
ACAGTCATGCCGTATCGGTACGTTC
Checking for letter “A” in middle (13th) position
About a million copies of a 25-letter long piece of DNA are built on eachfeature
One of the 160 million features
Matches!
Labeled DNA fragment from individual’s sample
G
G
GG
G
G
G
C
C
C
CC
C
T
T
T
T
T
AA
A
A
A
A
T
CGGATCGCAATCAGGTTACGATACA
Features are Features are ““tiledtiled”…”…
GGATCGCAATCAGGTTACGATACATGATCGCAATCAGGTTACGATACATC
…AGCATGT…
…AGCCTGT…
…AGCGTGT…
…AGCTTGT…
Probing for Match and MismatchProbing for Match and Mismatch
A
G
Detection of DNA Variation Using Microarrays
Detection of DNA Variation Using Microarrays
Labeled DNALabeled DNAHybridized Hybridized
to Chipto Chip
DNA sequenceDNA sequenceof Interestof Interest
55’’ 33’’
TTGGCCAA
DNA 1DNA 1
DNA 2DNA 2
55’’ 33’’
TTGGCCAA
AA A T C C A T G T T GG C G TT G T C A C GA G
AA A A T C C A T G T T AA C G TT G T C A C GA
a a a a t c c a t g t t Gc g t t t c ac gat t t t a g g t ac aaCgc aa a g t g ct
gc
gc
Human WholeHuman Whole--Genome HighGenome High--DensityDensityOligonucleotideOligonucleotide ArraysArrays
223 arrays containing more than 10 billion unique probes
Human genome3 billion bp
1.6 Million SNPsrepresenting common
Human variation
Hinds et al 2005
- Implementation -
OryzaSNP Project Design• Development phase
o 20 germplasm lineso 379 kb unique sequence in 684 kb region on Chr3o Perlegen optimized hybridization with 8 kb LR-PCR ampliconso Pilot completed Oct 2006
• Discovery phaseo 20 germplasm lines and remaining 100 Mb sequenceo Performed SSD on lines, undergoing seed increase for
phenotyping, crossing to produce F1s for RIL developmento High quality genomics DNAs produced by CsCl2 density gradient
ultracentrifugation with 1 mg shipped to Perlegen for target production
o Perlegen Model-based predictions obtained April 2007
- Reference Genome -
International Rice Genome Sequencing Project (IRGSP) 2005 NatureInternational Rice Genome Sequencing Project (IRGSP) 2005 Nature 436:793436:793--800 800
HQ BAC-by-BAC Nipponbare(< 1 error in 10K bases)
100 Mb of Nipponbare with 57% coverage of annotated gene models
- OryzaSNP Germplasm -
IRIS (Code) a
Variety Source Origin Groupb IRIS (Code)
Variety Source Origin Group
2254728 (2) Nipponbare IRTPc 23787 Japan
Tempd japonica
2254721 (13) FR13 A IRGC 6144 India Aus
2021623 (3)
Tainung 67 Academia Sinica
Taiwan Temp
japonica 2254726
(12) Rayada IRGC 77210 Bangladesh
Deep- water 4
2254732 (7)
Li-Jiang-Xin- Tuan-Hei-Gu
(LTH) IRGCe 59323 China
Temp japonica
2254724 (10) Aswina IRGC 26289 Bangladesh
Deep- water 3
2254738 (4) M 202 IRGC 77142 U.S.A.
Temp japonica
2254729 (21)
IR64 (IR64-21) IRRI PBGB Philippines Indica
2254730 (5) Azucena IRTP 4209 Philippines
Tropf japonica
2254731 (18)
Shan-Huang Zhan-2 (SHZ2) IRTP 16338 China Indica
2254722 (6) Moroberekan IRGC 12048 Guinea
Trop japonica
2254727 (19) Pokkali IRGC 108921 India Indica
2254737 (9) Cypress IRRI PBGBg U.S.A.
Trop japonica
2254736 (20) Swarna IRRI PBGB India Indica
2254723 (8)
Dom-sufid IRGC 12880 Iran Aromatich 2254719
(17) Sadu-cho IRGC 2243 Korea Indica
2254720 (11) N 22 IRGC 4819 India Aus
2030504 (15) Minghui 63
Huazhong Ag. Un. China Indica
2254725 (14)
Dular IRGC 32561 India Aus 2030525
(16) Zhenshan 97B
Huazhong Ag. Un.
China Indica
aThe germplasm identifier (GID) in the International Rice Information System (IRIS, iris.irri.org) and numerical code assigned for analyses; bDetermined by isozyme, SSR, SNP analyses, or a combination thereof; cInternational Network for the Genetic Evaluation of Rice accession; dTemperate type; eInternational Rice Genebank Collection accession; fTropical type; gPlant Breeding, Genetics and Biotechnology (Division); hbasmati (aromatic) plant type.
Diverse rice varieties re-sequenced by HDOA.
Plant Types in OryzaSNP set
IR64
IAC
165
M20
2
Mor
ober
kan
Dom
Suf
id
Cyp
ress
Pokk
ali
Asw
ina
Swar
na Inia
Tocu
ari
Co 39 Patbyeo Gerdeh Dular Sadu-cho
Panicle/Seed Types in Panicle/Seed Types in OryzaSNPOryzaSNP setset
- Data Production & SNP Predictions -
Design & Produce Tiling Arrays
20 * 13,582 LR-PCR rxns
Predict SNPs by Model-Based Algorithms (hybridization footprint intensities)
Optimize Conditions for Rice
+
Shear, Label, Hyb, Scan
R. Bohnert/G. Raetsch
Predict SNPs by Machine Learning Algorithms
(Train SVMs with hybridization intensities,qualities, and known SNPs.Assess quality, recall, and FRD.
R. Stokowski/D. Ballinger
Summary statistics for MB and ML predictions, their union (MB∪ML) and intersection (MB∩ML)
Dataset All SNPs
Non-rep SNPs
Geno-types
Freq ≥0.15 (%)
Biallelic(%)
Transi/Transv
MB 259721 242,196
316,373397,348159,879
1242410 67.2 97.4 1.900
ML 326471 1349341 53.8 97.7 1.665MB ∪ ML 422244 1824074 56.9 97.1 1.654MB ∩ ML 162478 761606 64.9 99.7 2.072
Complementarity of SNPs Identified by MB & ML Algorithms
316,37339% ML
[7.9:14.0]
159,87938% Intersect
[9.7:3.2]
397,348100% Union
[27.8:12.3]
242,19623% MB[10.2:24.7]
[recall:FDR]
Summary Statistics by Chromosome (MB∩ML)
1.597 SNPs/kb
20.7% singletons
SNPs at non-repetitive sites SNP Types at Bi-allelic Sites
Chr. Tiled bases SNPs at All Sites
SNPs at
Repetitive
Sites
No. SNPs/kb
Genotypes
Sites with Allele Freq = 0.05
Sites with Allele Freq = 0.1
Sites with Allele Freq ≥ 0.15
Bi-allelic SNPs
Transi-tions
Transver- sions
1 13383899 28984 345 28639 2.140 149653 5321 3638 19680 28538 19041 9448 2 11033444 16773 186 16587 1.503 79223 3577 2397 10613 16539 10991 5506 3 11970871 18764 180 18584 1.552 94490 3203 2476 12905 18531 12449 6012 4 9113512 12824 278 12546 1.377 50449 3251 2158 7137 12498 8334 4128 5 7803388 15242 146 15096 1.935 80477 2210 1747 11139 15039 10158 4828 6 8003969 17312 286 17026 2.127 82856 3193 2565 11268 16975 11518 5423 7 7650967 6277 88 6189 0.809 27094 1359 937 3893 6175 4103 2050 8 6927300 8110 127 7983 1.152 34985 1897 1323 4763 7954 5362 2543 9 5843665 8075 145 7930 1.357 34924 1639 1290 5001 7908 5361 2524 10 5353904 7543 153 7390 1.380 35522 1456 1033 4901 7360 4956 2356 11 6773210 12091 358 11733 1.732 48967 3254 1862 6617 11674 7975 3674 12 6246677 10483 307 10176 1.629 42966 2760 1575 5841 10143 6887 3221
Total 100104806 162478 2599 159879 761606 33120 23001 103758 159334 107135 51713 Ratio 0.984 1.597 0.207 0.144 0.649 0.997 2.072
- Data Analyses -
LD decay: ~200-500 kb (average across whole genome)
B. Padhukasahasram
Groups 1,2,3(2099 sites)
Group 3(1238 sites)
Group 2(852 sites)
Group1(133 sites)
Groups 2,3(1724 sites)
Groups 1,3(1355 sites)
Groups 1,2(967 sites)
Haploview
Groupwise r2 plots for Chr3 Development Region
Pairwise r2 for Chr3 Development Region plus ~1200 up/downstream SNPs
Haploview
Apparent significant LD over >300 kb
Nippo TNG67 LTH M202 Azuc Morob Cypr Doms N22 Dular FR13A Rayad ZS97 Aswi Pokka Swarn IR64 SHZ2 MH63 SaduCNippo 9588 18678 21940 37806 44225 49085 47661 104131121413122940123817 83133 128143107213107967110278 91646 114764117687TNG67 9588 18298 21835 33620 40749 43432 47828 94638 111520112597113682 74246 116552 93951 100083 98693 87884 103190108370LTH 18678 18298 24971 30411 35462 41208 42618 84781 106355109038108883 78476 113906 97386 100235 97862 88525 104429108022M202 21940 21835 24971 33987 39711 43745 49197 93637 110746112996112568 77546 114562 98078 94442 99350 86796 102993104388Azuc 37806 33620 30411 33987 25415 28150 43440 80845 96213 100746 98302 73699 100695 86716 88397 80685 78739 92300 93238Morob 44225 40749 35462 39711 25415 34066 47269 89522 104786110306102727 86009 113933100161101490 98804 90485 105040106649Cypr 49085 43432 41208 43745 28150 34066 47783 92673 108786112420 111147 86198 111998 99884 100299 98129 89782 103662106584Doms 47661 47828 42618 49197 43440 47269 47783 78909 87774 93054 95349 78650 96438 89076 91418 91865 82245 98073 98338N22 104131 94638 84781 93637 80845 89522 92673 78909 45416 57225 63151 73666 80428 74680 81565 81971 72488 83198 82532
Dular 121413 111520106355110746 96213 104786108786 87774 45416 44756 49102 75010 63542 69319 74624 74823 65866 74168 74145FR13A 122940112597109038112996100746110306112420 93054 57225 44756 40897 69518 67055 68085 71690 75220 67271 74544 74908Rayad 123817113682108883112568 98302 102727 111147 95349 63151 49102 40897 81842 72146 75656 77667 80452 73259 81016 80877ZS97 83133 74246 78476 77546 73699 86009 86198 78650 73666 75010 69518 81842 65671 50486 50283 49877 39894 50342 46509Aswi 128143116552113906114562100695113933 111998 96438 80428 63542 67055 72146 65671 56771 53187 56282 51359 54886 54375Pokka 107213 93951 97386 98078 86716 100161 99884 89076 74680 69319 68085 75656 50486 56771 45235 46549 41688 49183 45187Swarn 107967100083100235 94442 88397 101490100299 91418 81565 74624 71690 77667 50283 53187 45235 43740 38378 44363 44416IR64 110278 98693 97862 99350 80685 98804 98129 91865 81971 74823 75220 80452 49877 56282 46549 43740 30471 39936 45986SHZ2 91646 87884 88525 86796 78739 90485 89782 82245 72488 65866 67271 73259 39894 51359 41688 38378 30471 37918 39169MH63 114764103190104429102993 92300 105040103662 98073 83198 74168 74544 81016 50342 54886 49183 44363 39936 37918 40173SaduC 117687108370108022104388 93238 106649106584 98338 82532 74145 74908 80877 46509 54375 45187 44416 45986 39169 40173
Pairwise SNPs at MB U ML (non-rep. features)
Indica/Aus Patterns of Shared SNPs (MB U ML, 100kb window)
saltol QTL
Keyan Zhao
Japonica/Indica Patterns of Shared SNPs (MB∩ML, 100kb window)
Keyan Zhao
sd1
saltol
Keyan Zhao
Keyan Zhao
Fst using biallelic SNPs (MB U ML, 10kb window)
Keyan ZhaoIndica/Japonica Hybrid Sterility
Keyan Zhao
Haplotype sharing in 10kb sliding window (MB ∩ ML)
K. Childs R. Davidson
MB ML MBML un ion MBML
inte rse ct
Ge nic 135,1 19 171,750 215,032 91, 150
Co di ng 57 ,935 74 ,007 91,992 39, 652
Intr on ic 56 ,693 71,779 90 ,301 37, 883
5' U TR 7,374 10 ,11 6 12 ,660 4,7 94
3' U TR 14 ,195 17 ,179 21,751 9,5 51
Interg enic 105,306 142,764 179,646 67, 778
Total S NPs 24 0,425 314,514 39 4,678 158 ,928
SNP Annotations for TIGR/MSU r5 (MB ∩ ML)
OryzaSNPdb r2
Summary
RIL Population Development May need Bridging types for association across groups
(inbred NAM-type design with MAGIC also underway)
114 F1s produced or underway. Developing RILs from selected crosses
male female . Npb TN67 M202 LTH Cypr Moro Azuc Dom N22 Raya FR13 Dula Aswi ZS97 Pokk Swar MH63 Sadu IR64 SHZ2
Nipponbare f1 f1 f1 f2 f1 f1 f1 f1 X X f1 f1 f2 F1 f1 f1 f1 f1 F2Tainung 67 X F2 F2 f2 F2 F2 F2 F2 X F2 F2 F2 F2 f2 F2 F2 F2 F2 F2M202LTHCypress F1 F1 F1 F1Moroberekan F2 F2Azucena P P F2 P P P xx P P P PDom SufidN 22 P P P P P P P P PRayadaFR13 ADular F2Aswina ZS97B F1 F1Pokkali F1 F1 F1 F1Swarna X F2 F2 F2 X F2 F2 F2 F2 X F2 F2 F2 F2 f2 F2 F2 F2 XMH63Sadu-choIR64 f2 F2 F2 F2 f2 F2 F2 F2 F2 X F2 F2 F2 F2 f2 F2 F2 F2 F2SHZ-2
Phenotyping for agronomic and other characters.
WS2007
- Next -
OryzaSNP2 – IRFGC Consortium for Genome Wide Scans
• Develop high-density genotyping Affiearrays with 600K SNPs
• Use tag SNPs from OryzaSNP MBML overlapping and filtered frequent, union calls
• Plus NGS discovered SNPs• Design arrays with 6M features (Affymetrix
design a la Arabidopsis):o U$D 1 million for 2000 slides ($500 slide plus
processing for 2000 lines)
Indica(groups 1,2,3)
Aus
Aromatic
Tropical(groups 1,2)
Temperate
Japonica
Admixed
Admixed
Indica(group 4)
• Genotype 2000+ diverse varieties under consortium• IRRI, Cornell, CIRAD, Academia-Sinica, Bayer, NYU, ICAR & Biotec(India), Korea, MPI-Tubingen, JIRCAS, Thailand, EMBRAPA (Brazil), Duke Uni, Wageningen Uni, Syngenta,
• Phenotype targeted subsets for traits• Test associations as routine in human genetics • Discover and deliver new alleles
Variety group specific SNPs by NGScover regions absent from OryzaSNP
• 6-8 indica– IR64-21– Sadu-cho– Fede Arroz– Pokkali or S. Asian– Madagascar– Chinese– European
• 4 aus– Dular– N22– 2 more
• Aswina?
• 4 aromatic– Dom sufid– Basmati 370
• 6 tropical japonica– Azucena– Moroberekan– Inia Tacuari– Gerdeh– Chinese– European
• 4 rufipogon– CSSL parents
• Rayada?
Choose lines purified and ready to go
Germplasm for Deep Genotyping2500-3000 lines
• GCP genotyping set (2339)• GCP drought (1000)• GCP Aus (300)• Haploryza (500)• Orytage Indica, Trop & Temp
Japonica (600)
• NSF-TV (500)• Hybrid parents (300)• Madagascar (50)• Rufipogon/nivara (100)• MAGIC parents (20)• USDA care (1500)• Academia Sinica (20)
Identify unique set, choose purified or purify
Diversity (coverage), utility, trait donors, nominations
Phenotyping Consortium
• Drought - reproductive• Salinity - seedling/reproductive• Heat - hot/humid vs. hot/dry• Yield - favorable• …
• If a $1000 genome (human) is possible• 3 X 109 bases (human)• 3.9 X 108 bases (rice)• i.e. ~7.5 fold less
• then, a $133 rice genome might be possible!• Current technologies (e.g. SBS by 454) are almost still as
costly as HDOA.
• Prospect exists to sequence the entire IRGC
Further ahead …
… but nearly there?
Renee Stokowski Julie Montgomery Laura StuveHeng Tao Cindy Chen Julee KwonGeoff Nielsen Robert Gupta Matt MorenzoniDavid Cox Eric Peacock Diana StarrDennis Ballinger Kelly Frazer Keiko Greenberg
Jan LeachR. Davidson
C. Robin BuellKevin Childs
Hei Leung Dave Mackill K. McNallyR. Bruskiewich V.J. Ulat V. PamplonaM. Macatangay S.H. Zhang Bin LiuM. Rodriguez R. Reano J. DetrasE. Nelzo R. Mauleon
Detlef Weigel Richard Clark Georg ZellerRegina Bohnert Gunnar Rätsch Gabriele Schweikert
Oryz
aSNP
Haplo
ryza
GCP
CIRAD: G. Droc C. Billot B. Courtois A.A. FaroukN. Ahmadi G. Clément J. Taillebois
IRD: G. SecondIRAG: B. BarryINRA-CNG: D. Brunel A. BérardCNG: M. Lathrop M. Foglio
AcknowledgementsAcknowledgements
T. BureauD. Hoen
C. BustamanteKeyan ZhaoB.Padhukasahasram