Linking Candidate Genes to Drought for SNP...

Outline1) SNPs and Association Genetics2) Haploryza - ind/jap SNPs3) Genome-wide SNP discovery

•OryzaSNP project4) Summary 5) The Future

SNiPs & Associations

What is a SNiP?

is it the sound of ?

or the Scottish National Party

~Pàrtaidh Nàiseanta na h-Alba~

?No…

• a difference between the nucleotide at a corresponding locus in one genome as compared to another.

• variation can be o either A to G or C to T transitions o or A/G to C/T transversions.

• the bulk of genetic variation is due to SNPs

a SNP is a Single Nucleotide Polymorphism

A

T

G

CFor random mutations, Transition/Transversion = 0.5

Yet, for mammals the observed ratio is 1.4 - 1.7 indicating that transitions are 2.8 - 3.4 times as likely as the tranversions.

Therefore, SNPs are usually bi-allelic.

In Synthetic Populations• Map as any other marker

In Natural Populations• Use recombinations over historical time• Measure the association of alleles with one

another and with phenotypes.• Use estimates of equilibrium/disequilibrium for

the gametic phase.

How do we use SNPs for mapping?

• Bi-allelic nature results in easier genotyping as opposed to SSRs

• Allelic variation might be indicative of function

• Association of alleles (2-states) with one another and with phenotypes is easily tested.

• Easy to estimate linkage equilibrium/ disequilibrium (gametic phase)

Why use SNPs for mapping and association?

Rafalski 2002

p = 0.001 (Fisher’s Exact Test) p = 0.716

Rafalski 2002

Tag SNPs for mapping

Goldstein & Cavalleri, Nature 27 Oct 2005

Power of Association Tests• Pre-requisites

o High density genotypes with coverage at level of LD in target population

o Detailed, replicated phenotype datao Integrated database and Computation (!!)

• Use the pool of recombinants created over historical timeo Length (Haplotype blocks) → a minimum length

over many generations of recombination• Identify regions or genes by correlation of their genotypic pattern

with a component of a trait.

• Advantages over classical mapping with SSRso Finding an association you’ve gotten the gene or are at least

very close to it.o Either alone or when combined with specialized bridging

populations, the power of detection is increased over QTL mapping.

HaplOryzaHaplOryza ProjectHD SNP GenotypingHD SNP Genotyping

G. Droc, C. Billot, B. Courtois, A.A. Farouk, N. Ahmadi, G. Clément, J. Taillebois, B. Barry, G. Second, D. Brunel, A. Bérard, M. Lathrop, M. Foglio, KL McNally

HaplOryza Platform• 1536 SNPs chosen from Nipponbare (japonica) and 93-11 (indica)

genome comparisons foro LD scan of the entire genome with 1 SNP/320 kb (76%)o A closer LD scan on specific regions with 1 SNP/50 kb (24 %)

• test for introgression of different material:• two low density areas (chr7) and two normal density areas (chr 12)

o all types of mutation (syn vs non-synonymous), o Best Illumina scores (> 0.9)

• 900 accessionso A good representation of the overall genetic diversity (GCP Composite)

• “MiniGB” accessions (241), Complement from remaining 2757 accessions (231), and from previous study (46)

o Accessions presenting putative patterns of introgressions between indica and japonica groups

• Highland rice from Madagascar (132), Guinean rice (39), Chinese rice (12), Surinam rice (23), Breeding crosses: European (12), Brasilian (11)

o 20 Reference accessions chosen for the OryzaSNP project133 Wild accessions

• Illumina GoldenGate assay

Rice SNP pipeline

• Only 1 polymorphism over a 50 bpwindow

• Only 1 hit to japonica or indicapseudomolecules (Excluding redundant hits)

• 100% identical sequence of 20 bp on either side of the polymorphism

• Only 1 SNP by location (Excluding redundant SNPs between the 3 datasets)

SNP repartition in the genomeIntergenic & genic Introns & exons

Results on a subset of 548 samples• 44 monomorphic SNPs (2.8 %) over 548 samples• 31 (2.1%) over 50% missing data• 13 (2.1 %) samples over 50 % missing loci• 535 samples x 1461 SNPs with 7.4 % missing data• Diversity analysis - 2.1 % heterozygous

Axis 1: 51.9 %, Axis 2: 5.7 %

IndicaJap Trop

Jap Temp

Surinam

Madagascar

Diversity – Subsample O.sativa• Random choice of 100 SNP (whole genome scanning)• 283 accessions representing sativa diversity• All sativa groups are well defined

• Factorial analysis, based on simple matching index• Axis 1: 54 %, Axis 2: 3.5 %

Indica

Japonica

Aus / BoroBasmati

Haplotyping – Subsample O.sativa

GenomeGenome--wide SNP Variation in wide SNP Variation in Landraces and Modern Varieties of RiceLandraces and Modern Varieties of Rice

K.L. McNally, K.L. Childs, R. Bohnert, R.M. Davidson, K. Zhao, V.J. Ulat, G. Zeller,R. Clark, D. Hoen, T. Bureau, R. Stokowski, D.G. Ballinger, K.A. Frazer, D.R. Cox,

B. Padhukasahasram, C.D. Bustamante, D. Weigel, R. Bruskiewich, G. Rätsch, C.R. Buell, D.J. Mackill, H. Leung, and J.E. Leach

Platform• High density oligomer arrays based on Affymetrix

technology• 25-mers tiled with 1-base offset for ALL non-

repetitive regions greater than 60 bp in length• Complete tiling of both strands with full degeneracy

o 8 oligos per base• 5 in. X 5 in. arrays each interrogating 25 Mb using

over 160 million separate features

Perlegen Technology

http://www.oryzasnp.org/

Whole-wafer high-density microarraysWhole-wafer high-density microarrays

0.7 mm thick0.7 mm thick 5-inches5-inches

5-inches5-inches

Up to 160 million ŅfeaturesÓ

�� Perlegen has exclusive access to Affymetrix whole waferPerlegen has exclusive access to Affymetrix whole waferarrays, which can contain up to 160 million distinct featuresarrays, which can contain up to 160 million distinct features

ACAGTCATGCCGTATCGGTACGTTC

Checking for letter “A” in middle (13th) position.

About a million copies of a 25-letter long piece of DNA are built on eachfeature

One of the 160 million features




Checking for letter “A” in middle (13th) position

About a million copies of a 25-letter long piece of DNA are built on eachfeature

One of the 160 million features

Matches!

Labeled DNA fragment from individual’s sample

G

G

GG

G

G

G

C

C

C

CC

C

T

T

T

T

T

AA

A

A

A

A

T

CGGATCGCAATCAGGTTACGATACA

Features are Features are ““tiledtiled”…”…

GGATCGCAATCAGGTTACGATACATGATCGCAATCAGGTTACGATACATC

…AGCATGT…

…AGCCTGT…

…AGCGTGT…

…AGCTTGT…

Probing for Match and MismatchProbing for Match and Mismatch

A

G

Detection of DNA Variation Using Microarrays

Detection of DNA Variation Using Microarrays

Labeled DNALabeled DNAHybridized Hybridized

to Chipto Chip

DNA sequenceDNA sequenceof Interestof Interest

55’’ 33’’

TTGGCCAA

DNA 1DNA 1

DNA 2DNA 2

55’’ 33’’

TTGGCCAA

AA A T C C A T G T T GG C G TT G T C A C GA G

AA A A T C C A T G T T AA C G TT G T C A C GA

a a a a t c c a t g t t Gc g t t t c ac gat t t t a g g t ac aaCgc aa a g t g ct

gc

gc

Human WholeHuman Whole--Genome HighGenome High--DensityDensityOligonucleotideOligonucleotide ArraysArrays

223 arrays containing more than 10 billion unique probes

Human genome3 billion bp

1.6 Million SNPsrepresenting common

Human variation

Hinds et al 2005

- Implementation -

OryzaSNP Project Design• Development phase

o 20 germplasm lineso 379 kb unique sequence in 684 kb region on Chr3o Perlegen optimized hybridization with 8 kb LR-PCR ampliconso Pilot completed Oct 2006

• Discovery phaseo 20 germplasm lines and remaining 100 Mb sequenceo Performed SSD on lines, undergoing seed increase for

phenotyping, crossing to produce F1s for RIL developmento High quality genomics DNAs produced by CsCl2 density gradient

ultracentrifugation with 1 mg shipped to Perlegen for target production

o Perlegen Model-based predictions obtained April 2007

- Reference Genome -

International Rice Genome Sequencing Project (IRGSP) 2005 NatureInternational Rice Genome Sequencing Project (IRGSP) 2005 Nature 436:793436:793--800 800

HQ BAC-by-BAC Nipponbare(< 1 error in 10K bases)

100 Mb of Nipponbare with 57% coverage of annotated gene models

- OryzaSNP Germplasm -

IRIS (Code) a

Variety Source Origin Groupb IRIS (Code)

Variety Source Origin Group

2254728 (2) Nipponbare IRTPc 23787 Japan

Tempd japonica

2254721 (13) FR13 A IRGC 6144 India Aus

2021623 (3)

Tainung 67 Academia Sinica

Taiwan Temp

japonica 2254726

(12) Rayada IRGC 77210 Bangladesh

Deep- water 4

2254732 (7)

Li-Jiang-Xin- Tuan-Hei-Gu

(LTH) IRGCe 59323 China

Temp japonica

2254724 (10) Aswina IRGC 26289 Bangladesh

Deep- water 3

2254738 (4) M 202 IRGC 77142 U.S.A.

Temp japonica

2254729 (21)

IR64 (IR64-21) IRRI PBGB Philippines Indica

2254730 (5) Azucena IRTP 4209 Philippines

Tropf japonica

2254731 (18)

Shan-Huang Zhan-2 (SHZ2) IRTP 16338 China Indica

2254722 (6) Moroberekan IRGC 12048 Guinea

Trop japonica

2254727 (19) Pokkali IRGC 108921 India Indica

2254737 (9) Cypress IRRI PBGBg U.S.A.

Trop japonica

2254736 (20) Swarna IRRI PBGB India Indica

2254723 (8)

Dom-sufid IRGC 12880 Iran Aromatich 2254719

(17) Sadu-cho IRGC 2243 Korea Indica

2254720 (11) N 22 IRGC 4819 India Aus

2030504 (15) Minghui 63

Huazhong Ag. Un. China Indica

2254725 (14)

Dular IRGC 32561 India Aus 2030525

(16) Zhenshan 97B

Huazhong Ag. Un.

China Indica

aThe germplasm identifier (GID) in the International Rice Information System (IRIS, iris.irri.org) and numerical code assigned for analyses; bDetermined by isozyme, SSR, SNP analyses, or a combination thereof; cInternational Network for the Genetic Evaluation of Rice accession; dTemperate type; eInternational Rice Genebank Collection accession; fTropical type; gPlant Breeding, Genetics and Biotechnology (Division); hbasmati (aromatic) plant type.

Diverse rice varieties re-sequenced by HDOA.

Plant Types in OryzaSNP set

IR64

IAC

165

M20

2

Mor

ober

kan

Dom

Suf

id

Cyp

ress

Pokk

ali

Asw

ina

Swar

na Inia

Tocu

ari

Co 39 Patbyeo Gerdeh Dular Sadu-cho

Panicle/Seed Types in Panicle/Seed Types in OryzaSNPOryzaSNP setset

- Data Production & SNP Predictions -

Design & Produce Tiling Arrays

20 * 13,582 LR-PCR rxns

Predict SNPs by Model-Based Algorithms (hybridization footprint intensities)

Optimize Conditions for Rice

+

Shear, Label, Hyb, Scan

R. Bohnert/G. Raetsch

Predict SNPs by Machine Learning Algorithms

(Train SVMs with hybridization intensities,qualities, and known SNPs.Assess quality, recall, and FRD.

R. Stokowski/D. Ballinger

Summary statistics for MB and ML predictions, their union (MB∪ML) and intersection (MB∩ML)

Dataset All SNPs

Non-rep SNPs

Geno-types

Freq ≥0.15 (%)

Biallelic(%)

Transi/Transv

MB 259721 242,196

316,373397,348159,879

1242410 67.2 97.4 1.900

ML 326471 1349341 53.8 97.7 1.665MB ∪ ML 422244 1824074 56.9 97.1 1.654MB ∩ ML 162478 761606 64.9 99.7 2.072

Complementarity of SNPs Identified by MB & ML Algorithms

316,37339% ML

[7.9:14.0]

159,87938% Intersect

[9.7:3.2]

397,348100% Union

[27.8:12.3]

242,19623% MB[10.2:24.7]

[recall:FDR]

Summary Statistics by Chromosome (MB∩ML)

1.597 SNPs/kb

20.7% singletons

SNPs at non-repetitive sites SNP Types at Bi-allelic Sites

Chr. Tiled bases SNPs at All Sites

SNPs at

Repetitive

Sites

No. SNPs/kb

Genotypes

Sites with Allele Freq = 0.05

Sites with Allele Freq = 0.1

Sites with Allele Freq ≥ 0.15

Bi-allelic SNPs

Transi-tions

Transver- sions

1 13383899 28984 345 28639 2.140 149653 5321 3638 19680 28538 19041 9448 2 11033444 16773 186 16587 1.503 79223 3577 2397 10613 16539 10991 5506 3 11970871 18764 180 18584 1.552 94490 3203 2476 12905 18531 12449 6012 4 9113512 12824 278 12546 1.377 50449 3251 2158 7137 12498 8334 4128 5 7803388 15242 146 15096 1.935 80477 2210 1747 11139 15039 10158 4828 6 8003969 17312 286 17026 2.127 82856 3193 2565 11268 16975 11518 5423 7 7650967 6277 88 6189 0.809 27094 1359 937 3893 6175 4103 2050 8 6927300 8110 127 7983 1.152 34985 1897 1323 4763 7954 5362 2543 9 5843665 8075 145 7930 1.357 34924 1639 1290 5001 7908 5361 2524 10 5353904 7543 153 7390 1.380 35522 1456 1033 4901 7360 4956 2356 11 6773210 12091 358 11733 1.732 48967 3254 1862 6617 11674 7975 3674 12 6246677 10483 307 10176 1.629 42966 2760 1575 5841 10143 6887 3221

Total 100104806 162478 2599 159879 761606 33120 23001 103758 159334 107135 51713 Ratio 0.984 1.597 0.207 0.144 0.649 0.997 2.072

- Data Analyses -

LD decay: ~200-500 kb (average across whole genome)

B. Padhukasahasram

Groups 1,2,3(2099 sites)

Group 3(1238 sites)

Group 2(852 sites)

Group1(133 sites)

Groups 2,3(1724 sites)



Haploview

Groupwise r2 plots for Chr3 Development Region

Pairwise r2 for Chr3 Development Region plus ~1200 up/downstream SNPs

Haploview

Apparent significant LD over >300 kb

Nippo TNG67 LTH M202 Azuc Morob Cypr Doms N22 Dular FR13A Rayad ZS97 Aswi Pokka Swarn IR64 SHZ2 MH63 SaduCNippo 9588 18678 21940 37806 44225 49085 47661 104131121413122940123817 83133 128143107213107967110278 91646 114764117687TNG67 9588 18298 21835 33620 40749 43432 47828 94638 111520112597113682 74246 116552 93951 100083 98693 87884 103190108370LTH 18678 18298 24971 30411 35462 41208 42618 84781 106355109038108883 78476 113906 97386 100235 97862 88525 104429108022M202 21940 21835 24971 33987 39711 43745 49197 93637 110746112996112568 77546 114562 98078 94442 99350 86796 102993104388Azuc 37806 33620 30411 33987 25415 28150 43440 80845 96213 100746 98302 73699 100695 86716 88397 80685 78739 92300 93238Morob 44225 40749 35462 39711 25415 34066 47269 89522 104786110306102727 86009 113933100161101490 98804 90485 105040106649Cypr 49085 43432 41208 43745 28150 34066 47783 92673 108786112420 111147 86198 111998 99884 100299 98129 89782 103662106584Doms 47661 47828 42618 49197 43440 47269 47783 78909 87774 93054 95349 78650 96438 89076 91418 91865 82245 98073 98338N22 104131 94638 84781 93637 80845 89522 92673 78909 45416 57225 63151 73666 80428 74680 81565 81971 72488 83198 82532

Dular 121413 111520106355110746 96213 104786108786 87774 45416 44756 49102 75010 63542 69319 74624 74823 65866 74168 74145FR13A 122940112597109038112996100746110306112420 93054 57225 44756 40897 69518 67055 68085 71690 75220 67271 74544 74908Rayad 123817113682108883112568 98302 102727 111147 95349 63151 49102 40897 81842 72146 75656 77667 80452 73259 81016 80877ZS97 83133 74246 78476 77546 73699 86009 86198 78650 73666 75010 69518 81842 65671 50486 50283 49877 39894 50342 46509Aswi 128143116552113906114562100695113933 111998 96438 80428 63542 67055 72146 65671 56771 53187 56282 51359 54886 54375Pokka 107213 93951 97386 98078 86716 100161 99884 89076 74680 69319 68085 75656 50486 56771 45235 46549 41688 49183 45187Swarn 107967100083100235 94442 88397 101490100299 91418 81565 74624 71690 77667 50283 53187 45235 43740 38378 44363 44416IR64 110278 98693 97862 99350 80685 98804 98129 91865 81971 74823 75220 80452 49877 56282 46549 43740 30471 39936 45986SHZ2 91646 87884 88525 86796 78739 90485 89782 82245 72488 65866 67271 73259 39894 51359 41688 38378 30471 37918 39169MH63 114764103190104429102993 92300 105040103662 98073 83198 74168 74544 81016 50342 54886 49183 44363 39936 37918 40173SaduC 117687108370108022104388 93238 106649106584 98338 82532 74145 74908 80877 46509 54375 45187 44416 45986 39169 40173

Pairwise SNPs at MB U ML (non-rep. features)

Indica/Aus Patterns of Shared SNPs (MB U ML, 100kb window)

saltol QTL

Keyan Zhao

Japonica/Indica Patterns of Shared SNPs (MB∩ML, 100kb window)

Keyan Zhao

sd1

saltol

Keyan Zhao

Keyan Zhao

Fst using biallelic SNPs (MB U ML, 10kb window)

Keyan ZhaoIndica/Japonica Hybrid Sterility

Keyan Zhao

Haplotype sharing in 10kb sliding window (MB ∩ ML)

K. Childs R. Davidson

MB ML MBML un ion MBML

inte rse ct

Ge nic 135,1 19 171,750 215,032 91, 150

Co di ng 57 ,935 74 ,007 91,992 39, 652

Intr on ic 56 ,693 71,779 90 ,301 37, 883

5' U TR 7,374 10 ,11 6 12 ,660 4,7 94

3' U TR 14 ,195 17 ,179 21,751 9,5 51

Interg enic 105,306 142,764 179,646 67, 778

Total S NPs 24 0,425 314,514 39 4,678 158 ,928

SNP Annotations for TIGR/MSU r5 (MB ∩ ML)

OryzaSNPdb r2

Summary

RIL Population Development May need Bridging types for association across groups

(inbred NAM-type design with MAGIC also underway)

114 F1s produced or underway. Developing RILs from selected crosses

male female . Npb TN67 M202 LTH Cypr Moro Azuc Dom N22 Raya FR13 Dula Aswi ZS97 Pokk Swar MH63 Sadu IR64 SHZ2

Nipponbare f1 f1 f1 f2 f1 f1 f1 f1 X X f1 f1 f2 F1 f1 f1 f1 f1 F2Tainung 67 X F2 F2 f2 F2 F2 F2 F2 X F2 F2 F2 F2 f2 F2 F2 F2 F2 F2M202LTHCypress F1 F1 F1 F1Moroberekan F2 F2Azucena P P F2 P P P xx P P P PDom SufidN 22 P P P P P P P P PRayadaFR13 ADular F2Aswina ZS97B F1 F1Pokkali F1 F1 F1 F1Swarna X F2 F2 F2 X F2 F2 F2 F2 X F2 F2 F2 F2 f2 F2 F2 F2 XMH63Sadu-choIR64 f2 F2 F2 F2 f2 F2 F2 F2 F2 X F2 F2 F2 F2 f2 F2 F2 F2 F2SHZ-2

Phenotyping for agronomic and other characters.

WS2007

- Next -

OryzaSNP2 – IRFGC Consortium for Genome Wide Scans

• Develop high-density genotyping Affiearrays with 600K SNPs

• Use tag SNPs from OryzaSNP MBML overlapping and filtered frequent, union calls

• Plus NGS discovered SNPs• Design arrays with 6M features (Affymetrix

design a la Arabidopsis):o U$D 1 million for 2000 slides ($500 slide plus

processing for 2000 lines)

Indica(groups 1,2,3)

Aus

Aromatic

Tropical(groups 1,2)

Temperate

Japonica

Admixed

Admixed

Indica(group 4)

• Genotype 2000+ diverse varieties under consortium• IRRI, Cornell, CIRAD, Academia-Sinica, Bayer, NYU, ICAR & Biotec(India), Korea, MPI-Tubingen, JIRCAS, Thailand, EMBRAPA (Brazil), Duke Uni, Wageningen Uni, Syngenta,

• Phenotype targeted subsets for traits• Test associations as routine in human genetics • Discover and deliver new alleles

Variety group specific SNPs by NGScover regions absent from OryzaSNP

• 6-8 indica– IR64-21– Sadu-cho– Fede Arroz– Pokkali or S. Asian– Madagascar– Chinese– European

• 4 aus– Dular– N22– 2 more

• Aswina?

• 4 aromatic– Dom sufid– Basmati 370

• 6 tropical japonica– Azucena– Moroberekan– Inia Tacuari– Gerdeh– Chinese– European

• 4 rufipogon– CSSL parents

• Rayada?

Choose lines purified and ready to go

Germplasm for Deep Genotyping2500-3000 lines

• GCP genotyping set (2339)• GCP drought (1000)• GCP Aus (300)• Haploryza (500)• Orytage Indica, Trop & Temp

Japonica (600)

• NSF-TV (500)• Hybrid parents (300)• Madagascar (50)• Rufipogon/nivara (100)• MAGIC parents (20)• USDA care (1500)• Academia Sinica (20)

Identify unique set, choose purified or purify

Diversity (coverage), utility, trait donors, nominations

Phenotyping Consortium

• Drought - reproductive• Salinity - seedling/reproductive• Heat - hot/humid vs. hot/dry• Yield - favorable• …

• If a $1000 genome (human) is possible• 3 X 109 bases (human)• 3.9 X 108 bases (rice)• i.e. ~7.5 fold less

• then, a $133 rice genome might be possible!• Current technologies (e.g. SBS by 454) are almost still as

costly as HDOA.

• Prospect exists to sequence the entire IRGC

Further ahead …

… but nearly there?

Renee Stokowski Julie Montgomery Laura StuveHeng Tao Cindy Chen Julee KwonGeoff Nielsen Robert Gupta Matt MorenzoniDavid Cox Eric Peacock Diana StarrDennis Ballinger Kelly Frazer Keiko Greenberg

Jan LeachR. Davidson

C. Robin BuellKevin Childs

Hei Leung Dave Mackill K. McNallyR. Bruskiewich V.J. Ulat V. PamplonaM. Macatangay S.H. Zhang Bin LiuM. Rodriguez R. Reano J. DetrasE. Nelzo R. Mauleon

Detlef Weigel Richard Clark Georg ZellerRegina Bohnert Gunnar Rätsch Gabriele Schweikert

Oryz

aSNP

Haplo

ryza

GCP

CIRAD: G. Droc C. Billot B. Courtois A.A. FaroukN. Ahmadi G. Clément J. Taillebois

IRD: G. SecondIRAG: B. BarryINRA-CNG: D. Brunel A. BérardCNG: M. Lathrop M. Foglio

AcknowledgementsAcknowledgements

T. BureauD. Hoen

C. BustamanteKeyan ZhaoB.Padhukasahasram

Linking Candidate Genes to Drought for SNP...

Documents

Transcript of Linking Candidate Genes to Drought for SNP...