SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. ·...

25
John McEwan AgResearch PAG Jan 2010 SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser IIx

Transcript of SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. ·...

Page 1: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

John McEwanAgResearch

PAG Jan 2010

SNP Discovery in Deer (Cervus elaphus) Using

The Illumina Genome Analyser IIx

Page 2: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Summary

• 4.1M SNPs

• 8 lanes

• ~1c/SNP

• 9X with 7 animals

• 100bp PER

• Sufficient for SNP chip

Page 3: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Deer SNPs… lessons learned

• Illumina GA IIx 100bp PER ~500bp insert 3Gbp x 7 animals

• Select animals span genetic diversity

• 1 flow cell 7 lanes

– WGS … more even coverage

– 100bp reads > match to related genome

– 8X coverage …. >98% depth of 4 or greater

– Low coverage SNPs vital to track read source

– Better info on flanking sequence

– PER = better assembly (by simulation)

– Forms basis for draft sequence of a genome

– Sheep ~$2M in 2007 3X ~$50K 2009 9X

– started Sept 2009, seq late Oct with Illumina

Page 4: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start
Page 5: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Wob 1

War 1

Red 1

Eas 1 M

Elk 1

Elk 2

1x 1x

1x 1x

2x 1x

Repeat mask

Blast UMD3

Assemble with Velvet

Meld against bovine scaffold

Detect SNPs

Sequencing

1x Hun 1

Page 6: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Sequence

• 8 lanes

• 100bp PER

• 284.3M reads

• 28.4Gbp

• High % full length

• Not trimmed

Page 7: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Masking

• Used Repeatmasker

• Used Ruminantia db

• Supplemented with:

– >10 identical reads assembly

– Multiple blast hit assembly

– Sped up sequence matching

– Greatly reduced output size

• Optimal masking sensitivity & mapping need to be different!!!

Page 8: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Mapping: deer reads to UMD3

• Used Megablast

• Options

-D 3 -t 21 -W 11 -q -3 -r 2 -G 5 -E 2 -s 56 -N 2 -F "m D" -U T

• Opt speed with maximal specificity & % unique hits

• ~ 10% added if Blastn hits W9 also added (sensitive blast)

• Used unique hits and where ehit1/ehit2 =1e-20

48 44

8

52

Page 9: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Mapping Specificity

• High specificity

• P~0.0009-0.004

• Some animal diffs?

Page 10: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start
Page 11: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Est distance between mate pair ends

~200bp insert sizes

0.02-0.03% had mate pairs wrong orientation if match on same chromosome→ that blast criteria very specific

Page 12: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Velvet assembly criteria selection

• Varied kmer length

• N50 length

• % assembly coverage

• Non chimeric %

• cf CAP3

• Chose default kmer=31

Page 13: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Velvet assembly

• 1Mbp regions assembled

• Divide and conquer approach

• Many small contigs

• 58.6% length UMD3

• UMD3 59.5% unique!

• N50 & coverage affected by insert length

• Better for SNP oligo design

Results

Start

N sequences (M) 284.3

Blast

N sequences (M) 147.3

Assembly

Contigs (M) 3.2

Bases (Gbp) 1.562

N50 (bp) 813

Page 14: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Meld Process

Ovine contigs

Align (BLAST)

reference

contigs

MELDCTAGTGCATGCTGCactTGCTataTGTGCtagNNgcATATTGCTGNNTGCTAT

Bovine reference scaffold

Figure 2. Creation of MELDed ovine sequence using bovine as a guide genome

Ovine contigs

Align (BLAST)

reference

contigs

MELDCTAGTGCATGCTGCactTGCTataTGTGCtagNNgcATATTGCTGNNTGCTAT

Bovine reference scaffold

Figure 2. Creation of MELDed ovine sequence using bovine as a guide genome

Deer contigs

Page 15: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Meld and overall assembly stats

• Reduce contigs 34%

• Reduce length 8%

• Increase N50 26%

• 53.8% coverage of UMD3

• Optimised for SNP discovery

Page 16: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

% Assembly Refseq Coverage

• Masked Bov refseqs

• Mapped deer assembly

• ~13% not mapped

• 80% refseqs >40% unique coverage

• Seq matched 66%

• Conservative

Page 17: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

SNP detection• SNP Detection Criteria

– Stacking: collapsed where reads same start base

– Depth: >3 (98% of sequence) and <17 reads deep

– MAF: at least 2 reads present

– SNP Class:

A 2 or more animals present for both alleles.

B 2 or more animals present for at least 1 allele,

C alleles present one animal

– SNP quality:

• discarded if 10bp flanking sequence has variants

– Previous expts get ~93% conversion rate on SNP chip

Page 18: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Read Depth distribution at SNP calls

• ~Poisson

• Little genome bias?

Page 19: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

• A = both alleles seen in 2 deer

• SNP chip real estate• Infinium 2 SNP

1 probe 50bp no G/C, A/T• Infinium 1 SNP

2 probes 50bp

• 38% removed proximity filter

• 5% removed depth filter

• leaves 4.1M SNPs ~1/349bp

• ~90% pass design (0.8 threshold)

• ~ 1.98M Class A Infinium 2 SNPs

Illumina Deer SNP Results

Page 20: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Estimated Minor allele frequency

• Bias to high MAF

• SNP chip results will be similar

• Average MAF =0.3

Page 21: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

SNP density across genome

1

10

100

1000

10000

0 20 40 60 80 100 120 140 160

SNP

nu

mb

er/M

bp

Chromosome 1 Mbp

A/C

A/G

A/T

G/C

Total

Page 22: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

SNP specificity

• Large % fixed differences

• Impt when selecting SNPs

• Reflects est genetic divergence

SNP freq

Elk only 0.04

Europe only 0.50

both 0.15

fixed dif 0.30

Page 23: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Summary

• Sequenced 7 animals to ~1X coverage– selected to span genetic diversity

– ≥4X depth over 99% of the genome

– 100bp PER

• Used a mixture of assisted and de novo assembly – Optimised to provide high quality sequence for SNP discovery

– Ordered and orientated contigs via related genome

• SNP calling routine– corrects for “stacking” artifacts and repetitive regions

– traces animal origin of reads for high quality calls

• Results– 4.1M SNPs, 2.4M class A

– Suitable to create a high density Illumina SNP array

– Cost ~1 cent/SNP identified

Page 24: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start

Acknowledgements

• Cindy Lawley Illumina

• Kimberly Gietzen

• Nan Leng

• Rudi Brauning, AgResearch

• Paul Fisher AgResearch

• Jason Archer

• Matt Bixley

• Jamie Ward

• Geoff Nicoll Landcorp

Page 25: SNP Discovery in Deer (Cervus elaphus) Using The Illumina Genome Analyser … · 2010. 2. 4. · SNP detection • SNP Detection Criteria –Stacking: collapsed where reads same start