SnpFilt: A pipeline for reference-free ... - Fudan...
Transcript of SnpFilt: A pipeline for reference-free ... - Fudan...
SnpFilt:Apipelineforreference-freeassembly-basedidentificationof
SNPsinbacterialgenomes
A/ProfRuitingLanUniversityofNewSouthWales
Australia
WhyinterestedinSNPsinbacteria?
• Genomesequencingforpublichealthmicrobiology– Outbreakinvestigations– Diseasetransmission
Salmonella outbreak1
• Outbreak1occurredinaresidentialcollege• 16casesofgastroenteritisamongstudentsandstaffovertwodays
• MLVAprofile3-11-7-12-523• Chocolatemousseaspossiblecommonfoodsource
• 13humanisolatesand6mousseisolatesweresequenced
Outbreak1G
ene
stfC
STM
0270
STM
0328
.s
allP
fepA
mrdB
mrdA
ybeV
gltL
ybiS
rpsA
rpoS
nlpD
barA
STM
3073
arcB
mreB
yhhK
mtlR
rpoZ
rbsR
ilvD
rplL
yjdE hfq
mpl
arcA
AA
Cha
nge
N ->
D
Y ->
C
K->
N
N ->
D
L ->
R
A ->
V
H ->
L
D ->
G
E ->
V
S ->
I
H ->
D
H ->
R
A ->
V
S ->
R
R ->
H
Q ->
STO
P
S ->
A
V ->
A
Q ->
STO
P
K ->
T
Lab No. Source Epidemiological link A A G A A T C G C C T A A C C A C A C G C A G C T C T A T C T1687 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . G . . . . . .1688 Human Yes . . . . . . . . . . . . . . G . . . . . . . . . . . . . . . .1689 Human Yes . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . .1690 Human Yes G . . G G A . . T T . . . . . G . . . . . . A . . T . C . T C1691 Human Yes . . . . . . . A . . C . . . . . . . . . . . . . . . . . . . .1692 Human Yes . . . . . . . . . . . . . T . . . . . . T . . . . . A . C . .1693 Human Yes . G . . . . . . . . . . T . . . . . . . . . . . . . . . . . .1694 Human Yes . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . .1695 Human Yes . . T . . . T . . . . G . . . . T . T . . . . . . . . . . . .1696 Human Yes . . . . . . . . . . . . . . . . . C . . . C . . . . . . . . .1697 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1698 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1699 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1700 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1701 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1702 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1703 Mousse Yes . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . .1704 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1705 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Human - Epidemiologically confirmed
Food/contaminated source
1687 1
1689
116941
1703
1
1688
1
1693
2
16912
1692
41695 5
1690
12
1696
1
169716981699
17001701170217041705
Octaviaetal.JCM2015 53:1063
Salmonellaoutbreak2• Outbreak2occurredinaReady-to-eatfoodfromthesamebakeryinmetropolitanSydney
• 27cases• MLVAtype3-9-8-12-523• 11isolatessequenced
– 9isolatesfrompatientswithsalmonellosisatthetimeoftheoutbreakandresidingnearby
• 4confirmedoutbreakbasedondescriptivecaseseries• 4unrelatedfollowingPHUinvestigation• 1unknownlink– patientdidnotattendPHUinterview
– 1isolatefromabootswaband1fromdirtyeggshellrinsefromthebakery
Outbreak2Gene Name stiC cydC cysK yhgH dlhHAA Change T -> I H -> YConsensus Source Date of collection Epidemiological link A A G A A A C
1837 Human 27-Apr-12 No . . . . . G .1838 Human 26-Apr-12 No . . . . . . .1839 Human 26-Apr-12 No . . . . . . .1840 Human 24-Apr-12 Yes . . . . . . .1841 Human 24-Apr-12 No . . . . . . A1842 Human 24-Apr-12 Yes . . . . . . .1843 Human 23-Apr-12 Yes . . . . . . .1844 Human 23-Apr-12 Yes . . . . . . .1845 Human 10-Apr-12 Unknown G G A C G . .1846 Boot Swab Row 4 03-May-12 Yes . . . . . . .1847 Dirty Egg Shell Rinse 03-May-12 Yes . . . . . . .
1837 11841
1
18461847
18381839
1840184218431844
1845
5
Human - Epidemiologically confirmed
Human - Unknown epidemiological link
Human - Epidemiologically unlinked
Food/contaminated source
Octaviaetal.JCM2015 53:1063
IsitaSNP?
Denovoassembly(velvet)
ProgressiveMAUVEalignment
CommonSNPs
SNPs
FilterreadsbyQUALITY
BWAMapping
SNPs
FilterreadsbyQUALITY
FilterSNPs
Mappingbased Assemblybased
Whyreferencefree?
• SNPsdiscovereddependingonthereferenceyouused
• SNPsculledforhighSNPdensityregions(Zhouetal. PLoS Genetics2013)
NGS raw reads
Assembly (SPAdes)
Map reads to contigs (BWA)
Apply filters
SNPs
SNPfilt work flow
SNPcallingperformancemetrics
Truepositives(TP)
Falsepositives(FP)
Falsenegatives(FN)
Truenegatives(TN)
Actualsequence
SNP NotSNP
SNPcallalgorithm
sSN
PNotSNP
Precision= TPTP+FP
Sensitivity= TPTP+FN
Choiceofassemblers
• Abyss• Cabog• Mira• MaSuRCA
• SGA• SoapDenovo• SPAdes• Velvet
Abys
s
Cab
og
Mira
MaS
uRC
A
SGA
Soap
Den
ovo
SPAd
es
Velv
et
Sens
itivi
ty
0.0
0.2
0.4
0.6
0.8
1.0A
Abys
s
Cab
og
Mira
MaS
uRC
A
SGA
Soap
Den
ovo
SPAd
es
Velv
et
Prec
isio
n
0.0
0.2
0.4
0.6
0.8
1.0M.abscessus (HiSeq)M.abscessus (MiSeq)R.sphaeroides (HiSeq)R.sphaeroides (MiSeq)
B
AssembliesfromtheGAGE-Bstudy
M.abscessus (HiSeq)M.abscessus (MiSeq)
R.sphaeroides (HiSeq)R.sphaeroides (MiSeq)
SNPfilters
• F1)Regionsofexcessivecoverage– Therunningmeanofthereadcoverageacrossawindowof100basesisgreaterthanthemedian+2mediandeviationacrossthewholeassembly
• F2)Lowmappingquality– Mappingquality<58,foranysitewithinaneighbourhoodof400bases
SNPfilters
• F3)Lowcoverage– <20reads,or0supportingreadineithertheforwardorreversedirection
• F4)lowforwardcoverage– <10readsintheforwarddirection,foranysitewithinaneighbourhoodof20bases
• F5)Highheterogeneity– Thenumberofsupportingreads<70%foranysitewithinaneighbourhoodof20bases
SNPfilters
• F6)Lowbasequality– Atleast50baseswithinawindowof2000baseshavebasequality<q.thres,whereq.thres isthemean- 3standarddeviationsofqualityscoresacrossthewholeassembly
Effectoffilters:GAGE-Bassemblies
M.abscessus
(HiSeq)
M.abscessus
(MiSeq)
R.sphaeroides
(HiSeq)
R.sphaeroides
(MiSeq)
Filter TN FN TN FN TN FN TN FN
F6:lowquality 24 9 8 0 107 1 0 0
F5:highheterogeneity 0 0 61 0 42 0 297 0
F4:lowforwardcoverage 0 0 6 0 2 2 11 0
F3:lowcoverage 0 0 58 2 0 3 0 2
F2:lowmappingquality 0 0 10 10 27 0 3 3
F1:excessivecoverage 0 7 0 7 22 0 0 0
Effectoffilters:GAGE-Bassemblies
Effectoffilters:knowngenomesE.coliK12 M.tuberculosis F11 S.pneumoniaeTIGR4
Filter Sites Errors Sites Errors Sites Errors
F6:lowquality 50026 29 244705 151 0 0
F5:highheterogeneity 3652 40 1706 38 91023 121
F4:lowforwardcoverage 8621 0 1365 0 750679 1
F3:lowcoverage 33832 0 7565 0 33057 4
F2:lowmappingquality 15062 4 36744 14 104219 12
F1:excessivecoverage 469390 6 375689 0 10357 0
F0:reliablesites 4017250 0 3713565 0 1937574 0
Totalassemblysize 4694957 4386568 2963539
Genomesize 4641652 4424435 2163340
CoveragerequiredforfullSNPcalls
20 40 60 80 100
010
2030
40
Read depth
TPs
●●● ●●●
●
●●
●●●
●●●
●●● ●●●
●●● ●●
● ●●●
● MiSeqNextSeq
Conclusions
• Reference-freeassembly-baseddiscoveryofSNPs
• Unreliableregionsareremovedbasedonthequalityandcoverageofre-alignedreads
• Atleast40-foldcoverageisrequiredforreliableandcompleteSNPcalls
Acknowledgments
• DrCarmenChan• DrSophieOctavia• A/ProfVitaliSintchenko• DrQinningWang
• FundingsupportfromNationalHealthandMedicalResearchCouncilofAustralia
IsitaSNP?MiSeq(2x250bp)sequencing
Mappingreadstothereferencegenome(LT2)
Burrows-Wheeler Aligner (BWA)
IdentificationsofSNPsSAMtools
denovoassemblyVelvet, Spades
ManualverificationofSNPs&NatureofSNPscustomscripts
AlignmentofContigs andscaffolds
progressiveMauve
ReadscorrectionbyQUAKE
• Whencoverageislow,correctionisworthwhile
1821
18261827
18191820182218231825
18241
1828
18531836
1830183118321834
1829 1 183321
181318181812
180818091810181118151817
1816 1 18141
Outbreak 3 Outbreak 4 Outbreak 5
Human - Epidemiologically confirmed
Human - Unknown epidemiological link
Human - Epidemiologically unlinked
Food/contaminated source
Threemoreoutbreaks
Octaviaetal.JCM2015 53:1063
Isitpartoftheoutbreak?Cut-offbasedonSNPdifferences
Octaviaetal.JCM2015 53:1063
1687 1
1689
116941
1703
1
1688
1
1693
2
16912
1692
41695 5
1690
12
1696
1
169716981699
17001702170217041705
1821
18261827
18191820182218231825
18241
1828
18531836
1830183118321834
1829 1 183321
18131818
1812
180818091810181118151817
1816 1 18141
18371
18411
18461847
18381839
1840184218431844
1845
5
Outbreak 1 Outbreak 2 Outbreak 3
Outbreak 4 Outbreak 5
Human - Epidemiologically confirmed
Human - Unknown epidemiological link
Human - Epidemiologically unlinked
Food/contaminated source
SNPcallingperformancemetrics
• Truepositives(TrueSNPs)• Truenegatives(TruenotSNPs)• Falsepositives(CalledSNPsbutnottrueSNPs)• Falsenegatives(SNPsbutnotcalledSNPs)
SNPdetection
• Qualitycontrolisveryimportant– Filterreads– Correctreads– FilterSNPs– ManualcheckingofSNPs
BWAmapping/SNPsiteextraction(4352)
Filter>=20readscoverage(3322)
Sitesthatcontain>=70%SNPsupportingreads(945)
Sitesthatcontain30%to<70%SNPsupportingreads(443)
Sitesthatcontain<30%SNPsupportingreads(1934)
DivideSNPs intothreecategories
Discard1.1%genuineSNPsthrownaway1.8%falsepositives
FilterreadsbyQUALITY/Correctreads