Introduction to single- isolates, single CGE services€¦ · Helicobacter pylori Klebsiella...
Transcript of Introduction to single- isolates, single CGE services€¦ · Helicobacter pylori Klebsiella...
Workshop on Whole Genome Sequencing and Analysis, 19-21 Mar. 2018
Introduction to single-isolates, single CGE services
Learning objective:
After this lecture and exercise, you should be able to…
…describe how the CGE methods for identifying species, Multilocus Sequence Type, plasmids, and antimicrobial resistance genes work
… account for the difference between assembly+BLAST-based prediction methods and mapping based methods
…use the above-mentioned methods as stand-alone services and interpret the results
Tools for species identification
Name of Service Description Status PublicationSpeciesFinder Species
identification using 16S rRNA
OnlinePublished Feb 2014 PMID: 24574292
KmerFinder Species identification using overlapping 16mers
Online
Published Jan 2014 PMID: 24172157
TaxonomyFinder Taxonomy identification using functional protein domains
Under development
Published in PMID: 24574292 + Oksana Lukjancenko's PhD thesis
Reads2Type Species identification on client computer
No longer supported
Published Feb 2014 PMID: 24574292
PMID: 24574292
Training data◇ 1,647 completed / almost completed genomes downloaded from
NCBI in 2011 (1,009 different species)
Evaluation data◇ NCBI draft genomes
• 695 isolates from species that overlap with training set (151 species)
◇ SRA draft genomes• 10,407 sets of short reads from Illumina (168 species)
• 10,407 draft genomes from Illumina data (168 species)
16S rRNA
• 16S rRNA sequencing has dominated molecular taxonomy of prokaryotes for 40 years (Fox et al, Int. J. Syst. Bacteriol., 1977)
• Tremendous amounts of 16S rRNA sequence data are available in public databases
Concerns: • Low resolution • Some genomes contain several copies of the 16S rRNA gene with inter-gene variation
• The 16S rRNA gene represents only about 0.1% of the coding part of a microbial genome
Reference database • 16S rRNA genes are isolated from genomes in training data using RNAmmer (Lagesen, NAR, 2007).
Method • Input genomes are BLASTed against 16S rRNA genes in reference database.
• Best hit is selected based on a combination of coverage, % identity, bitscore, number of mistmatches and number of gaps in the alignments.
CGE implementation of 16S species identification
SpeciesFinder
•Genomesintrainingdataischoppedinto16mers:
A T G A C G T A T G A C T G A T G G C G T A G T A G T C C
•Downsampling
•Only16merswithspecificprefix(ATGAC)arekept
KmerFinder Using all information in the WGS data
almost
Bact1-> E. coli
Bact2-> S. enterica
Bact3-> K. pneumoniae
Bact4-> S. aureus
?????
Query bacteria of unknown species
Reference db bacteria of known species (template)
Prediction: Query bacteria is a S. aureus
Three other methods were evaluatedTaxonomyFinder: Performs its predictions based on the presence of protein profiles that are specific to particular taxonomic groups.
Reads2Type: Performs its predictions based on species-specific 50mers in the 16S rRNA or gyrB gene (for Enterobacteriaceae).
rMLST: Performs its predictions based on up to 53 ribosomal genes. Implemented in collaboration with Keith Jolley from Oxford (MLST).
Results
(16srRNA)
Summary of taxonomy benchmark study
• KmerFinder had the highest accuracy and was the fastest method.
• SpeciesFinder (16S rRNA-based) had the lowest accuracy.
• Methods that only sample genomic loci (16S, Reads2Type, rMLST) had difficulties distinguishing species that only recently diverged, especially when main difference is a plasmid.
“Standard”whenaimingatdeterminingthespeciesofoneisolate
“Winnertakesitall”ifyouhaveamixedsampleorsuspectyouhaveamixedsample
KmerFinder statistics
€
Squ
S:Score(totalnumberofuniquekmersinquerysequencethatmatchkmersinreference(template)sequence)qu:Totalnumberofuniquekmersinquerysequence
€
Slu
S:Score(totalnumberofuniquekmersinquerysequencethatmatchkmersintemplatesequence)lu:Totalnumberofuniquekmersinreference(template)sequence
luS
Querycoverage
Templatecoverage
Kmersinquery Kmersinreference(template)genome
qu
More KmerFinder statistics
Depth(DepthofCoverage).Onlyrelevantwhenuploadingrawreads.
Average number of times each position is covered by a kmer.
€
N ⋅ LG
N=totalno.ofkmersthatmatchthetemplate(notthesameasscore)
L=16(lengthofkmer)
G=Totalno.ofuniquekmersintemplate
KmerFinder output standard scoring method
Query(input)Rawreadsfromurinesamplearesplitinto16mers
Onlyunique16mersarekept
Template/referencedatabase
E.coli
P.mirabilis
S.aureus
Inthe“total”valuesthekmersareallowedtomatchmorethanonetemplate
“Winnertakesitall”
4493
3320
Depth
Tools for further typing
Name of Service Description Publication
MLSTMultilocus sequence typing
Published Apr 2012, PMID: 22238442
PlasmidFinder
Identification of plasmids (replicons) in Enterobacteriaceae (and Gram-positives)
Published Apr 2014, PMID: 24777092
pMLST pMLST of plasmids in Enterobacteriaceae
Published Apr 2014, PMID: 24777092
Multilocus Sequence Typing (MLST)
• First developed in 1998 for Neisseria meningitis (Maiden et al. PNAS 1998. 95:3140-3145)
• The nucleotide sequence of internal regions of app. 7 housekeeping genes are determined by PCR followed by Sanger sequencing
• Different alleles are each assigned a random number
• The unique combination of alleles is the sequence type (ST)
UsingWGSdataforMLST
DownloadoftheMLSTdatafrompubmlst.org
The BLAST based MLST software identifies theMLST alleles within the genome, which isafterwardstranslatedtothecorrespondingST
Assembledgenome454–singleendreads454–pairedendreadsIllumina–singleendreadsIllumina–pairedendreadsIonTorrentSOLiD–singleendreadsSOLiD–matepairreads
Acinetobacterbaumannii#1Acinetobacterbaumannii#2ArcobacterBorreliaburgdorferiBacilluscereusBrachyspirahyodysenteriaeBifidobacteriumBrachyspiriaintermediaBordetellaBurkholderiapseudomalleiBrachyspiraBurkholeriacepaciacomplexCampylobacterjejuniClostridiumbotulinumClostridiumdifficile#1Clostridiumdifficile#2CampylobacterhelveticusCampylobacterinsulaenigraeClostridiumsepticumC.diphtheriaeCampylobacterfetusChlamydiales
CampylobacterlariCronobacterC.upsaliensisEscherichiacoli#1Escherichiacoli#2EnterococcusfaecalisEnterococcusfaeciumF.psychrophilumHaemophilusinfluenzaeHaemophilusparasuisHelicobacterpyloriKlebsiellapneumoniaeLactobacilluscaseiLactococcuslactisLeptospiraListeriaListeriamonocytogenesMoraxellacatarrhalisMannheimiahaemolyticaNeisseriaP.gingivalisP.acne
PseudomonasaeruginosaPasteurellamultocidaPasteurellamultocidaStaphylococcusaureusStreptococcusagalactiaeSalmonellaentericaStaphylococcusepidermidisS.maltophiliaStreptococcuspneumoniaeStreptococcusoralisS.zooepidemicusStreptococcuspyogenesStreptococcussuisStreptococcusthermophilusStreptomycesStreptococcusuberisVibrioparahaemolyticusVibriovulnificusWolbachiaXylellafastidiosaY.pseudotuberculosis
Mismatches
Extended Output
Truncated gene
Extended Output
PlasmidFinder-identificationofplasmidreplicons
The BLAST based PlasmidFinder softwareidentifiesrepliconsintheinputdata
Ongoing update of thePlasmidFinderdatabase
Enterobacteriaceae
Grampositives
Selectthethresholdforminimum%Identityandlengthcoverage
Selectthedatabase
Input genome
%IdentityThe percentage og nucleotides in the matching locus of the input genome that is identical to the nucleotides of the plasmid replicon in the PlasmidFinder database
Plasmid replicon
Minimum lengthMinimum percent of the length of the plasmid replicon that has a matching region in the input genome
Input genome
Plasmid replicon
Remember-ThePlasmidFinderdatabasecontainsreplicons,notentireplasmids.
pMLSTplasmidMLSTforincF,incN,incHI1,IncHI2,andIncI1plasmids
ResFinder-identificationofacquiredresistancegenes
TheBLASTbasedResFindersoftwareidentifiesthe acquired resistance genes within thegenome, along with accession numbers andtheoreticalresistancephenotype
Tetracycline
Beta-lactam
Colistin
Ongoing update oft h e R e s F i n d e rdatabase
ResFinderoutput
◇ 200 isolates from 4 different species (Salmonella Typhimurium, Escherichia coli, Enterococcus faecalis and Enterococcus faecium)
◇ ResFinder, 98 %ID, 60% length coverage
◇ Phenotypic tests, 3,051 in total • 482 Resistant • 2569 Susceptible
=> 99,74% of the results were in agreement between ResFinder and the phenotypic tests
23 discrepancies -> 16, typically in relation to spectinomycin in E. coli
• Allows for species-specific identification of point mutations in chromosomal genes causing antimicrobial resistance
• Uses BLAST for identifying relevant genes in input genomes, which are then screened for mutations known to cause resistance
CampylobacterE.coliSalmonellaN.gonorrhoeaM.tuberculosis
Canalsoreportmutationsnotspecificallyknowntocauseresistance(unknownmutations)
Overview of chromosomal point mutations included for E. coli
Assembly+BLAST-based methods
Draft genomeRaw reads
Database w. genes of interest
Assembly
• The old, trusty method • Slow• Genes might be missed if at config ends (assembly dependent)
Mappingbasedmethods
Database w. genes of interestRaw reads
Mapping
(BWA/Kmers)
• Initially used in SRST2 (Inouye et al., 2014, Genome Med: 6:90)
• Fast
• More sensitive -> higher performance
• Too sensitive -> false positives caused by noise (contamination)
KmerResistance - Identification of acquired resistance genes
https://cge.cbs.dtu.dk/services/KmerResistance/
• Examines the number of co-occurring kmers between input data and genes in ResFinder database
• Uses the “winner takes it all” strategy
I) Kmers are only assigned to the gene with the highest kmer matches
II) Kmers matching this best hit are removed
III) Step I+II are repeated until no more kmers
• To avoid false positives, a threshold for min. depth and breadth of coverage is introduced
• The threshold varies according to the depth and breadth of the entire genome as predicted by KmerFinder
Clausen et al. (2016). J Antimicrob Chemother. 71(9):2484-8
Output from KmerResistance
Output from KmerResistance, continued
…..
Handling sequence data?Watch out!
FASTA file in Word
This should be fine…
Handling sequence data?Watch out!
Oh no! This wont work…
Use “pure” text editors
Example: • Sublime Text
Save files in “txt” format.
What your data actually looks like!
A word on browsers
Browserslikelytoworkwithnoproblems:
Chrome,Firefox,(Safari)
Browserswedon’tlike:Explorer,Edge
And now…www.goseqit.com/exercise1
https://www.dropbox.com/sh/09r0kab7hzeb9mv/AAAHWHvUuad3pG2gPq9llc7Za?dl=0AlsoavailableviaDropBox:
Exercise data