DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability...

41
DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics Studies The New York Botanical Garden, Bronx, New York

Transcript of DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability...

Page 1: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

DNA Barcode sequence identification incorporating taxonomic hierarchy and

within taxon variability

Damon P. Little

Cullman Program for Molecular Systematics StudiesThe New York Botanical Garden, Bronx, New York

Page 2: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.
Page 3: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

test data sets (Little and Stevenson 2007)

gymnosperm nuclear ribosomal internal transcribed spacer 2 (nrITS 2)

1,037 sequences

413 species71 genera

gymnosperm plastid encoded maturase K (matK)

522 sequences334 species75 genera

Page 4: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

…alignment

locus sequencesmedian unaligned length (IQR)

aligned length

nrITS 2

all 137 (108–250) bp 8,733 bp

one per species 196 (115–260) bp 6,778 bp

matK

all 1,561 (1,412–1,661) bp 3,975 bp

one per species 1,601 (1,530–1,661) bp 3,906 bp

Page 5: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

pairwise divergence

locus sequences median interquartile rangezero comparisons

nrITS 2

all 30.99% 26.53–34.48% 0.09%

one per species 29.39% 25.75–33.30% 0.21%

matK

all 20.39% 5.95–23.30% 0.54%

one per species 21.38% 8.13–23.89% 0.42%

Page 6: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

measuring precision and accuracy

Page 7: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.
Page 8: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.
Page 9: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.
Page 10: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

precision

method nrITS2 matK

parsimony ratchet 58% (13%) 71% (41%)

SPR search 60% (11%) 70% (41%)

neighbor joining 65% (8%) 44% (23%)

BLAST 94% (81%) 99% (67%)

BLAT 94% (82%) 99% (69%)

megaBLAST 94% (80%) 99% (61%)

BLAST/parsimony ratchet 86% (74%) 77% (55%)

BLAST/SPR 87% (73%) 76% (53%)

BLAST/neighbor joining 93% (71%) 95% (56%)

DNA–BAR 98% (89%) 100% (79%)

DOME ID 80% (80%) 60% (60%)

ATIM 100% (83%) 100 (67%)

Page 11: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.
Page 12: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.
Page 13: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.
Page 14: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.
Page 15: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

accuracy to species

method nrITS2 matK

parsimony ratchet 67% (46%) 77% (60%)

SPR search 69% (47%) 78% (58%)

neighbor joining 68% (42%) 75% (52%)

BLAST 67% (63%) 84% (68%)

BLAT 66% (62%) 82% (67%)

megaBLAST 72% (68%) 84% (64%)

BLAST/parsimony ratchet 78% (67%) 80% (60%)

BLAST/SPR 79% (67%) 78% (61%)

BLAST/neighbor joining 80% (64%) 86% (56%)

DNA–BAR 65% (62%) 73% (62%)

DOME ID 67% (66%) 50% (50%)

ATIM 83% (71%) 87% (53%)

Page 16: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

lessons learned

Page 17: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

“global” alignments do not work

Page 18: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

precision

method nrITS2 matK

parsimony ratchet 58% (13%) 71% (41%)

SPR search 60% (11%) 70% (41%)

neighbor joining 65% (8%) 44% (23%)

BLAST 94% (81%) 99% (67%)

BLAT 94% (82%) 99% (69%)

megaBLAST 94% (80%) 99% (61%)

BLAST/parsimony ratchet 86% (74%) 77% (55%)

BLAST/SPR 87% (73%) 76% (53%)

BLAST/neighbor joining 93% (71%) 95% (56%)

DNA–BAR 98% (89%) 100% (79%)

DOME ID 80% (80%) 60% (60%)

ATIM 100% (83%) 100 (67%)

Page 19: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

accuracy to species

method nrITS2 matK

parsimony ratchet 67% (46%) 77% (60%)

SPR search 69% (47%) 78% (58%)

neighbor joining 68% (42%) 75% (52%)

BLAST 67% (63%) 84% (68%)

BLAT 66% (62%) 82% (67%)

megaBLAST 72% (68%) 84% (64%)

BLAST/parsimony ratchet 78% (67%) 80% (60%)

BLAST/SPR 79% (67%) 78% (61%)

BLAST/neighbor joining 80% (64%) 86% (56%)

DNA–BAR 65% (62%) 73% (62%)

DOME ID 67% (66%) 50% (50%)

ATIM 83% (71%) 87% (53%)

Page 20: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

“fuzzy” matches are not precise

Page 21: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

precision

method nrITS2 matK

parsimony ratchet 58% (13%) 71% (41%)

SPR search 60% (11%) 70% (41%)

neighbor joining 65% (8%) 44% (23%)

BLAST 94% (81%) 99% (67%)

BLAT 94% (82%) 99% (69%)

megaBLAST 94% (80%) 99% (61%)

BLAST/parsimony ratchet 86% (74%) 77% (55%)

BLAST/SPR 87% (73%) 76% (53%)

BLAST/neighbor joining 93% (71%) 95% (56%)

DNA–BAR 98% (89%) 100% (79%)

DOME ID 80% (80%) 60% (60%)

ATIM 100% (83%) 100 (67%)

Page 22: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

accuracy to species

method nrITS2 matK

parsimony ratchet 67% (46%) 77% (60%)

SPR search 69% (47%) 78% (58%)

neighbor joining 68% (42%) 75% (52%)

BLAST 67% (63%) 84% (68%)

BLAT 66% (62%) 82% (67%)

megaBLAST 72% (68%) 84% (64%)

BLAST/parsimony ratchet 78% (67%) 80% (60%)

BLAST/SPR 79% (67%) 78% (61%)

BLAST/neighbor joining 80% (64%) 86% (56%)

DNA–BAR 65% (62%) 73% (62%)

DOME ID 67% (66%) 50% (50%)

ATIM 83% (71%) 87% (53%)

Page 23: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

autoapomorphies (unique characters) work... but not always present

Page 24: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

precision

method nrITS2 matK

parsimony ratchet 58% (13%) 71% (41%)

SPR search 60% (11%) 70% (41%)

neighbor joining 65% (8%) 44% (23%)

BLAST 94% (81%) 99% (67%)

BLAT 94% (82%) 99% (69%)

megaBLAST 94% (80%) 99% (61%)

BLAST/parsimony ratchet 86% (74%) 77% (55%)

BLAST/SPR 87% (73%) 76% (53%)

BLAST/neighbor joining 93% (71%) 95% (56%)

DNA–BAR 98% (89%) 100% (79%)

DOME ID 80% (80%) 60% (60%)

DOME ID* 100% (100%) 100% (100%)

ATIM 100% (83%) 100 (67%)

Page 25: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

accuracy to species

method nrITS2 matK

parsimony ratchet 67% (46%) 77% (60%)

SPR search 69% (47%) 78% (58%)

neighbor joining 68% (42%) 75% (52%)

BLAST 67% (63%) 84% (68%)

BLAT 66% (62%) 82% (67%)

megaBLAST 72% (68%) 84% (64%)

BLAST/parsimony ratchet 78% (67%) 80% (60%)

BLAST/SPR 79% (67%) 78% (61%)

BLAST/neighbor joining 80% (64%) 86% (56%)

DNA–BAR 65% (62%) 73% (62%)

DOME ID 67% (66%) 50% (50%)

DOME ID* 76% (75%) 90% (90%)

ATIM 83% (71%) 87% (53%)

Page 26: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

some sequences are simply unidentifiable

Page 27: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

...remaining (insoluble) problems

identical sequences for multiple terminals

shared alleles between terminals

use allele frequency as a predictor?

Page 28: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

desirable methodologies and properties of

Sequence IDentification Engines (SIDEs)

Page 29: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

Sequence IDentification Engines (SIDEs)

avoid global alignment by comparing short segments: pseudo–alignment

use exact matches

use autoapomorphies where possible

...but allow the use of other characters too

Page 30: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

context/text DNA recoding

characters are defined by flanking context

=> pretext and postext

permit “alignment–free” comparisons

size and separation between pretext and postext must be arbitrarily delimited

states (text) limited by the proximity of context

terminals can be individual sequences or composites representing taxa

Page 31: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

context/text DNA recoding

Page 32: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

context/text DNA recoding

characters are defined by flanking context

=> pretext and postext

permit “alignment–free” comparisons

size and separation between pretext and postext is arbitrarily

possible states (text) is limited by the length of the text

terminals can be individual sequences or composites representing taxa

Page 33: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

querying text/context database

find pretext/text/postext in the query sequence and match to references

Page 34: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

querying text/context database

Page 35: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

querying text/context database

find pretext/text/postext in the query sequence and match to references

score terminals based on the number of matches

final score can be raw or based a weighting function

Page 36: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

possible weighting functions

equal weights (raw score)

number of distinct texts

=> up weights more variable characters

1/(number of distinct texts)

=> down weights more variable characters

(number of texts)/(number of scores)

Page 37: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

precisionmethod nrITS2 matK

parsimony ratchet 58% (13%) 71% (41%)

SPR search 60% (11%) 70% (41%)

neighbor joining 65% (8%) 44% (23%)

BLAST 94% (81%) 99% (67%)

BLAT 94% (82%) 99% (69%)

megaBLAST 94% (80%) 99% (61%)

BLAST/parsimony ratchet 86% (74%) 77% (55%)

BLAST/SPR 87% (73%) 76% (53%)

BLAST/neighbor joining 93% (71%) 95% (56%)

DNA–BAR 98% (89%) 100% (79%)

DOME ID 80% (80%) 60% (60%)

ATIM 100% (83%) 100 (67%)

BRONX 0 91% (90%) 88% (84%)

BRONX 1 96% (86%) 98% (79%)

Page 38: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

accuracy to speciesmethod nrITS2 matK

parsimony ratchet 67% (46%) 77% (60%)

SPR search 69% (47%) 78% (58%)

neighbor joining 68% (42%) 75% (52%)

BLAST 67% (63%) 84% (68%)

BLAT 66% (62%) 82% (67%)

megaBLAST 72% (68%) 84% (64%)

BLAST/parsimony ratchet 78% (67%) 80% (60%)

BLAST/SPR 79% (67%) 78% (61%)

BLAST/neighbor joining 80% (64%) 86% (56%)

DNA–BAR 65% (62%) 73% (62%)

DOME ID 67% (66%) 50% (50%)

ATIM 83% (71%) 87% (53%)

BRONX 0 59% (58%) 76% (71%)

BRONX 1 72% (67%) 92% (75%)

Page 39: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

BRONX conclusions

BRONX is more precise than existing algorithms

BRONX is sometimes more accurate than existing algorithms

BRONX is an incremental improvement

Page 40: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

future directions

improve the scoring function in BRONX

dynamically size context/text

benchmark additional datasets for all methods

incorporate context/text recoding into a scalable version of the ATIM algorithm

Page 41: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

acknowledgments

Kenneth Cameron

Santiago Madriñán

Christian Schulz

Dennis Stevenson