Ilene Mizrachi - Opening Plenary

16
National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA BARCODE SEQUENCE DATAFLOW INTO GENBANK Ilene Mizrachi November 30, 2011 Fourth International Barcode of Life Conference

description

Barcode Sequence Dataflow into Genbank

Transcript of Ilene Mizrachi - Opening Plenary

Page 1: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USANational Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

BARCODE SEQUENCE DATAFLOW INTO GENBANK

Ilene MizrachiNovember 30, 2011

Fourth International Barcode of Life Conference

Page 2: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

Barcode Project -2003 and beyondBarcode of Life project was initiated at in

2003 INSDC would be the repository for raw and

assembled sequence dataINSDC adopts new source fields to

accommodate Barcode metadata requirements

Barcode of Life Database (BOLD) established as a community workbench and sequencing center

Page 3: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

What is a Barcode? A global reference library of DNA barcode

sequences that is integrated with other systems of biodiversity information (e.g., databases of specimens, species, biogeographic information).

Mechanism to link DNA sequences to vouchered specimens and valid species names.

A reserved BARCODE keyword was adopted for data that met strict barcode standards

Page 4: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

Barcode Standard Formally described species or a provisional label for an unpublished

species Voucher specimen identifier, preferably in a biorepository using a

structured field Country-Code using the controlled vocabulary used by GenBank; Sequence from a gene region specified by the CBOL

COI for animals matK and rbcL for plants ITS for fungi

Contain at least 75% contiguous, high quality bases from within the approved region

Electropherogram trace files for bidirectional sequencing runs Sequences of all forward and reverse primers

Strongly recommended data elements GPS coordinates Name of the identifier Name of the collector Date of collection

Page 5: Ilene Mizrachi - Opening Plenary

Compliant Barcode Record

Page 6: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

Barcode records in GenBank

Page 7: Ilene Mizrachi - Opening Plenary

Life of an iBOL Record

Page 8: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

Submissions from BOLD

Page 9: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

Data Sharing Works

Page 10: Ilene Mizrachi - Opening Plenary

http://www.ncbi.nlm.nih.gov/WebSub/?tool=barcode

Page 11: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

QA checks at GenBankTo ensure that the sequence data is of high quality, the following checks are run:Barcode data element complianceConsistency checks such as:

reported latitude-longitude falls within cited country

collection date has already occurredSequence quality checks

Page 12: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

Compliance tool

Page 13: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

Checking Sequence Quality

• Trim primer sequences• Check congruence

between fwd and reverse reads

• Align sequences to check for gaps

• Translate sequences to check for internal stops

Page 14: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

Updates Are CriticalPrimary data repository – sequence records

owned by submitter Submitter is responsible for providing

additional data and metadata as it becomes available:PublicationSequenceTaxonomyVoucher

Third party updates are welcome!

Page 15: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

ChallengesIf Reference Barcodes are to be used for species

identification, phylogenetics, ecological forensics, conservation, and macro-analysis of biodiversity patterns, then the minimal requirement should be (a) high quality sequence (b) link to specimen and (c) taxonomic identification

Need to support rapid data release including preliminary taxonomic classifications similar to “Fort Lauderdale Principles” of genomics community

Data updated asynchronously at BOLD and in GenBank. Need to continue work on update channel

Need to work with communities to devise strict QA tests for plant and fungal Barcodes

Page 16: Ilene Mizrachi - Opening Plenary

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

AcknowledgementsTaxonomy Group

Scott FederhenConrad SchochLu SunCarol HottonDetlef Leipe

GenBank GroupSusan Schafer Michael Fetchko

Software SupportColleen BollinKamen TodorovVasuki Gobu