Robinson bosc2010 bio_hdf

Post on 18-Nov-2014

800 views 1 download

description

 

Transcript of Robinson bosc2010 bio_hdf

www.hdfgroup.org

The HDF Group

1July 9, 2010

BioHDFOpen Binary File Formats for

Next-Generation Sequencing Data

Dana Robinson

The HDF Group

derobins@hdfgroup.org

Copyright © 2010 The HDF Group. All Rights Reserved

Current Status and Future Directions

www.hdfgroup.orgJuly 9, 2010

NGS Data Challenges

2

Very large quantities of data (100s of GB)

"Drinking from the firehose"

Analysis methods vary greatly, so a flexible yet unified data store would be useful.

www.hdfgroup.orgJuly 9, 2010

What is Needed

3

A Data ModelA data model which accurately describes the data and can be expanded to contain new types of data

A Data StoreA file format or data store which is efficient in access time and storage size and which scales well

A ToolkitA flexible software toolkit that can be used to create tools and pipelines based on the data model and file format

www.hdfgroup.orgJuly 9, 2010 4

What is BioHDF?

An open-source, community-driven project, funded by an NIH SBIR grant and led by Geospiza, Inc. in collaboration with The HDF Group.

BioHDF is a particular arrangement of objects in an HDF5 file (similar to a database schema)

BioHDF is a library and C API which can be used to write applications (coming soon)

BioHDF is a set of command line tools for storing, retrieving and manipulating data in BioHDF files

www.hdfgroup.orgJuly 9, 2010 5

HDF = Hierarchical Data Format

/Reads/

Alignments/

References

somefile.h5

groups

datasets

is_sorted

attributes

An example of how data is stored in HDF5

www.hdfgroup.orgJuly 9, 2010 6

Benefits of BioHDF

• Portability and data sharing:Platform independent, endian independent, self describing, common data models.

• High performance:Fast random access and efficient, scalable, petabyte level compressed storage.

• Widespread adoption:MATLAB, IDL, NASA-Earth Observing System, Pacific Biosciences, SOLiD, 100's of products.

• 20 year history:Robust, performance tuned, and well supported by The HDF Group, an independent non-profit entity.

www.hdfgroup.orgJuly 9, 2010

HDF in Bioinformatics

• Baylor Imaging Group• Life Technologies• Pacific Biosciences• Oxford Nanopore• GenomeData (UW)• Geospiza• Others

www.hdfgroup.orgJuly 9, 2010 8

Data Stored

The prototype BioHDF stores

Reads

Alignments

Annotations

Clusters of Aligned Reads

Reference Sequences

Indexes (NCList or simple)

www.hdfgroup.orgJuly 9, 2010 9

Data Stored

Additional user-specific data can be stored without breaking the library or tools.

BioHDFData

User-SpecificData

Similar to how adding additional tables to a database schema does not invalidate existing queries.

www.hdfgroup.orgJuly 9, 2010 10

Project Stages

A "pipeline prototype " set of tools to demonstrate the suitability of HDF5 for NGS data storage.

An version 1.0 release of a BioHDF library and C API targeting the functionality of samtools.

A higher-level C API that abstracts out and hides the underlying storage technology.

www.hdfgroup.orgJuly 9, 2010 11

BioHDF Applications andWrappers (e.g. Perl, Python)

HDF5 API and Applications

HDF5 API

Physical Storage

BioHDF API

High-Level API

www.hdfgroup.orgJuly 9, 2010 12

A Higher-Level API

high-levelC API

BioHDFAPI

samtools

tool

wrapperBAMAPI

low-levelC APIs

A high-level API will encapsulate and hide the underlying storage technology.

www.hdfgroup.orgJuly 9, 2010 13

Acknowledgements

BioHDF is supported by NIH SBIR Phase II grant HG003792 awarded to Geospiza, Inc.

GeospizaTodd SmithMark Welsh

The HDF GroupMike Folk

www.hdfgroup.org

The HDF Group

14July 9, 2010

Thank you for your time!

If you are interested in using or contributing to BioHDF, please contact us!

Dana Robinson (derobins@hdfgroup.org)

http://www.biohdf.org

BOSC BoF: Friday 5:10-6:00

ISMB Poster J18: Monday, July 12: 12:40-2:30

ISMB BoF: Tuesday, July 13 1-2 pm, room 306