BioTorrents: A File Sharing Service for Scientific Data

Post on 19-May-2015

835 views 1 download

Tags:

description

I present an overview of BioTorrents.net. This was presented at the Open Science Summit 2010 conference in Berkeley, CA.

Transcript of BioTorrents: A File Sharing Service for Scientific Data

Morgan Langille, PhD

Open Science Summit 2010

Berkeley, California

July 29st, 2010

Acknowledgements

iSEEM project Dr. Jonathan Eisen UC Davis

Questions/Comments Twitter: @BetaScience

Motivation

Data in science is growing rapidly

Transfer times increasing

Reliability of data transfer

Sharing scientific data openly

Personal Challenges

1. Improve download speed and reliability from large data providers

2. Encourage sharing of all data associated with a study

3. Allow easier sharing of unpublished data

Traditional file transfer methods Single source server

Bandwidth limitations

No data redundancy

No data verification

Peer-to-peer file transfer: BitTorrent Data is shared between

all computers

Bandwidth grows as users increases

Data redundancy

Data is verified Sha1 cryptographic hash

25-50% of all Internet traffic is BitTorrent

BitTorrent: How it works1. User installs BitTorrent

client software

2. User downloads a small “.torrent” descriptor file

3. Client software connects to “Tracker” to obtain a list of other “peers” with same data

4. Client begins downloading/uploading

.torrent.torrent

“Tracker” server

Other BitTorrent Advantages

Every dataset is given a unique id (Sha1 hash)

Distributed Hash Table (DHT) & Peer Exchange (PEX)Tracker-less peer identification

Local Peer Discovery (LPD)Finds peers on local area network (LAN) allowing much faster

data transfer

Web SeedsFTP or HTTP resources can be added to the torrent

BitTorrent Trackers Many trackers already

exist

Almost all have legal issues with copyright infringement issues

None are tailored to hosting scientific datasets

BioTorrents is a file sharing website for scientists

BioTorrents provides a central listing of datasets

Anyone can upload their own data

All data must be “open”; no illegal file sharing

Data is not hosted on BioTorrents**

Langille & Eisen, 2010, PLoS ONE 5: e10071.

BioTorrents: Advanced Features Browse and search by

Keyword (dataset title and description)Category (Genomics, Proteomics, Chemistry, etc.) License (Public Domain, Creative Commons, GPL, etc.)Username (mlangill, jeisen, NCBI, etc.)

RSS feeds and automatic downloading Torrents linked into “Versions” Upload script for bulk torrent creation

BioTorrents progress

1000 registered users

43 datasets (107 GB)

766 downloads

1386 GB data transferred

Real Example

Download GenBank (~230GB) from NCBI

NCBI to

UC Davis

Download speed

Time

Max 30MB/s 2 hours

FTP to other server

~10MB/s 6 hours

FTP to NCBI ~.5MB/s 5 days

Who will use BioTorrents?

1. Existing large data providers More reliable and faster downloads for users Less bandwidth requirements for provider

2. Scientists sharing published data All data is bundled together and given a unique id Easier than setting up a Web/FTP server

3. Scientists sharing unpublished data Data that might not be suitable for existing databases Results that may not be sufficient for publication

Issues BitTorrent works best for large, popular datasets

Long term seedingAt least 1 seeder has to exist

Many institutions block/limit BitTorrent activity

Future

MetalinkXML Link ProtocolCombines multiple sources

○ FTP, HTTP, BitTorrent, etc.

Volunteer StorageParallel to volunteer computing

Final Message Data transfer should be fast and easy

Scientific community should embrace existing technologies such as BitTorrent

BioTorrents uses the strengths of BitTorrent and provides features unique to scientific data