Post on 25-Jan-2015
description
There is no magicThere is only awesome
D e e p a k S i n g h
Platforms for data science
bioinformatics
image: Ethan Hein
3
collection
curation
analysis
what’s the big deal?
Source: http://www.nature.com/news/specials/bigdata/index.html
Image: Yael Fitzpatrick (AAAS)
Image: Yael Fitzpatrick (AAAS)
lots of data
lots of people
lots of places
constant change
we want to make our data more effective
versioning
provenance
filter
aggregate
extend
mashup
human interfaces
image: Leo Reynolds
hard problem
really hard problem
so how do get there?
information platforms
dataspaces
Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
the unreasonable effectiveness of data
Halevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
accept all data formats
evolve APIs
beyond databases and the data warehouse
data as a programmable
resource
data is a royal garden
compute is a fungible commodity
optimizing the most valuable resource
compute, storage, workflows, memory,
transmission, algorithms, cost, …
people
Credit: Pieter Musterd a CC-BY-NC-ND license
Image: Chris Dagdigian
my bias
cloud services
distributed systems
scale
global
consumptionmodels
on-demand
what is the value of your data?
Credit: Angel Pizzaro, U. Penn
mapreduce for genomics
http://bowtie-bio.sourceforge.net/crossbow/index.shtmlhttp://contrail-bio.sourceforge.net
http://bowtie-bio.sourceforge.net/myrna/index.shtml
Bioproximity
http://aws.amazon.com/solutions/case-studies/bioproximity/
30,472 cores
$1279/hr
in summary
large scale data requires a rethink
data architecture
compute architecture
distributed, programmable infrastructure
cloud services
remove constraints
can we build data science platforms?
there is no magicthere is only awesome
deesingh@amazon.com Twitter:@mndoci
http://slideshare.net/mndocihttp://mndoci.com
Inspiration and ideas from Matt Wood& Larry Lessig
Credit” Oberazzi under a CC-BY-NC-SA license