2014 moore-ddd

19
Infrastructure for Data Intensive Biology “Better Science through Superior Software” C. Titus Brown

Transcript of 2014 moore-ddd

Page 1: 2014 moore-ddd

Infrastructure for Data Intensive

Biology“Better Science through Superior Software”

C. Titus Brown

Page 2: 2014 moore-ddd

Current research:

Compressive algorithms for sequence analysis

Can we enable and accelerate sequence-based inquiry by making all

basic analysis easier and some analyses possible?

Page 3: 2014 moore-ddd

Three super-awesome technologies…

1. Low-memory k-mer counting(Zhang et al., PLoS One, 2014)

2. Compressible assembly graphs(Pell et al., PNAS, 2012)

3. Streaming lossy compression of sequence data

(Brown et al., arXiv, 2012)

Page 4: 2014 moore-ddd

…implemented in one super- awesome software package…

github.com/ged-lab/khmer/

BSD licensed

Openly developed using good practice.

> 10 external contributors.

Thousands of downloads/month.

50 citations in 3 years.

We think > 1000 people are using it; have heard from dozens.

Page 5: 2014 moore-ddd

…enabling super-awesome biology.

1. Assembling soil metagenomesHowe et al., PNAS, 2014

2. Understanding bone-eating worm symbiontsGoffredi et al., ISME, 2014.

3. An ultra-deep look at the lamprey transcriptome

(in preparation)

4. Understanding derived anural development in Molgulid ascidians (in preparation)

Page 6: 2014 moore-ddd

Early on, lack of replicability in pubs slowed us down =>

Strategy: “level up” the field

High quality & novel science,

done openly,

written up in reproducible and remixable papers,

using IPython Notebook,

and posted to preprint servers.

Expression based clustering of 85 lamprey tissue samples (de novo assembly of 3 billion reads) ~ 1 month

Camille Scott

Page 7: 2014 moore-ddd

Open protocols for the cloud: ~$100/analysis

khmer-protocols.readthedocs.org/

Transcriptome and metagenome assembly protocols

Page 8: 2014 moore-ddd

The data challenge in biology

In 5-10 years, we will have nigh-infinite data. (Genomic,

transcriptomic, proteomic, metabolomic, …?)

We currently have no good way of querying, exploring, investigating, or

mining these data sets, especially across multiple locations..

Moreover, most data is unavailable until after publication…

…which, in practice, means it will be lost.

Page 9: 2014 moore-ddd

Proposal: distributed graph database server

Page 10: 2014 moore-ddd

Proposal: distributed graph database server

Page 11: 2014 moore-ddd

Proposal: distributed graph database server

Page 12: 2014 moore-ddd

Proposal: distributed graph database server

Page 13: 2014 moore-ddd

Graph queriesacross public & walled-garden data sets:

See Lee, Alekseyenko, Brown, paper in SciPy 2009: the “pygr” project.

Page 14: 2014 moore-ddd

The larger vision

Enable and incentivize sharing by providing immediate utility; frictionless sharing.

Permissionless innovation for e.g. new data mining approaches.

Plan for poverty with federated infrastructure built on open & cloud.

Solve people’s current problems, while remaining agile for the future.

Page 15: 2014 moore-ddd

Who needs this?

Everyone.

Environmental microbiology, evo devo, agriculture, VetMed...

Page 16: 2014 moore-ddd

How would I start?1-2 pilot projects

w/domain postdocs: drive computational

infrastructure with biology problems.

Support postdocs with software engineer

(infrastructure) and graduate student CS

(research).

Cross-train postdocs in data-intensive research methods and software

engineering.

Note: finding existing data is not a problem :)

“DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab.

Via Elizabeth Kujawinski

Page 17: 2014 moore-ddd

Education and trainingBiology is underprepared for data-intensive

investigation.

We must teach and train the next generations.

~5-10 workshops / year, novice -> masterclass; open materials.

Deeply self-interested:

What problems does everyone have, now? (Assembly)

What problems do leading-edge researchers have? (Data integration)

Page 18: 2014 moore-ddd

Pre-answered Questions

Q: What will be open?

A: Everything; I succeed & fail publicly.

Q: How will you measure success?

A: By other people using & extending our “products” without talking to us.

Blog: ivory.idyll.org/blog/ - search for “moore”, “satire”@ctitusbrown

Page 19: 2014 moore-ddd

Graph queriesacross public & walled-garden

data sets:

“What data sets contain <this gene>?”

“Which reads match to <this gene>, but not in <conserved

domain>?”

“Give me relative abundance of <gene X> across all data sets, grouped by nitrogen exposure.”