Building collaborative workflows for scientific data

Building

collaborative

workflows for

scientific databmpvieira.com/orcambridge14

http://bmpvieira.com/allbio14

| « Phd Student @Bioinformatics andPopulation GenomicsSupervisor:Yannick Wurm | « Before:

Bruno Vieira @bmpvieira

@yannick__

© 2014 Bruno Vieira CC-BY 4.0

http://www.ciencias.ulisboa.pt/

http://www.bbsrc.ac.uk/

http://twitter.com/bmpvieira

http://creativecommons.org/licenses/by/4.0/deed.en_US

http://cobig2.com/

http://www.qmul.ac.uk/


http://twitter.com/yannick__

https://geekli.st/


http://bmpvieira.com/

http://bmpvieira.com/

http://eseb2013.com/

Sequencing cost drops

http://nstoler.com/img/sequencingcosts.jpg

Sequencing data rises

http://www.nature.com/nature/journal/v498/n7453/pdf/498255a.pdf

Goodbye Excel/Windows

Hello command line

Hello super computers

https://twitter.com/erikgarrison/status/530399001129271296

Programming

http://xkcd.com/208/

Programming

http://xkcd.com/353

Programming

http://www.phdcomics.com/comics.php?f=1690

Programming

http://ada.lynxlab.com/staff/steve/public/docu/lidia/images/camel.jpg

Programming

http://www.troll.me/images/futurama-fry/not-sure-if-this-is-bash-or-really-bad-perl.jpg

Reproducibility crisis

http://cnx.org/resources/e268f3f73eb57c4ad741e45dd7c5cdd5/30.jpg

Losing data

http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416

Reproducibility layers

CodeDataWorkflowEnvironment

Code

https://octodex.github.com/labtocat

The GitHub for Science...

is GitHub!

https://github.com/bionode/bionode#bionode

Code as a research output

http://software-carpentry.org/

http://www.software.ac.uk/

Data

http://dat-data.com/

Dat

open source tool for sharing andcollaborating on datastarted august '13, we are grant fundedand 100% open source

public on freenode

dat-data.com

#datgitter.im/datproject/discussionsDat Community Call #1

http://irccloud.com/#!/ircs://irc.freenode.net:6697/%23dat

https://gitter.im/datproject/discussions

http://dat-data.com/

https://www.youtube.com/watch?v=4sk6kcF7viM

Dat - "git for data"

npm install -g datdat initcollect-data | dat importdat listen

http://eukaryota.dathub.org/

Dat

dat clone dat pull --livedat blobs put mygenome data.fastadat cat | transformdat cat | docker run -i transform

http://eukaryota.dathub.org

http://eukaryota.dathub.org/

Dat

Planneddat checkout revisiondat diffdat branchmulti master replicationsync to databasesregistry

Data stored locally in leveldb, but can useother backends such as

PostgresRedisetc

Files stored in blob-storess3local-fsbitorrentftpetc

Dat features

auto schema generationfree REST APIall APIs are streaming

Dat workshop

maxogden.github.io/get-dat

Dat quick deploy

github.com/bmpvieira/heroku-dat-template

Workflow

Bionode

open source project for modular anduniversal bioinformaticsstarted january '14

bionode.io

Some problems I faced

during my research:

Difficulty getting relevant descriptions anddatasets from NCBI API using bio* libsFor web projects, needed to implementthe same functionality on browser andserverDifficulty writing scalable, reproducibleand complex bioinformatic pipelines

Bionode also collaborates with BioJS

Bionode

npm install -g bionodebionode ncbi download gff bacteriabionode ncbi download sra arthropoda |bionode sra fastq-dumpnpm install -g bionode-ncbibionode-ncbi search assembly formicidae |dat import --json

Bionode - list of modules

Name Type Status PeopleDataaccess

status production

Parser status production

Wrangling status production Dataaccess

status production

Parser status production

ncbi

fastaseq IMensembl

blast-parser


Name Type Status PeopleDocumentation status production

Documentation status production



templateJS pipelineGasketpipelineDat/Bionodeworkshop


Name Type Status PeopleWrappers status development Wrappers status development

Wrappers status development Parser status development

srabwasambbi


status request

Name Type PeopleData access Data access ParserParserWrappersWrappers Wrappers

ebisemanticvcfgffbowtiesge badryanblast


Name Type PeopleWrappersWrappersWrappersWrappersWrappersWrappers

vsearchkhmerrsemgmapstargo badryan

Bionode - Why wrappers?

Same interface between modules(Streams and NDJSON)Easy installation with NPMSemantic versioningAdd testsAbstract complexity / More user friendly

Bionode - Why Node.js?

Same code client/server side

Need to reimplement the same code onbrowser and server.Solution: JavaScript everywhere

-> -> ,

-> ->

Afra bionode-seqGeneValidator seq fastaSequenceServerBioJS collaborating for code reuseBiodalliance converting to bionode

Bionode - Why Node.js?

Reusable, small and tested

modules

Benefit from other JS

projects

Dat BioJS NoFlo

Difficulty getting relevant description anddatasets from NCBI API using bio* libsPython example: URL for the Achromyrmexassembly?

Solution:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000188075.1_Si_gnG

import xml.etree.ElementTree as ETfrom Bio import EntrezEntrez.email = "[email protected]"esearch_handle = Entrez.esearch(db="assembly", term="Achromyrmex")esearch_record = Entrez.read(esearch_handle)for id in esearch_record['IdList']: esummary_handle = Entrez.esummary(db="assembly", id=id) esummary_record = Entrez.read(esummary_handle) documentSummarySet = esummary_record['DocumentSummarySet'] document = documentSummarySet['DocumentSummary'][0] metadata_XML = document['Meta'].encode('utf-8') metadata = ET.fromstring('' + metadata_XML + '') for entry in Metadata[1]: print entry.text

bionode-ncbi

Difficulty getting relevant description anddatasets from NCBI API using bio* libsExample: URL for the Achromyrmexassembly?

JavaScript

http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_000204515.1_Aech_3.9_genomic.fna.gz

var bio = require('bionode')bio.ncbi.urls('assembly', 'Acromyrmex', function(urls) { console.log(urls[0].genomic.fna)})

bio.ncbi.urls('assembly', 'Acromyrmex').on('data', printGenomeURL)function printGenomeURL(urls) { console.log(urls[0].genomic.fna)})

Difficulty getting relevant description anddatasets from NCBI API using bio* libsExample: URL for the Achromyrmexassembly?

JavaScript

BASH

http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_000204515.1_Aech_3.9_genomic.fna.gz

var ncbi = require('bionode-ncbi')var ndjson = require('ndjson')ncbi.urls('assembly', 'Acromyrmex').pipe(ndjson.stringify()).pipe(process.stdout)

bionode-ncbi urls assembly Acromyrmex |tool-stream extractProperty genomic.fna

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.Solution: Node.js Streams everywherevar ncbi = require('bionode-ncbi')var tool = require('tool-stream')var through = require('through2')var fork1 = through.obj()var fork2 = through.obj()

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.Solution: Node.js Streams everywherencbi.search('sra', 'Solenopsis invicta').pipe(fork1).pipe(dat.reads)

fork1.pipe(tool.extractProperty('expxml.Biosample.id')).pipe(ncbi.search('biosample')).pipe(dat.samples)

fork1.pipe(tool.extractProperty('uid')).pipe(ncbi.link('sra', 'pubmed')).pipe(ncbi.search('pubmed')).pipe(fork2).pipe(dat.papers)

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.

bionode-example-dat-gasketget-dat workshopget-dat bionode gasket example

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines. { "import-data": [ "bionode-ncbi search genome eukaryota", "dat import --json --primary=uid" ], "search-ncbi": [ "dat cat", "grep Guillardia", "tool-stream extractProperty assemblyid", "bionode-ncbi download assembly -", "tool-stream collectMatch status completed", "tool-stream extractProperty uid", "bionode-ncbi link assembly bioproject -", "tool-stream extractProperty destUID", "bionode-ncbi link bioproject sra -", "tool-stream extractProperty destUID", "grep 35526", "bionode-ncbi download sra -", "tool-stream collectMatch status completed", "tee > metadata.json" ],

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines. "index-and-align": [ "cat metadata.json", "bionode-sra fastq-dump -", "tool-stream extractProperty destFile", "bionode-bwa mem **/*fna.gz" ], "convert-to-bam": [ "bionode-sam 35526/SRR070675.sam" ] }

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.datscriptpipeline main run pipeline import

pipeline import run foobar | run dat import --json

bmpvieira exampleekg example

Environment

Docker for reproduciblesciencedocker run bmpvieira/thesis

- Modular and universal bioinformatics

Pipeable UNIX command line tools and

JavaScript / Node.js APIs for bioinformatic

analysis workflows on the server and browser.

- Build data pipelinesProvides a streaming interface between every fileformat and data storage backend. "git for data"

Bionode.io

#bionode

gitter.im/bionode/bionode

Dat-data.com

#datgitter.im/datproject/discussions

Acknowledgements

« « « « « «

@yannick__@maxogden@mafintosh@erikgarrison@QM_SBCS@opendataBionode contributors

Thanks!

"Science should work as anOpen Source project"

dat-data.combionode.io

Building collaborative workflows for scientific data

Science

Transcript of Building collaborative workflows for scientific data