Post on 30-Jul-2015
Building
collaborative
workflows for
scientific databmpvieira.com/orcambridge14
| « Phd Student @Bioinformatics andPopulation GenomicsSupervisor:Yannick Wurm | « Before:
Bruno Vieira @bmpvieira
@yannick__
© 2014 Bruno Vieira CC-BY 4.0
Goodbye Excel/Windows
Hello command line
Reproducibility layers
CodeDataWorkflowEnvironment
Reproducibility layers
CodeDataWorkflowEnvironment
Dat
open source tool for sharing andcollaborating on datastarted august '13, we are grant fundedand 100% open source
public on freenode
dat-data.com
#datgitter.im/datproject/discussionsDat Community Call #1
Dat - "git for data"
npm install -g datdat initcollect-data | dat importdat listen
Dat
dat clone dat pull --livedat blobs put mygenome data.fastadat cat | transformdat cat | docker run -i transform
http://eukaryota.dathub.org
Dat
Planneddat checkout revisiondat diffdat branchmulti master replicationsync to databasesregistry
Data stored locally in leveldb, but can useother backends such as
PostgresRedisetc
Files stored in blob-storess3local-fsbitorrentftpetc
Dat features
auto schema generationfree REST APIall APIs are streaming
Dat workshop
maxogden.github.io/get-dat
Dat quick deploy
github.com/bmpvieira/heroku-dat-template
Reproducibility layers
CodeDataWorkflowEnvironment
Workflow
Bionode
open source project for modular anduniversal bioinformaticsstarted january '14
bionode.io
Some problems I faced
during my research:
Difficulty getting relevant descriptions anddatasets from NCBI API using bio* libsFor web projects, needed to implementthe same functionality on browser andserverDifficulty writing scalable, reproducibleand complex bioinformatic pipelines
Bionode also collaborates with BioJS
Bionode
npm install -g bionodebionode ncbi download gff bacteriabionode ncbi download sra arthropoda |bionode sra fastq-dumpnpm install -g bionode-ncbibionode-ncbi search assembly formicidae |dat import --json
Bionode - list of modules
Name Type Status PeopleDataaccess
status production
Parser status production
Wrangling status production Dataaccess
status production
Parser status production
ncbi
fastaseq IMensembl
blast-parser
Bionode - list of modules
Name Type Status PeopleDocumentation status production
Documentation status production
Documentation status production
Documentation status production
templateJS pipelineGasketpipelineDat/Bionodeworkshop
Bionode - list of modules
Name Type Status PeopleWrappers status development Wrappers status development
Wrappers status development Parser status development
srabwasambbi
Bionode - list of modules
status request
Name Type PeopleData access Data access ParserParserWrappersWrappers Wrappers
ebisemanticvcfgffbowtiesge badryanblast
Bionode - list of modules
Name Type PeopleWrappersWrappersWrappersWrappersWrappersWrappers
vsearchkhmerrsemgmapstargo badryan
Bionode - Why wrappers?
Same interface between modules(Streams and NDJSON)Easy installation with NPMSemantic versioningAdd testsAbstract complexity / More user friendly
Bionode - Why Node.js?
Same code client/server side
Need to reimplement the same code onbrowser and server.Solution: JavaScript everywhere
-> -> ,
-> ->
Afra bionode-seqGeneValidator seq fastaSequenceServerBioJS collaborating for code reuseBiodalliance converting to bionode
Bionode - Why Node.js?
Reusable, small and tested
modules
Benefit from other JS
projects
Dat BioJS NoFlo
Difficulty getting relevant description anddatasets from NCBI API using bio* libsPython example: URL for the Achromyrmexassembly?
Solution:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000188075.1_Si_gnG
import xml.etree.ElementTree as ETfrom Bio import EntrezEntrez.email = "mail@bmpvieira.com"esearch_handle = Entrez.esearch(db="assembly", term="Achromyrmex")esearch_record = Entrez.read(esearch_handle)for id in esearch_record['IdList']: esummary_handle = Entrez.esummary(db="assembly", id=id) esummary_record = Entrez.read(esummary_handle) documentSummarySet = esummary_record['DocumentSummarySet'] document = documentSummarySet['DocumentSummary'][0] metadata_XML = document['Meta'].encode('utf-8') metadata = ET.fromstring('' + metadata_XML + '') for entry in Metadata[1]: print entry.text
bionode-ncbi
Difficulty getting relevant description anddatasets from NCBI API using bio* libsExample: URL for the Achromyrmexassembly?
JavaScript
http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_000204515.1_Aech_3.9_genomic.fna.gz
var bio = require('bionode')bio.ncbi.urls('assembly', 'Acromyrmex', function(urls) { console.log(urls[0].genomic.fna)})
bio.ncbi.urls('assembly', 'Acromyrmex').on('data', printGenomeURL)function printGenomeURL(urls) { console.log(urls[0].genomic.fna)})
Difficulty getting relevant description anddatasets from NCBI API using bio* libsExample: URL for the Achromyrmexassembly?
JavaScript
BASH
http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_000204515.1_Aech_3.9_genomic.fna.gz
var ncbi = require('bionode-ncbi')var ndjson = require('ndjson')ncbi.urls('assembly', 'Acromyrmex').pipe(ndjson.stringify()).pipe(process.stdout)
bionode-ncbi urls assembly Acromyrmex |tool-stream extractProperty genomic.fna
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.Solution: Node.js Streams everywherevar ncbi = require('bionode-ncbi')var tool = require('tool-stream')var through = require('through2')var fork1 = through.obj()var fork2 = through.obj()
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.Solution: Node.js Streams everywherencbi.search('sra', 'Solenopsis invicta').pipe(fork1).pipe(dat.reads)
fork1.pipe(tool.extractProperty('expxml.Biosample.id')).pipe(ncbi.search('biosample')).pipe(dat.samples)
fork1.pipe(tool.extractProperty('uid')).pipe(ncbi.link('sra', 'pubmed')).pipe(ncbi.search('pubmed')).pipe(fork2).pipe(dat.papers)
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.bionode-ncbi search genome Guillardia theta |tool-stream extractProperty assemblyid |bionode-ncbi download assembly |tool-stream collectMatch status completed |tool-stream extractProperty uid|bionode-ncbi link assembly bioproject |tool-stream extractProperty destUID |bionode-ncbi link bioproject sra |tool-stream extractProperty destUID |bionode-ncbi download sra |bionode-sra fastq-dump |tool-stream extractProperty destFile |bionode-bwa mem 503988/GCA_000315625.1_Guith1_genomic.fna.gz |tool-stream collectMatch status finished|tool-stream extractProperty sam|bionode-sam
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.
bionode-example-dat-gasketget-dat workshopget-dat bionode gasket example
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines. { "import-data": [ "bionode-ncbi search genome eukaryota", "dat import --json --primary=uid" ], "search-ncbi": [ "dat cat", "grep Guillardia", "tool-stream extractProperty assemblyid", "bionode-ncbi download assembly -", "tool-stream collectMatch status completed", "tool-stream extractProperty uid", "bionode-ncbi link assembly bioproject -", "tool-stream extractProperty destUID", "bionode-ncbi link bioproject sra -", "tool-stream extractProperty destUID", "grep 35526", "bionode-ncbi download sra -", "tool-stream collectMatch status completed", "tee > metadata.json" ],
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines. "index-and-align": [ "cat metadata.json", "bionode-sra fastq-dump -", "tool-stream extractProperty destFile", "bionode-bwa mem **/*fna.gz" ], "convert-to-bam": [ "bionode-sam 35526/SRR070675.sam" ] }
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.datscriptpipeline main run pipeline import
pipeline import run foobar | run dat import --json
bmpvieira exampleekg example
Reproducibility layers
CodeDataWorkflowEnvironment
Environment
Docker for reproduciblesciencedocker run bmpvieira/thesis
- Modular and universal bioinformatics
Pipeable UNIX command line tools and
JavaScript / Node.js APIs for bioinformatic
analysis workflows on the server and browser.
- Build data pipelinesProvides a streaming interface between every fileformat and data storage backend. "git for data"
Bionode.io
#bionode
gitter.im/bionode/bionode
Dat-data.com
#datgitter.im/datproject/discussions
Acknowledgements
« « « « « «
@yannick__@maxogden@mafintosh@erikgarrison@QM_SBCS@opendataBionode contributors
Thanks!
"Science should work as anOpen Source project"
dat-data.combionode.io