Building collaborative workflows for scientific data
-
Upload
bruno-vieira -
Category
Science
-
view
77 -
download
4
Transcript of Building collaborative workflows for scientific data
Building
collaborative
workflows for
scientific databmpvieira.com/orcambridge14
| « Phd Student @Bioinformatics andPopulation GenomicsSupervisor:Yannick Wurm | « Before:
Bruno Vieira @bmpvieira
@yannick__
© 2014 Bruno Vieira CC-BY 4.0
Sequencing cost drops
Sequencing data rises
Goodbye Excel/Windows
Hello command line
Hello super computers
Programming
Programming
Programming
Reproducibility crisis
Reproducibility layers
CodeDataWorkflowEnvironment
Reproducibility layers
CodeDataWorkflowEnvironment
Data
Dat
open source tool for sharing andcollaborating on datastarted august '13, we are grant fundedand 100% open source
public on freenode
dat-data.com
#datgitter.im/datproject/discussionsDat Community Call #1
Dat - "git for data"
npm install -g datdat initcollect-data | dat importdat listen
Dat
dat clone dat pull --livedat blobs put mygenome data.fastadat cat | transformdat cat | docker run -i transform
http://eukaryota.dathub.org
Dat
Planneddat checkout revisiondat diffdat branchmulti master replicationsync to databasesregistry
Data stored locally in leveldb, but can useother backends such as
PostgresRedisetc
Files stored in blob-storess3local-fsbitorrentftpetc
Dat features
auto schema generationfree REST APIall APIs are streaming
Dat workshop
maxogden.github.io/get-dat
Dat quick deploy
github.com/bmpvieira/heroku-dat-template
Reproducibility layers
CodeDataWorkflowEnvironment
Workflow
Bionode
open source project for modular anduniversal bioinformaticsstarted january '14
bionode.io
Some problems I faced
during my research:
Difficulty getting relevant descriptions anddatasets from NCBI API using bio* libsFor web projects, needed to implementthe same functionality on browser andserverDifficulty writing scalable, reproducibleand complex bioinformatic pipelines
Bionode also collaborates with BioJS
Bionode
npm install -g bionodebionode ncbi download gff bacteriabionode ncbi download sra arthropoda |bionode sra fastq-dumpnpm install -g bionode-ncbibionode-ncbi search assembly formicidae |dat import --json
Bionode - list of modules
Name Type Status PeopleDataaccess
status production
Parser status production
Wrangling status production Dataaccess
status production
Parser status production
ncbi
fastaseq IMensembl
blast-parser
Bionode - list of modules
Name Type Status PeopleDocumentation status production
Documentation status production
Documentation status production
Documentation status production
templateJS pipelineGasketpipelineDat/Bionodeworkshop
Bionode - list of modules
Name Type Status PeopleWrappers status development Wrappers status development
Wrappers status development Parser status development
srabwasambbi
Bionode - list of modules
status request
Name Type PeopleData access Data access ParserParserWrappersWrappers Wrappers
ebisemanticvcfgffbowtiesge badryanblast
Bionode - list of modules
Name Type PeopleWrappersWrappersWrappersWrappersWrappersWrappers
vsearchkhmerrsemgmapstargo badryan
Bionode - Why wrappers?
Same interface between modules(Streams and NDJSON)Easy installation with NPMSemantic versioningAdd testsAbstract complexity / More user friendly
Bionode - Why Node.js?
Same code client/server side
Need to reimplement the same code onbrowser and server.Solution: JavaScript everywhere
-> -> ,
-> ->
Afra bionode-seqGeneValidator seq fastaSequenceServerBioJS collaborating for code reuseBiodalliance converting to bionode
Bionode - Why Node.js?
Reusable, small and tested
modules
Benefit from other JS
projects
Dat BioJS NoFlo
Difficulty getting relevant description anddatasets from NCBI API using bio* libsPython example: URL for the Achromyrmexassembly?
Solution:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000188075.1_Si_gnG
import xml.etree.ElementTree as ETfrom Bio import EntrezEntrez.email = "[email protected]"esearch_handle = Entrez.esearch(db="assembly", term="Achromyrmex")esearch_record = Entrez.read(esearch_handle)for id in esearch_record['IdList']: esummary_handle = Entrez.esummary(db="assembly", id=id) esummary_record = Entrez.read(esummary_handle) documentSummarySet = esummary_record['DocumentSummarySet'] document = documentSummarySet['DocumentSummary'][0] metadata_XML = document['Meta'].encode('utf-8') metadata = ET.fromstring('' + metadata_XML + '') for entry in Metadata[1]: print entry.text
bionode-ncbi
Difficulty getting relevant description anddatasets from NCBI API using bio* libsExample: URL for the Achromyrmexassembly?
JavaScript
http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_000204515.1_Aech_3.9_genomic.fna.gz
var bio = require('bionode')bio.ncbi.urls('assembly', 'Acromyrmex', function(urls) { console.log(urls[0].genomic.fna)})
bio.ncbi.urls('assembly', 'Acromyrmex').on('data', printGenomeURL)function printGenomeURL(urls) { console.log(urls[0].genomic.fna)})
Difficulty getting relevant description anddatasets from NCBI API using bio* libsExample: URL for the Achromyrmexassembly?
JavaScript
BASH
http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_000204515.1_Aech_3.9_genomic.fna.gz
var ncbi = require('bionode-ncbi')var ndjson = require('ndjson')ncbi.urls('assembly', 'Acromyrmex').pipe(ndjson.stringify()).pipe(process.stdout)
bionode-ncbi urls assembly Acromyrmex |tool-stream extractProperty genomic.fna
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.Solution: Node.js Streams everywherevar ncbi = require('bionode-ncbi')var tool = require('tool-stream')var through = require('through2')var fork1 = through.obj()var fork2 = through.obj()
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.Solution: Node.js Streams everywherencbi.search('sra', 'Solenopsis invicta').pipe(fork1).pipe(dat.reads)
fork1.pipe(tool.extractProperty('expxml.Biosample.id')).pipe(ncbi.search('biosample')).pipe(dat.samples)
fork1.pipe(tool.extractProperty('uid')).pipe(ncbi.link('sra', 'pubmed')).pipe(ncbi.search('pubmed')).pipe(fork2).pipe(dat.papers)
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.bionode-ncbi search genome Guillardia theta |tool-stream extractProperty assemblyid |bionode-ncbi download assembly |tool-stream collectMatch status completed |tool-stream extractProperty uid|bionode-ncbi link assembly bioproject |tool-stream extractProperty destUID |bionode-ncbi link bioproject sra |tool-stream extractProperty destUID |bionode-ncbi download sra |bionode-sra fastq-dump |tool-stream extractProperty destFile |bionode-bwa mem 503988/GCA_000315625.1_Guith1_genomic.fna.gz |tool-stream collectMatch status finished|tool-stream extractProperty sam|bionode-sam
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.
bionode-example-dat-gasketget-dat workshopget-dat bionode gasket example
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines. { "import-data": [ "bionode-ncbi search genome eukaryota", "dat import --json --primary=uid" ], "search-ncbi": [ "dat cat", "grep Guillardia", "tool-stream extractProperty assemblyid", "bionode-ncbi download assembly -", "tool-stream collectMatch status completed", "tool-stream extractProperty uid", "bionode-ncbi link assembly bioproject -", "tool-stream extractProperty destUID", "bionode-ncbi link bioproject sra -", "tool-stream extractProperty destUID", "grep 35526", "bionode-ncbi download sra -", "tool-stream collectMatch status completed", "tee > metadata.json" ],
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines. "index-and-align": [ "cat metadata.json", "bionode-sra fastq-dump -", "tool-stream extractProperty destFile", "bionode-bwa mem **/*fna.gz" ], "convert-to-bam": [ "bionode-sam 35526/SRR070675.sam" ] }
Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.datscriptpipeline main run pipeline import
pipeline import run foobar | run dat import --json
bmpvieira exampleekg example
Reproducibility layers
CodeDataWorkflowEnvironment
Environment
Docker for reproduciblesciencedocker run bmpvieira/thesis
- Modular and universal bioinformatics
Pipeable UNIX command line tools and
JavaScript / Node.js APIs for bioinformatic
analysis workflows on the server and browser.
- Build data pipelinesProvides a streaming interface between every fileformat and data storage backend. "git for data"
Bionode.io
#bionode
gitter.im/bionode/bionode
Dat-data.com
#datgitter.im/datproject/discussions
Acknowledgements
« « « « « «
@yannick__@maxogden@mafintosh@erikgarrison@QM_SBCS@opendataBionode contributors
Thanks!
"Science should work as anOpen Source project"
dat-data.combionode.io