Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

Post on 28-Jan-2015

114 views 0 download

Tags:

description

Peter Li's talk on GigaDB and Galaxy at BGI's 3rd Bioinformatics Software and Data Release Conference at #ICG7 in Hong Kong, 28th November 2012

Transcript of Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organization and analysis

GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis

Peter LiGigaScience

peter@gigasciencejournal.com

www.gigasciencejournal.com

Journal and database forlarge-scale data

Editor-in-Chief: Laurie GoodmanEditor: Scott Edmunds

Commissioning Editor: Nicole NogoyLead Curator: Tam Sneddon

Data Platform: Peter Li

in conjunction with

Why another *omics journal?

Already many journals publishing research involving large data sets

Resultsreproducibility

Unrepeatability of scientific results

Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses.Nature Genetics 41: 149-155.

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

How are we supporting data reproducibility?

Data sets

AnalysesGigaScience

paper

Linked to

Linked to

Community tools fordata reproduction and reuse

DOI

Paper DOI

Data set DOI

Linking of papers and data by citation of DOIs

http://gigadb.org

GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological and biomedical research as it enters the era of “big-data”… (see more)

Aspera data transfer

Faster download speeds

BGI Datasets Get DOI®s

PLANTSChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghum

MicrobeE. Coli O104:H4 TY-2482T2D gut metagenome

Cell-LinesChinese Hamster OvaryMouse methylomes

Human Asian individual (YH)

- DNA Methylome - Genome Assembly

- TranscriptomeCancer (14TB)Single cell bladder cancerHBV infected exomesAncient DNA - Saqqaq Eskimo - Aboriginal Australian

VertebratesDarwin’s FinchGiant panda Macaque -Chinese rhesus -Crab-eatingMini-PigNaked mole rat Parrot, Puerto Rican Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearSheepTibetan antelope

InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSchistosomaSilkwormParasitic nematodePacific oyster

Released pre-publicationPaper published in GigaScience

39 data sets

Currently: 39 public datasets*10 citations in references*

Humans Ancient DNA- Aboriginal Australian- Saqqaq Eskimo Asian individual (YH)

What about the analyses?

Data sets

AnalysesGigaScience

paper

Linked to

Linked to

How will we make analyses availablefor downloading and execution?

Example workflow: Investigate the evolutionary relationships between proteins

Proteinsequences

Bioinformatics data analyses as workflows

QueryMultiplesequencealignment

Implement GigaScience workflowsin a community-accepted format

http://galaxyproject.org

Over 20,000 main Galaxy server users

Over 500 papersciting Galaxy use

Over 55 Galaxyservers deployed

Open source

Tool list Tool parameterisation Results panel

Pilot project - Integrate BGI SOAP package into Galaxy

Enable SOAP tools to be used from within Galaxy workflows

Data analysis pipelines

SOAP1 SOAP2 SOAPdenovo1 SOAPdenovo2 SOAPsnp SOAPsplice

Integrate BGI SOAP package into Galaxy

Pythonwrapper

Pythonwrapper

Pythonwrapper

Pythonwrapper

Pythonwrapper

Pythonwrapper

GitHub open code repository

https://github.com/gigascience

Tool list Tool parameterisation Results panel

SOAPdenovo2 Galaxy workflow

http://www.myexperiment.org

Why publish in GigaScience?

Benefit• Data hosted in GigaDB• Allocation of DOIs to data• Metadata in isa-tab format• Galaxy tool integration• Use of tools in Galaxy

workflows

Added value• No need to use own servers• Citable data• Aids reuse of data• Supports reuse of tools• Improves documentation• Shows how tool can be used

with other bioinf. software

Thanks to:

• Tin-Lap Lee and Huayan Gao - CUHK• Tam, Jesse, Scott, Nicole & Laurie - GigaScience

peter@gigasciencejournal.com