Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what...

23
by Data Fellas, Data Enthusiasts v 4.0 (July, 13th ‘15) Scalable and Interoperable data services Applied to Genomics

Transcript of Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what...

Page 1: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

by Data Fellas, Data Enthusiasts v 4.0 (July, 13th ‘15)

Scalable and Interoperable data servicesApplied to Genomics

Page 2: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Young Belgian Startup

The Data Fellas Startup

Data ScienceXavier Tordoir@xtordoir

Andy Petrella@noootsab

Data Processing

Scalable Machine Learning

Micro Services oriented

Page 3: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Data Fellas EcosystemWe’ve worked with

Page 5: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

First: Data ScienceAnalysis

Spark Notebook

Page 6: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

First: Data ScienceAnalysis

Production

Project Generator

Mesos / C* / DCOS

Page 7: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

First: Data ScienceAnalysis

Production

Distribution

Micro Service / Binary format

Marathon

Page 8: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

First: Data ScienceAnalysis

Production

DistributionRendering

SChema for output

GG / D3 …

Page 9: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

First: Data ScienceAnalysis

Production

DistributionRendering

Discovery

Service Metadata

SOLR , …

Page 10: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

First: Data ScienceAnalysis

Production

DistributionRendering

Discovery

CatalogSpark Notebookusing Services too

Page 11: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

First: Data ScienceAnalysis

Production

DistributionRendering

Discovery

Share Analyses

Share Results

Share Datasets

Page 12: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

First: Data Science

Project Code Name:

Shar3

Page 13: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Next: Applied TO Genomics

Genomics data is pretty big

● 100,000’s genomes in 2015● 1,000,000’s … ● 100,000,000’s … ● …

Page 14: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Next: Applied TO Genomics

Genomics data is pretty big and of High dimensionality

One genome:○ 3 billions bases (basic DNA component) sequence○ 30 - 60 x coverage for quality○ 10’s to 100’s millions variants (variable bases

from one individual to the next)

Page 15: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Next: Applied TO Genomics

e.g. 1000genomes project:

● 200TB compressed data● organised in files/directories● data formatted following specs in a … PDF

Data and services schemas are required

Page 16: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

What we do with genomics data?

Lots of Querying and Learning:

E.G.

● Population structure is a fundamental basis● Querying relationships between genomes and other

biological features

Hey… no one has all data!

Metadata

Page 17: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

What we do with genomics data?

Lots of Querying and Learning:

E.G.

● We do some specific Modelling on some data…

Hey… no two serve the same computations!

Service Discovery

Page 18: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Interoperability

So, no one has all data … BUT all should be able to talk…

Interoperability (GA4GH)

Page 19: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Interoperable… Analysis

Production

DistributionRendering

Discovery

Share Analyses

Share Results

Share Datasets

Page 20: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Interoperable & scalable…

GA4GH + Shar3 = Med@Scale

+ ADAM & spark+ In Memory optimization (Tachyon)+ Deployment (e.g. DCOS)

Page 21: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Wrap-UP

Follow us @DataFellas and get notified about our

+ sharing platform at scale: Shar3

+ Google Genomics At Home (^.^): Med@Scale

+ future plans: modules for Trading, Geospatial, other medical data, …

Page 22: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

ReferencesAdam: https://github.com/bigdatagenomics/adamBdg-Formats: https://github.com/bigdatagenomics/bdg-formats

GA4GH website: http://genomicsandhealth.org/GA4GH data working group: http://ga4gh.org/

@Spark-Notebook: https://github.com/andypetrella/spark-notebook/

Med-At-Scale: https://github.com/med-at-scale/high-health

Data Fellas: http://data-fellas.guru/ Training: http://spark4devs.data-fellas.guru/

Page 23: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Q/ATHANKS!