Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @...

47
Hadoop and Graph Databases (Neo4j): Winning Combination for Bioinformatics Jonathan Freeman @freethejazz {GraphConnect NYC} {Open Software Integrators} { www.osintegrators.com} {@osintegrators}

description

This talk will describe a prototype application designed to demonstrate the ability to utilize both Hadoop and Neo4j for Big Data analysis.

Transcript of Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @...

Page 1: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

Hadoop and Graph Databases (Neo4j): Winning Combination for

Bioinformatics

Jonathan Freeman@freethejazz

{GraphConnect NYC}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 2: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

Open Software Integrators● Founded January 2008 by Andrew C. Oliver

○ Durham, NCRevenue and staff has at least doubled every year since

2009.

● New office (2012) in Chicago, IL○ We're hiring associate to senior level as well as UI Developers

(JQuery, Javascript, HTML, CSS)○ Up to 50% travel (probably less), salary + bonus, 401k, health,

etc etc○ Preferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS,

JQuery○ Nice to have: Hadoop, Neo4j, MongoDB, Ruby a/o at least one

Cloud platform

Hadoop + Neo4j = Bioanalytics Win

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Jonathan Freeman @freethejazz

Page 3: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

Questions to answer

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

● uhh, bioinformatics?● What is Hadoop? Why is it a good fit?● And Neo4j? Why the combination?● I want this now! How do I do it?!?!

Page 4: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

Bioinformatics

{Hadoop + Neo4j = Bioinformatics Win}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 5: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

dynamic

information processing

system

Page 6: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Lifehttp://www.labtimes.org/labtimes/issues/lt2011/lt07/lt_2011_07_26_29.pdf

Page 7: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

● Storing/Retrieving Biological Data● Organizing Biological Data● Analyzing Biological Data

Page 8: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Biological Data

● amino acid sequences● nucleotide sequences● protein structures

Page 9: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

● Genetic sequence analysis● Tracing biological evolution● Analysis of gene expression● Studying mutations in cancer● Predicting protein structure and

function● Molecular Interaction

Page 10: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

● Genetic sequence analysis● Tracing biological evolution● Analysis of gene expression● Studying mutations in cancer● Predicting protein structure and

function● Molecular Interaction

Page 11: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Full Human Genome Sequencing Then

13 Years $2,700,000,000

Page 12: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Full Human Genome Sequencing Then

1 Day $5,000

Page 13: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

http://www.genome.gov/images/content/cost_per_genome_apr.jpg

Page 14: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

So what are we waiting for?

Page 15: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Page 16: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

well, the thingabout that…

Page 17: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Page 18: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Page 19: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

...ATTCCAGGAGTATTGACACCAT...

Page 20: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Page 21: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Page 22: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Page 23: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

AGGATTACCAGGA CAAAGGATT TTACCAGGATACCAG TGACAA AAGGATTAC GATACCAGTA CAAGGATTGTGACAA

Page 24: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

Hadoop

{Hadoop + Neo4j = Bioinformatics Win}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 25: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Infrastructure for distributed computing

HDFS

A distributed file system.

MapReduce

An implementation of a programming model for processing very large data sets.

Page 26: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Page 27: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Page 28: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Page 29: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Infrastructure for distributed computing

HDFS

A distributed file system.

MapReduce

An implementation of a programming model for processing very large data sets.

Page 30: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

AGGATTACCAGGA CAAAGGATT TTACCAGGATACCAG TGACAA AAGGATTAC GATACCAGTA CAAGGATTGTGACAA

Page 31: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

...ATTCCAGGAGTATTGACACCAT...

Page 32: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

1000 CPU hours

Page 33: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

3 hours$85OSS

http://bowtie-bio.sourceforge.net/crossbow/index.shtml

Page 34: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

And Neo4j?

{Hadoop + Neo4j = Bioinformatics Win}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 35: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Page 36: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

MATCH (snp)<-[:INFLUENCED_BY]-(conditions)WHERE snp.id = “rs1234”RETURN conditions;

Page 37: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

MATCH (p)-[:GENOME_CONTAINS]->(snp) (snp)<-[:INFLUENCED_BY]-(conditions)WHERE p.name = “Jonathan Freeman”RETURN conditions;

Page 38: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

MATCH (p)-[:GENOME_CONTAINS]->(snp) (snp)<-[:INFLUENCED_BY]-(conditions)WHERE c.name = “Parkinsons”RETURN p;

Page 39: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

How can I haz?!?!?!1

{Hadoop + Neo4j = Bioinformatics Win}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 40: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Step 1: Get local copies

● Hadoop: http://www.neo4j.org/download● Neo4j: http://hadoop.apache.org/releases.html#Download● Batch Importer: https://github.com/jexp/batch-import

Page 41: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Step 2: Familiarize yourself with the languages

● MapReduce: http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html● Pig: http://pig.apache.org/docs/r0.12.0/start.html● Hive: https://cwiki.apache.org/confluence/display/Hive/GettingStarted

Page 42: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Step 3: Find a dataset

● Typical starter data: http://www.gutenberg.org/● Amazon’s public data sets: http://aws.amazon.com/publicdatasets/

Page 43: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Step 4: Start Playing!!!

Page 44: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Step 5: Take Hadoop to the cloud

● http://aws.amazon.com/elasticmapreduce/

Page 45: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Doing this in production?

http://blog.xebia.com/2012/11/13/combining-neo4j-and-hadoop-part-i/http://blog.xebia.com/2013/01/17/combining-neo4j-and-hadoop-part-ii/

Page 46: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

Thank You@freethejazz

{Hadoop + Neo4j = Bioinformatics Win}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Page 47: Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hadoop + Neo4j = Bioinformatics WinJonathan Freeman

@freethejazz

Image Attribution:Sand Timer: http://bit.ly/HyCAgy

Money: http://bit.ly/1e4lhS6

Scraggly DNA drawings: Jonathan Freeman :)