Graph DB + Bioinformatics: Bio4j, recent applications and future directions

Graph DB + Bioinformatics: Bio4j, recent applications and future

directions

www.ohnosequences.com www.bio4j.com

http://www.bio4j.com/

But who‟s this guy talking here?

I am Currently working as a Bioinformatics consultant/developer/researcher at Oh no sequences! and I have been here at the Ohio State University working as a Visiting Scholar during these last two months.


Oh no what !?

We are the R&D group at Era7 Bioinformatics.we like bioinformatics, cloud computing, NGS, category theory, bacterial genomics…well, lots of things.

What about Era7 Bioinformatics?

Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis, knowledge management and sequencing data interpretation.Our area of expertise revolves around biological sequence analysis, particularly Next Generation Sequencing data management and analysis.


http://www.ohnosequences.com/

http://www.era7bioinformatics.com/en/index.html

We‟re a small but quite peculiar company! (in the good sense of course )Currently we have offices in:


Boston MA (USA)

Madrid (Spain)

Granada (Spain)

Yeah, I know what you‟re thinking, they are not precisely ugly cities…



Our team is multidisciplinary: bioinformaticians, mathematicians, lab researchers, immunologists, biologists specialized in biochemistry and IT professionals.

A team formed by people with different backgrounds is able to analyze the same problem from different point of views.

We are based in Research

In a fast changing area, our activity is based in being able to offer cutting edge solutions. This is only possible maintaining a continuous research and innovation activity.In addition, since many of our customers are researchers, being part of that community allow us to be really customer oriented.



Everything we do is 100% Open source !

Yes, we hate patents. And no, we‟re not crazy (or maybe just a bit…)

Ok that‟s really nice, but how can that actually work??

• Free marketing and dissemination• We can use other bioinformatics open source tools/DBs/etc…• Faster adaptation to a fast changing field (bioinformatics, genomics)• You may not earn a lot of money but you earn money enough doing many

creative things



Money? Where from ??

• Providing services• Adapting services to different infrastructures and frameworks…

OK, but you could probably get much more money with a different business model…

Yeah, but this is our philosophy!



We are also based on Cloud Computing

Cloud Computing has revolutionized the world of computing because in this paradigm you get the infrastructure as a service (IaaS). We are expert in the use of the leaders of this world: Amazon Web Services (AWS).

So, what do we get?

a) No investment in infrastructure. Pay per use.

b) Scalability: For example we can launch just one virtual server for two hours or more than one hundred during ten hours depending on the amount of data that should be analyzed in different projects.


What‟s Bio4j?

Bio4j is a bioinformatics graph based DB including most data available in :

Uniprot KB(SwissProt + Trembl)

Gene Ontology (GO)

UniRef (50,90,100)

NCBI Taxonomy

RefSeq

Enzyme DB



http://www.uniprot.org/

http://www.geneontology.org/

http://www.ebi.ac.uk/uniref/

http://www.ncbi.nlm.nih.gov/Taxonomy/

http://www.ncbi.nlm.nih.gov/RefSeq/

http://enzyme.expasy.org/

It provides a completely new and powerful framework

for protein related information querying and

management.

Since it relies on a high-performance graph engine, data

is stored in a way that semantically represents its own

structure


What‟s Bio4j?


Bio4j uses Neo4j technology, a "high-performance graph

engine with all the features of a mature and robust

database".

Thanks to both being based on Neo4j DB and the API

provided, Bio4j is also very scalable, allowing anyone

to easily incorporate his own data making the best

out of it.


What‟s Bio4j?


http://www.neo4j.org/

Everything in Bio4j is open source !

released under AGPLv3


What‟s Bio4j?


http://www.gnu.org/licenses/agpl.html

Bioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Highly interconnected overlapping knowledge spread throughout different DBs


http://www.uniprot.org/

http://www.geneontology.org/

http://www.ncbi.nlm.nih.gov/RefSeq/

http://www.ebi.ac.uk/intact/

http://www.ebi.ac.uk/embl/


However all this data is in most cases modeled in relational databases.Sometimes even just as plain CSV files

As the amount and diversity of data grows, domain models become crazily complicated!



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud


http://www.geneontology.org/images/diag-godb-er.jpg

With a relational paradigm, the double implication

Entity Table

does not go both ways.

You get „auxiliary‟ tables that have no relationship with the smallpiece of reality you are modeling.

You need ‘artificial’ IDs only for connecting entities, (and these are mixed with IDs that somehow live in reality)

Entity-relationship models are cool but in the end you always have to deal with ‘raw’ tables plus SQL.

Integrating/incorporating new knowledge into already existing databases is hard and sometimes even not possible without changing the domain model



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud


Life in general and biology in particular are probably not 100% like a graph…

but one thing‟s sure, they are not a set of tables!



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

http://blog.ohnosequences.com/2011/03/playing-with-gephi-bio4j-and-go/

http://blog.ohnosequences.com/2011/10/playing-with-microsatellites-simple-sequence-repeats-java-and-neo4j/

http://blog.bio4j.com/2011/04/go-annotation-graph-visualizations-with-bio4j-go-tools-gephi-toolkit-sigma-project/



NoSQL (not only SQL)

“NoSQL is a broad class of database management systems

that differ from the classic model of the relational database

management system (RDBMS) in some significant ways.

These data stores may not require fixed table schemas,

usually avoid join operations and typically scale

horizontally.”

NoSQ… what !??

Let‟s see what Wikipedia says…


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud



NoSQL data modelsBioinformatics DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud



Neo4j is a high-performance, NOSQL graph database with all the features of a mature and robust database.

The programmer works with an object-oriented, flexible network structure rather than with strict and static tables

All the benefits of a fully transactional, enterprise-strengthdatabase.

For many applications, Neo4j offers performance improvements on the order of 1000x or more compared to relational DBs.


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud


http://www.neo4j.com/


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud


Ok, but why starting all this? Were you so bored…?!

It all started somehow around our need for massive access to protein GO (Gene Ontology) annotations.

At that point I had to develop my own MySQL DB based on the official GO SQL database, and problems started from the beginning:

I got crazy „deciphering‟ how to extract Uniprot protein annotations from GO official tables schema

Uniprot and GO official protein annotations were not always consistent

Populating my own DB took really long due to all the joins and subqueries needed in order to get and store the protein annotations.

Soon enough we also had the need of having massive access to basic protein information.



These processes had to be automated for our (specifically

designed for NGS data) bacterial genome annotation system

BG7

Uniprot web services available were too limited:

- Slow

- Number of queries limitation

- Too little information available

So I downloaded the whole Uniprot DB in XML format (Swiss-Prot + Trembl)

and started to have some fun with it !


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud


http://bg7.ohnosequences.com/

1• Selection of the specific reference protein set

2• Prediction of possible genes by BLAST similarity

3• Gene definition: merging compatible similarity regions, detecting start and stop

4• Solving overlapped predicted genes

5• RNA prediction by BLAST similarity

6• Final annotation and complete deliverables. Quality control.

www.era7bioinformatics.com

BG7 algorithm


We got used to having massive direct access to all this proteinrelated information…

So why not adding other resources we needed quite often in most projects and which now were becoming a sort of bottleneck compared to all those already included in Bio4j ?

Then came:

- Isoform sequences

- Protein interactions and features

- Uniref 50, 90, and 100

- RefSeq

- NCBI Taxonomy


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

- Enzyme Expasy DB



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud


Let‟s dig a bit about Bio4j structure:

Data sources and their relationships:



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud


Bio4j domain model


https://github.com/bio4j/Bio4j/raw/develop/Bio4jDomainModelWithCardinality.jpg


The Graph DB model: representation

Core abstractions:

Properties on both

Relationships between nodes

Nodes


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud



Let‟s dig a bit about Bio4j structure:

How are things modeled?

Couldn‟t be simpler!

Entities

Nodes

Associations / Relationships

Edges


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud



Some examples of nodes would be:

Protein

GO term

Genome Element

and relationships:

Protein

GO term

PROTEIN_GO_ANNOTATION


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud



We have developed a tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer

Bio4jExplorer allows you to:

• Navigate through all nodes and relationships

• Access the javadocs of any node or relationship

• Graphically explore the neighborhood of a node/relationship

• Look up for the indexes that may serve as an entry point for a node

• Check incoming/outgoing relationships of a specific node

• Check start/end nodes of a specific relationship


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud


http://gotools.bio4j.com:8080/Bio4jExplorerServer/Bio4jExplorer.html


Entry points and indexing

There are two kinds of entry points for the graph:

Auxiliary relationships going from the reference node, e.g.

Node indexing

- MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl

There are two types of node indexes:

- CELLULAR_COMPONENT: leads to the root of GO cellular component sub-ontology

- Exact: Only exact values are considered hits

- Fulltext: Regular expressions can be used


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud


https://github.com/pablopareja/Bio4j/wiki/Auxiliary-relationships

https://github.com/pablopareja/Bio4j/wiki/Node indexing


Querying Bio4j with Cypher

START k=node:keyword_id_index(keyword_id_index = "KW-0181")

return k.name, k.id


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Getting a keyword by its ID

START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")

MATCH d <-[r:PROTEIN_DATASET]- p,

circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -

[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -

[:PROTEIN_PROTEIN_INTERACTION]-> (p)

return p.accession, p2.accession, p3.accession

Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset:

Check this blog post for more info and our Bio4j Cypher cheetsheet


http://docs.neo4j.org/chunked/1.5/cypher-query-lang.html

http://blog.bio4j.com/2012/01/mining-bio4j-data-finding-topological-patterns-in-ppi-networks/

https://github.com/bio4j/Bio4j/wiki/Bio4j-cypher-cheat-sheet


gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name ==> Aspartate aminotransferase, mitochondrial


Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Get protein by its accession number and return its full name

Get proteins (accessions) associated to an interpro motif (limited to 4 results)

gremlin> g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.accession[0..3]==> E2GK26 ==> G3PMS4 ==> G3Q865==> G3PIL8

Check our Bio4j Gremlin cheetsheet

A graph traversal language


https://github.com/tinkerpop/gremlin/wiki

https://github.com/bio4j/Bio4j/wiki/Bio4j-gremlin-cheat-sheet



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Visualizations (1) REST Server Data Browser

Navigate through Bio4j data in real time !


http://blog.bio4j.com/wp-content/uploads/2012/01/Bio4jDataBrowser.png



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Visualizations (2) Bio4j + Gephi

Get really cool graph visualizations using Bio4j and Gephi visualization and exploration platform


http://blog.ohnosequences.com/2011/12/gene-ontology-go-graph-visualizations-for-ehec-automatic-annotation-of-bgi-v4-assembly/

http://gephi.org/



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Visualizations (3) Bio4j GO Tools


http://gotools.bio4j.com:8080/Bio4jTestServer/Bio4jGoToolsWeb.html



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Why would I use Bio4j ?

Massive access to protein/genome/taxonomy… related information

Networks analysis

Integration of your own DBs/resources around common information

Development of services tailored to your needs built around Bio4j

Besides many others I cannot think of myself… If you have something in mind for which Bio4j might be useful, please let us know so we can all see how it could help you meet your needs! ;)

Visualizations




Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Bio4j + Cloud (1)

Interoperability and data distribution

We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving us the following benefits:

Releases are available as public EBS Snapshots, giving AWS users the opportunity of creating and attaching to their instances Bio4j DB 100% ready volumes in just a few seconds.

CloudFormation templates:

- Basic Bio4j DB Instance

- Bio4j REST Server Instance


http://www.aws.com/

http://aws.amazon.com/ebs/

http://aws.amazon.com/cloudformation/

http://blog.bio4j.com/2011/12/bio4j-aws-cloudformation-your-own-fresh-baked-db-in-less-than-a-minute/

https://github.com/bio4j/Bio4j/wiki/Getting-started



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Bio4j + Cloud (2)

Backup and Storage using S3 (Simple Storage Service)

We use S3 both for backup (indirectly through the EBS snapshots) and storage (directly storing RefSeq sequences as independent S3 files)

What kind of benefits do we get from this?

• Easy to use

• Flexible

• Cost-Effective

• Reliable

• Scalable and high-performance

• Secure


http://aws.amazon.com/s3/



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Bio4j + Cloud (3)

Web servers and service providers in the cloud

Deploying your own web server in AWS using Bio4j as back-end is really simple.

A good example of this would be Bio4jTestServer, a continuously developed server showcasing Web Services based on Bio4j.


https://github.com/pablopareja/Bio4jTestServer/wiki



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

Community

Bio4j has a fast growing internet presence:

- Twitter: check @bio4j for updates

- Blog: go to http://blog.bio4j.com

- Mail-list: ask any question you may have in our list.

- LinkedIn: check the Bio4j group

- Github issues: don‟t be shy! open a new issue if you thinksomething‟s going wrong.


http://www.twitter.com/bio4j

http://blog.bio4j.com/



Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the Cloud

And the best part of all this is:

You have the latest version of Bio4j already imported and fully working in EgStation! ;)



Bio4j + MG7 for the integration and analysis of Chip-seq data



Some numbers:

• 157 639 502 nodes

• 742 615 705 relationships

• 632 832 045 properties

• 148 relationship types

• 44 node types

Bio4j + MG7 + 24 Chip-Seq samples

And it works just fine!



MG7 domain model


https://github.com/pablopareja/MG7/raw/master/MG7Neo4jModelWithCardinality.jpg


What’s MG7?

MG7 is a new system for massive analysis of sequences from metagenomics samples specially designed for next generation sequencing technologies.

MG7 uses cloud computing to solve the problem of massive data analysis providing scalable, real time, on demand computing for metagenomics data analysis.

MG7 is able to obtain annotation and functional profiles for shot gun genomic sequences and taxonomic assignation for any type of read.

The inference of function and the assignation of taxonomical origin for each sequence are based on massive BLAST similarity analysis.



What’s MG7?

MG7 provides the possibility of choosing different parameters to fix thethresholds for filtering the BLAST hits:

i. E-valueii. Identity and query coverage

It allows exporting the results of the analysis to different data formats like:• XML• CSV • Gexf (Graph exchange XML format)

As well as provides to the user with Heat maps and graph visualizations whilst including an user-friendly interface that allows to access to the alignment responsible for each functional or taxonomical read assignation and that displays the frequencies in the taxonomical tree --> MG7Viewer



Heat-map Viz



Graph Viz



MG7 Viewer



Bio4j + GRG

A completely new approach for modeling genomic information and

gene regulatory networks



Bio4j + GRG

Integrating genomic information from organisms such as:

• Zea mays subsp. Mays

• Oryza sativa Japonica Group

• Sorghum bicolor

• Brachypodium distachyon

• Arabidopsis thaliana Columbia

• Arabidopsis lyrata lyrata MN47



Bio4j + GRG domain model


https://github.com/CAPS-ComputerLab/GRG/raw/master/domainModelWithCardinality.jpg


Bio4j + GRG

Get all the advantages of Bio4j and Graph DB while modeling genomic data for grasses, (although it could be also applied to other species/families).

Possibility of integrating data from other projects here at CAPS/EGLab in a common framework.

Data-mining of data that currently is not accessible or simply is not structured enough/in a good way to explore it. Both for external genomic data included in sites like phytozome or coming directly from the experiments/analysis performed in the lab.

Common framework for accessing all this information together with other “Universal” resources such as Uniprot, RefSeq or Gene Ontology.



Bio4j + GRG

Chance for the Lab to enter the Cloud and Graph DB world, being pioneer in providing access to this sort of data to a whole set of possible different users.

Not worrying anymore about possible problems with back-ups, mantaininginfrastructure or things like that…And what‟s more important:

Scalability Being able to adapt to the specific needs of new projects as they go along.



And the best part… Acknowledgments!

Bio4j + MG7 + Chip-Seq results

Bio4j + GRG



The other guys from the basement…

(Andrew)

(Brett)

(Matias)



And of course the rest of the Lab !



That’s it !

Thanks for your time ;)


Graph DB + Bioinformatics: Bio4j, recent applications and future directions

Education

Transcript of Graph DB + Bioinformatics: Bio4j, recent applications and future directions