Graph databases: Tinkerpop and Titan DB

GRAPH DATABASES: THE SOLUTION FOR STORING

SEMI-STRUCTURED BIG DATAMohamed Taher Alrefaie

DATA IS GETTING BIGGER “Every two days, we create as much information as we did us to 2003”. Eric Schmidt, former Google CEO, 2010.

DATA IS MORE CONNECTED Having a look at the

following proves it: - Facebook Graph - LinkedIn Graph - Linked Data - Blogs/Tagging

DATA IS LESS STRUCTURED Modelling FB Graph? Persons, friendships, photos, locations, apps, pages, ads, interests, age range, etc.

NOSQL DATABASES Four types of databases that alleviate the performance issues of relational databases

KEY VALUE STORES Data Model: Global key-value mappingBig scalable HashMapHighly fault tolerant (typically) Examples:

Redis, Riak, Voldemort. Dynamo

KEY VALUE STORES: PROS AND CONS Pros:Simple data modelScalable ConsCreate your own “foreign keys”Poor for complex data

COLUMN FAMILY Main idea is based on BigTable: Google’s distributed storage model for Structured Data Data Model: A big table, with column familiesMap Reduce for querying/processing Examples:

HBase, HyperTable, Cassandra

COLUMN FAMILY: PROS AND CONS Pros:Supports Semi-Structured DataNaturally Indexed (columns)Scalable ConsPoor for interconnected data

DOCUMENT DATABASES Data Model: A collection of documentsA document is a key value collectionIndex-centric, uses map-reduce extensively Examples:

CouchDB, MongoDB

DOCUMENT DATABASES: PROS AND CONS Pros:Simple, powerful data modelScalable ConsPoor for interconnected dataQuery model limited to keys and indexesMap reduce for larger queries

GRAPH DATABASES Data Model: Nodes and Relationships Examples:

Titan, Neo4j, OrientDB, etc.

GRAPH DATABASES: PROS AND CONS Pros:Powerful data model, as general as RDBMSConnected data locally indexedEasy to query ConsSharding Requires different data modelling

RDBMS

LIVING IN A NOSQL WORLDCo

mpl

exity

BigTableClones

Size

Key-ValueStore

DocumentDatabases

GraphDatabases

90% ofUse

Cases

RelationalDatabases

9,223,372,036,854,775,807

WHAT IS A GRAPH? An abstract representation of a set of objects where some pairs are connected by links.

Object (Vertex, Node)

Link (Edge, Arc, Relationship)

WHAT IS A GRAPH DATABASE? A database with an explicit graph structure Each node knows its adjacent nodes through edges As the number of nodes increases, the cost of a local step (or hop) remains the same plus an Index for lookups

APACHE TINKERPOP: A UNIFIED API Dealing with such complex databases, requires a well-implemented API by the vendor. But using a vendor specific API, makes migrating to another database impossible. The solution is provided by Apache Tinkerpop.

WHAT IS APACHE TINKERPOP?● A Graph processing system● Currently under Apache incubation ( 2015 )● Has Tinkerpop3 Structure API

● Graph, Element, Property

● Has Tinkerpop3 Process API● TraversalSource, GraphComputer

● Gremlin query language● A scripting language for graph traversal and mutation

● REST API

WHY APACHE TINKERPOP? Tinkerpop is a generic API for graph databases Think ODBC, JDBC or Hibernate for relational databases

Integrates with:Titan DBNeo4jOrient DBAnd many more.Uses Gremlin graph scripting language

TITAN DATABASE Titan is a scalable graph database using Tinkerpop APIs optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Supports Apache Spark and Hadoop (implicitly) for map-reduce operations. Integrates with:Elasticsearch, Solr, Lucene

Uses as a backend storage:Apache CassandraApache HbaseOracle BerkeleyDB

PUTTING IT ALL TOGETHERApache Tinkerpop

APIGremlin server

Graph traversal

Gremlin client Monitoring

Titan DBStorage specific (Cassandra, HBase,

BerkeleyDB)

TITAN: EXAMPLE Download titan server and console here https://github.com/thinkaurelius/titan/wiki/Downloads

$ cd titan-1.0.0-hadoop1 $ bin/gremlin.shgremlin> graph=TitanFactory.open(“conf/titan-berkely-es.properties”)gremlin> g=GraphOfGodsFactory.load(graph).traversal()

https://github.com/thinkaurelius/titan/wiki/Downloads

https://github.com/thinkaurelius/titan/wiki/Downloads

TINKERPOP: EXAMPLEGraph g = TinkerGraph.open(); (1)Vertex marko = g.addVertex(Element.ID, 1, "name", "marko", "age", 29); (2)Vertex vadas = g.addVertex(Element.ID, 2, "name", "vadas", "age", 27); Vertex lop = g.addVertex(Element.ID, 3, "name", "lop", "lang", "java"); Vertex josh = g.addVertex(Element.ID, 4, "name", "josh", "age", 32); Vertex ripple = g.addVertex(Element.ID, 5, "name", "ripple", "lang", "java"); Vertex peter = g.addVertex(Element.ID, 6, "name", "peter", "age", 35);marko.addEdge("knows", vadas, Element.ID, 7, "weight", 0.5f); (3) marko.addEdge("knows", josh, Element.ID, 8, "weight", 1.0f); marko.addEdge("created", lop, Element.ID, 9, "weight", 0.4f); josh.addEdge("created", ripple, Element.ID, 10, "weight", 1.0f); josh.addEdge("created", lop, Element.ID, 11, "weight", 0.4f); peter.addEdge("created", lop, Element.ID, 12, "weight", 0.2f);

TINKERPOP: EXAMPLE (CONT.)

gremlin> g.V().has('name','marko') .out('knows')

.values('name') (3) ==>vadas ==>josh

SUMMARY Graph databases are the solution for highly scalable semi-structured connected data. Apache Tinkerpop is a generic API for graph databases to avoid DB vendor specific business logic code. Titan DB is a scalable distributed graph database on top of several other databases. It uses BerkeleyDB, HBase or BerkeleyDB as an end storage. This helps the database to be as linear or scalable you want it to be.

REFERENCEShttp://www.slideshare.net/maxdemarzi/introduction-to-graph-databases-12735789http://www.slideshare.net/mikejf12/an-introduction-to-apache-tinkerpophttp://www.tinkerpop.comhttp://tinkerpop.incubator.apache.orghttp://tinkerpop.incubator.apache.org/docs/3.0.0.M9-incubating/#gremlin-consolehttp://www.titandb.io

http://www.slideshare.net/maxdemarzi/introduction-to-graph-databases-12735789



http://www.slideshare.net/mikejf12/an-introduction-to-apache-tinkerpop

http://www.slideshare.net/mikejf12/an-introduction-to-apache-tinkerpop

http://www.tinkerpop.com/

http://tinkerpop.incubator.apache.org/

http://tinkerpop.incubator.apache.org/

http://tinkerpop.incubator.apache.org/docs/3.0.0.M9-incubating/#gremlin-console



http://www.titandb.io/

MOHAMED TAHER ALREFAIE 07/12/2015

Graph databases: Tinkerpop and Titan DB

Software

Transcript of Graph databases: Tinkerpop and Titan DB