Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation
Graph databases: Tinkerpop and Titan DB
-
Upload
mohamed-taher-alrefaie -
Category
Software
-
view
874 -
download
7
Transcript of Graph databases: Tinkerpop and Titan DB
GRAPH DATABASES: THE SOLUTION FOR STORING
SEMI-STRUCTURED BIG DATAMohamed Taher Alrefaie
DATA IS GETTING BIGGER “Every two days, we create as much information as we did us to 2003”. Eric Schmidt, former Google CEO, 2010.
DATA IS MORE CONNECTED Having a look at the
following proves it: - Facebook Graph - LinkedIn Graph - Linked Data - Blogs/Tagging
DATA IS LESS STRUCTURED Modelling FB Graph? Persons, friendships, photos, locations, apps, pages, ads, interests, age range, etc.
NOSQL DATABASES Four types of databases that alleviate the performance issues of relational databases
KEY VALUE STORES Data Model: Global key-value mappingBig scalable HashMapHighly fault tolerant (typically) Examples:
Redis, Riak, Voldemort. Dynamo
KEY VALUE STORES: PROS AND CONS Pros:Simple data modelScalable ConsCreate your own “foreign keys”Poor for complex data
COLUMN FAMILY Main idea is based on BigTable: Google’s distributed storage model for Structured Data Data Model: A big table, with column familiesMap Reduce for querying/processing Examples:
HBase, HyperTable, Cassandra
COLUMN FAMILY: PROS AND CONS Pros:Supports Semi-Structured DataNaturally Indexed (columns)Scalable ConsPoor for interconnected data
DOCUMENT DATABASES Data Model: A collection of documentsA document is a key value collectionIndex-centric, uses map-reduce extensively Examples:
CouchDB, MongoDB
DOCUMENT DATABASES: PROS AND CONS Pros:Simple, powerful data modelScalable ConsPoor for interconnected dataQuery model limited to keys and indexesMap reduce for larger queries
GRAPH DATABASES Data Model: Nodes and Relationships Examples:
Titan, Neo4j, OrientDB, etc.
GRAPH DATABASES: PROS AND CONS Pros:Powerful data model, as general as RDBMSConnected data locally indexedEasy to query ConsSharding Requires different data modelling
RDBMS
LIVING IN A NOSQL WORLDCo
mpl
exity
BigTableClones
Size
Key-ValueStore
DocumentDatabases
GraphDatabases
90% ofUse
Cases
RelationalDatabases
9,223,372,036,854,775,807
WHAT IS A GRAPH? An abstract representation of a set of objects where some pairs are connected by links.
Object (Vertex, Node)
Link (Edge, Arc, Relationship)
WHAT IS A GRAPH DATABASE? A database with an explicit graph structure Each node knows its adjacent nodes through edges As the number of nodes increases, the cost of a local step (or hop) remains the same plus an Index for lookups
APACHE TINKERPOP: A UNIFIED API Dealing with such complex databases, requires a well-implemented API by the vendor. But using a vendor specific API, makes migrating to another database impossible. The solution is provided by Apache Tinkerpop.
WHAT IS APACHE TINKERPOP?● A Graph processing system● Currently under Apache incubation ( 2015 )● Has Tinkerpop3 Structure API
● Graph, Element, Property
● Has Tinkerpop3 Process API● TraversalSource, GraphComputer
● Gremlin query language● A scripting language for graph traversal and mutation
● REST API
WHY APACHE TINKERPOP? Tinkerpop is a generic API for graph databases Think ODBC, JDBC or Hibernate for relational databases
Integrates with:Titan DBNeo4jOrient DBAnd many more.Uses Gremlin graph scripting language
TITAN DATABASE Titan is a scalable graph database using Tinkerpop APIs optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Supports Apache Spark and Hadoop (implicitly) for map-reduce operations. Integrates with:Elasticsearch, Solr, Lucene
Uses as a backend storage:Apache CassandraApache HbaseOracle BerkeleyDB
PUTTING IT ALL TOGETHERApache Tinkerpop
APIGremlin server
Graph traversal
Gremlin client Monitoring
Titan DBStorage specific (Cassandra, HBase,
BerkeleyDB)
TITAN: EXAMPLE Download titan server and console here https://github.com/thinkaurelius/titan/wiki/Downloads
$ cd titan-1.0.0-hadoop1 $ bin/gremlin.shgremlin> graph=TitanFactory.open(“conf/titan-berkely-es.properties”)gremlin> g=GraphOfGodsFactory.load(graph).traversal()
TINKERPOP: EXAMPLEGraph g = TinkerGraph.open(); (1)Vertex marko = g.addVertex(Element.ID, 1, "name", "marko", "age", 29); (2)Vertex vadas = g.addVertex(Element.ID, 2, "name", "vadas", "age", 27); Vertex lop = g.addVertex(Element.ID, 3, "name", "lop", "lang", "java"); Vertex josh = g.addVertex(Element.ID, 4, "name", "josh", "age", 32); Vertex ripple = g.addVertex(Element.ID, 5, "name", "ripple", "lang", "java"); Vertex peter = g.addVertex(Element.ID, 6, "name", "peter", "age", 35);marko.addEdge("knows", vadas, Element.ID, 7, "weight", 0.5f); (3) marko.addEdge("knows", josh, Element.ID, 8, "weight", 1.0f); marko.addEdge("created", lop, Element.ID, 9, "weight", 0.4f); josh.addEdge("created", ripple, Element.ID, 10, "weight", 1.0f); josh.addEdge("created", lop, Element.ID, 11, "weight", 0.4f); peter.addEdge("created", lop, Element.ID, 12, "weight", 0.2f);
TINKERPOP: EXAMPLE (CONT.)
gremlin> g.V().has('name','marko') .out('knows')
.values('name') (3) ==>vadas ==>josh
SUMMARY Graph databases are the solution for highly scalable semi-structured connected data. Apache Tinkerpop is a generic API for graph databases to avoid DB vendor specific business logic code. Titan DB is a scalable distributed graph database on top of several other databases. It uses BerkeleyDB, HBase or BerkeleyDB as an end storage. This helps the database to be as linear or scalable you want it to be.
REFERENCEShttp://www.slideshare.net/maxdemarzi/introduction-to-graph-databases-12735789http://www.slideshare.net/mikejf12/an-introduction-to-apache-tinkerpophttp://www.tinkerpop.comhttp://tinkerpop.incubator.apache.orghttp://tinkerpop.incubator.apache.org/docs/3.0.0.M9-incubating/#gremlin-consolehttp://www.titandb.io
MOHAMED TAHER ALREFAIE 07/12/2015