08 Graph Databases Survey

7

Click here to load reader

description

DBMS

Transcript of 08 Graph Databases Survey

Page 1: 08 Graph Databases Survey

The Current State of Graph Databases

Mike BuerliDepartment of Computer Science

Cal Poly San Luis [email protected]

December 2012

Abstract

Graph Database Models is increasingly a topic of in-terest the database realm. The representation of datain the form of a graph lends itself well to structureddata with a dynamic schema. This paper goes overcurrent applications and implementations of graphdatabases, giving an overview of the different typesavailable and their application. Due wide spread ofgraph algorithms and models, no standard system orquery language has been defined for graph databases.Research and industry adoption will determine thefuture direction of graph databases.

1 Introduction

In the past few years there has been a re-emergenceof interest in storing and managing graph data. Inacademia and research, we see many new attemptsat providing a database model for large graph data,particularly social graphs and the Web graph. While,more and more commercial applications are lookingtowards graph databases for their dynamic schemaand ease of use in storing more complex data. Thispaper will go through many of the current databasemodels giving a comparison of the different designimplementations and tradeoffs.

Historically the birth of graph theory is attributedto the Swiss mathematician Leonhard Euler, whofirst solved the Seven Bridges of Konigsberg prob-lem in 1736. This problem introduced the conceptof representing data in the form of a graph (a set of

vertices or nodes with edges connecting them) anddetermining the traversal of the graph that results inevery edge being crossed only once. Graph theorytranslates to today’s work in computational biologyand social graphs with shortest path queries, clus-tering, community detection, and other graph algo-rithms. The optimization of these queries separatesgraph databases from the rest.

The research of graph databases was popular in theearly 1990s with database models like LDM, GOOD,O2, and GraphDB. However, this interest died offwith the insurgence of XML and the Internet. Notuntil recently have graph databases again become atopic of interest. This re-emergence is due in partto the large amounts of graph data introduced bythe Web. Just this year, the first graph conference -Graph Connect 2012[7] - was held focusing only ongraph databases and the adoption of such models.

The recent trend, following the NOSQL movement,has moved away from relational databases to onesbetter suited for a given application. While muchof this movement is focused around the horizontalscalability of data with column-store and key-value-stores, the graph data model provides a greater levelof data complexity in comparison. Figure 1 showsa categorization of NOSQL data models, comparingdata complexity versus data size. Graph data modelsprovide a higher level of data complexity in return forbeing able to handle less data.

The remainder of this paper is structured as fol-lows. Section 2 will go over related work and pastsurveys of graph database models. Section 3 will

1

Page 2: 08 Graph Databases Survey

cover applications and types of graph data. Section4 goes through a number of graph database models,grouped by type. Last Section 5 gives some compar-isons of usign graph versus relational databases.

Figure 1: NOSQL Data Models

2 Related Work

Past comparisons and research of graph databases,do a great job of analyzing the advantages and dis-advanges. In particular Renzo Angles[19] presents awell rounded survey of graph data models and theirfeatures. The paper has gives multiple comparisonsof graph data models with respect to data storing,data structure, query languages, and integrity con-straints. For a survey of earlier work (pre-NOSQL)in graph databases Angles and Gutierrez[20] presenta survey of graph database models prior to 2002,particularly geographical, spatial and semistructureddatabase models. It is important to notice the shiftof focus between the two papers, with respect todata structure. Older data models focused heavilyon semistructured and XML data in a traditionaldatabase. Current-day data models, in contrast, fo-cus more on providing an object-oriented, or relation-ship oriented, structure where the individual nodes

or relations are first class citizens. There is also atrend of abstraction by database models only provid-ing API’s for operation and manipulation.

As many of the graph databases remain unchangedfrom these surveys, this paper will instead focus moreon the application of each database model and cate-gorize them into different types.

3 Graph Applications

Some argue that most data is inherently a graph, andthat all data can be stored as a graph. Using graphsto store data not only allows for a dynamic schema,but also provides representations of data not previ-ously possible. The ability overlay different graphs(Ex. social, temporal, and spacial) on data extendsthe functionality of querying data. In Managing andMining Graph Data[18] we are introduced to a vari-ety of applications for graph data, focusing on threemajor groups: chemical and biological data, socialnetworks, and the Web. This paper will focus onthese three, with the addition of enterprise data.

3.1 Chemical and Biological

Chemical data is modeled as a graph by assigningatoms as nodes and bonds the edges between them.Biological data is represented the same way, only withamino acids as the nodes and links between them asthe edges. This graph data is important for such op-erations as drug discovery and analysis. The data hasmany repeating node labels, so graph operations arefocused at pattern recognition. Pattern recognition isdone by finding frequent subgraphs of a given graph.Other operations include rank-retrieval and scaffold-hopping which are used to determine chemical sim-ilarity. Modeled with a traditional database model,these operations would take a great deal longer, dueto the recursive nature of traversing a graph.

3.2 Social Networks

Social Networks is a very popular topic not just in so-ciety, but in graph research. Social networks, not onlyintroduce a profound amount of data, but present

2

Page 3: 08 Graph Databases Survey

large graph data problems for the research commu-nity. These graphs, not only store nodes of peoplebut also link nodes of multimedia, relationships, andmessaging. For large social graphs we are most in-terested shortest path queries and clustering. Thesegraph algorithms provide analysis of relations of twonodes and determination of communities or socialnetworks. Currently social networking sites like Face-book do not use a graph databases. Instead they usekey-value stores or column stores Cassandra (a col-umn store similar to BigTable). This is due to thesheer amount of throughput that must be handled.However, smaller scale systems or systems focusedon offline querying like Pregel[25] provide optimizedor distributed methods of analysis that support thistype of data.

3.3 The Web

The Web, in its entirety, is essentially a graph of dataand information linked together. Cudre-Mauroux etal.[23] define the Web in terms of the Linked Datamovement which supports the rapid dissemination oflarge-scale structured data through three principles:i) Unique Resource Identifiers (URIs) establish thecreation of distinct data anywhere on the web. ii)Structured data, usually in the form of Resource De-scription Framework (RDF) triplets, provides a stan-dard structure for data to be linked by. iii) Linksto similar online resources connect the data to formcommunities or clusters of data. This massive graphof data presents applications in web search and datacollection. PageRank, possibly on of the most wellknown secrets of Google. Is a graph algorithm thatanalyzes Web data (pages) to determine a rank forpages by looking at how many pages are linked toit. Other important graph algorithms include web-document clustering and keyword search. Both ofthese algorithms aid in the searching and narrow-ing of data sets. Applications that deal with webdata, if on a smaller scale can efficiently provide on-line querying of this graph-like data. For applicationsthat focus on larger-scale graph data, offline queryingprovides analytics and aggregates of graph data.

3.4 Enterprise Data

Graph databases are not limited to academia or large-data graphs. Enterprise data provides perhaps thelargest uptake of applications for graph databasemodels. Modeling of data as a graph is not limitedto scientific or web data; rather you can model mostanything as a graph. The advantage of using graphsis the ability to represent more complex data mod-els and support a dynamic schema. In particular,graph databases have been successful for companiesthat store hierarchies of product and financial andindustry data[7]. The affordance of modeling datawith relationships, allows for efficient restructuringas well as multiperspective querying. Graph algo-rithms are utilized the most, with applications suchas these where data analytics are a large part of busi-ness. Another possible application worth mentioningis the use of graph databases for bug localization[18].Overall there is a wide range of areas where graphdata models are applicable.

4 Graph Database Models

There is a wide range of graph database models thathave been introduced throughout the past few years.From implementations on top existing non-realtionaldatabase models to graph database models build fromthe ground up, there is no standard graph databasemodel on which graph algorithms are developed[30].Rather, each graph database is optimized for a spe-cific set of task or queries. The problem resides inthe multiple divisions of graph databases. Graphdatabases can focus on graph algorithms like shortestpath queries and subgraph matching which requirethe whole graph to reside in memory and make dis-tributed systems very difficult. On the other side ofthe spectrum, a graph database can focus on handlinglarge graphs by scaling horizontally. This howevermakes many graph algorithms extremely inefficientor even impossible. Graphs can also focus on eitheronline querying where low latency is required, or of-fline querying where larger data is handled. Graphdatabase models are also divided by language, sinceno standard language has been introduced for graph

3

Page 4: 08 Graph Databases Survey

querying. Most graphs implement their own API foroperation and manipulation, in which only certainlanguages are provided the API in addition to theHTTP REST protocol. Some databases, more specif-ically known as RDF databases, support SPARQLquerying which queries triple patterns against thelarge graph of triplets stored in the database.

The following sections organize the different graphdatabase models into categories corresponding to thedata model type.

4.1 Graph Databases

The main-stream graph databases provide an ob-ject model for nodes and relationships. These graphdatabases focus on either RDF triplets, linked data,or relationships for storage. These databases oftenuse direct memory links to adjacent nodes rather thanrequiring joins or keys lookups.

AllegroGraph(2005)[1] is a high-performance, soft-ware oriented database model that came as a precur-sor to the current generation of graph databases. Itis implemented as an RDF databases, and serves as areference implementation for the SPARQL query lan-guage. Implementations of geo-temporal reasoningand social network analysis extend the functionalityof the database as well as a prolog extension. Allegro-Graph also partially enforces ACID while remainingscalable.

DEX(2007)[3, 26] is a very efficient, bitmaps-basedgraph database model written in C++. The focus ofDEX is performance in the management of very largegraphs, and even allows for the integration of variousdata sources. In addition to the large data capacity,DEX has a good integrity model for management ofpersistent and temporary graphs. Operation or corefunctionality like link analysis, social network analy-sis, pattern recognition and keyword search is donethrough their Java API. These core functionalitieslend themselves well to applications like IMDb, onwhich experiments were done[26].

Neo4j(2007)[11] is a disk-based transactional graphdatabase advertised as “The worlds leading graphdatabase”. It works on a network oriented modelwith relations as first class objects. The API is inJava, and supports Java object storage. The system

is very efficient in graph traversals, however currentlyrequires the full dataset on each node (work is beingdone on transparent partitioning). Neo4j also haspartial ACID support and lends itself well to trans-actional enterprise solutions.

HyperGraphDB(2010)[9, 24] is an open-sourcedatabase focused on supporting generalized hyper-graphs. Hypergraphs differ from normal graphs intheir ability for edges to point to other edges. Thisrepresentation is useful in the modeling of graph datafor artificial intelligence, bio-informatics, and otherknowledge representations. Hypergraph supports on-line querying with a Java API.

Sones(2010)[15] is an object-oriented databasewritten in C#. The graph database model providesits own query language based on SQL and supportsa higeher level of abstraction for graph queries. Themodel is based on weighted graphs and also has sup-port for hypergraphs. Sones runs on a distributed filesystem to support scalability.

4.2 Distributed Graph Databases

Distributed Graph databases focus on distributinglarge graphs across a framework. Partitioning graphdata is a non-trivial problem, optimal division ofgraphs requires finding subgraphs of a graph. Formost data, the number of links or relationships istoo large to efficiently compute an optimal partition,therefore most databases use random partitioning.

Horton(2010)[8, 28] is a transactional graph pro-cessing framework created by Microsoft. Hortonmakes use of the Orleans cloud framework in order toquery large distributed graphs. Instead of adoptinga map/reduce architecture, Horton works with a dis-tributed graph, passing a state machine across nodes.This allows for better ad-hoc querying in comparisonto map/reduce systems.

InfiniteGraph(2010)[10] is a distributed-orientedsystem that supports large-scale graphs and efficientgraph analysis. Rather than in-memory graphs, thissystem supports efficient traversal of graphs acrossdistributed data stores. This works by creating afederation of compute nodes operated through theirjava API.

4

Page 5: 08 Graph Databases Survey

4.3 Key-Value Graph Databases

Key-value graph databases simplify the object-related model of graph databases to allow for greaterhorizontal scalability. These models build off, or ontop of, existing key-value stores allowing for greaterscalability and partitioning of graph nodes.

VertexDB(2009)[17] is a key-value disk store thatmakes use of TokyoCabinet. The graph database fo-cuses on a vertex graph with added support for au-tomatic garbage collection.

CloudGraph(2010)[2] is an in-development, fullytransactional graph database written in C#. It takesadvantage of key/value pairs to store data both in-memory and on-disk. CloudGraph has also createdits own graph query language (GQL).

Redis Graph(2010)[14] is an implementation of agraph database in python using redis. Redis is amodern key-value store; the python implementationis minimalistic, creating an API in only forty lines ofcode.

Trinity(2011)[16, 29] is a RAM-based key valuestore under development by Microsoft Research. Ituses message passing over a distributed system,achieving low latency queries on large distributedgraphs. The benefit of in-memory key value storagecan be seen with increased performance.

4.4 Document Graph Databases

Like key-value stores, document based graphdatabases introduce a higher level of data complexityfor a given node.

Orientdb(2009)[12] is a high-performancedocument-graph database. They make use of anovel distributed hash table algorithm in order toget greater parallelism.

Another example of a document-store in graphdatabases is an implementation on CouchDB[27].This implementation makes use of the documentstore, in order to serve low latency queries for largegraph databases. Document stores, much like key-value stores provide quick data retrieval for struc-tured data.

4.5 SQL Graph Databases

Filament(2010)[4] is a graph persistence library builton top of PostgreSQL. It allows SQL queryingthrough JDBC with navigational queries for query-ing the graph data.

G-Store(2010)[5] is a prototype query language andstorage manager for large graphs. It is also build ontop of PostgreSQL.

These implementations of graph databases are of-ten referred to as Graph stores, for the implementa-tion only concerns storage and retrieval of a graphdata from the database, not how the data is stored.

4.6 Map/Reduce Graph Databases

To handle very large graphs, one can implementMap/Reduce functionality, in order to achieve themaximum amount of parallelism. Partitioning nodesof a graph across many machines will result in onlya sizable amount of computation to be one on eachmachine.

Pregel(2009)[25] is a vertex-based infrastructurefor graphs built on top of Hadoop. Hadoop, aMap/Reduce framework provides batch jobs for pro-cessing the distributed verticies with message pass-ing. This approach only affords doing offline queriesof the graph data.

Phoebus(2010)[13] is another implementation ofPregel, again building on top of Hadoop in order tobenefit from the Map/Reduce framework.

Giraph(2011)[6] also builds off of Pregel with theaddition of fault tolerance. If the applicaiton coor-dinator has a fault, one of the available nodes willautomatically become the new coordinator.

5 Comparisons

Multiple studies have been done comparing the per-formance of graph databases and relational ones.Graph databases like Neo4j[11] optimize for adja-cency queries and graph traversal. While some op-erations may not be as fast as the indexing providedin a SQL database, the overall performance when do-ing graph-like queries will be much improved. Thingsto look for in graph-like queries are, lots of many to

5

Page 6: 08 Graph Databases Survey

many relationships, having tree like characteristics,or requiring frequent schema changes. In one com-parison Neo4j and MySQL[22], the authors foundthat graph databases did perform better than therelational model on the objective queries. Howeverthey noted that Neo4j is not yet mature, and becausethere is no standard query language available it onlyadded to this. Another paper looked at how graphdatabases like Neo4j performed on spacial data[21].The paper found that relational databases still per-formed better, in all spatial queries but the ones thatinvolved hierarchical traversal. The fact that a rela-tional database can quickly index a coordinate loca-tion gives an advantage to relational databases.

In contrast test run with a directed acyclical graphon Neo4j and MySQL[31], showed a clear advantageof graph databases for structural queries. When com-parisons focus on structured data with graphs thatare fairly dense, relational indexing performance withjoins can no longer keep up with the linked data rep-resentation in graph databases.

6 Conclusions

This paper gave an overall summary of the currentstate of graph databases. Much of the current re-search in the field is application driven. However,in turn, the various applications have made a wideassortment of graph databases. In order to possi-bly enumerate all the different categories of graphdatabases, this paper went over many of the cur-rent graph database models being used today. Graphdatabase models are divided by a number algorithmsand paradigms which databases wish to optimize.There still does not exist a standard query languagefor graph databases, leading many implementationsto be API only. The future of graph databases residesin the prevalence of one database over another, mostlikely determined by the enterprise industry and theiradoption. Overall graph databases provide a muchneeded structure for storing data and incorporatinga dynamic schema, however the research topic itselfneeds more structure before it can fully be adoptedby industry.

References

[1] Allegrograph. http://www.franz.com/agraph/allegrograph/.

[2] Cloudgraph. http://www.cloudgraph.com/.

[3] Dex. http://www.sparsity-technologies.

com/dex.

[4] Filament. http://filament.sourceforge.

net/.

[5] G-store. http://g-store.sourceforge.net/.

[6] Giraph. https://github.com/apache/giraph.

[7] Graph connect. http://www.graphconnect.

com/.

[8] Horton. http://research.microsoft.com/

en-us/projects/ldg/.

[9] Hypergraphdb. http://www.hypergraphdb.

org/.

[10] Infinitegraph. http://infinitegraph.com/.

[11] Neo4j. http://www.neo4j.org/.

[12] Orientdb. http://www.orientdb.org/.

[13] Phoebus. https://github.com/xslogic/

phoebus.

[14] redis graph. https://github.com/tblobaum/

redis-graph.

[15] Sones. http://www.dekorte.com/projects/

opensource/vertexdb/.

[16] Trinity. http://research.microsoft.com/

en-us/projects/trinity/.

[17] Vertexdb. http://www.dekorte.com/

projects/opensource/vertexdb/.

[18] C.C. Aggarwal and H. Wang. Managing andmining graph data, volume 40. Springer, 2010.

6

Page 7: 08 Graph Databases Survey

[19] Renzo Angles. A comparison of current graphdatabase models. In Proceedings of the 2012IEEE 28th International Conference on DataEngineering Workshops, ICDEW ’12, pages171–177, Washington, DC, USA, 2012. IEEEComputer Society.

[20] Renzo Angles and Claudio Gutierrez. Survey ofgraph database models. ACM Comput. Surv.,40(1):1:1–1:39, February 2008.

[21] BLP Baas. Nosql spatial–neo4j versus postgis.2012.

[22] S. Batra and C. Tyagi. Comparative analysisof relational and graph databases. InternationalJournal of Soft Computing, 2.

[23] P. Cudre-Mauroux and S. Elnikety. Graph datamanagement systems for new application do-mains. In International Conference on VeryLarge Data Bases (VLDB), 2011.

[24] B. Iordanov. Hypergraphdb: a generalized graphdatabase. Web-Age Information Management,pages 25–36, 2010.

[25] Grzegorz Malewicz, Matthew H. Austern,Aart J.C Bik, James C. Dehnert, Ilan Horn,Naty Leiser, and Grzegorz Czajkowski. Pregel:a system for large-scale graph processing. InProceedings of the 2010 ACM SIGMOD Interna-tional Conference on Management of data, SIG-MOD ’10, pages 135–146, New York, NY, USA,2010. ACM.

[26] Norbert Martınez-Bazan, Victor Muntes-Mulero, Sergio Gomez-Villamor, Jordi Nin,Mario-A. Sanchez-Martınez, and Josep-L.Larriba-Pey. Dex: high-performance explo-ration on large graphs for information retrieval.In Proceedings of the sixteenth ACM conferenceon Conference on information and knowledgemanagement, CIKM ’07, pages 573–582, NewYork, NY, USA, 2007. ACM.

[27] Jayanta Mondal and Amol Deshpande. Manag-ing large dynamic graphs efficiently. In Proceed-ings of the 2012 ACM SIGMOD International

Conference on Management of Data, SIGMOD’12, pages 145–156, New York, NY, USA, 2012.ACM.

[28] M. Sarwat, S. Elnikety, Y. He, and G. Kliot.Horton: Online query execution engine forlarge distributed graphs. In Data Engineering(ICDE), 2012 IEEE 28th International Confer-ence on, pages 1289–1292. IEEE, 2012.

[29] B. Shao, H. Wang, and Y. Li. The trinitygraph engine. Technical report, Technical Re-port 161291, Microsoft Research, 2012.

[30] Bin Shao, Haixun Wang, and Yanghua Xiao.Managing and mining large graphs: systemsand implementations. In Proceedings of the2012 ACM SIGMOD International Conferenceon Management of Data, SIGMOD ’12, pages589–592, New York, NY, USA, 2012. ACM.

[31] Chad Vicknair, Michael Macias, Zhendong Zhao,Xiaofei Nan, Yixin Chen, and Dawn Wilkins.A comparison of a graph database and a re-lational database: a data provenance perspec-tive. In Proceedings of the 48th Annual South-east Regional Conference, ACM SE ’10, pages42:1–42:6, New York, NY, USA, 2010. ACM.

7