Graph Database with Cassandra

35

Transcript of Graph Database with Cassandra

Page 1: Graph Database with Cassandra
Page 2: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 2

Graph DatabasesBrandon VeberChad Dvoracek

Page 3: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 3

Agenda

•Introduction to graph databases–What they are–Why to use them

•Titan technology stack–NoSQL distributed scalable data storage–Spark in-memory distributed computing

•Graph queries and analytics

Page 4: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 4

Introduction to Graph

Page 5: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 5

What is a Graph Database?

Graph databases use graph structures such as nodes and edges to store data and relationships.

Entities are modelled as nodes and the relationships between them are modelled as edges.

Blue, J Driving Insights with Network Graphs. Retrieved fromwww.mapr.com/blog/driving-insights-network-graphs

Page 6: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 6

How is it different from RDBMS?

● Relational databases prioritize the table

● Relationships are ad-hoc in the form of FK constraints

● Querying through complex relationships requires several costly joins

Graph DB vs RDBMS. Accessed from http://neo4j.com/developer/graph-db-vs-rdbms/

Page 7: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 7

How is it different from RDBMS?

● Nodes contain entities and their corresponding properties

● Relationships are given top priority

● Pointers instead of index look-ups

Graph Database. Accessed from https://en.wikipedia.org/wiki/Graph_database

Page 8: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 8

How is it different from RDBMS?

● Inherently NoSQL● Scalable● High availability

● Data model is intuitive and agile.

Graph vs RDBMS. Accessed from http://neo4j.com/developer/graph-db-vs-rdbms/

Page 9: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 9

When to use Graph DB

Page 10: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 10

When not to use Graph DB

● Data warehousing

● Schema-oriented design

● Aggregates on sets

● Robust transactional processing

Page 11: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 11

When to use Graph DB

● Graph databases work well with highly interconnected data with complex relationships

● Some use cases include:○ Social networks○ Route planning○ Master data management○ Recommendation engine

AWS Master Data Management Model. Accessed from http://neo4j.com/graphgist/8526106

Page 12: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 12

Successful Use Cases

Page 13: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 13

Successful Use Cases - HealthUnlocked

Goal: Redesign system to manage performance issues associated with increasing data volume

Methods:

● Graph database to store relationships between symptoms, conditions and treatments

● Language processing to build multilingual ontology into the database

Result:

● Improved query performance● Easier data model for pattern matching● Two months to launch

Page 14: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 14

Open Source Graph Framework

● The Apache Tinkerpop project provides an open source, vendor agnostic framework for graph construction, query and analysis.

● Changing between graph engines and back-end storage technologies is possible without significant refactoring

● Supports graph databases (OLTP) and graph analytics (OLAP)

Page 15: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 15

Titan

Page 16: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 16

Technology Stack - Storage

● Supports several distributed NoSQL databases

● Support for ACID transactions

● Linearly scalable

Page 17: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 17

Technology Stack - Analytics & ETL

Titan offers support for several analytics and batch loading technologies.

Page 18: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 18

Technology Stack - Search + Framework

Titan supports the following search technologies:

•ElasticSearch•Lucene•Solr

Titan also integrates natively with Apache Tinkerpop

Page 19: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 19

Apache Cassandra

● Key-Value Store

● Exceptional fault tolerance

● Scalable

● Denormalized tables

Page 20: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 20

Apache Spark

● Resilient distributed datasets

● In-memory cluster computing

● Scalable

● Up to 100x faster than MapReduce

● Native Cassandra connector

Page 21: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 21

Datastax Graph

● Designed for cloud applications

● Multi-model capable

● Enterprise support

● Scalable

Page 22: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 22

Queries and Analytics

Page 23: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 23

Example Model

Page 24: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 24

Example Model

● Edges can contain values and properties as well

● The ‘Includes’ edge will contain a quantity property

Page 25: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 25

Simple Traversal

Page 26: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 26

Traversal ExampleQuestion: What items were purchased in ‘Transaction 1’?

g.V().hasLabel(‘transaction’).has(‘tx_id’,1).out(‘includes’).values(‘name’)

Output

● Pop

● Gum

● Bread

Page 27: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 27

Traversal ExampleQuestion: What customers have shopped at ‘Store 1’?

g.V().hasLabel(‘store’).has(‘store_id’,1).out(‘processes’).in(‘purchases’).values(‘name’)

Output

● Customer 1

Page 28: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 28

Branching TraversalQuestion: Of all transactions when ‘Pop’ was purchased what was the average quantity?

g.V().has('name','Pop').inE('includes').values('quantity').mean()

Output● 1.5

Page 29: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 29

Branching TraversalQuestion: What is the average quantity of all items sold when purchased in a transaction?

g.V().hasLabel('item').local(inE('includes').values('quantity').mean())

Output● 1.5● 2.5● 1

Page 30: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 30

More Traversal Strategies

● Recursive● Path● Projecting● Declarative

Page 31: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 31

Graph Analytics - Network Properties

● Node count - Total number of nodes● Edge count - Total number of edges ● Diameter - Maximum length of a shortest path between any two nodes● Min & Max & Mean Degree - Degree is the number of connections for each node● Degree distribution - Histogram (shown on next page)

Page 32: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 32

Graph Analytics - Degree Distribution

The degree of a node represents how many connections it has. A degree distribution is the probability distribution of those degrees in the network.

Most graphs exhibit the behavior of GitHub distribution shown on the right

BIG GRAPH DATA ON HORTONWORKS DATA PLATFORM, Accessed at http://hortonworks.com/blog/big-graph-data-on-hortonworks-data-platform/

Page 33: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 33

Graph Analytics - Network Properties

● Clustering coefficients - Represent the randomness of connections in a graph● Centrality - Identify the most important nodes (e.g. PageRank)● Community detection - Identify groups of nodes that are more densely connection

among themselves than the other nodes in the graph

Page 34: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 34

Questions?

Page 35: Graph Database with Cassandra

Proprietary and Confidential / © The Nerdery, LLC 35

Contact

The [email protected](877) 664.6373