(DAT203) Building Graph Databases on AWS
-
Upload
amazon-web-services -
Category
Technology
-
view
4.011 -
download
3
Transcript of (DAT203) Building Graph Databases on AWS
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Todd Hildebrant and Matthew Sowders
AWS
October 2015
DAT203
Graph Databases on AWS
What to Expect from the Session
• Who are we?
• General overview of graph database technology
• AWS architecture examples
• Amazon Fulfillment technology’s “Inventory Notification
Graph”
• Amazon DynamoDB Storage Backend for Titan
Graph databases on AWS
What is a graph? What is a graph database?
• A graph is a data structure consisting of vertexes
(nodes), directed edges (relationships), and properties.
Subset of tree data structure.
• A graph database uses a property graph as the data
model and includes a query language.
• Other possible data models are hyper-graphs, triple-
stores, RDF.
Graph data modeling
• NoSQL data models – Document, Key-Value, Columnar,
Graph, Mixed
• CAP and ACID
• Start with the use case, then develop the data model:
• As a Student, I want to know other Students in my Class who
know about a Subject
• Student KNOWS Subject, Student BELONGS_TO Class
StudentSubject Class
KNOWS BELONGS_TO
Graph vs. relational database
Graph
• Need to traverse a graph
without JOINs
• Queries have a starting
location MATCH ON x
• Normalized attribute to
enable filtering
• Dynamic schema
Relational
• Columnar analytics
• Tables denormalized for
performance
• Cluster and fault
management
• Recursive query support in
the query optimizer
Titan: distributed graph database
• Distributed graph
• Storage layer has plug-in architecture
• Native TinkerPop implementation
• Full text search with Lucene, SOLR, Elasticsearch
• HA using multi-master replication (Cassandra cluster)
• Scalability using DynamoDB
• Shared-nothing architecture, single master (writes),
multiple replicas (reads), embeddable using JVM
• HA when distributed, uses Paxos for master election
• Attempts to load DB into RAM, larger is better. Efficient
spilling to disk.
• Primary query language is Cypher, supports Gremlin
AWS deployment for Neo4j
Availability Zone #1
Write ELB
Availability Zone #1
Read ELB
ELB health checks
HTTP GET
/db/manage/server/ha/master
/db/manage/server/ha/slave
/db/manage/server/ha/active
Analytics on graphs
• OLAP not OLTP
• Leverages the Hadoop / MapReduce framework
• GraphX is analytics on Spark in-memory; functional-like,
“declarative” programming model
• Giraph is graph using MapReduce / HDFS; procedural,
vertex-centric programming model
• Aggregation type queries over the entire graph
TinkerPop
• Apache Incubator graph framework supporting both
OLAP and OLTP.
• Gremlin, a query language for graph traversals.
Supports analysis, modification, and queries.
• Gremlin Structured API, a generic connector framework
or API. Interface to a backend graph engine.
Graph DB use cases
• Social
• Recommendation
• Classic network problems
• Deep hierarchies
• Sensor analysis with geo-spatial constraints
• Fraud detection
• Identity and Access Management
Recommendation engine example
neo4j cluster
EMR
Writes Reads
Buy like
item
“People who bought
this item also bought”
Custom
“Something you
recently looked at has
changed”
Inbound fulfillment
Inbound fulfillment data problems
Manual Research
• All tools emit events
• Humans trace the events
• Difficult to follow as search
space increases
• Developed queries, but took
too long to run
Approaches
Unique Identifiers
• Every item gets a unique
identifier
• Easy to get all related events
• Expensive
• Impractical for some items
Inventory notification graph: data model
Why not use a relational or NoSQL database?
• Relational Database
• Knew data volume would be huge and keep growing
• Did not want to vertically scale
• JOINs on table will be expensive
• Use case required high availability
• NoSQL Store
• Would be the same solution without all the functionality built
into the TinkerPop Graph Framework
Why a graph?
• No way to index just the events we need
• Need to perform search from receive to stow and vice
versa; i.e., requires many hops to find the data
• Need to process messages out of order
• Graphs provide a simple mental model
Why Titan?
Tinkerpop
Backend
DynamoDB Local DynamoDB Cassandra HBase BerkeleyDB
Titan
Rexster(graph server)
Blueprints(generic graph API)
Furnace(graph algorithms)
Frames(object-graph mapper)
Gremlin(traversal language)
Pipes(dataflows)
Cassandra
• Highly available
• Existing Titan implementation
• EC2Snitch
• Replication
• RandomPartitioner
Cassandra: Titan lessons learned
• No one on our team had experience managing or
configuring a Cassandra cluster
• Needed to manage a cluster
• Team manually replaces hosts as EC2 swaps them out
• Does not handle time series data well
• We ran two producers against two keyspaces so we
could efficiently drop old data
DynamoDB: Titan
• Massively scalable
• No more tuning and host management
• Team was already familiar with DynamoDB
• Risky because there was no existing Titan
implementation
Inventory notification graph – architecture
DynamoDB: single-item data model
Hash Key (hk) Attribute Attribute Attribute Attribute Attribute
Vertex id 1 Property –
Name Justin
Edge (out) –
Friend: Anna
Edge (out) –
Friend: Kris
Edge (out) –
Likes: Movies
Hidden
Property -
Exists
Vertex id 2 Property –
Name Anna
Edge (out) –
Friend: Justin
Edge (out) –
Likes: Books
Hidden
Property -
Exists
Vertex id 3 Property –
Name Kris
Edge (out) –
Friend: Justin
Edge (out) –
Likes: Movies
Hidden
Property -
Exists
Vertex id 4 Property –
Name Movies
Edge (out) –
Friend: Justin
Edge (out) –
Likes: Kris
Hidden
Property -
Exists
Vertex id 5 Property –
Name Books
Edge (out) –
Friend: Anna
Hidden
Property -
Exists
DynamoDB: multiple-item data model
Hash Key (hk) Range Key (rk) Value (v)
Vertex id 1 Range key
Vertex id 1 Property id Property – Name Justin
Vertex id 1 Edge id Edge (out) – Friend Anna
Vertex id 1 Edge id Edge (out) – Friend Kris
Vertex id 2 Range key
Vertex id 2 Property id Property – Name Anna
Vertex id 2 Edge id Edge (out) – Friend Justin
Vertex id 2 Edge id Edge (out) – Friend
Brooks
DynamoDB: how does it scale?
• Close to 100 billion vertices
• Terabytes of data
• Without corresponding increase in latency
DynamoDB: Titan lessons learned
• Use Titan explicit partitioning on large graph
• Partition across multiple graphs for time series data
• Able to achieve stable performance at scale
How to get started
• GitHub Repository
• DynamoDB Local
• CloudFormation Template
Resources
• Graph Databases by Ian Robinson, Jim Webber, and Emil Eifrem
• Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL
Movement by Eric Redmond and Jim R. Wilson
• NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by
Pramod J. Sadalage and Martin Fowler
• Titan Graph Database Integration with DynamoDB: World-class Performance,
Availability, and Scale for New Workloads by Werner Vogels
• Store and Process Graph Data using the DynamoDB Storage Backend for Titan by
Jeff Barr
• Amazon DynamoDB Storage Backend for Titan: Distributed Graph Database by
Matthew Sowders and Alexander Patrikalakis
• Amazon DynamoDB Storage Backend for Titan FAQ
• Amazon DynamoDB Storage Backend for Titan Documentation
Thank you!
Remember to complete
your evaluations!