COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems:...
Transcript of COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems:...
1
COSC 6339
Big Data Analytics
NoSQL Database Systems(IV) –
Graph DataBase systems: Neo4j
Edgar Gabriel
Fall 2018
Graph database models
Image source: https://www.quackit.com/neo4j/tutorial/
2
Graph database models
• Data models for schema and instances are graph
models or generalizations of them.
• Data manipulation is expressed as graph-based
operations.
Slide source:
http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf
Graph Databases Advantages
• Natural modeling of highly connected data
• Special graph storage structure
• Efficient schemaless graph algorithms
• Support for query languages.
• Operators to query the graph structure.
Slide source:
http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf
3
Graph Databases
Slide source:
http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf
Neo4j
• Neo4j is a graph database, adopting a labeled property
graph model.
• Terminology:
– vertices are called nodes
– edges are called relationships.
Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher
4
Neo4j
• Nodes
– Nodes are typically used to represent entities (or complex
value types).
– Nodes can have properties, which are key/value pairs.
Values can be primitives or collections of primitives.
– Nodes can have zero or more relationships connecting
them to other nodes.
Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher
Neo4j
• Relationships
– Relationships are used to represent the relationships
between nodes; to provide context to the nodes.
– Relationships must have a start and end node, thus
relationships must have a direction. Direction can be
ignored at query time, so the fact that direction is there
does not mean it must be used.
– Relationships must have a relationship type.
– Relationships can have properties (key/value pairs. values
can be primitives or collections of primitives).
Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher
5
Neo4j
• Properties
– Nodes and relationships can have properties (key/value
pairs. values can be primitives or collections of
primitives.)
– Properties can quantify relationships.
• Labels
– Nodes can have zero or more labels.
– Labels can represent roles, categories or types.
– Labels are used to define indexes and constraints.
Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher
Neo4J API(I)
db = new GraphDatabaseFactory().newEmbeddedDatabase(DB_PATH)
Node author1 = db.createNode();
Node author2 = db.createNode();
Node author3 = db.createNode();
Node paper1 = db.createNode();
Node paper2 = db.createNode();
Node paper3 = db.createNode();
Node paper4 = db.createNode();
Node paper5 = db.createNode();
Node conference1 = db.createNode();
Node conference2= db.createNode();
Slide source:
http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf
6
Neo4J API(II)
author1.setProperty(“firstname”,”Peter”);
author1.setProperty(“lastname”,”Smith”);
author2.setProperty(“firstname”,”Juan”);
author2.setProperty(“lastname”,”Perez”);
author3.setProperty(“firstname”,”John”);
author3.setProperty(“lastname”,”Smith”);
conference1.setProperty(“name”,”ESWC”);
conference2.setProperty(“name”,”ISWC”);
Slide source:
http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf
Adding attributes
Neo4J API(III)
Adding relationshipsRelationshipType relWrote = DynamicRelationshipType.withName(”WROTE”);
Relationship rel1 = author1.createRelationshipTo(paper1,relWrote);
Relationship rel2=author1.createRelationshipTo(paper5,relWrote);
...
RelationshipType relCites=DynamicRelationshipType.withName(”CITES”);
Relationship rel8=paper1.createRelationshipTo(paper2,relCites);
Relationship rel9=paper3.createRelationshipTo(paper1,relCites);
...
RelationshipType rConf=DynamicRelationshipType.withName(”Conference”);
Relationship rel12=paper1.createRelationshipTo(conference1,rConf);
Relationship rel13=paper2.createRelationshipTo(conference2,rConf);
Relationship rel14=paper3.createRelationshipTo(conference1,rConf);
Slide source:
http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf
7
Cypher
• Neo4J query language
• Mix of pattern-oriented and declarative query language
• Cypher patterns:
– Fundamental traversal description of Cypher
– nodes as circles, e.g. (ident)
– Relationships as arrows, e.g. (ident)-->(ident2)
– Relationship identifiers are specified within square
brackets, with an optional type after a colon, e.g.
(u)-[r:HAS_ACCESS]->(a)
Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher
Cypher Query Language
• CREATE:
– Create new nodes based on a pattern
• MATCH:
– Pattern matching to match based on the starting point(s)
• MERGE:
– Match or Create if it doesn’t exist
• RETURN:
– What is projected out from the evaluation of the query.
• WHERE:
– Filtering criteria.
Slide source:
http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf
8
CREATE (a:Author {Firstname:”Peter”, Lastname:”Smith”)
CREATE (a:Author {Firstname:”John”, Lastname:”Smith”)
CREATE (a:Author {Firstname:”Juan”, Lastname:”Perez”)
CREATE (:Paper {Title:”Paper1”)
CREATE (:Paper {Title:”Paper2”)
CREATE (:Paper {Title:”Paper3”)
CREATE (:Conference {Title:”ISCW”)
CREATE (:Conference {Title:”ESWC”)
Label Properties
Variable to reference a node later
CYPHER Examples (I)
• Query: Papers written by “Peter Smith”
MATCH (a:Author)
WHERE a.Firstname = ’Peter’ AND a.Lastname = ’Smith’
RETURN a
or equivalent (but shorter)
MATCH (a:Author {Firstname:’Peter’, Lastname:’Smith’})
RETURN a
Slide source:
http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf
9
CYPHER setting relationships
MATCH (a:Author {Firstname=‘Peter’, Lastname=‘Smith’),
(p:Paper {Title=‘Paper1’}
CREATE (a)-[r:WROTE]->(p)
RETURN r
MATCH (p1:Paper {Title=‘Paper1’), (p2:Paper
{Title=‘Paper2’}
CREATE (p1)-[r:CITES]->(p2)
RETURN r
• Query: Papers cited by a paper written by “Peter
Smith”
MATCH (a:Author {Firstname:’Peter’, Lastname:’Smith’})--
[:WROTE]->()-[:CITES]->(p:Papers)
RETURN p
Query: Number of papers cited by a paper written by
“Peter Smith” MATCH (a:Author {Firstname:’Peter’, Lastname:’Smith’})--
[:WROTE]->()-[:CITES]->(p:Papers)
RETURN COUNT(p)
Slide source:
http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf
10
• Query: Papers cited by a paper written by Peter Smith
that have at least 20 cites.
MATCH (a:Author {Firstname:’Peter’, Lastname:’Smith’})-
[:WROTE]->()-[:CITES]->()-[:CITES]->(p:Papers)
WITH COUNT(p) as cites
WHERE cites > 19
RETURN p
Slide source:
http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf
Importing Data into Neo4j
Neo4j supports importing data from csv files
LOAD CSV FROM 'file:///tmp/users.csv' as line
MERGE (:User {id:toInt(line[0]), name:line[1]});
LOAD CSV FROM 'file:///tmp/groups.csv' as line
MERGE (:Group {id:toInt(line[0]), name:line[1]});
LOAD CSV FROM 'file:///tmp/user_groups.csv' as line
MATCH (u:User {id:toInt(line[0])}), (g:Group
{id:toInt(line[1])})
MERGE (u)-[:IS_MEMBER]->(g);
Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher
11
Neo4j Clustering
• Core servers:
– replicating all
transactions using the
Raft protocol.
– a majority of Core
Servers in a cluster
(N/2+1) have to accept
a transaction before it
is safe to acknowledge
the commit to the end
user application
Source: https://neo4j.com/docs/operations-manual/current/clustering/introduction/
Neo4j Clustering
• Read replicas:
– scale out graph workloads (Cypher queries, etc.).
– Act like caches for the data that the Core Servers store
– fully-fledged Neo4j databases capable of fulfilling
arbitrary (read-only) graph queries
– asynchronously replicated from Core Servers via
transaction log shipping. A Read Replica will periodically
poll a Core Server for any new transactions that it has
processed since the last poll, and the Core Server will
ship those transactions to the Read Replica.
– Provides Casual Consistency
Source: https://neo4j.com/docs/operations-manual/current/clustering/introduction/
12
Neo4j Clustering
Casual consistency:
– An application that writes a graph (to a core server) is
guaranteed to see it in a subsequent read request ( from
a read server)
– Implemented using a bookmark: on executing a
transaction, the client can ask for a bookmark which it
then presents as a parameter to subsequent transactions
– the cluster can ensure that only servers which have
processed the client’s bookmarked transaction will run its
next transaction. This provides a causal chain which
ensures correct read-after-write semantics from the
client’s point of view
Source: https://neo4j.com/docs/operations-manual/current/clustering/introduction/
Further features
• Neo4j supports various Graph algorithms:
– Centrality Algorithms:
• E.g., PageRank
– Community Detection Algorithms:
• E.g. Connected Components
– Path Finding Algorithms:
• E.g. Minimum Weight Spanning Tree, Shortestpath
– Similarity Algorithms
• E.g. Euclidean Distance, Cosine Similarity