COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems:...

12
1 COSC 6339 Big Data Analytics NoSQL Database Systems(IV) – Graph DataBase systems: Neo4j Edgar Gabriel Fall 2018 Graph database models Image source: https ://www.quackit.com/neo4j/tutorial/

Transcript of COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems:...

Page 1: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

1

COSC 6339

Big Data Analytics

NoSQL Database Systems(IV) –

Graph DataBase systems: Neo4j

Edgar Gabriel

Fall 2018

Graph database models

Image source: https://www.quackit.com/neo4j/tutorial/

Page 2: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

2

Graph database models

• Data models for schema and instances are graph

models or generalizations of them.

• Data manipulation is expressed as graph-based

operations.

Slide source:

http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf

Graph Databases Advantages

• Natural modeling of highly connected data

• Special graph storage structure

• Efficient schemaless graph algorithms

• Support for query languages.

• Operators to query the graph structure.

Slide source:

http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf

Page 3: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

3

Graph Databases

Slide source:

http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf

Neo4j

• Neo4j is a graph database, adopting a labeled property

graph model.

• Terminology:

– vertices are called nodes

– edges are called relationships.

Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher

Page 4: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

4

Neo4j

• Nodes

– Nodes are typically used to represent entities (or complex

value types).

– Nodes can have properties, which are key/value pairs.

Values can be primitives or collections of primitives.

– Nodes can have zero or more relationships connecting

them to other nodes.

Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher

Neo4j

• Relationships

– Relationships are used to represent the relationships

between nodes; to provide context to the nodes.

– Relationships must have a start and end node, thus

relationships must have a direction. Direction can be

ignored at query time, so the fact that direction is there

does not mean it must be used.

– Relationships must have a relationship type.

– Relationships can have properties (key/value pairs. values

can be primitives or collections of primitives).

Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher

Page 5: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

5

Neo4j

• Properties

– Nodes and relationships can have properties (key/value

pairs. values can be primitives or collections of

primitives.)

– Properties can quantify relationships.

• Labels

– Nodes can have zero or more labels.

– Labels can represent roles, categories or types.

– Labels are used to define indexes and constraints.

Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher

Neo4J API(I)

db = new GraphDatabaseFactory().newEmbeddedDatabase(DB_PATH)

Node author1 = db.createNode();

Node author2 = db.createNode();

Node author3 = db.createNode();

Node paper1 = db.createNode();

Node paper2 = db.createNode();

Node paper3 = db.createNode();

Node paper4 = db.createNode();

Node paper5 = db.createNode();

Node conference1 = db.createNode();

Node conference2= db.createNode();

Slide source:

http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf

Page 6: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

6

Neo4J API(II)

author1.setProperty(“firstname”,”Peter”);

author1.setProperty(“lastname”,”Smith”);

author2.setProperty(“firstname”,”Juan”);

author2.setProperty(“lastname”,”Perez”);

author3.setProperty(“firstname”,”John”);

author3.setProperty(“lastname”,”Smith”);

conference1.setProperty(“name”,”ESWC”);

conference2.setProperty(“name”,”ISWC”);

Slide source:

http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf

Adding attributes

Neo4J API(III)

Adding relationshipsRelationshipType relWrote = DynamicRelationshipType.withName(”WROTE”);

Relationship rel1 = author1.createRelationshipTo(paper1,relWrote);

Relationship rel2=author1.createRelationshipTo(paper5,relWrote);

...

RelationshipType relCites=DynamicRelationshipType.withName(”CITES”);

Relationship rel8=paper1.createRelationshipTo(paper2,relCites);

Relationship rel9=paper3.createRelationshipTo(paper1,relCites);

...

RelationshipType rConf=DynamicRelationshipType.withName(”Conference”);

Relationship rel12=paper1.createRelationshipTo(conference1,rConf);

Relationship rel13=paper2.createRelationshipTo(conference2,rConf);

Relationship rel14=paper3.createRelationshipTo(conference1,rConf);

Slide source:

http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf

Page 7: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

7

Cypher

• Neo4J query language

• Mix of pattern-oriented and declarative query language

• Cypher patterns:

– Fundamental traversal description of Cypher

– nodes as circles, e.g. (ident)

– Relationships as arrows, e.g. (ident)-->(ident2)

– Relationship identifiers are specified within square

brackets, with an optional type after a colon, e.g.

(u)-[r:HAS_ACCESS]->(a)

Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher

Cypher Query Language

• CREATE:

– Create new nodes based on a pattern

• MATCH:

– Pattern matching to match based on the starting point(s)

• MERGE:

– Match or Create if it doesn’t exist

• RETURN:

– What is projected out from the evaluation of the query.

• WHERE:

– Filtering criteria.

Slide source:

http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf

Page 8: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

8

CREATE (a:Author {Firstname:”Peter”, Lastname:”Smith”)

CREATE (a:Author {Firstname:”John”, Lastname:”Smith”)

CREATE (a:Author {Firstname:”Juan”, Lastname:”Perez”)

CREATE (:Paper {Title:”Paper1”)

CREATE (:Paper {Title:”Paper2”)

CREATE (:Paper {Title:”Paper3”)

CREATE (:Conference {Title:”ISCW”)

CREATE (:Conference {Title:”ESWC”)

Label Properties

Variable to reference a node later

CYPHER Examples (I)

• Query: Papers written by “Peter Smith”

MATCH (a:Author)

WHERE a.Firstname = ’Peter’ AND a.Lastname = ’Smith’

RETURN a

or equivalent (but shorter)

MATCH (a:Author {Firstname:’Peter’, Lastname:’Smith’})

RETURN a

Slide source:

http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf

Page 9: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

9

CYPHER setting relationships

MATCH (a:Author {Firstname=‘Peter’, Lastname=‘Smith’),

(p:Paper {Title=‘Paper1’}

CREATE (a)-[r:WROTE]->(p)

RETURN r

MATCH (p1:Paper {Title=‘Paper1’), (p2:Paper

{Title=‘Paper2’}

CREATE (p1)-[r:CITES]->(p2)

RETURN r

• Query: Papers cited by a paper written by “Peter

Smith”

MATCH (a:Author {Firstname:’Peter’, Lastname:’Smith’})--

[:WROTE]->()-[:CITES]->(p:Papers)

RETURN p

Query: Number of papers cited by a paper written by

“Peter Smith” MATCH (a:Author {Firstname:’Peter’, Lastname:’Smith’})--

[:WROTE]->()-[:CITES]->(p:Papers)

RETURN COUNT(p)

Slide source:

http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf

Page 10: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

10

• Query: Papers cited by a paper written by Peter Smith

that have at least 20 cites.

MATCH (a:Author {Firstname:’Peter’, Lastname:’Smith’})-

[:WROTE]->()-[:CITES]->()-[:CITES]->(p:Papers)

WITH COUNT(p) as cites

WHERE cites > 19

RETURN p

Slide source:

http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Tutorials/SDMinGraphDataBases/SlidesTutorialGraphDatabases2014.pdf

Importing Data into Neo4j

Neo4j supports importing data from csv files

LOAD CSV FROM 'file:///tmp/users.csv' as line

MERGE (:User {id:toInt(line[0]), name:line[1]});

LOAD CSV FROM 'file:///tmp/groups.csv' as line

MERGE (:Group {id:toInt(line[0]), name:line[1]});

LOAD CSV FROM 'file:///tmp/user_groups.csv' as line

MATCH (u:User {id:toInt(line[0])}), (g:Group

{id:toInt(line[1])})

MERGE (u)-[:IS_MEMBER]->(g);

Slide based on tutorial https://www.airpair.com/neo4j/posts/getting-started-with-neo4j-and-cypher

Page 11: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

11

Neo4j Clustering

• Core servers:

– replicating all

transactions using the

Raft protocol.

– a majority of Core

Servers in a cluster

(N/2+1) have to accept

a transaction before it

is safe to acknowledge

the commit to the end

user application

Source: https://neo4j.com/docs/operations-manual/current/clustering/introduction/

Neo4j Clustering

• Read replicas:

– scale out graph workloads (Cypher queries, etc.).

– Act like caches for the data that the Core Servers store

– fully-fledged Neo4j databases capable of fulfilling

arbitrary (read-only) graph queries

– asynchronously replicated from Core Servers via

transaction log shipping. A Read Replica will periodically

poll a Core Server for any new transactions that it has

processed since the last poll, and the Core Server will

ship those transactions to the Read Replica.

– Provides Casual Consistency

Source: https://neo4j.com/docs/operations-manual/current/clustering/introduction/

Page 12: COSC 6339 Big Data Analytics NoSQL Database Systems(IV) Graph DataBase systems: Neo4jgabriel/courses/cosc6339_f18/BDA_21... · 2018-11-08 · •Natural modeling of highly connected

12

Neo4j Clustering

Casual consistency:

– An application that writes a graph (to a core server) is

guaranteed to see it in a subsequent read request ( from

a read server)

– Implemented using a bookmark: on executing a

transaction, the client can ask for a bookmark which it

then presents as a parameter to subsequent transactions

– the cluster can ensure that only servers which have

processed the client’s bookmarked transaction will run its

next transaction. This provides a causal chain which

ensures correct read-after-write semantics from the

client’s point of view

Source: https://neo4j.com/docs/operations-manual/current/clustering/introduction/

Further features

• Neo4j supports various Graph algorithms:

– Centrality Algorithms:

• E.g., PageRank

– Community Detection Algorithms:

• E.g. Connected Components

– Path Finding Algorithms:

• E.g. Minimum Weight Spanning Tree, Shortestpath

– Similarity Algorithms

• E.g. Euclidean Distance, Cosine Similarity