A quick review of Python and Graph Databases

27
A quick review of Python and Graph Databases NIC CROUCH @FPHHOTCHIPS

Transcript of A quick review of Python and Graph Databases

A quick review of Python and Graph DatabasesNIC CROUCH

@FPHHOTCHIPS

Who am I?◦ Consultant at Deloitte Melbourne

in Enterprise Information Management

◦ Recent graduate of Flinders University in Adelaide

◦ Casual/Enthusiast reviewer of Graph Databases

What is a graph?“A set of objects connected by links” – Wikipedia

Objects: Vertices, nodes, points

Links: Edges, arcs, lines, relationships

Prior Work on Graphs in PythonGraph Database Patterns in Python – Elizabeth Ramirez, PyCon US 2015

Practical Graph/Network Analysis Made Simple – Eric Ma, PyCon US 2015

Graphs, Networks and Python: The Power of Interconnection – Lachlan Blackhall, PyCon AU 2014

An introduction to Python and graph databases with Neo4j - Holger Spill, PyCon NZ 2014

Mogwai: Graph Databases in your App – Cody Lee, PyTexas 2014

Today: Pythonic GraphsAn exploration of graph storage in Python:

◦ API must be Pythonic◦ execute(“<Not Python>”) doesn’t count.

◦ As little configuration as possible

Caveats:

◦ No configuration means no tuning

◦ Can’t compare distributed performance on a single node

◦ Limited to rough comparisons of performance – not a lab environment!

The Simple 1) Set up a dictionary of nodes

2) Each node keeps a list of relationships (or two, if you want a directed graph)

3) Set up add and get convenience methods

Pros:• Sometimes the simplest ways are the best• Very quick

Cons:• Not consistent• Probably going to need to be

maintained• Not persistent

The (slightly less) Simple1) Set up a Shelf of nodes

2) Each node keeps a list of relationships (or two, if you want a directed graph)

3) Set up add and get convenience methods

Pros:• Still reasonably quick

Cons:• Not consistent• Probably going to need to be

maintained

Off-topic: NetworkXAll the advantages of using a dictionary with none of the custom code.

◦ Comes with graph generators

◦ BSD Licenced

◦ Loads of standard analysis algorithms

◦ 90% test coverage

◦ … no persistence (except Pickle).

The Incumbent: Neo4jReleased in 2007

Written in Java

GPLv3/AGPLv3 or a commercial license

Runs as a server that exposes a REST Interface

Natively uses Cypher – an in-house developed graph query language

Best established, most popular graph-database

Easy to install – unzip and run a script

High Availability, but a little difficult to scale

Neo4j from PythonPy2Neo:

◦ Built by Nigel Small from Neo4j

◦ Actively maintained

Neo4j-rest-client

◦ Javier de la Rosa from University of Western Ontario

◦ Maintained through 9 months ago

neo4jdb-python

◦ Jacob Hansson of Neo4j

◦ Maintained through 8 months ago

◦ Mostly just wrappers around Cypher

Bulbflow:◦ Built by James Thornton of Pipem/Espeed

◦ Maintained to 8 months ago

◦ Connects to multiple backends

Py2Neo: SyntaxSet up a connection:

◦ graph=Graph("http://neo4j:password@localhost:7474/db/data/")

Create a node:◦ graph.create(Node("node_label", name=node_name))

◦ Node labels are like classes

Find a node:◦ graph.find_one("node_label", property_key="name",property_value=node_name)

Create a relationship:◦ graph.create(Relationship(node1, relationship, node2))

Find a relationship:

o graph.match_one(node1, relationship, node2, bidirectional=False)

Py2Neo: Good and BadThe good:

Simple API

Well documented

Easy to connect and get started.

Cool (if preliminary) spatial support

Not so much:◦ Skinny API

◦ No transaction support for Pythonic calls

◦ Performance struggles on large inputs

◦ No ORM (kinda)

neo4j-rest-client SyntaxSet up a connection:

◦ graph=GraphDatabase("http://localhost:7474/db/data/", username="username", password="password")

Create a node:◦ node=graph.nodes.create(name=node_name)

◦ Node labels are like classes

Find a node:◦ graph.nodes.filter(Q("name", iexact=node)).elements[0]

Create a relationship:◦ relationship=node1.is_related_to(node2)

neo4j-rest-client: Good and BadTransaction support with a context manager*

Strong filtering syntax

Very strong labelling syntax – searchable tags for nodes

Lazy evaluation of queries

Still REST based – still difficult to make it perform

*Seemingly. Somewhat difficult to make it work.

Py2Neo vs Neo4j-Rest-Client: Performance100 nodes with 20% connection:

Loading:Py2Neo: ~8 secondsNeo4j-rest-client: ~5 secondsPostgres: 4s

Retrieving:Py2Neo: ~6 secondsNeo4j-rest-client: ~5 secondsPostgres: 4s

1000 nodes with 20% connection:

Loading:Py2Neo: ~7 minutesNeo4j-rest-client: ~50 minutesPostgres: 6 minutes

Retrieving:Py2Neo: ~7 minutesNeo4j-rest-client: ~50 minutesPostgres: 6 minutes

Machine:AWS Memory Optimised xLarge node (30GB RAM) on Ubuntu Server using iPython2 3.0.0

Important noteCompletely unoptimised! No indexes, no attempt to chunk, only a couple OS optimisations.

OrientDB

PyOrient:◦ Official OrientDB Driver for Python

◦ Binary Driver

◦ Not Pythonic

Released in 2011

More NoSQL than Neo and Titan (Documents as well as graphs)

Scalable across multiple servers

Supports SQL

TitanFirst released in 2012

Written in Java

Licenced under Apache Licence

Many storage backends, including Cassandra, HBase and BerkeleyDB

Hadoop integration

Large amount of search back-ends

Built for scalability

Commercially supported by DataStax (formerly Aurelius)

Titan and PythonMogwai:

◦ Written by Cody Lee of wellaware

◦ Binary Driver for RexPro Server

◦ Very pythonic!

Bulbflow:◦ Built by James Thornton of Pipem/Espeed

◦ REST-based interface

◦ Maintained to 8 months ago

◦ Connects to multiple backends

RexPro and theTinkerpop StackApache Incubator Open Source Graph Framework

◦ Built around Gremlin

◦ Written in Java

◦ Extensively documented

Mogwai Performance100 nodes with 20% connection:

Loading:14 seconds

Retrieving:18 seconds

1000 nodes with 20% connection:

Loading:~9 minutes

Retrieving:~25 minutes

So, what should I use?*

Neo4j:◦ Good, relatively quick

bindings

◦ Well supported

◦ Could be expensive

◦ May not scale

*The full title of this slide is “What should I research further to ensure it meets my specific needs and then consider using?” In any case, the answer is still “It depends”

It depends.

Titan:◦ Good bindings

◦ Support in doubt

◦ Should be cheaper

◦ Proven scalability

Orient:◦ Poor bindings

◦ Well supported

◦ Open pricing structure

◦ Should scale well

What about Python Graph Databases?Not just Python bindings –pure(ish) Python.

GrapheekDB: https://bitbucket.org/nidusfr/grapheekdb◦ Uses local memory, Kyoto Cabinet or Symas LMDB as backend

◦ Under active development

◦ Exposes client/server interface

◦ Code is Beta quality at best

◦ Documentation is very spotty

Ajgu: https://bitbucket.org/amirouche/ajgu-graphdb/

◦ Uses Berkeley Database backend

◦ Under active development

◦ “This program is alpha becarful”

◦ Python 3 only

AjguSet up a connection:

◦ graph = GraphDatabase(Storage('./BSDDB/graph'))

Create a node:◦ transaction = self.graph.transaction(sync=True) ◦ node = transaction.vertex.create(node)

Find a node:◦ transaction.vertex.label(start)

Create a relationship:◦ relationship=transaction.edge.create(node1,node2)

Take-awaysGraphs match plenty of data sets

The big three Graph Databases are Neo4j, Titan and Orient

All three have upsides and downsides – depending on the usecase.

If you want to have a bit more fun, try Ajgu or Grapheek!

Thanks! Questions?

[email protected]

@fphhotchips

Py2Neo: Performance andTransactional SupportLarge imports should be done in one transaction to decrease overhead:

Graph.create(long_list_of_nodes_and_relationships)

This kills the client (essentially hangs in string processing).

So:for chunk in izip_longest(*[iter(iterator)]*size, fillvalue=''):

try:chunk = chunk[0:chunk.index('')]

except ValueError:pass

try:self.graph.create(*chunk)

except Exception as ex:pass #chunk dividing goes here

We lose ACID at this point.

What if this fails? Have to chunk it up again to find what failed.