Neo4j - files.meetup.comfiles.meetup.com/10978482/neo4j_clojure.pdfNeo4J and Clojure part II:...

32
Neo4j (with a side of Clojure) Steven Tobin Senior Data Scientist @ Web Summit [email protected] [email protected]

Transcript of Neo4j - files.meetup.comfiles.meetup.com/10978482/neo4j_clojure.pdfNeo4J and Clojure part II:...

Neo4j (with a side of Clojure)

Steven Tobin Senior Data Scientist @ Web Summit

[email protected] [email protected]

Bog-standard caveats• Views are mine, not representative of Web Summit

in any way shape or form (except: graph DBs are cool)

• Views are further cobbled together and may not be 100%, y'know, correct (c.f., anything to do with RDBMS)

• Don't design or build production systems based solely on this talk, unless you are also me

Outline• Graphs - mathematical basis and areas where they are powerful

tools

• Neo4J graph database

• Neo4J's query language Cypher

• Interop with Neo4J - REST API

• Extending Neo4J - JVM/Java API

• Our current Clojure use

• Fin

Background• Data scientist at Web Summit / Ci Labs

• Most recently working on our graph databases for in-conference recommendations and chat

• Before: applied mathematics postdoc at University of Melbourne

• Research focus: algorithms for information transfer / flow over graphs, reconciling the mathematical graph with the real world system it represented

Graphs 101

• Basically, a mathematical model which models relationships between objects or edges joining vertices

• Useful for modelling networks of any type: infrastructure (e.g., road and rail networks), communication (e.g., the web), social; and the propagation of information through those networks

• Graphs can be directed or undirected.

• We'll only be looking at the simplest graphs: nodes joined by one or more edges. More exotic graphs exist: multigraphs (each node is a graph itself)

G = (V,E)

Graphs 102

Graph analysis

Disrupting USSR supply chain with minimal effort

Graph databases• Lots of data consists of relations (edges) between

things (nodes) - good fit for graph representation

• Explicit: Steven is friends with Mike

• Implicit: Steven works for Web Summit, Steven uses Neo4J

• Nested: Steven is friends with Mike who works for Web Summit which organises CollisionConf 2016

Graph vs traditional databases

• Traditional databases are a good fit for highly structured data, and tend to be much more performant for tabular lookups

• select user_name from users where id = 1 faster in postgres than neo4j

• But: traversing relationships normally uses table JOIN operations - expensive computationally

• Graphs allow many-to-many relationships trivially: "every JOIN is precomputed". Structure is just nodes and edges

• match (a:user {id: 1})<-[:FOLLOWS]-(b:user) return b.name faster in neo4j than postgres (note direction)

Neo4J• Most popular graph database

• Launched 1.0 in 2010, 2.0 (i.e., production ready) Dec 2013. Young but actively developed

• Written in Java and Scala, originally developed as an include-able library

• Everything is either: node (Person), edge (FRIENDS), property (name, age)

• Performant as long as you can fit the graph in RAM: scaling can be an issue

CypherNeo4J's query language, designed to represent graphs

Kind of ASCII art

(Node)-[:RELATIONSHIP]-(Node)

(a {property:fee})-[:DIRECTED {other:bar}]->(b:NodeLabel)

Can be used over: command line interface (neo4j-shell)

web browser interface (in-built) REST API (later in talk)

Cypher: simple example

MATCH (p:Person {name:"Steve"})<-[r:FOLLOWS]-(n)

RETURN p.id, p.name, n.name

variable/identifierNode type (for e.g.,

indexing)Attribute

Relationship (type and direction)

Match everyone who follows anyone called

Steve, get the Steve's ID and the followers' names

Cypher: more complicated example

Neo4J + Cypher

Neo4J REST API• POST Cypher (wrapped in JSON) to

http://hostname:7474/db/data/transaction/commit

• DELETE cancels transaction in progress

• Clunky and awkward to use manually: better to use libraries

• Caveat: data science means python, more familiar with py2neo than Clojure libraries

Neo4J Rest API

Neo4J and Clojure part I: neocons

More Clojuric approach to graphs Useful if you don't like Cypher or want more programmatic

control rather than tinkering with strings

neocons continued

Supports Cypher if you need it Primarily how we use py2neo: if you have optimised Cypher

you won't gain much (might even lose) performance Interesting question: how to increase performance?

Extending Neo4J

• Large Cypher queries can be slow

• If you are repeatedly executing the same task it can be worth it to dive deeper into the JVM, potential for huge optimisation

• Remember: Neo4J was designed originally as a library, so extension is fairly straightforward

Extending Neo4J• Possible to create server extensions: compiled

JARs which define new HTTP endpoints

• No interpretation overhead: directly acting on and manipulating graph objects

• The ugly: primarily Java, limited docs, handwriting graph traversal and result generation code

• but 100x performance increase for our primary queries

Actual fake example timeCypher query to get topics an Attendee is interested in

returns attendees who are interested in any interest

• Easy to understand and extend • Execute over REST: no worries • Sub-sub-sub part of much more complex queries

we needed to run

Getting interests, the JVM way

Neo4J and Clojure part II: extending the server

• Graph traversal and server extension is a good fit for functional / recursive programming

• Neo4J can technically be extended by any JVM language - several internal components are Scala. Why not Clojure?

• Docs are thin on the ground, I didn't have enough experience or time

• Would love to try to build this out though: writing close to neocons but compiling into the server process would be a huge win IMO

Neo4J and Clojure, the third

• Lots of Java so far, but no actual Clojure

• We needed to put something between Neo4J and the outside world - unilateral decision time!

• Screw it, time to deploy Clojure to production

Neo4J and Clojure part III: filling the gaps

• Caching and pseudo-backups: prevent downtime in case of Neo4J issues. Replace complicated point of failure with simple and redundant

• Allowed backwards compatibility with older (low traffic) APIs, avoiding more Java - transforming data is an ideal use case

• Easily restricted access to the more open Neo4J port in favour of single use restricted port

• Scale to thousands of reqs/sec on small instances. Scaling Neo4J itself is more difficult

Neo4J and Clojure part III: filling the gaps

• HTTP: httpkit (async capable, not fully utilised), routing with compojure

• Redis: Carmine is awesome

• Logging with timbre

• JSON manipulation: cheshire

• Simple deployment: uberjar + upstart + AWS AMI, trivial to launch boxes into our load balancer as needed

Closing• Graph DBs are awesome at the things they are

awesome at. But make sure you're doing one of those things

• Pairing with a traditional DB can give best-of-both-worlds performance and capabilities

• Graphs/networks are an elegant and powerful mathematical tool that you should keep in mind

Check out• networkx: pure python network library. Useful for

data explorations & modelling and algorithmic work

• Gephi: graph visualisation and manipulation. Clunky but highly functional

• MATLAB/Mathematica/Octave: to look at matrix & lin alg variants of graph algorithms

Questions?

Adjacency matrix

Neighbours between i, j: A_(i, j) > 1 Next nearest neighbours between i, j: A^2_(i, j) > 1

PageRank is (approximately) the primary eigenvector of a weighted matrix representation of the web