Using MongoDB as a high performance graph database

60
Thursday, 21 June 12

description

 

Transcript of Using MongoDB as a high performance graph database

Page 1: Using MongoDB as a high performance graph database

Thursday, 21 June 12

Page 2: Using MongoDB as a high performance graph database

Using MongoDB as a high performance graph database

MongoDB UK, 20th June 2012

Chris Clarke CTO, Talis Education Limited

Thursday, 21 June 12

Who is talis?

Using mongo about 8 months (since 2.0)5 months in production

Page 3: Using MongoDB as a high performance graph database

What this talk not about

Thursday, 21 June 12

A blueprint for what you should doA pitch to encourage you to take our approachProviding or proving performance benchmarks Evangelism for the semantic web or linked dataEncouraging you to contribute/download/use an open source projectOptimised for your use case

Although we can talk to you about any of the above (see me after)

Page 4: Using MongoDB as a high performance graph database

So, what is this talk about?

Thursday, 21 June 12

Our journey of using MongoDB as a high performance graph databaseSpecifically the software wrapper we implemented on top of Mongo to give us a leg up in terms of scalability and performanceTo give you some ideas for how to work with graph data models if you’d like to use document databases

Page 5: Using MongoDB as a high performance graph database

GRAPHS 101

Thursday, 21 June 12

ApologiesNodes and edges or Resources and propertiesReally easy to represents facts

Page 6: Using MongoDB as a high performance graph database

John knows Jane

knowsJohn Jane

Thursday, 21 June 12

Ball and stick diagramsThis is an undirected graph. It implies that John knows Jane and Jane knows John. The property has no directional significance.

Page 7: Using MongoDB as a high performance graph database

knowsJohn Jane

John knows JaneJane knows John

Thursday, 21 June 12

This is an undirected graph. It implies that John knows Jane and Jane knows John. The property has no directional significance.

Page 8: Using MongoDB as a high performance graph database

knowsJohn Jane

John knows JaneJane ? John

Thursday, 21 June 12

This is a directed graph. The relationship is one way. To add Jane knows John we need a second property.

We will only use directed graphs from herein as they are more specific

Page 9: Using MongoDB as a high performance graph database

John Janeknows

knows

John knows JaneJane knows John

Thursday, 21 June 12

Page 10: Using MongoDB as a high performance graph database

Triples + RDF 101

Thursday, 21 June 12

Page 11: Using MongoDB as a high performance graph database

Subject Property Object

John knows Jane

Thursday, 21 June 12

This is a triple

Property = predicate

Page 12: Using MongoDB as a high performance graph database

Subject Property Object

John knows Jane

Jane knows John

Thursday, 21 June 12

This is a second tripleThe same resource can be a subject or an object

Page 13: Using MongoDB as a high performance graph database

http://xmlns.com/foaf/0.1/knows http://example.com/Janehttp://example.com/John

Subject Property Object

Thursday, 21 June 12

RDFResources and properties as URIsURIs can be dereferencedCan share common property descriptions (RDF Schemas)Here using FOAF - billions if not trillions of triples defined using FOAF

Page 14: Using MongoDB as a high performance graph database

foaf:knows http://example.com/Janehttp://example.com/John

Subject Property Object

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

foaf:name “John”http://example.com/John

Thursday, 21 June 12

Namespaces for readability

In RDF subjects are always urisBut objects can be literals i.e. plain textMany RDF/graph databases allow you to further type literals as dates, numbers, etc.

Page 15: Using MongoDB as a high performance graph database

Subject Property Object

foaf:name “John”

http://example.com/John

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

foaf:Personrdf:type

http://example.com/John

foaf:name “Jane”

http://example.com/Jane foaf:Personrdf:type

http://example.com/Jane

foaf:knows http://example.com/Janehttp://example.com/John

foaf:knows http://example.com/Johnhttp://example.com/Jane

Thursday, 21 June 12

Here we type John and Jane as foaf:Person using rdf:type

Note both John and Jane appear as subjects and resources

This RDF graph represents six facts

Page 16: Using MongoDB as a high performance graph database

example:John

foaf:knows

foaf:knows

example:Jane

foaf:Person

rdf:type rdf:type

“John” “Jane”

Thursday, 21 June 12

Here it is in ball and stick

Page 17: Using MongoDB as a high performance graph database

FFS! I can do that in two minutes in BSON

Thursday, 21 June 12

Page 18: Using MongoDB as a high performance graph database

> db.people.find(){

_id: ObjectID(‘123’),name: ‘John’knows: [ObjectID(‘456’)]

},{

_id: ObjectID(‘456’),name: ‘Jane’knows: [ObjectID(‘123’)]

}

Thursday, 21 June 12

Yes, you can!Data only makes sense inside your db though

Page 19: Using MongoDB as a high performance graph database

http://sheikspear.blogspot.co.uk/2011/07/simples.html

Thursday, 21 June 12

Talk over, right?We can all go home

Page 20: Using MongoDB as a high performance graph database

Some useful stuff, using RDF

Thursday, 21 June 12

Lets look at some reasons why we think RDF is good

Page 21: Using MongoDB as a high performance graph database

attribution

Thursday, 21 June 12

This is the linked open data cloud

Linked data is a way RDF published on the open web

Search linked data TED to hear why Tim Burness Lee cares about this

Each blob on this diagram represents an open, interlinked dataset. The lines between them represent the interlinking between data sets

Billions of public “facts” and growing exponentially from sites such as BBC, governments, Last.fm, Wikipedia

Page 22: Using MongoDB as a high performance graph database

Merging data from different sources is really easy

Thursday, 21 June 12

Because the format is subject, predicate, object the shape of RDF is always the same. Because schemas are public and widely shared the same properties are used all over the place.Really easy to use this data in your own app and remix

Page 23: Using MongoDB as a high performance graph database

Dataset B

foaf:Person

example:Johnexample:John

rdf:type

“John”

foaf:name

Dataset A

Thursday, 21 June 12

Page 24: Using MongoDB as a high performance graph database

foaf:Person

example:John

rdf:type

“John”

foaf:name

Dataset A+B

Thursday, 21 June 12

Really easy to merge graphs“Designed in” to the data formatLots of existing tooling to do this

Page 25: Using MongoDB as a high performance graph database

RDF query language: SPARQL

Thursday, 21 June 12

Page 26: Using MongoDB as a high performance graph database

PREFIX foaf: <http://xmlns.com/foaf/0.1/>SELECT ?name ?emailWHERE { ?person a foaf:Person. ?person foaf:name ?name. ?person foaf:mbox ?email.}ORDER BY ?nameLIMIT 50

Thursday, 21 June 12

SPARQL is mega flexible. Lots of functions for grouping, walking graphs, pattern matching, inference, UNIONS, Geo extensions etc. etc. - all that shit. Most if not all of those datasets will have a SPARQL endpoint you can query

Page 27: Using MongoDB as a high performance graph database

SELECT TabularDESCRIBE GraphASK BooleanCONSTRUCT Graph

Thursday, 21 June 12

4 main query types

Page 28: Using MongoDB as a high performance graph database

PREFIX foaf: <http://xmlns.com/foaf/0.1/>SELECT ?name ?emailWHERE { ?person a foaf:Person. ?person foaf:name ?name. ?person foaf:mbox ?email.}ORDER BY ?nameLIMIT 50

FFS! That looks like SQL!

Thursday, 21 June 12

Yes it does. The WHERE clause is basically doing a shit load of joins. I’ll come back to that.

Page 29: Using MongoDB as a high performance graph database

Offline conversion process

Application DB(SQL or other)

Triple store +SPARQL

Thursday, 21 June 12

Most datasets on the LOD diagram don’t exist natively as Linked data and RDF. They are post-produced.Data not held natively - so conversion script - needs to be maintained and updated every time app schema changesData not up to date (1 hour, 1 day, 1 month behind?)

Page 30: Using MongoDB as a high performance graph database

Our innovation:Native Linked Data

Applications

Thursday, 21 June 12

We started working on these applications back in 2008

They are natively linked data so solve the conversion+currency issue

There is no other “format” or schema the data is stored in, it’s native RDF

When you have no schema, and you can integrate data from elsewhere on the web, it’s addictive

Page 31: Using MongoDB as a high performance graph database

Our problem:FFS! For applications, we

need humongous scale and performance

Thursday, 21 June 12

Those applications becoming rather popular with our users...

sub 50ms query time

Modern web apps need speed and data scale

Out-grown triple store and SPARQL

SPARQL is very flexible and expressive. It’s also expensive SPARQL is great for data sets where the questions you can ask are limitless, but our applications need a data layer where speed is measured in single digit ms.

Complex caching (w/Memcache) to achieve performance and scalability90:10 read:write

Page 32: Using MongoDB as a high performance graph database

Tripod

Thursday, 21 June 12

It’s a pod for our triplesA triple store designed for applications and scalabilityBased on Mongo

Page 33: Using MongoDB as a high performance graph database

Functional requirements:• Order magnitude increase in perf/scale• Graph-orientated interface

Non-functional requirements:• Strong community

Thursday, 21 June 12

Existing code very graph orientated

Page 34: Using MongoDB as a high performance graph database

Core data formatTripod API

Dealing with complex queriesTripodTables

Free text search

Thursday, 21 June 12

Walk through Tripod looking at 5 areas

Page 35: Using MongoDB as a high performance graph database

{‘http://example.com/John’ : {

‘http://purl.org/dc/elements/1.1/name’ : [{

value: ‘John’,type: ‘literal’

}],‘http://purl.org/dc/elements/1.1/knows’ : [

{value: ‘http://example.com/Jane’,type: ‘uri’

}]

},‘http://example.com/Jane’ : {

‘http://purl.org/dc/elements/1.1/name’ : [{

value: ‘Jane’,type: ‘literal’

}],‘http://purl.org/dc/elements/1.1/knows’ : [

{value: ‘http://example.com/John’,type: ‘uri’

},{

value: ‘http://example.com/James’,type: ‘uri’

}]

}}

Thursday, 21 June 12

RDF/JSON - a serialisation of RDF in JSON

Neither disk space efficient or readable

full-formed properties not compatible with Mongo (dot notation)

Even single values inside an array (problems for compound indexing)

Page 36: Using MongoDB as a high performance graph database

> db.CBD_people.find(){

_id: ‘http://example.com/John’,‘foaf:name’: {l: ‘John’},‘foaf:knows’: {u: ‘http://example.com/Jane’}

},{

_id: ‘http://example.com/Jane’,‘foaf:name’: {l: ‘Jane’},‘foaf:knows’: [ {u:‘http://example.com/John’}, {u:‘http://example.com/James’}]

}

Thursday, 21 June 12

Same semantics

2 documents here

Concise bound descriptions - all data known about a subject, one relationship deep

One document per subject per collection, keyed (and thus enforced) by Subject URI

Property names are namespaced

CBD collections are deemed as read/write in Tripod

Page 37: Using MongoDB as a high performance graph database

class MongoGraph extends SimpleGraph {

function add_tripod_array($tarray) function to_tripod_array($docId)

}

Thursday, 21 June 12

All of our app already uses SimpleGraph from a library called Moriarty (Google Code)

Simple extension which can ingest/output the data format on prev slide

Page 38: Using MongoDB as a high performance graph database

Core data formatTripod API

Dealing with complex queriesTripodTables

Free text search

Thursday, 21 June 12

Walk through Tripod looking at 5 areas

Page 39: Using MongoDB as a high performance graph database

interface ITripod{ public function select($query,$fields,$sortBy=null,$limit=null); public function describeResource($resource); public function describeResources(Array $resources); public function saveChanges($oldGraph, $newGraph); public function search($query);}

Thursday, 21 June 12

Almost the same as our existing data access API onto generic triple store

All of these methods return graphs, all are mega-simple queries on the CBD collections

None of these methods support joins (WHERE clause in SPARQL)

Page 40: Using MongoDB as a high performance graph database

public function describeResource($resource){

$query = array(“_id”=>$resource);$bson = $this->getCollection()->findOne($query);$graph = new MongoGraph();$graph->add_tripod_data($bson);return $graph;

}

Thursday, 21 June 12

These methods mega simple to implement as they translate to really simple Mongo Queries on the CBD collections returning single objects

Page 41: Using MongoDB as a high performance graph database

interface ITripod{ public function select($query,$fields,$sortBy=null,$limit=null); public function describeResource($resource); public function describeResources(Array $resources); public function saveChanges($oldGraph, $newGraph); public function search($query);

public function getViewForResource($resource,$viewType); public function getViewForResources(Array $resources,$viewType); public function getViews(Array $filter,$viewType);

}

Thursday, 21 June 12

Some extra methods to deal with complex queries involving joins

Page 42: Using MongoDB as a high performance graph database

Core data formatTripod API

Dealing with complex queriesTripodTables

Free text search

Thursday, 21 June 12

2 things we realised when looking at our applications

Page 43: Using MongoDB as a high performance graph database

DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ?authorList ?author ?usedBy ?creator ?libraryNote ?publisherWHERE{

OPTIONAL {

<http://example.com/foo> resource:contains ?sectionOrItem . OPTIONAL

{ ?sectionOrItem resource:resource ?resource .

OPTIONAL { ?resource dcterms:isPartOf ?document . } OPTIONAL {

?resource bibo:authorList ?authorList . OPTIONAL { ?authorList ?p ?author . } } OPTIONAL { ?resource dcterms:publisher ?publisher . } } OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem } } . OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } . OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator }}

Thursday, 21 June 12

Typical SPARQL query in our app

9 “joins” in this query

Page 44: Using MongoDB as a high performance graph database

DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ?authorList ?author ?usedBy ?creator ?libraryNote ?publisherWHERE{

OPTIONAL {

<http://example.com/foo> resource:contains ?sectionOrItem . OPTIONAL

{ ?sectionOrItem resource:resource ?resource .

OPTIONAL { ?resource dcterms:isPartOf ?document . } OPTIONAL {

?resource bibo:authorList ?authorList . OPTIONAL { ?authorList ?p ?author . } } OPTIONAL { ?resource dcterms:publisher ?publisher . } } OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem } } . OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } . OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator }}

Thursday, 21 June 12

Only thing that changes at run time in this query is this URI

Flexibility of SPARQL great for developer but terrible here for system performance

Query engine needs to join 9 times! Flexibility costs us every time we run this query!

This is why we hid it behind a cache

Page 45: Using MongoDB as a high performance graph database

join countfollow sequences (n times)join across databases

All the above with a condition

include certain propertiesinclude all properties

Thursday, 21 June 12

2nd thing

We only make use of minimal SPARQL

And some of these aren’t even well supported in SPARQL (sequences + join across databases)

Page 46: Using MongoDB as a high performance graph database

Materialised views, generated infrequently, read often

Thursday, 21 June 12

Remember 90:10 read:update

View specifications based on a subset of SPARQL

Views are for DESCRIBE like queries where all the data is brought back in one hit (not tabular data)

Page 47: Using MongoDB as a high performance graph database

{ _id: "v_resource_brief", from: "CBD_harvest", type: "http:\/\/talisaspire.com\/schema#Resource", include: ["rdf:type", "dct:subject", "dct:isVersionOf", "searchterms:usedAt", "dc:identifier"], joins: { "acorn:preferredMetadata": [], "acorn:listReferences": { include: ["acorn:list"] }, "acorn:bookmarkReferences": { include: ["acorn:bookmark"] }, "dcterms:isPartOf": [], "acorn:partReferences": { include: ["dct:hasPart"], joins: { "dct:hasPart": { joins: { "acorn:preferredMetadata": [] } } } } }}

Thursday, 21 June 12

A view specification - itself a document that can be stored in Mongo

8 keywords:

type from include joinsttl followSequence maxJoins counts

Page 48: Using MongoDB as a high performance graph database

Generated by incremental MapReduce when:1) Data is changed

2) TTL expires

Thursday, 21 June 12

Tripod can take these specifications and manage views in a special collection within the DB.

They expire and are regenerated automatically (and incrementally)

Incremental map reduce inside the DB

Fast, interleaves with reads

Page 49: Using MongoDB as a high performance graph database

> db.views.findOne(){ "_id" : { "rdf:resource" : "http://talisaspire.com/examples/1", "type" : "v_resource_full" }, "value" : { "graphs" : [ { "_id" : "http://talisaspire.com/examples/1", "rdf:type" : { "type" : "uri", "value" : "http://talisaspire.com/schema#Resource" } } ], "impactIndex" : [ "rdf:resource" : "http://talisaspire.com/examples/1" ] }}

Thursday, 21 June 12

This is what a view looks like

ID is a composite key of the view type and root resourceGraphs is a collection of CBDs

MongoGraph we displayed earlier can take this and represent it as a unified graph to the application

Impact index - A watch list of resources. When resources are saved the impact index is queried to find views that need invalidating

TTL is an alternative. If in viewspec timestamp is stored in view to determine when it can be invalidated

Page 50: Using MongoDB as a high performance graph database

attribution

11 22

33

44

Thursday, 21 June 12

Match views to data update rate

Page 51: Using MongoDB as a high performance graph database

Core data formatTripod API

Dealing with complex queriesTripodTables

Free text search

Thursday, 21 June 12

Tripod Tables are for larger datasets which cannot be brought back in one hit

They can be paged or have individual columns indexed for fast sort capability

Page 52: Using MongoDB as a high performance graph database

SELECT ?listName ?listUri!WHERE{! ?resource bibo:isbn10 "$isbn" ! UNION! { ! ! ?resource bibo:isbn10 "$isbnLowerCase" .! }! ?item resource:resource ?resource .! UNION! {! ! ?resourcePartOf bibo:isbn10 "$isbn" .! ! UNION! ! {! ! ! ?resourcePartOf bibo:isbn10 "$isbnLowerCase" . ! ! }! ! ?resourcePartOf dct:hasPart ?resource .! ! ?item resource:resource ?resource . } ?listUri resource:contains ?item . ?listUri sioc:name ?listName . ?listUri rdf:type resource:List}LIMIT 10OFFSET 40

Thursday, 21 June 12

This is a select query that brings back a two col document

OFFSET

LIMIT

Page 53: Using MongoDB as a high performance graph database

<?xml version="1.0"?><sparql xmlns="http://www.w3.org/2005/sparql-results#">! <head>! ! <variable name="label"/>! ! <variable name="type"/>! </head>! <results>! ! <result>! ! ! <binding name="label">! ! ! ! <literal>Tropical grassland</literal>! ! ! </binding>! ! ! <binding name="type">! ! ! ! <uri>http://purl.org/ontology/wo/TerrestrialHabitat</uri>! ! ! </binding>! ! </result>! ! <result>! ! ! <binding name="label">! ! ! ! <literal>Grassy field</literal>! ! ! </binding>! ! ! <binding name="type">! ! ! ! <uri>http://purl.org/ontology/wo/TerrestrialHabitat</uri>! ! ! </binding>! ! </result>! </results></sparql>

Thursday, 21 June 12

SPARQL SELECT results - tabular format - here in XML

Page 54: Using MongoDB as a high performance graph database

> db.t_resource.findOne(){

"_id" : "http://talisaspire.com/resources/3SplCtWGPqEyXcDiyhHQpA-2","value" : {

"type" : ["http://purl.org/ontology/bibo/Book","http://talisaspire.com/schema#Resource"

],"isbn" : "9780393929690","isbn13" : [

"9780393929691","9780393929691-2",

! "9780393929691-3"],"impactIndex" : [

"http://talisaspire.com/works/4d101f63c10a6",]

}}

Thursday, 21 June 12

This time our map reduce doesn’t create one doc as with materialised views

We get one doc per row

Page 55: Using MongoDB as a high performance graph database

Core data formatTripod API

Dealing with complex queriesTripodTables

Free text search

Thursday, 21 June 12

Our triple store included free text search

We wanted to stream updates into Elastic Search or A N Other search solution

When documents saved, same specification language used to build Search Document Format docs and submit them to an endpoint

We like ElasticSearch but you could use Amazon CloudSearch

Page 56: Using MongoDB as a high performance graph database

Limitations

Thursday, 21 June 12

Map Reduce as a non-blocking db.eval() and also to work around sync PHP programming model

PHP only for now - our web apps were PHP

To get a SPARQL endpoint we are exporting data out to Fueski - solved the mapping not the currency (for SPARQL)

Page 57: Using MongoDB as a high performance graph database

Future

Thursday, 21 June 12

Node JS portUse as a server not a libraryEliminate dependancy on map reduceSpecification version controlTap into op log for stream approach into Fuseki and other locationsNamed graph supportFurther optimisation of data modelMaybe open source

Page 58: Using MongoDB as a high performance graph database

That’s it

Thursday, 21 June 12

Page 59: Using MongoDB as a high performance graph database

Questions?

Find us on:

Web: talisaspire.comTwitter: @talisaspireYouTube: youtube.com/user/TalisAspireFacebook: facebook.com/talisaspireSupport: support.talisaspire.com

Questions?

Thursday, 21 June 12

Page 60: Using MongoDB as a high performance graph database

Find us on:

Web: talisaspire.comTwitter: @talisaspireYouTube: youtube.com/user/TalisAspireFacebook: facebook.com/talisaspireSupport: support.talisaspire.com

Thursday, 21 June 12