Wanderu - Lessons from Building a Travel Site with Neo4j
-
Upload
neo4j-the-open-source-graph-database -
Category
Technology
-
view
108 -
download
0
description
Transcript of Wanderu - Lessons from Building a Travel Site with Neo4j
Wanderu: Lessons Learned
Lessons Learned and Unlearned from Building a Travel Site with Graphs and Neo4j
Eddy WongCTO, Wanderu.com
@eddywongch
About Wanderu.comSearch Engine for (Intercity) Buses and Trains
Demo
From pt A to pt B
A: Boston B: DC
NYC
Nomenclature: Stations, Trips
Amtrak, $101, 09/26/2013
Bolt, $25, 09/26/2013 Mega, $24, 09/26/2013
From pt A to pt B
B: Brooklyn, NY
A: Cambridge, MA31st & 9th Ave, NYC
South Station, Boston
28st & 7th Ave, NYC
34st & 8th Ave, NYC
Our Story
• Tech Started about 1+ yr ago
• Beta in Mar, Launch in Aug
• Knew nothing about Neo4j when we started (Jun 2012)
• Did not like the relational model: wanted schema-less and no self-joins
• Wanted a graph model
Relational vs. Graph
Lessons
LearnedUnLearned
Idea
•Architectural•Modeling•Geo
Architectural Lessons
Art: MC Escher
Our Story
• Started with MongoDB as a general store: easy to manipulate and organize data
• Wanted a db that could preserve the Graph Model
• Debated: Document vs. Graph
• Could not find one single db that could do both: general store + graph
Workflow
Store
Scraping JSON
Bus Websites Non-uniform Data
Uniform Data
Server
noSQL
• You need to make a choice of one noSQL database
• You need ONE (centralized) database
• The word “database” is a loaded term
• Lots of (very diff) noSQL dbs options
Our Situation
• Data is written only in one direction
• Users search for paths, then segments
• Searches are done by date
• Needed online capability
• Trip info (price/avail) could change on some
Our Solution
• Use Both: MongoDB + Neo4j
• “Docugraph” = Document + Graph
• Syncing two kinds of databases
• Eventual consistency
Pipeline
Scraping JSON
Bus Websites Non-uniform Data
Uniform Data
MongoDBNeo4jMongoConn
Nodes & Edges
Replica Mechanism
MongoConnector
• MongoDB Lab project, open source, unsupported
• Uses Replica Mechanism: Oplog
• Eventually Consistent (not real time)
• Written in Python
• Main methods: Upserts and Deletes, passes doc
• Implement DocMgr->Neo4jDocMgr->py2neo
• Other impls: MongoDocMgr, SolrDocMgr, ESDocMgr
Populating Neo4j (2)
• Created our own way of creating Edges
• Auto Node creation when Edge is created: Could add Stations (nodes) on the fly
• py2neo requires 2 “node ref”s to create an edge, ie. might need two round trips to Neo4j
Edge Creator P-codehashtable allStations = load_stations
w_create_edge (station_id a, station_id b, otherdata)
look_up a in allStations
If found -> ref_a = allStations.get(a)
If not found ->
ref_a = py2neo.create_node(a)
Add a to allStations
...
py2neo.create_edge(ref_a, ref_b, ...)
Pipeline
Scraping JSON
Bus Websites Non-uniform Data
MongoDB
Neo4j
MongoConnNodes & Edges
Replica Mechanism
REST Server
BOS, NYCBOS, PHLNYC, DC
NYC, PHL
Modeling Lessons
Art: MC Escher
Our Story
• We tried to “dump” all data into Neo4j
• Stations -> Nodes, Trips -> Edges
• Problem: Edges had dates -> too many Edges -> “Super Node”
• Query perf was terrible (1+ mins) and worse as # edges increased
Our Story (2)
• Went from Cypher to Gremlin, thinking that would have improve performance
• Needed range queries on Edges
Our Solution
• Don’t store everything in the Neo4j, only metadata
• Use Neo4j as an index
• Don’t store entities in Nodes, only keys
• Don’t store heavy properties in Edges
Neo4j Model
source: Tobias Lindaaker, Wes Freeman
Neo4j RuntimeModel
• Relationships are in a linked list
• Properties are in a linked list
• Therefore: There is NO random access for Relationships or Properties
• A range query of relationships required a full scan
Our Solution (2)
• Needed ability to do range queries on Edges
• Serve paths from Neo4j, segments from MongoDB
• The one thing we tried to avoid we ended up doing: Joins
• Came up with “Docugraph” approach
Docugraph
• MongoDB Collections for Nodes and Edges
• Neo4j: Only keys for nodes
• Neo4j: Only Properties relevant for queries
Nodes & Edges
• Collection for Stations (nodes)
{id: “BOS”, name: “Boston South Station”, address: “Summer St”, ...}
• Collection for Trips (edges)
{depart_id: “BOS”, arrive_id: “NYC”, carrier: “Megabus”, price: 24.0, ...}
Modeling
• Storing info in two or more dbs
• Doing a “join” across multiple dbs
Joins across DBs
MongoDB: Stations Neo4j: Nodes
BOS BOS
NYC NYC
DC DC
... ...
MongoDB: Trips Neo4j: Edges
BOS-NYC BOS-NYC
BOS-DC BOS-DC
NYC-DC NYC-DC
... ...
• Forget seq id generated by dbs
• Use a human-created long string for id
• Convert pair into id: depart-arrive
• For example: BOS-NYC
Indexing Technique
• Index Trips by {origin-dest, datetime}
Querying
• REST API in node.js
• Assemble results from two sources
• Paths from Neo4j
• Segments from MongoDB
• Sort by price, duration
Geo Lessons
Art: MC Escher
Our Story
• Wanted to mix public transport data with intercity data
• Did not want to host all public transport data
• Created a hybrid solution
Our Solution
• Hybrid:
• Google Autocomplete
• Google Maps
• In house station geo lookup
Geo
• Neo4j geo func was not out of the box
• Requires jar install
• Run a Java program to index
• Needed better doc
• Ended up using MongoDB geo instead
• Make geo func out of the box
Conclusions
• Even with a join across dbs -> solution better than relational
• 10s paths x 100s segments vs. 500k x 500k
• Glad to have picked Neo4j: doing content gen and more geo features now
• Graph model will be useful for future analytics->Big Data
Useful Links
• Neo4j Internals
slideshare.net/thobe/an-overview-of-neo4j-internals
• Aseem’s Lessons Learned with Neo4j
http://aseemk.com/talks/neo4j-lessons-learned#/14
• Wes Freeman, Neo4j Internals
http://wes.skeweredrook.com/graphdb-meetup-may-2013.pdf
• MongoConnector
blog.mongodb.org/post/29127828146/introducing-mongo-connector