Wanderu - Lessons from Building a Travel Site with Neo4j

Wanderu: Lessons Learned

Lessons Learned and Unlearned from Building a Travel Site with Graphs and Neo4j

Eddy WongCTO, Wanderu.com

@eddywongch

About Wanderu.comSearch Engine for (Intercity) Buses and Trains

From pt A to pt B

A: Boston B: DC

NYC

Nomenclature: Stations, Trips

Amtrak, $101, 09/26/2013

Bolt, $25, 09/26/2013 Mega, $24, 09/26/2013

From pt A to pt B

B: Brooklyn, NY

A: Cambridge, MA31st & 9th Ave, NYC

South Station, Boston

28st & 7th Ave, NYC

34st & 8th Ave, NYC

Our Story

• Tech Started about 1+ yr ago

• Beta in Mar, Launch in Aug

• Knew nothing about Neo4j when we started (Jun 2012)

• Did not like the relational model: wanted schema-less and no self-joins

• Wanted a graph model

Relational vs. Graph

Lessons

LearnedUnLearned

Idea

•Architectural•Modeling•Geo

Architectural Lessons

Art: MC Escher

Our Story

• Started with MongoDB as a general store: easy to manipulate and organize data

• Wanted a db that could preserve the Graph Model

• Debated: Document vs. Graph

• Could not find one single db that could do both: general store + graph

Workflow

Store

Scraping JSON

Bus Websites Non-uniform Data

Uniform Data

Server

noSQL

• You need to make a choice of one noSQL database

• You need ONE (centralized) database

• The word “database” is a loaded term

• Lots of (very diff) noSQL dbs options

Our Situation

• Data is written only in one direction

• Users search for paths, then segments

• Searches are done by date

• Needed online capability

• Trip info (price/avail) could change on some

Our Solution

• Use Both: MongoDB + Neo4j

• “Docugraph” = Document + Graph

• Syncing two kinds of databases

• Eventual consistency

Pipeline

Scraping JSON


Uniform Data

MongoDBNeo4jMongoConn

Nodes & Edges

Replica Mechanism

MongoConnector

• MongoDB Lab project, open source, unsupported

• Uses Replica Mechanism: Oplog

• Eventually Consistent (not real time)

• Written in Python

• Main methods: Upserts and Deletes, passes doc

• Implement DocMgr->Neo4jDocMgr->py2neo

• Other impls: MongoDocMgr, SolrDocMgr, ESDocMgr

Populating Neo4j (2)

• Created our own way of creating Edges

• Auto Node creation when Edge is created: Could add Stations (nodes) on the fly

• py2neo requires 2 “node ref”s to create an edge, ie. might need two round trips to Neo4j

Edge Creator P-codehashtable allStations = load_stations

w_create_edge (station_id a, station_id b, otherdata)

look_up a in allStations

If found -> ref_a = allStations.get(a)

If not found ->

ref_a = py2neo.create_node(a)

Add a to allStations

...

py2neo.create_edge(ref_a, ref_b, ...)

Pipeline

Scraping JSON


MongoDB

Neo4j

MongoConnNodes & Edges

Replica Mechanism

REST Server

BOS, NYCBOS, PHLNYC, DC

NYC, PHL

Modeling Lessons

Art: MC Escher

Our Story

• We tried to “dump” all data into Neo4j

• Stations -> Nodes, Trips -> Edges

• Problem: Edges had dates -> too many Edges -> “Super Node”

• Query perf was terrible (1+ mins) and worse as # edges increased

Our Story (2)

• Went from Cypher to Gremlin, thinking that would have improve performance

• Needed range queries on Edges

Our Solution

• Don’t store everything in the Neo4j, only metadata

• Use Neo4j as an index

• Don’t store entities in Nodes, only keys

• Don’t store heavy properties in Edges

Neo4j Model

source: Tobias Lindaaker, Wes Freeman

Neo4j RuntimeModel

• Relationships are in a linked list

• Properties are in a linked list

• Therefore: There is NO random access for Relationships or Properties

• A range query of relationships required a full scan

Our Solution (2)

• Needed ability to do range queries on Edges

• Serve paths from Neo4j, segments from MongoDB

• The one thing we tried to avoid we ended up doing: Joins

• Came up with “Docugraph” approach

Docugraph

• MongoDB Collections for Nodes and Edges

• Neo4j: Only keys for nodes

• Neo4j: Only Properties relevant for queries

Nodes & Edges

• Collection for Stations (nodes)

{id: “BOS”, name: “Boston South Station”, address: “Summer St”, ...}

• Collection for Trips (edges)

{depart_id: “BOS”, arrive_id: “NYC”, carrier: “Megabus”, price: 24.0, ...}

Modeling

• Storing info in two or more dbs

• Doing a “join” across multiple dbs

Joins across DBs

MongoDB: Stations Neo4j: Nodes

BOS BOS

NYC NYC

DC DC

... ...

MongoDB: Trips Neo4j: Edges

BOS-NYC BOS-NYC

BOS-DC BOS-DC

NYC-DC NYC-DC

... ...

• Forget seq id generated by dbs

• Use a human-created long string for id

• Convert pair into id: depart-arrive

• For example: BOS-NYC

Indexing Technique

• Index Trips by {origin-dest, datetime}

Querying

• REST API in node.js

• Assemble results from two sources

• Paths from Neo4j

• Segments from MongoDB

• Sort by price, duration

Geo Lessons

Art: MC Escher

Our Story

• Wanted to mix public transport data with intercity data

• Did not want to host all public transport data

• Created a hybrid solution

Our Solution

• Hybrid:

• Google Autocomplete

• Google Maps

• In house station geo lookup

Geo

• Neo4j geo func was not out of the box

• Requires jar install

• Run a Java program to index

• Needed better doc

• Ended up using MongoDB geo instead

• Make geo func out of the box

Conclusions

• Even with a join across dbs -> solution better than relational

• 10s paths x 100s segments vs. 500k x 500k

• Glad to have picked Neo4j: doing content gen and more geo features now

• Graph model will be useful for future analytics->Big Data

Useful Links

• Neo4j Internals

slideshare.net/thobe/an-overview-of-neo4j-internals

• Aseem’s Lessons Learned with Neo4j

http://aseemk.com/talks/neo4j-lessons-learned#/14

• Wes Freeman, Neo4j Internals

http://wes.skeweredrook.com/graphdb-meetup-may-2013.pdf

• MongoConnector

blog.mongodb.org/post/29127828146/introducing-mongo-connector

http://blog.mongodb.org/post/29127828146/introducing-mongo-connector

http://blog.mongodb.org/post/29127828146/introducing-mongo-connector

Wanderu - Lessons from Building a Travel Site with Neo4j

Technology

Transcript of Wanderu - Lessons from Building a Travel Site with Neo4j