Wanderu - Lessons from Building a Travel Site with Neo4j

38
Wanderu: Lessons Learned Lessons Learned and Unlearned from Building a Travel Site with Graphs and Neo4j Eddy Wong CTO, Wanderu.com @eddywongch

description

Wanderu is a consumer-focused search engine for buses and trains. In this webinar, we will recount the architectural, modeling and other technical "lessons learned" and "lessons unlearned" in implementing our geospatial and search features using Neo4j in the context of a NoSQL polyglot solution. Speaker: Eddy Wong, CTO, Wanderu A technologist, innovator and entrepreneur who has architected products and web sites for companies like Hasbro, Maark, Allurent, Macromedia, Allaire, Open Sesame, Philips and AT&T. He was the Chief Architect at Open Sesame where he built one of the first attribute-based personalization engines. Eddy has over 15 years of experience as a software architect and is a Boston tech-community leader in the areas of NoSQL, Big Data and Personalization. He is also the organizer of the Boston GraphDB Meetup.

Transcript of Wanderu - Lessons from Building a Travel Site with Neo4j

Page 1: Wanderu - Lessons from Building a Travel Site with Neo4j

Wanderu: Lessons Learned

Lessons Learned and Unlearned from Building a Travel Site with Graphs and Neo4j

Eddy WongCTO, Wanderu.com

@eddywongch

Page 2: Wanderu - Lessons from Building a Travel Site with Neo4j

About Wanderu.comSearch Engine for (Intercity) Buses and Trains

Page 3: Wanderu - Lessons from Building a Travel Site with Neo4j

Demo

Page 4: Wanderu - Lessons from Building a Travel Site with Neo4j

From pt A to pt B

A: Boston B: DC

NYC

Nomenclature: Stations, Trips

Amtrak, $101, 09/26/2013

Bolt, $25, 09/26/2013 Mega, $24, 09/26/2013

Page 5: Wanderu - Lessons from Building a Travel Site with Neo4j

From pt A to pt B

B: Brooklyn, NY

A: Cambridge, MA31st & 9th Ave, NYC

South Station, Boston

28st & 7th Ave, NYC

34st & 8th Ave, NYC

Page 6: Wanderu - Lessons from Building a Travel Site with Neo4j

Our Story

• Tech Started about 1+ yr ago

• Beta in Mar, Launch in Aug

• Knew nothing about Neo4j when we started (Jun 2012)

• Did not like the relational model: wanted schema-less and no self-joins

• Wanted a graph model

Page 7: Wanderu - Lessons from Building a Travel Site with Neo4j

Relational vs. Graph

Page 8: Wanderu - Lessons from Building a Travel Site with Neo4j

Lessons

LearnedUnLearned

Idea

•Architectural•Modeling•Geo

Page 9: Wanderu - Lessons from Building a Travel Site with Neo4j

Architectural Lessons

Art: MC Escher

Page 10: Wanderu - Lessons from Building a Travel Site with Neo4j

Our Story

• Started with MongoDB as a general store: easy to manipulate and organize data

• Wanted a db that could preserve the Graph Model

• Debated: Document vs. Graph

• Could not find one single db that could do both: general store + graph

Page 11: Wanderu - Lessons from Building a Travel Site with Neo4j

Workflow

Store

Scraping JSON

Bus Websites Non-uniform Data

Uniform Data

Server

Page 12: Wanderu - Lessons from Building a Travel Site with Neo4j

noSQL

• You need to make a choice of one noSQL database

• You need ONE (centralized) database

• The word “database” is a loaded term

• Lots of (very diff) noSQL dbs options

Page 13: Wanderu - Lessons from Building a Travel Site with Neo4j

Our Situation

• Data is written only in one direction

• Users search for paths, then segments

• Searches are done by date

• Needed online capability

• Trip info (price/avail) could change on some

Page 14: Wanderu - Lessons from Building a Travel Site with Neo4j

Our Solution

• Use Both: MongoDB + Neo4j

• “Docugraph” = Document + Graph

• Syncing two kinds of databases

• Eventual consistency

Page 15: Wanderu - Lessons from Building a Travel Site with Neo4j

Pipeline

Scraping JSON

Bus Websites Non-uniform Data

Uniform Data

MongoDBNeo4jMongoConn

Nodes & Edges

Replica Mechanism

Page 16: Wanderu - Lessons from Building a Travel Site with Neo4j

MongoConnector

• MongoDB Lab project, open source, unsupported

• Uses Replica Mechanism: Oplog

• Eventually Consistent (not real time)

• Written in Python

• Main methods: Upserts and Deletes, passes doc

• Implement DocMgr->Neo4jDocMgr->py2neo

• Other impls: MongoDocMgr, SolrDocMgr, ESDocMgr

Page 17: Wanderu - Lessons from Building a Travel Site with Neo4j

Populating Neo4j (2)

• Created our own way of creating Edges

• Auto Node creation when Edge is created: Could add Stations (nodes) on the fly

• py2neo requires 2 “node ref”s to create an edge, ie. might need two round trips to Neo4j

Page 18: Wanderu - Lessons from Building a Travel Site with Neo4j

Edge Creator P-codehashtable allStations = load_stations

w_create_edge (station_id a, station_id b, otherdata)

look_up a in allStations

If found -> ref_a = allStations.get(a)

If not found ->

ref_a = py2neo.create_node(a)

Add a to allStations

...

py2neo.create_edge(ref_a, ref_b, ...)

Page 19: Wanderu - Lessons from Building a Travel Site with Neo4j

Pipeline

Scraping JSON

Bus Websites Non-uniform Data

MongoDB

Neo4j

MongoConnNodes & Edges

Replica Mechanism

REST Server

BOS, NYCBOS, PHLNYC, DC

NYC, PHL

Page 20: Wanderu - Lessons from Building a Travel Site with Neo4j

Modeling Lessons

Art: MC Escher

Page 21: Wanderu - Lessons from Building a Travel Site with Neo4j

Our Story

• We tried to “dump” all data into Neo4j

• Stations -> Nodes, Trips -> Edges

• Problem: Edges had dates -> too many Edges -> “Super Node”

• Query perf was terrible (1+ mins) and worse as # edges increased

Page 22: Wanderu - Lessons from Building a Travel Site with Neo4j

Our Story (2)

• Went from Cypher to Gremlin, thinking that would have improve performance

• Needed range queries on Edges

Page 23: Wanderu - Lessons from Building a Travel Site with Neo4j

Our Solution

• Don’t store everything in the Neo4j, only metadata

• Use Neo4j as an index

• Don’t store entities in Nodes, only keys

• Don’t store heavy properties in Edges

Page 24: Wanderu - Lessons from Building a Travel Site with Neo4j

Neo4j Model

source: Tobias Lindaaker, Wes Freeman

Page 25: Wanderu - Lessons from Building a Travel Site with Neo4j

Neo4j RuntimeModel

• Relationships are in a linked list

• Properties are in a linked list

• Therefore: There is NO random access for Relationships or Properties

• A range query of relationships required a full scan

Page 26: Wanderu - Lessons from Building a Travel Site with Neo4j

Our Solution (2)

• Needed ability to do range queries on Edges

• Serve paths from Neo4j, segments from MongoDB

• The one thing we tried to avoid we ended up doing: Joins

• Came up with “Docugraph” approach

Page 27: Wanderu - Lessons from Building a Travel Site with Neo4j

Docugraph

• MongoDB Collections for Nodes and Edges

• Neo4j: Only keys for nodes

• Neo4j: Only Properties relevant for queries

Page 28: Wanderu - Lessons from Building a Travel Site with Neo4j

Nodes & Edges

• Collection for Stations (nodes)

{id: “BOS”, name: “Boston South Station”, address: “Summer St”, ...}

• Collection for Trips (edges)

{depart_id: “BOS”, arrive_id: “NYC”, carrier: “Megabus”, price: 24.0, ...}

Page 29: Wanderu - Lessons from Building a Travel Site with Neo4j

Modeling

• Storing info in two or more dbs

• Doing a “join” across multiple dbs

Page 30: Wanderu - Lessons from Building a Travel Site with Neo4j

Joins across DBs

MongoDB: Stations Neo4j: Nodes

BOS BOS

NYC NYC

DC DC

... ...

MongoDB: Trips Neo4j: Edges

BOS-NYC BOS-NYC

BOS-DC BOS-DC

NYC-DC NYC-DC

... ...

• Forget seq id generated by dbs

• Use a human-created long string for id

• Convert pair into id: depart-arrive

• For example: BOS-NYC

Page 31: Wanderu - Lessons from Building a Travel Site with Neo4j

Indexing Technique

• Index Trips by {origin-dest, datetime}

Page 32: Wanderu - Lessons from Building a Travel Site with Neo4j

Querying

• REST API in node.js

• Assemble results from two sources

• Paths from Neo4j

• Segments from MongoDB

• Sort by price, duration

Page 33: Wanderu - Lessons from Building a Travel Site with Neo4j

Geo Lessons

Art: MC Escher

Page 34: Wanderu - Lessons from Building a Travel Site with Neo4j

Our Story

• Wanted to mix public transport data with intercity data

• Did not want to host all public transport data

• Created a hybrid solution

Page 35: Wanderu - Lessons from Building a Travel Site with Neo4j

Our Solution

• Hybrid:

• Google Autocomplete

• Google Maps

• In house station geo lookup

Page 36: Wanderu - Lessons from Building a Travel Site with Neo4j

Geo

• Neo4j geo func was not out of the box

• Requires jar install

• Run a Java program to index

• Needed better doc

• Ended up using MongoDB geo instead

• Make geo func out of the box

Page 37: Wanderu - Lessons from Building a Travel Site with Neo4j

Conclusions

• Even with a join across dbs -> solution better than relational

• 10s paths x 100s segments vs. 500k x 500k

• Glad to have picked Neo4j: doing content gen and more geo features now

• Graph model will be useful for future analytics->Big Data

Page 38: Wanderu - Lessons from Building a Travel Site with Neo4j

Useful Links

• Neo4j Internals

slideshare.net/thobe/an-overview-of-neo4j-internals

• Aseem’s Lessons Learned with Neo4j

http://aseemk.com/talks/neo4j-lessons-learned#/14

• Wes Freeman, Neo4j Internals

http://wes.skeweredrook.com/graphdb-meetup-may-2013.pdf

• MongoConnector

blog.mongodb.org/post/29127828146/introducing-mongo-connector