Airbnb Search Architecture
Maxim Charkov, Engineering Manager [email protected], @mcharkov
AirbnbTotal Guests
20,000,000+Countries
190
Cities
34,000+Castles
600+
Listings Worldwide
800,000+
Search
www.airbnb.com
Booking Model
Search BookContact Accept
Search Backend
Technical Stack ____________________________
DropWizard as a service framework (incl. Jetty, Jersey, Jackson)
Guice dependency injection framework, Guava libraries, etc.
ZooKeeper (via Smartstack) for service discovery.
Lucene for index storage and simple retrieval.
In-house built real time indexing, ranking, advanced filtering.
Search Backend
~150 search threads
4 indexing threads
Data maintained by indexers:
Inverted Lucene index for retrieval
Forward index for ranking signals
Relevance models
JVM
Indexing
What’s in the Lucene index? ____________________________
Positions of listings indexed using Lucene’s spatial module (RecursivePrefixTreeStrategy)
Categorical and numerical properties like room type and maximum occupancy
Calendar information
Full text (descriptions, reviews, etc.)
~40 fields per listing from a variety of data sources, all updated in real time
Indexing
Challenges ____________________________
Bootstrap (creating the index from scratch)
Ensuring consistency of the index with ground truth data in real time
Indexing
master calendar fraud
SpinalTap
Medusa PersistentStorage
Search2Search1 SearchN…
Indexing
master calendar fraud
SpinalTap
Medusa PersistentStorage
Search2Search1 SearchN…
Indexing
SpinalTap ____________________________
Responsible for detecting updates happening to the ground truth data (no need to maintain search index invalidation logic in application code)
Tails binary update logs from MySQL servers (5.6+)
Converts them into actionable data objects, called “Mutations”
Broadcasts using a distributed queue, like Kafka or RabbitMQ
Indexing# sources for mysql binary logssources: - name : airslave host : localhost port : 11 user : spinaltap password: spinaltap- name : calendar_db host : localhost port : 11 user : spinaltap password: spinaltap!destinations: - name : kafka clazzName : com.airbnb.spinaltap.destination.kafka.KafkaDestination!pipes: - name : search sources : [“airslave", "calendar_db"] tables : ["production:listings,calendar_db:schedule2s"] destination : kafka
SpinalTap Pipes ____________________________
Each pipe connects one or more binlog sources (MySQL) with a destination (e.g. Kafka)
Configured via YAML files
Indexing{ "seq" : 3, "binlogpos" : "mysql-bin.000002:5217:5273", "id" : -1857589909002862756, "type" : 2, "table" : { "id" : 70, "name" : "users", "db" : "my_db", "columns" : [ { "name" : "name", "type" : 15, "ispk" : false }, { "name" : "age", "type" : 2, "ispk" : false } ] }, "rows" : [ { "1" : { "name" : "eric", "age" : 31, }, "2" : { "name" : "eric", "age" : 28, } } ]}
SpinalTap Mutations ____________________________
Each binlog entry is parsed and converted into one of three event types: “Insert”, “Delete” or “Update”
“Insert” and “Delete” carry the entire row to be inserted or deleted
“Update” mutations contain both the old and the current row
Additional information: unique id, sequence number, column and table metadata
Indexing
Medusa ____________________________
Documents in index contain data from ~15 different source tables
Lucene needs a copy of all fields (not just fields that changed) to update the index
We also need a mechanism to build the entire index from scratch, without putting too much strain on MySQL
Indexing
Reads from SpinalTap or directly from MySQL
Data from multiple tables is joined into Thrift objects, which correspond to Lucene documents
The intermediate Thrift objects are persisted in Redis
As changes are detected, updated objects are pushed to the Search instances to update Lucene indexes
Can bootstrap the entire index in 3 minutes via multithreaded streaming
Leader election via ZooKeeper
Medusa PersistentStorage
Search2Search1 SearchN…
Ranking
Ranking Problem ____________________________
Not a text search problem
Users are almost never searching for a specific item, rather they’re looking to “Discover”
The most common component of a query is location
Highly personalized – the user is a part of the query
Optimizing for conversion (Search -> Inquiry -> Booking)
Evolution through continuos experimentation
Ranking
Ranking Components ____________________________
Relevance
Quality
Bookability
Personalization
Desirability of location
New host promotion
etc.
Ranking
Several hundred signals determining search ranking:
Properties of the listing (reviews, location, etc.)
Behavioral signals (mined from request logs)
Image quality and click ability (computer vision)
Host behavior (response time/rate, cancellations, etc.)
Host preferences model
DB snapshots Logs
Rankingpublic void attemptLoadData() { DateTime remoteTs = dataLoader.getModTime(pathToSignals);! if (currentTs == null || remoteTs.isAfter(currentTs) { Map<K, D> newSignals = loadData(); if (newSignals != null && (signalsMap == null || isHealthy(newSignals)) { synchronized (this) { signalsMap = newSignals; currentTs = remoteTs; this.notifyAll(); } } else { LOG.severe("Failed to load the avro file: " + pathToSignals); } }}!…!ThreadedLoader<Integer, QualitySignalsAvro> qualitySignalsLoader = loaders.get(LoaderCollection.Loader.QualitySignals);final QualitySignalsAvro qs = qualitySignalsLoader.get(hostingId, true);
Loading Signals ____________________________
Storing signals in a separate data structure
Pros:
Good fit for this type of update pattern: not real-time, but almost everything changes on each load
No need for costly Lucene index rebuild
Greatly simplifies design
Cons:
Unable to use Lucene retrieval on such data
Life of a Query
Query Understanding
Retrieval
External Calls
Populator Scorer
Third Pass Ranking
Result Generation AirEvents Logging
Geocoding
Configuring retrieval options
Choosing ranking models Quality
Bookability
Relevance2000 results
Filtering and Reranking
Pricing Service
Social Connections
25 results
2000 results
25 results
Ranking
Second Pass Ranking ____________________________
Traditional ranking works like this:
!then sort by rr
In contrast, second pass operates on the entire list at once:
!Makes it possible to implement features like result diversity, etc.
Life of a Query
Query Understanding
Retrieval
External Calls
Populator Scorer
Third Pass Ranking
Result Generation AirEvents Logging
Geocoding
Configuring retrieval options
Choosing ranking models Quality
Bookability
Relevance2000 results
Filtering and Reranking
Pricing Service
Social Connections
25 results
2000 results
25 results
Ranking
Ranking
Ranking
Ranking
Outside of the scope of this talk ____________________________
Ranking models
Machine Learning infrastructure
Tools (loadtest, deploy, etc.)
Other Search Infrastructure services: UserProfiler, Pricing, Social, Hoods, etc.
Top Related