Download - Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Airbnb Search Architecture

Maxim Charkov, Engineering Manager [email protected], @mcharkov

mailto:[email protected]

AirbnbTotal Guests

20,000,000+Countries

190

Cities

34,000+Castles

600+

Listings Worldwide

800,000+

Search

www.airbnb.com

Booking Model

Search BookContact Accept

Search Backend

Technical Stack ____________________________

DropWizard as a service framework (incl. Jetty, Jersey, Jackson)

Guice dependency injection framework, Guava libraries, etc.

ZooKeeper (via Smartstack) for service discovery.

Lucene for index storage and simple retrieval.

In-house built real time indexing, ranking, advanced filtering.

Search Backend

~150 search threads

4 indexing threads

Data maintained by indexers:

Inverted Lucene index for retrieval

Forward index for ranking signals

Relevance models

JVM

Indexing

What’s in the Lucene index? ____________________________

Positions of listings indexed using Lucene’s spatial module (RecursivePrefixTreeStrategy)

Categorical and numerical properties like room type and maximum occupancy

Calendar information

Full text (descriptions, reviews, etc.)

~40 fields per listing from a variety of data sources, all updated in real time

Indexing

Challenges ____________________________

Bootstrap (creating the index from scratch)

Ensuring consistency of the index with ground truth data in real time

Indexing

master calendar fraud

SpinalTap

Medusa PersistentStorage

Search2Search1 SearchN…

Indexing

SpinalTap ____________________________

Responsible for detecting updates happening to the ground truth data (no need to maintain search index invalidation logic in application code)

Tails binary update logs from MySQL servers (5.6+)

Converts them into actionable data objects, called “Mutations”

Broadcasts using a distributed queue, like Kafka or RabbitMQ

Indexing# sources for mysql binary logssources: - name : airslave host : localhost port : 11 user : spinaltap password: spinaltap- name : calendar_db host : localhost port : 11 user : spinaltap password: spinaltap!destinations: - name : kafka clazzName : com.airbnb.spinaltap.destination.kafka.KafkaDestination!pipes: - name : search sources : [“airslave", "calendar_db"] tables : ["production:listings,calendar_db:schedule2s"] destination : kafka

SpinalTap Pipes ____________________________

Each pipe connects one or more binlog sources (MySQL) with a destination (e.g. Kafka)

Configured via YAML files

Indexing{ "seq" : 3, "binlogpos" : "mysql-bin.000002:5217:5273", "id" : -1857589909002862756, "type" : 2, "table" : { "id" : 70, "name" : "users", "db" : "my_db", "columns" : [ { "name" : "name", "type" : 15, "ispk" : false }, { "name" : "age", "type" : 2, "ispk" : false } ] }, "rows" : [ { "1" : { "name" : "eric", "age" : 31, }, "2" : { "name" : "eric", "age" : 28, } } ]}

SpinalTap Mutations ____________________________

Each binlog entry is parsed and converted into one of three event types: “Insert”, “Delete” or “Update”

“Insert” and “Delete” carry the entire row to be inserted or deleted

“Update” mutations contain both the old and the current row

Additional information: unique id, sequence number, column and table metadata

Indexing

Medusa ____________________________

Documents in index contain data from ~15 different source tables

Lucene needs a copy of all fields (not just fields that changed) to update the index

We also need a mechanism to build the entire index from scratch, without putting too much strain on MySQL

Indexing

Reads from SpinalTap or directly from MySQL

Data from multiple tables is joined into Thrift objects, which correspond to Lucene documents

The intermediate Thrift objects are persisted in Redis

As changes are detected, updated objects are pushed to the Search instances to update Lucene indexes

Can bootstrap the entire index in 3 minutes via multithreaded streaming

Leader election via ZooKeeper

Medusa PersistentStorage

Search2Search1 SearchN…

Ranking

Ranking Problem ____________________________

Not a text search problem

Users are almost never searching for a specific item, rather they’re looking to “Discover”

The most common component of a query is location

Highly personalized – the user is a part of the query

Optimizing for conversion (Search -> Inquiry -> Booking)

Evolution through continuos experimentation

Ranking

Ranking Components ____________________________

Relevance

Quality

Bookability

Personalization

Desirability of location

New host promotion

etc.

Ranking

Several hundred signals determining search ranking:

Properties of the listing (reviews, location, etc.)

Behavioral signals (mined from request logs)

Image quality and click ability (computer vision)

Host behavior (response time/rate, cancellations, etc.)

Host preferences model

DB snapshots Logs

Rankingpublic void attemptLoadData() { DateTime remoteTs = dataLoader.getModTime(pathToSignals);! if (currentTs == null || remoteTs.isAfter(currentTs) { Map<K, D> newSignals = loadData(); if (newSignals != null && (signalsMap == null || isHealthy(newSignals)) { synchronized (this) { signalsMap = newSignals; currentTs = remoteTs; this.notifyAll(); } } else { LOG.severe("Failed to load the avro file: " + pathToSignals); } }}!…!ThreadedLoader<Integer, QualitySignalsAvro> qualitySignalsLoader = loaders.get(LoaderCollection.Loader.QualitySignals);final QualitySignalsAvro qs = qualitySignalsLoader.get(hostingId, true);

Loading Signals ____________________________

Storing signals in a separate data structure

Pros:

Good fit for this type of update pattern: not real-time, but almost everything changes on each load

No need for costly Lucene index rebuild

Greatly simplifies design

Cons:

Unable to use Lucene retrieval on such data

Life of a Query

Query Understanding

Retrieval

External Calls

Populator Scorer

Third Pass Ranking

Result Generation AirEvents Logging

Geocoding

Configuring retrieval options

Choosing ranking models Quality

Bookability

Relevance2000 results

Filtering and Reranking

Pricing Service

Social Connections

25 results

2000 results

25 results

Ranking

Second Pass Ranking ____________________________

Traditional ranking works like this:

!then sort by rr

In contrast, second pass operates on the entire list at once:

!Makes it possible to implement features like result diversity, etc.

Life of a Query

Query Understanding

Retrieval

External Calls

Populator Scorer

Third Pass Ranking

Result Generation AirEvents Logging

Geocoding

Configuring retrieval options

Choosing ranking models Quality

Bookability

Relevance2000 results

Filtering and Reranking

Pricing Service

Social Connections

25 results

2000 results

25 results

Ranking

Outside of the scope of this talk ____________________________

Ranking models

Machine Learning infrastructure

Tools (loadtest, deploy, etc.)

Other Search Infrastructure services: UserProfiler, Pricing, Social, Hoods, etc.