Engineering challenges in vertical search engines

+

Engineering Challenges in Vertical Search Engines Aleksandar Bradic, Senior Director, Engineering and R&D, Vast.com

+Introduction

  Vertical Search   Search focused on vertical data

  Vertical Data – data inherently described by it’s structure:

  Items/Properties for sale (Automotive, Real Estate..)

  Geographical Data (Neighborhoods, Locations..)

  Services (Hotels, Transportation..)

  Businesses (Restaurants, Nightlife..)

  Events (Concerts, Plays..)

  Auction items (Collectibles, Art..)

  Metadata (News, Social Data, Reviews..)

  …

+Introduction

  Vertical Search != Full Text Search   Full Text Search queries:

  “Cheap tickets for Broadway shows this week”   “Trendy Restaurants in San Francisco near SoMa”   “3-day trips from NYC to anywhere under $1000”

  Vertical Search queries:   “price-sorted results bellow two standard deviations from tickets

category with Broadway as location and date range of 2010-04-11 to 2010-04-18”

  “distance-sorted results relative to center of SF/SoMa matching the appropriate threshold of composite score of user review scores and historical change in query/review volume”

  “total cost-sorted results for all 3-day intervals within next 6 months combining hotel and airfare price bellow max value of $1000 for all valid locations”

+Introduction

  Vertical Search = search on structured data

  Vertical Search at Web-Scale:   Web-Scale datasets

  Web-Scale query volumes

  Interactive operation

  Low latency requirements

  Utility maximization across all involved parties

  => loads of fun ! : )

[email protected]

  Vast.com : Vertical Search & Analytics Platform

  Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest Airlines, etc..

[email protected]

  Daily processing up to 1Tb of unstructured and semi-structured Web data

  Managing ~150M records operational dataset across multiple verticals

  Handling > 1000 query/sec peak search query loads

  We’re hiring ! : )

+Challenges in Vertical Search Engines

  Web Data Retrieval

  Unstructured Data

  Data Processing Infrastructures

  Vertical Search

  Data Analytics

  Computational Advertising

+Web Data Retrieval

  Crawler Architecture   Queue Management

  Crawl Ordering Policies

  Duplicate URL Detection

  Content Hash Management

  Politeness Management

  Coverage Measurement

  Freshness Optimization

  Incremental Crawling

+Web Data Retrieval

  ”Deep Web” crawling   Locating Deep Web Content Sources

  Selecting Relevant Sources

  Estimating Database Size

  Understanding Content / Form Detection

  Automatic Dispatch of HTML Forms

  Predicting content in free text forms

  Crawling non-HTML Content

  Estimating Query Result Sparsity

  URL Generation problem

  Query Covering Problem

+Web Data Retrieval

  Focused (Topical) Crawling   Content Classification

  Link Content Prediction

  Topic Relevance Estimation

  Modeling Temporal Characteristics   Site-Level Evolution

  Page-Level Evolution

  Adversarial Crawling   Web Spam Detection

  Cloaked Content Detection

+Unstructured Data

  Unstructured Data – information that does not have a pre-defined data model

  Handling Unstructured Data:   Data Cleaning

  Tagging with Metadata

  Vertical Classification

  Schema Matching

  Information Extraction

Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!make model year trim price

Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!

???

+Unstructured Data

  Information extraction from unstructured, ungrammatical data   Reference Sets - relational data sets that consist of collection of

known entities with associated common attributes

  Reference Set Selection

  Reference Set Generation

  Record Linkage : Finding “best matching” member of reference set corresponding post

  Challenge : Automatic Generation of Reference Sets

+Data Processing Infrastructures

  Infrastructures for continuous processing of unbounded streams of unstructured data

  Information Extraction as part of processing (non-trivial computation per each processed entry)

  Inherently distributed infrastructures - in order to support performance and scalability

  Time-to-site constraints. Ability to process out-of band data.

  Support for complex operations on aggregated data (de-duplication, static ranking, data enrichment, data cleaning/filtering …)

  Support for data archival and off-line analysis


  Distributed Computing Platforms:

  Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)

  Stream-oriented (Flume, S4, Stream SQL…)

  Distributed Data Stores (Dynamo/Cassandra/Riak…)

  The curse of CAP Theorem:   It is impossible for a distributed system to simultaneously provide

all three of the following guarantees:   Consistency   Availability   Partition tolerance

+Vertical Search

  Large-Scale structured data search

  Providing both analytic and canonical set of Information Retrieval functionalities

  Entries are represented in Vector Space Model

  Each result is represented as data point – tuple consisting of appropriate number of fields :

(make, model, year, trim …)

+Vertical Search

  Search in Vector Space Model   Resulting subset generation

  Sorting as linearization using selected metric

  Dynamic subset criteria calculation

  Search Result Clustering

  “Similar” result search

  …

… with up to ~100 ms milliseconds response time … at 10M+ records in index … handling 100+ queries/sec/host

+Vertical Search

  Faceted Search   fac-et (fas’it) :

  1. One of the flat polished surfaces cut on a gemstone or occurring naturally on a crystal.

  2. One of numerous aspects, as of a subject.

  Vocabulary problem for faceted data   Facet Design / selection

  "the keywords that are assigned by indexers are often at odds with those tried by searchers.”   Selection of information-distinguishing facet values

  User-specific faceted search   Dynamic correlated facet generation   Distributing facet computation

+Data Analytics

  Clickstream Data Analysis

  Learning from implicit user feedback

  Anonymous user clustering

  Learning to rank

  Inventory/Market Trends

  Rare Event detection

  Price Prediction

  Spam Content detection

+Data Analytics

  Challenges:   “Good Deal” detection

  Recommendation Systems for Vertical Data with no explicit user feedback

  Accuracy of Automatic Valuation Models

  Data-driven feature design

  Click Prediction

  User Behavior Modeling

+Computational Advertising

  The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement.

ads

ads

search results !


  Vertical Search presents an additional challenge in the sense that any of the actual search results can be “sponsored”

ad ?

ad ?


  Central challenge:   Find the “best match” between a given user in a given context

and a suitable advertisement

  “best match” – maximizing the value for :

  Users

  Advertisers

  Publishers

  Each of the parties has different set of utilities:

  Users want relevance

  Advertisers want ROI and volume

  Publishers want revenue per impression/search


  CTR (ClickThrough Rate Estimation):   Reactive (statistically significant historical CTR)

  Predictive (CTR estimated from features of ads)

  Hybrid (historical + predictive)

  Personalization of CTR Computation ?

  Dynamic CTR Estimation (online algorithms)

P(click) = ?


  Analytical Aparatus:   Regression Analysis (Linear, Logistic, probit model, High

Dimensional methods)

  Game Theory (Nash Equilibria, dominant strategy)

  Auction Theory (Vickrey, GSP, VCG…)

  Graph Theory (random walks on graphs, graph matching, etc.)

  Information Retrieval Techniques (similarity metrics, etc.)

  …

+Conclusion

  Vertical Search & Analytics at Web Scale == fun !!!

  Source of large number of relevant research & engineering problems !

  Opportunity to tackle wide spectra of techniques across all areas of Computer Science and Engineering !

Jump on the bandwagon ! : )

Engineering challenges in vertical search engines

Technology

Transcript of Engineering challenges in vertical search engines