Engineering challenges in vertical search engines
-
Upload
itdogadjajicom -
Category
Technology
-
view
1.192 -
download
2
description
Transcript of Engineering challenges in vertical search engines
+
Engineering Challenges in Vertical Search Engines Aleksandar Bradic, Senior Director, Engineering and R&D, Vast.com
+Introduction
Vertical Search Search focused on vertical data
Vertical Data – data inherently described by it’s structure:
Items/Properties for sale (Automotive, Real Estate..)
Geographical Data (Neighborhoods, Locations..)
Services (Hotels, Transportation..)
Businesses (Restaurants, Nightlife..)
Events (Concerts, Plays..)
Auction items (Collectibles, Art..)
Metadata (News, Social Data, Reviews..)
…
+Introduction
Vertical Search != Full Text Search Full Text Search queries:
“Cheap tickets for Broadway shows this week” “Trendy Restaurants in San Francisco near SoMa” “3-day trips from NYC to anywhere under $1000”
Vertical Search queries: “price-sorted results bellow two standard deviations from tickets
category with Broadway as location and date range of 2010-04-11 to 2010-04-18”
“distance-sorted results relative to center of SF/SoMa matching the appropriate threshold of composite score of user review scores and historical change in query/review volume”
“total cost-sorted results for all 3-day intervals within next 6 months combining hotel and airfare price bellow max value of $1000 for all valid locations”
+Introduction
Vertical Search = search on structured data
Vertical Search at Web-Scale: Web-Scale datasets
Web-Scale query volumes
Interactive operation
Low latency requirements
Utility maximization across all involved parties
=> loads of fun ! : )
Vast.com : Vertical Search & Analytics Platform
Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest Airlines, etc..
Daily processing up to 1Tb of unstructured and semi-structured Web data
Managing ~150M records operational dataset across multiple verticals
Handling > 1000 query/sec peak search query loads
We’re hiring ! : )
+Challenges in Vertical Search Engines
Web Data Retrieval
Unstructured Data
Data Processing Infrastructures
Vertical Search
Data Analytics
Computational Advertising
+Web Data Retrieval
Crawler Architecture Queue Management
Crawl Ordering Policies
Duplicate URL Detection
Content Hash Management
Politeness Management
Coverage Measurement
Freshness Optimization
Incremental Crawling
+Web Data Retrieval
”Deep Web” crawling Locating Deep Web Content Sources
Selecting Relevant Sources
Estimating Database Size
Understanding Content / Form Detection
Automatic Dispatch of HTML Forms
Predicting content in free text forms
Crawling non-HTML Content
Estimating Query Result Sparsity
URL Generation problem
Query Covering Problem
+Web Data Retrieval
Focused (Topical) Crawling Content Classification
Link Content Prediction
Topic Relevance Estimation
Modeling Temporal Characteristics Site-Level Evolution
Page-Level Evolution
Adversarial Crawling Web Spam Detection
Cloaked Content Detection
+Unstructured Data
Unstructured Data – information that does not have a pre-defined data model
Handling Unstructured Data: Data Cleaning
Tagging with Metadata
Vertical Classification
Schema Matching
Information Extraction
Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!make model year trim price
Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!
???
+Unstructured Data
Information extraction from unstructured, ungrammatical data Reference Sets - relational data sets that consist of collection of
known entities with associated common attributes
Reference Set Selection
Reference Set Generation
Record Linkage : Finding “best matching” member of reference set corresponding post
Challenge : Automatic Generation of Reference Sets
+Data Processing Infrastructures
Infrastructures for continuous processing of unbounded streams of unstructured data
Information Extraction as part of processing (non-trivial computation per each processed entry)
Inherently distributed infrastructures - in order to support performance and scalability
Time-to-site constraints. Ability to process out-of band data.
Support for complex operations on aggregated data (de-duplication, static ranking, data enrichment, data cleaning/filtering …)
Support for data archival and off-line analysis
+Data Processing Infrastructures
+Data Processing Infrastructures
Distributed Computing Platforms:
Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)
Stream-oriented (Flume, S4, Stream SQL…)
Distributed Data Stores (Dynamo/Cassandra/Riak…)
The curse of CAP Theorem: It is impossible for a distributed system to simultaneously provide
all three of the following guarantees: Consistency Availability Partition tolerance
+Vertical Search
Large-Scale structured data search
Providing both analytic and canonical set of Information Retrieval functionalities
Entries are represented in Vector Space Model
Each result is represented as data point – tuple consisting of appropriate number of fields :
(make, model, year, trim …)
+Vertical Search
Search in Vector Space Model Resulting subset generation
Sorting as linearization using selected metric
Dynamic subset criteria calculation
Search Result Clustering
“Similar” result search
…
… with up to ~100 ms milliseconds response time … at 10M+ records in index … handling 100+ queries/sec/host
+Vertical Search
Faceted Search fac-et (fas’it) :
1. One of the flat polished surfaces cut on a gemstone or occurring naturally on a crystal.
2. One of numerous aspects, as of a subject.
Vocabulary problem for faceted data Facet Design / selection
"the keywords that are assigned by indexers are often at odds with those tried by searchers.” Selection of information-distinguishing facet values
User-specific faceted search Dynamic correlated facet generation Distributing facet computation
+Data Analytics
Clickstream Data Analysis
Learning from implicit user feedback
Anonymous user clustering
Learning to rank
Inventory/Market Trends
Rare Event detection
Price Prediction
Spam Content detection
+Data Analytics
Challenges: “Good Deal” detection
Recommendation Systems for Vertical Data with no explicit user feedback
Accuracy of Automatic Valuation Models
Data-driven feature design
Click Prediction
User Behavior Modeling
+Computational Advertising
The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement.
ads
ads
search results !
+Computational Advertising
Vertical Search presents an additional challenge in the sense that any of the actual search results can be “sponsored”
ad ?
ad ?
+Computational Advertising
Central challenge: Find the “best match” between a given user in a given context
and a suitable advertisement
“best match” – maximizing the value for :
Users
Advertisers
Publishers
Each of the parties has different set of utilities:
Users want relevance
Advertisers want ROI and volume
Publishers want revenue per impression/search
+Computational Advertising
CTR (ClickThrough Rate Estimation): Reactive (statistically significant historical CTR)
Predictive (CTR estimated from features of ads)
Hybrid (historical + predictive)
Personalization of CTR Computation ?
Dynamic CTR Estimation (online algorithms)
P(click) = ?
+Computational Advertising
Analytical Aparatus: Regression Analysis (Linear, Logistic, probit model, High
Dimensional methods)
Game Theory (Nash Equilibria, dominant strategy)
Auction Theory (Vickrey, GSP, VCG…)
Graph Theory (random walks on graphs, graph matching, etc.)
Information Retrieval Techniques (similarity metrics, etc.)
…
+Conclusion
Vertical Search & Analytics at Web Scale == fun !!!
Source of large number of relevant research & engineering problems !
Opportunity to tackle wide spectra of techniques across all areas of Computer Science and Engineering !
Jump on the bandwagon ! : )