WS-CIM mapping using WSDM Igor Sedukhin Heather Kreger Andreas Maier
Response prediction for display advertising - WSDM 2014
-
Upload
olivier-chapelle -
Category
Technology
-
view
6.307 -
download
1
description
Transcript of Response prediction for display advertising - WSDM 2014
![Page 1: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/1.jpg)
RESPONSE PREDICTION FORDISPLAY ADVERTISING
WSDM ’14
Olivier Chapelle
![Page 2: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/2.jpg)
2
OUTLINE
1. Criteo & Display advertising
2. Modeling
3. Large scale learning
4. Explore / exploit
![Page 3: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/3.jpg)
3
DISPLAY ADVERTISING
Display Ad
![Page 4: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/4.jpg)
4
DISPLAY ADVERTISING
• Rapidly growing multi-billion dollar business (30% of internet advertising revenue in 2013).
• Marketplace between:– Publishers: sell display opportunities– Advertisers: pay for showing their ad
• Real Time Bidding:– Auction amongst advertisers is held at the moment when a user generates
a display opportunity by visiting a publisher’s web page.
![Page 5: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/5.jpg)
5
BRANDING VS PERFORMANCE
PRICING TYPE
CAMPAIGN TYPE
• CPM (Cost Per Mille): advertiser pays per thousand impressions
• CPC (Cost Per Click): advertiser pays only when the user clicks
• CPA (Cost Per Action): advertiser pays only when the user performs a predefined action such as a purchase.
• Branding CPM
• Performance based advertising (retargeting) CPC, CPA
CONVERSIONS
eCPM = CPC * predicted clickthrough rate
eCPM = CPA * predicted conversion rate
![Page 6: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/6.jpg)
6
CRITEO
6
Bill on
CPC
Advertisers
Pay on
CPM
Publishers
Criteo’s success depends highly on how precisely we can predict Click Through Rates (CTR) and Conversion Rates (CR)
![Page 7: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/7.jpg)
7
RECOMMENDATION
Collaborative filteringPropose related items to a user based on historical interactions from other users
Over 50% of Criteo driven sales come from recommended products the user had never viewed on advertiser websites
![Page 8: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/8.jpg)
8
CRITEO NUMBERS
8
750EMPLOYEES
K250AD CALLSPER SECOND
107500 CORES CLUSTER
PBB
AN
NE
RS
PE
R D
AY 2.1
BILLION
BILLION18
RT
BID
SP
ER
DA
Y
42COUNTRIES
$600MILLION2013 REVENUE
![Page 9: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/9.jpg)
9
OUTLINE
1. Criteo & Display advertising
2. Modeling
3. Large scale learning
4. Explore / exploit
![Page 10: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/10.jpg)
10
• Three sources of features: user, ad, page• In this talk: categorical features on ad and page.
FEATURES
Publisher network
Publisher
Site
Url
Advertiser network
Ad
Campaign
Advertiser
Publisher hierarchy Advertiser hierarchy
![Page 11: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/11.jpg)
11
HASHING TRICK
• Standard representation of categorical features: “one-hot” encoding
For instance, site feature
• Dimensionality equal to the number of different values– can be very large
• Hashing to reduce dimensionality (made popular by John Langford in VW)
• Dimensionality now independent of number of values
cnn.com news.yahoo.com
0 0 01 0 0 0
![Page 12: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/12.jpg)
12
HASHING VS FEATURE SELECTION
• “Small” problem with 35M different values.• Methods that require a dictionary have a larger model.
![Page 13: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/13.jpg)
13
QUADRATIC FEATURES
• Outer product between two features.• Example: between site and advertiser,
Feature is 1 site=finance.yahoo.com & advertiser=bank of america
Similar to a polynomial kernel of degree 2 Large number of values hashing trick
Publisher network
Publisher
Site
Url
Advertiser network
Ad
Campaign
Advertiser
![Page 14: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/14.jpg)
14
ADVANTAGES OF HASHING
• Practical– Straightforward implement; no need to maintain dictionaries
• Statistical– Regularization (infrequent values are washed away by frequent ones)
• Most powerful when combined with quadratic features
Quote of John Langford about hashing
At first it’s scary, then you love it
![Page 15: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/15.jpg)
15
LEARNING
• Regularized logistic regression– Vowpal Wabbit open source package
• Regularization with hierarchical features backoff smoothing
• Negative data subsampled for computational reason
Well estimated Small if rare value
![Page 16: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/16.jpg)
16
EVALUATION
• Comparison with (Agarwal et al. ’10)– Probabilistic model for the same display advertising prediction problem – Leverages the hierarchical structures on the ad and publisher sides– Sparse prior for smoothing
• Model trained on three weeks of data, tested on the 3 following days
auROC auPRC Log likelihood
+ 3.1% + 10.0% + 7.1%
D. Agarwal et al., Estimating Rates of Rare Events with Multiple Hierarchies through Scalable Log-linear Models, KDD, 2010
![Page 17: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/17.jpg)
17
BAYESIAN LOGISTIC REGRESSION
• Regularized logistic regression = MAP solution (Gaussian prior, logistic likelihood)
• Posterior is not Gaussian• Diagonal Laplace approximation:
with:
and:
![Page 18: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/18.jpg)
18
MODEL UPDATE
• Needed because ads / campaigns keep changing.• The posterior distribution of a previously trained model can be used as
the prior for training a new model with a new batch of data
• Influence of the update frequency (auPRC):
1 day 6 hours 2 hours
+3.7% +5.1% +5.8%
Day 1 Day 2 Day 3 Day 4 Day 5
M0 M1 M2
![Page 19: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/19.jpg)
19
OUTLINE
1. Criteo & Display advertising
2. Modeling
3. Large scale learning
4. Explore / exploit
![Page 20: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/20.jpg)
20
PARALLEL LEARNING
• Large training set– 2B training samples; 16M parameters– 400GB (compressed)
• Proposed method: less than one hour with 500 machines
• Optimize:
• SGD is fast on a single machine, but difficult to parallelize.• Batch (quasi-Newton) methods are straightforward to parallelize
– L-BFGS with distributed gradient computation.
= + … +S S1 Sm
![Page 21: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/21.jpg)
21
ALLREDUCE
• Aggregate and broadcast across nodes
• Very few modification to existing code: just insert several AllReduce op.• Compatible with Hadoop / MapReduce
– Build a spanning tree on the gateway– Single MapReduce job– Leverage speculative execution to alleviate the slow node issue
![Page 22: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/22.jpg)
22
ONLINE INITIALIZATION
• Hybrid approach:– One pass of online learning on each node– Average the weights from each node to get a warm start for batch
optimization
• Best of both (online / batch) worlds.
Splice site prediction (Sonnenburg et al. ‘10) Display advertising
S. Sonnenburg and V. Franc, COFFIN: A Computational Framework for Linear SVMs, ICML 2010
![Page 23: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/23.jpg)
23
OUTLINE
1. Criteo & Display advertising
2. Modeling
3. Large scale learning
4. Explore / exploit
![Page 24: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/24.jpg)
24
THOMPSON SAMPLING
• Heuristic to address the Explore / Exploit problem, dating back to Thompson (1933)
• Simple to implement• Good performance in practice (Graepel et al. ‘10, Chapelle and Li ‘11)• Rarely used, maybe because of lack of theoretical guarantee.
T. Graepel et al., Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine, ICML 2010
O. Chapelle and L. Li, An Empirical Evaluation of Thompson Sampling, NIPS 2011
Draw model parameter qt according to P( | q D)
Select best action according to qt
Observe reward and update model
![Page 25: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/25.jpg)
25
E/E SIMULATIONS
MAB with K arms. Best arm has mean reward = 0.5, others have 0.5 − ε.
![Page 26: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/26.jpg)
26
EVALUATION
• Semi-simulated environment: real input features, but labels generated.• Set of eligible ads varies from 1 to 5,910. Total ads = 66,373• Comparison of E/E algorithms:
– 4 days of data– Cold start
• Algorithms:– UCB: mean + a std. dev. – e-greedy– Thompson sampling
![Page 27: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/27.jpg)
27
RESULTS
• CTR regret (in percentage):
• Regret over time:
Thompson UCB e-greedy Exploit-only Random
3.72 4.14 4.98 5.00 31.95
![Page 28: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/28.jpg)
28
OPEN QUESTIONS
• Hashing– Theoretical performance guarantees
• Low rank matrix factorization– Better predict on unseen pairs (publisher, advertiser)
• Sample selection bias– System is trained only on selected ads, but all ads are scored.– Possible solution: inverse propensity scoring– But we still need to bias the training data toward good ads.
• Explore / exploit– Evaluation framework– Regret analysis of Thompson’s sampling– E/E with a budget; with multiple slots; with a delayed feedback
![Page 29: Response prediction for display advertising - WSDM 2014](https://reader036.fdocuments.in/reader036/viewer/2022081413/54796734b4af9f2c1e8b4755/html5/thumbnails/29.jpg)
29
CONCLUSION
• Simple yet efficient techniques for click prediction
• Main difficulty in applied machine learning: avoid the bias (because of academic papers) toward complex systems– It’s easy to get lured into building a complex system– It’s difficult to keep it simple
See paper for more details
Simple and scalable response prediction for display advertising
O. Chapelle, E. Manavoglu, R. Rosales, 2014