Post on 20-Mar-2017
1
Reading GroupDeep Crossing: Web-Scale Modeling
without Manually Crafted Combinatorial Features
Presenter:Xiang Zhang
02/05/2023
02/05/2023
1.Background2.Abstract3.Why choose it4.what’s the main idea5.How it works6.Implementation7.Expermentation8.Conclusions9.Experience
Main Content
2
3
1. Background
02/05/2023
Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features
Authors from Microsoft Research
Keywords Neural Networks, DL, CNN, etc.
Conference KDD 2016
4
1. Background
02/05/2023
KDD (A*) Conferences on Knowledge Discovery and Data mining
SIGKDD Special Interest Group on Knowledge Discovery in Data
KDD 2017 Aug.13-17, Halifax, CanadaDecember 9, 2016 (DDL)
5
2.Abstract
Combine features Important and useful, but manually craft is time consuming and experience needed, accuracy unguaranteed especially in large scale, variety and volume of features
ContributionDeep neural network to automatically combine features
Tool Computational network Tool Kit, multi-GPU platform
02/05/2023
6
3. Why choose it
Automatically combine features (reduce D)
Web-scale (Massive data)
Feature in different types & Dimensions
Better performance
02/05/2023
7
4.what’s the main ideaWhat is feature extraction?
Individual features
An individual measurable property of a phenomenon being observed. (Representation of the data)
Combinatorial features
Defined in the joint space of individual features, to make the model shorter training time, simpler, better generalization.
02/05/2023
02/05/2023
Manually:• Time• Experience
Automatically
4.what’s the main idea
9
5.How it works
02/05/2023
ModelArchitecture
02/05/2023
max(0, )O Ij j j jX W X b
( )j jm n ( 1)jn
( 1)jm
( 1)jm
( 1)jm ( 1)jm
( 1)jm
( 1)jn
Reduce Dj jm n
5.How it works5.1 Embedding layers
02/05/2023
(0, )maxO Ij j j jX W X b
5.1 Embedding layers
Rectified linear unit (ReLU)• Elements non-negative
Activation function: a node defines the output of that node given an input
• ReLU• Logistic• Tanh• Sigmoid
5.How it works
12
5.How it works
02/05/2023
0 1, , ,O O O OKX X X X
5.2 stacking layers
02/05/2023
0 1, , ,O O O OKX X X X
256, set 256
embedding & stackingj jn m
256,
stacking without embeddingjn
Feature Number
Inputs K
Embedding then stacking n
Stacking(non-embedding) K-n
Stacking all K
5.2 stacking layers
Stacking rules:Threshold: 256
5.How it works
02/05/2023
0 1 0 1( , , , , )OR IR IRF W W B BX X X
𝑋 𝐼𝑅
5.3 Residual layers
• Inputs and outputs have the same size• Residual Unit is first used beyond image
recognition
5.How it works
15
5.How it works
02/05/2023
5.4 Scoring layers
Sigmoid function:
𝑋 𝐼𝑅
16
5.How it works
02/05/2023
1
1log ( log( ) (1 ) log(1 ))N
i i i ii
loss y p y pN
5.5 Objective function
Objective function: loss function or its negative
N: No. of samples : sample label : output of Model(predict)
𝑋 𝐼𝑅
𝒚 𝒊𝒑 𝒊
02/05/2023
5.5 Early Crossing vs. Late Crossing
Deep CrossingDSSM
(Deep Semantic Similarity Model)
5.How it works
18
6.Implementation
02/05/2023
SoftwareComputational Network Toolkit (CNKT)
Same theoretical foundation with Tensorflow
Hardware Multi-GPU platform
24 days (1 GPS) to 20 hours (32 GPUs)
02/05/2023
7.Experimentation
7.1 Dataset
02/05/2023
7.Experimentation
7.2 Performance on a Pair of Text Inputs
DSSM: late crossing
DP: early crossing
DC>DSSM
21
7.Experimentation
2 May 2023
7.2 Performance on a Pair of Text Inputs
Production Model: one model can be used in sponsored search as baseline
DSSM < DC< Production
DC Main advantage: deal with many individual features
22
7.Experimentation
2 May 2023
7.3 Beyond Text Input
All features works best
23
7.Experimentation
2 May 2023
7.3 Beyond Text Input
Only counting feature is weak
24
7.Experimentation
2 May 2023
7.3 Beyond Text Input
Counting feature is useful
25
7.Experimentation
2 May 2023
7.3 Beyond Text Input
Performance changes a lot as features number changes; log loss suffers a big fluctuation with different feature combination. So that feature selection is meaningful.
26
7.Experimentation
2 May 2023
7.4 Comparison with Production Models
2.2 billion samples
DC perform better with much less dataset
DC is easier to build and maintain
27
8.Conclusions
Deep Crossing work well in automatically feature combinatorial in large scale
Need less time and experience
02/05/2023
28
9.Experience
02/05/2023
• Deep learning (LSTM, CNN, etc.) can extract feature automatically, we can compare the efficient of them with this model
• we can use the raw data instead the individual features to train
• In different domains such mobile sensing, recommender system
2902/05/2023
Thanks!
3002/05/2023
Questions?