Readinggroup xiang 24112016

Post on 20-Mar-2017

35 views 0 download

Transcript of Readinggroup xiang 24112016

1

Reading GroupDeep Crossing: Web-Scale Modeling

without Manually Crafted Combinatorial Features

Presenter:Xiang Zhang

02/05/2023

02/05/2023

1.Background2.Abstract3.Why choose it4.what’s the main idea5.How it works6.Implementation7.Expermentation8.Conclusions9.Experience

Main Content

2

3

1. Background

02/05/2023

Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features

Authors from Microsoft Research

Keywords Neural Networks, DL, CNN, etc.

Conference KDD 2016

4

1. Background

02/05/2023

KDD (A*) Conferences on Knowledge Discovery and Data mining

SIGKDD Special Interest Group on Knowledge Discovery in Data

KDD 2017 Aug.13-17, Halifax, CanadaDecember 9, 2016 (DDL)

5

2.Abstract

Combine features Important and useful, but manually craft is time consuming and experience needed, accuracy unguaranteed especially in large scale, variety and volume of features

ContributionDeep neural network to automatically combine features

Tool Computational network Tool Kit, multi-GPU platform

02/05/2023

6

3. Why choose it

Automatically combine features (reduce D)

Web-scale (Massive data)

Feature in different types & Dimensions

Better performance

02/05/2023

7

4.what’s the main ideaWhat is feature extraction?

Individual features

An individual measurable property of a phenomenon being observed. (Representation of the data)

Combinatorial features

Defined in the joint space of individual features, to make the model shorter training time, simpler, better generalization.

02/05/2023

02/05/2023

Manually:• Time• Experience

Automatically

4.what’s the main idea

9

5.How it works

02/05/2023

ModelArchitecture

02/05/2023

max(0, )O Ij j j jX W X b

( )j jm n ( 1)jn

( 1)jm

( 1)jm

( 1)jm ( 1)jm

( 1)jm

( 1)jn

Reduce Dj jm n

5.How it works5.1 Embedding layers

02/05/2023

(0, )maxO Ij j j jX W X b

5.1 Embedding layers

Rectified linear unit (ReLU)• Elements non-negative

Activation function: a node defines the output of that node given an input 

• ReLU• Logistic• Tanh• Sigmoid

5.How it works

12

5.How it works

02/05/2023

0 1, , ,O O O OKX X X X

5.2 stacking layers

02/05/2023

0 1, , ,O O O OKX X X X

256, set 256

embedding & stackingj jn m

256,

stacking without embeddingjn

Feature Number

Inputs K

Embedding then stacking n

Stacking(non-embedding) K-n

Stacking all K

5.2 stacking layers

Stacking rules:Threshold: 256

5.How it works

02/05/2023

0 1 0 1( , , , , )OR IR IRF W W B BX X X

𝑋 𝐼𝑅

5.3 Residual layers

• Inputs and outputs have the same size• Residual Unit is first used beyond image

recognition

5.How it works

15

5.How it works

02/05/2023

5.4 Scoring layers

Sigmoid function:

𝑋 𝐼𝑅

16

5.How it works

02/05/2023

1

1log ( log( ) (1 ) log(1 ))N

i i i ii

loss y p y pN

5.5 Objective function

Objective function: loss function or its negative

N: No. of samples : sample label : output of Model(predict)

𝑋 𝐼𝑅

𝒚 𝒊𝒑 𝒊

02/05/2023

5.5 Early Crossing vs. Late Crossing

Deep CrossingDSSM

(Deep Semantic Similarity Model)

5.How it works

18

6.Implementation

02/05/2023

SoftwareComputational Network Toolkit (CNKT)

Same theoretical foundation with Tensorflow

Hardware Multi-GPU platform

24 days (1 GPS) to 20 hours (32 GPUs)

02/05/2023

7.Experimentation

7.1 Dataset

02/05/2023

7.Experimentation

7.2 Performance on a Pair of Text Inputs

DSSM: late crossing

DP: early crossing

DC>DSSM

21

7.Experimentation

2 May 2023

7.2 Performance on a Pair of Text Inputs

Production Model: one model can be used in sponsored search as baseline

DSSM < DC< Production

DC Main advantage: deal with many individual features

22

7.Experimentation

2 May 2023

7.3 Beyond Text Input

All features works best

23

7.Experimentation

2 May 2023

7.3 Beyond Text Input

Only counting feature is weak

24

7.Experimentation

2 May 2023

7.3 Beyond Text Input

Counting feature is useful

25

7.Experimentation

2 May 2023

7.3 Beyond Text Input

Performance changes a lot as features number changes; log loss suffers a big fluctuation with different feature combination. So that feature selection is meaningful.

26

7.Experimentation

2 May 2023

7.4 Comparison with Production Models

2.2 billion samples

DC perform better with much less dataset

DC is easier to build and maintain

27

8.Conclusions

Deep Crossing work well in automatically feature combinatorial in large scale

Need less time and experience

02/05/2023

28

9.Experience

02/05/2023

• Deep learning (LSTM, CNN, etc.) can extract feature automatically, we can compare the efficient of them with this model

• we can use the raw data instead the individual features to train

• In different domains such mobile sensing, recommender system

2902/05/2023

Thanks!

3002/05/2023

Questions?