Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient...

64
Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc Sebban, Olivier Caelen and Liyun He-Guelton

Transcript of Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient...

Page 1: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Efficient top rank optimization with gradient boosting for supervised

anomaly detection

By

Jordan Fréry, Amaury Habrard, Marc Sebban, Olivier Caelen and Liyun He-Guelton

Page 2: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Anomaly detection

Find the data that do not conform to the normal behavior.

Applications: • Health care • Intrusion detection in cyber security • Fraud detection

– Insurance – Credit card transactions

Page 3: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Anomaly detection

Find the data that do not conform to the normal behavior.

Applications: • Health care • Intrusion detection in cyber security • Fraud detection

– Insurance – Credit card transactions

Page 4: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Anomaly detection

Find the data that do not conform to the normal behavior.

Applications: • Health care • Intrusion detection in cyber security • Fraud detection

– Insurance – Credit card transactions

Page 5: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Anomaly detection

Find the data that do not conform to the normal behavior.

Applications: • Health care • Intrusion detection in cyber security • Fraud detection

– Insurance – Credit card transactions

Page 6: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Anomaly detection

Find the data that do not conform to the normal behavior.

Applications: • Health care • Intrusion detection in cyber security • Fraud detection

– Insurance – Credit card transactions

Page 7: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Supervised learning

• Goal is to find a function 𝑓 ∶ 𝑋 → 𝑌

• We assume that there is an unknown joint distribution 𝑃 over 𝑋 × 𝑌

• We have a training set of M examples 𝑥𝑖 , 𝑦𝑖 𝑛=1𝑀 ∈ 𝑋 × 𝑌 𝑀

i.i.d. according to 𝑃

• Classification: discrete 𝑦

• Regression: continuous 𝑦

Page 8: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Supervised learning

• Goal is to find a function 𝑓 ∶ 𝑋 → 𝑌

• We assume that there is an unknown joint distribution 𝑃 over 𝑋 × 𝑌

• We have a training set of M examples 𝑥𝑖 , 𝑦𝑖 𝑛=1𝑀 ∈ 𝑋 × 𝑌 𝑀

i.i.d. according to 𝑃

• Classification: discrete 𝑦

• Regression: continuous 𝑦

Page 9: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Supervised learning

• Goal is to find a function 𝑓 ∶ 𝑋 → 𝑌

• We assume that there is an unknown joint distribution 𝑃 over 𝑋 × 𝑌

• We have a training set of M examples 𝑥𝑖 , 𝑦𝑖 𝑛=1𝑀 ∈ 𝑋 × 𝑌 𝑀

i.i.d. according to 𝑃

• Classification: discrete 𝑦

• Regression: continuous 𝑦

Page 10: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Supervised learning

• Goal is to find a function 𝑓 ∶ 𝑋 → 𝑌

• We assume that there is an unknown joint distribution 𝑃 over 𝑋 × 𝑌

• We have a training set of M examples 𝑥𝑖 , 𝑦𝑖 𝑛=1𝑀 ∈ 𝑋 × 𝑌 𝑀

i.i.d. according to 𝑃

• Classification: discrete 𝑦

• Regression: continuous 𝑦

-1

+1

+∞

-∞ 𝑥

𝑥

Page 11: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Loss function and risk minimization

Page 12: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Loss function and risk minimization • Loss function 𝐿(𝑓 𝑥 , 𝑦): agreement between prediction 𝑓(𝑥) and

desired output 𝑦.

• Common loss functions are accuracy based

• Anomaly detection can be presented as a binary problem with 𝑁 ≫ 𝑃

• In anomaly detection, we would rather look at

– 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛

– 𝑅𝑒𝑐𝑎𝑙𝑙

– 𝐹𝛽 𝑠𝑐𝑜𝑟𝑒

Example of different loss functions

Page 13: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Loss function and risk minimization • Loss function 𝐿(𝑓 𝑥 , 𝑦): agreement between prediction 𝑓(𝑥) and

desired output 𝑦.

• Common loss functions are accuracy based

• Anomaly detection can be presented as a binary problem with 𝑁 ≫ 𝑃

• In anomaly detection, we would rather look at

– 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛

– 𝑅𝑒𝑐𝑎𝑙𝑙

– 𝐹𝛽 𝑠𝑐𝑜𝑟𝑒

Example of different loss functions

Page 14: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Loss function and risk minimization • Loss function 𝐿(𝑓 𝑥 , 𝑦): agreement between prediction 𝑓(𝑥) and

desired output 𝑦.

• Common loss functions are accuracy based

• Anomaly detection can be presented as a binary problem with 𝑁 ≫ 𝑃

• In anomaly detection, we would rather look at

– 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛

– 𝑅𝑒𝑐𝑎𝑙𝑙

– 𝐹𝛽 𝑠𝑐𝑜𝑟𝑒

Example of different loss functions

Page 15: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Loss function and risk minimization • Loss function 𝐿(𝑓 𝑥 , 𝑦): agreement between prediction 𝑓(𝑥) and

desired output 𝑦.

• Common loss functions are accuracy based

• Anomaly detection can be presented as a binary problem with 𝑁 ≫ 𝑃

• In anomaly detection, we would rather look at

– 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛

– 𝑅𝑒𝑐𝑎𝑙𝑙

– 𝐹𝛽 𝑠𝑐𝑜𝑟𝑒

Example of different loss functions

Page 16: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Pitfall of the classification approach

Page 17: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Pitfall of the classification approach

• Experts often need to assess the potential anomalies

Page 18: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Pitfall of the classification approach

• Experts often need to assess the potential anomalies

Use another approach than classification: learning to rank

Page 19: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Learning to rank

Page 20: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Learning to rank

+ millions more links

Predicted most relevant links for a query:

Rel

evan

ce r

ank

pre

dic

ted

Page 21: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Learning to rank

+ millions more links

Predicted most relevant links for a query:

+ millions more transactions

Predicted most probable fraudulent transaction over a day:

Card ID Date Amount Shop.

120983 21/09/2017 23€ 20193

328903 21/09/2017 3€ 29103

328032 21/09/2017 14.2€ 9023

390293 21/09/2017 11€ 124

182393 21/09/2017 110€ 202

432445 21/09/2017 43€ 20193

645367 21/09/2017 4€ 1089

887644 21/09/2017 34.3€ 230

257546 21/09/2017 3.2€ 399

655655 21/09/2017 40€ 394

356578 21/09/2017 49.99€ 12

884733 21/09/2017 73.99€ 9000

… … … …

Rel

evan

ce r

ank

pre

dic

ted

Frau

du

len

t ra

nk

pre

dic

ted

Page 22: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Learning to rank

• How do we assess the quality of a ranked list?

+ millions more links

Predicted most relevant links for a query:

+ millions more transactions

Predicted most probable fraudulent transaction over a day:

Card ID Date Amount Shop.

120983 21/09/2017 23€ 20193

328903 21/09/2017 3€ 29103

328032 21/09/2017 14.2€ 9023

390293 21/09/2017 11€ 124

182393 21/09/2017 110€ 202

432445 21/09/2017 43€ 20193

645367 21/09/2017 4€ 1089

887644 21/09/2017 34.3€ 230

257546 21/09/2017 3.2€ 399

655655 21/09/2017 40€ 394

356578 21/09/2017 49.99€ 12

884733 21/09/2017 73.99€ 9000

… … … …

Rel

evan

ce r

ank

pre

dic

ted

Frau

du

len

t ra

nk

pre

dic

ted

Page 23: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Evaluation metrics for ranking

Page 24: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Evaluation metrics for ranking

• Area under the ROC curve

𝐴𝑈𝐶𝑅𝑂𝐶 =1

𝑃𝑁 I 𝑓 𝑥𝑖 > 𝑓 𝑥𝑗

𝑁

𝑗=1

𝑃

𝑖=1

Page 25: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Evaluation metrics for ranking

• Area under the ROC curve

𝐴𝑈𝐶𝑅𝑂𝐶 =1

𝑃𝑁 I 𝑓 𝑥𝑖 > 𝑓 𝑥𝑗

𝑁

𝑗=1

𝑃

𝑖=1

Burges 2010

Page 26: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Evaluation metrics for ranking

• Area under the ROC curve

𝐴𝑈𝐶𝑅𝑂𝐶 =1

𝑃𝑁 I 𝑓 𝑥𝑖 > 𝑓 𝑥𝑗

𝑁

𝑗=1

𝑃

𝑖=1

Burges 2010

Page 27: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Evaluation metrics for ranking

• Area under the ROC curve

𝐴𝑈𝐶𝑅𝑂𝐶 =1

𝑃𝑁 I 𝑓 𝑥𝑖 > 𝑓 𝑥𝑗

𝑁

𝑗=1

𝑃

𝑖=1

• Average precision (AUCPR)

𝐴𝑃 =1

𝑃 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘

𝑃

𝑖=1

Burges 2010

Page 28: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Evaluation metrics for ranking

• Area under the ROC curve

𝐴𝑈𝐶𝑅𝑂𝐶 =1

𝑃𝑁 I 𝑓 𝑥𝑖 > 𝑓 𝑥𝑗

𝑁

𝑗=1

𝑃

𝑖=1

• Average precision (AUCPR)

𝐴𝑃 =1

𝑃 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘

𝑃

𝑖=1

Burges 2010

Page 29: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Evaluation metrics for ranking

• Area under the ROC curve

𝐴𝑈𝐶𝑅𝑂𝐶 =1

𝑃𝑁 I 𝑓 𝑥𝑖 > 𝑓 𝑥𝑗

𝑁

𝑗=1

𝑃

𝑖=1

• Average precision (AUCPR)

𝐴𝑃 =1

𝑃 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘

𝑃

𝑖=1

Burges 2010

AP is better suited to our problem. However, it is not differentiable.

Page 30: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃

Page 31: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃 • Average precision

𝐴𝑃 =1

𝑃 I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗𝑃𝑗=1

I 𝑓 𝑥ℎ < 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

• Sigmoid

I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗 ≈1

1 + 𝑒𝛼 𝑓 𝑥𝑗 −𝑓 𝑥𝑖

= 𝜎 𝑓 𝑥𝑗 − 𝑓 𝑥𝑖

with 𝛼 a smoothing parameter

1 − 𝐴𝑃 𝑠𝑖𝑔 =1

𝑃 𝜎 𝑓 𝑥𝑗 − 𝑓 𝑥𝑖𝑁𝑗=1

𝜎 𝑓 𝑥ℎ − 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

Complexity : 𝑂(𝑃 × 𝑁)

Page 32: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃 • Average precision

𝐴𝑃 =1

𝑃 I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗𝑃𝑗=1

I 𝑓 𝑥ℎ < 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

• Sigmoid

I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗 ≈1

1 + 𝑒𝛼 𝑓 𝑥𝑗 −𝑓 𝑥𝑖

= 𝜎 𝑓 𝑥𝑗 − 𝑓 𝑥𝑖

with 𝛼 a smoothing parameter

1 − 𝐴𝑃 𝑠𝑖𝑔 =1

𝑃 𝜎 𝑓 𝑥𝑗 − 𝑓 𝑥𝑖𝑁𝑗=1

𝜎 𝑓 𝑥ℎ − 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

Complexity : 𝑂(𝑃 × 𝑁)

Page 33: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃 • Average precision

𝐴𝑃 =1

𝑃 I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗𝑃𝑗=1

I 𝑓 𝑥ℎ < 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

• Sigmoid

I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗 ≈1

1 + 𝑒𝛼 𝑓 𝑥𝑗 −𝑓 𝑥𝑖

= 𝜎 𝑓 𝑥𝑗 − 𝑓 𝑥𝑖

with 𝛼 a smoothing parameter

1 − 𝐴𝑃 𝑠𝑖𝑔 =1

𝑃 𝜎 𝑓 𝑥𝑗 − 𝑓 𝑥𝑖𝑁𝑗=1

𝜎 𝑓 𝑥ℎ − 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

Complexity : 𝑂(𝑃 × 𝑁)

Page 34: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃 • Average precision

𝐴𝑃 =1

𝑃 I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗𝑃𝑗=1

I 𝑓 𝑥ℎ < 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

• Sigmoid

I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗 ≈1

1 + 𝑒𝛼 𝑓 𝑥𝑗 −𝑓 𝑥𝑖

= 𝜎 𝑓 𝑥𝑗 − 𝑓 𝑥𝑖

with 𝛼 a smoothing parameter

1 − 𝐴𝑃 𝑠𝑖𝑔 =1

𝑃 𝜎 𝑓 𝑥𝑗 − 𝑓 𝑥𝑖𝑁𝑗=1

𝜎 𝑓 𝑥ℎ − 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

Complexity : 𝑂(𝑃 × 𝑁)

Page 35: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃 • Average precision

𝐴𝑃 =1

𝑃 I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗𝑃𝑗=1

I 𝑓 𝑥ℎ < 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

• Sigmoid

I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗 ≈1

1 + 𝑒𝛼 𝑓 𝑥𝑗 −𝑓 𝑥𝑖

= 𝜎 𝑓 𝑥𝑗 − 𝑓 𝑥𝑖

with 𝛼 a smoothing parameter

1 − 𝐴𝑃 𝑠𝑖𝑔 =1

𝑃 𝜎 𝑓 𝑥𝑗 − 𝑓 𝑥𝑖𝑁𝑗=1

𝜎 𝑓 𝑥ℎ − 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

Complexity : 𝑂(𝑃 × 𝑁)

Page 36: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃

• Average precision

𝐴𝑃 =1

𝑃 I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗𝑃𝑗=1

I 𝑓 𝑥ℎ < 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

• Exponential

I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗 ≤ 𝑒𝑓 𝑥𝑗 −𝑓 𝑥𝑖

1 − 𝐴𝑃 𝑒𝑥𝑝 = 𝑒𝑓 𝑥𝑛𝑁𝑛=1

𝑒𝑓 𝑥ℎ𝑀ℎ=1

Pros: 𝑂(𝑃 + 𝑁)

Cons: Gradient explosion

Page 37: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃

• Average precision

𝐴𝑃 =1

𝑃 I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗𝑃𝑗=1

I 𝑓 𝑥ℎ < 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

• Exponential

I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗 ≤ 𝑒𝑓 𝑥𝑗 −𝑓 𝑥𝑖

1 − 𝐴𝑃 𝑒𝑥𝑝 = 𝑒𝑓 𝑥𝑛𝑁𝑛=1

𝑒𝑓 𝑥ℎ𝑀ℎ=1

Pros: 𝑂(𝑃 + 𝑁)

Cons: Gradient explosion

Page 38: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃

• Average precision

𝐴𝑃 =1

𝑃 I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗𝑃𝑗=1

I 𝑓 𝑥ℎ < 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

• Exponential

I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗 ≤ 𝑒𝑓 𝑥𝑗 −𝑓 𝑥𝑖

1 − 𝐴𝑃 𝑒𝑥𝑝 = 𝑒𝑓 𝑥𝑛𝑁𝑛=1

𝑒𝑓 𝑥ℎ𝑀ℎ=1

Pros: 𝑂(𝑃 + 𝑁)

Cons: Gradient explosion

Page 39: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃

• Average precision

𝐴𝑃 =1

𝑃 I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗𝑃𝑗=1

I 𝑓 𝑥ℎ < 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

• Exponential

I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗 ≤ 𝑒𝑓 𝑥𝑗 −𝑓 𝑥𝑖

1 − 𝐴𝑃 𝑒𝑥𝑝 = 𝑒𝑓 𝑥𝑛𝑁𝑛=1

𝑒𝑓 𝑥ℎ𝑀ℎ=1

Pros: 𝑂(𝑃 + 𝑁)

Cons: Gradient explosion

Page 40: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃

• Average precision

𝐴𝑃 =1

𝑃 I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗𝑃𝑗=1

I 𝑓 𝑥ℎ < 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

• Exponential

I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗 ≤ 𝑒𝑓 𝑥𝑗 −𝑓 𝑥𝑖

1 − 𝐴𝑃 𝑒𝑥𝑝 = 𝑒𝑓 𝑥𝑛𝑁𝑛=1

𝑒𝑓 𝑥ℎ𝑀ℎ=1

Pros: 𝑂(𝑃 + 𝑁)

Cons: Gradient explosion

Page 41: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Smooth approximation of 𝐴𝑃

• Average precision

𝐴𝑃 =1

𝑃 I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗𝑃𝑗=1

I 𝑓 𝑥ℎ < 𝑓 𝑥𝑖𝑀ℎ=1

𝑃

𝑖=1

• Exponential

I 𝑓 𝑥𝑖 < 𝑓 𝑥𝑗 ≤ 𝑒𝑓 𝑥𝑗 −𝑓 𝑥𝑖

1 − 𝐴𝑃 𝑒𝑥𝑝 = 𝑒𝑓 𝑥𝑛𝑁𝑛=1

𝑒𝑓 𝑥ℎ𝑀ℎ=1

Pros: 𝑂(𝑃 + 𝑁)

Cons: Gradient explosion

Page 42: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Stochastic gradient boosting

Page 43: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Stochastic gradient boosting

Why ?

• Optimizing in function space instead of parameter space

• Adaptive algorithm

• SGB prevents the gradients from exploding

Weak learners ℎ are combined linearly:

𝑓𝑡 𝑥 = 𝑓𝑡−1 𝑥 + 𝛼𝑡ℎ𝑡 𝑥

Page 44: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Stochastic gradient boosting

Why ?

• Optimizing in function space instead of parameter space

• Adaptive algorithm

• SGB prevents the gradients from exploding

Weak learners ℎ are combined linearly:

𝑓𝑡 𝑥 = 𝑓𝑡−1 𝑥 + 𝛼𝑡ℎ𝑡 𝑥

Page 45: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Stochastic gradient boosting

At iteration 𝑡 − 1 we find the residuals for all 𝑥𝑖 𝑖=1𝑀

𝑔𝑡(𝑥𝑖) =𝜕𝐿 𝑦𝑖 , 𝑓𝑡−1 𝑥𝑖𝜕𝑓𝑡−1 𝑥𝑖

We find a model ℎ𝑡 with its corresponding weight such that:

ℎ𝑡 = 𝑎𝑟𝑔𝑚𝑖𝑛ℎ −𝑔 𝑥𝑖 ℎ(𝑥𝑖)

𝑀

𝑖=1

𝛼𝑡 = 𝑎𝑟𝑔𝑚𝑖𝑛α 𝐿(𝑦𝑖 , 𝑓𝑡−1 𝑥𝑖 + 𝛼ℎ𝑡 𝑥𝑖 )

𝑀

𝑖=1

Page 46: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Stochastic gradient boosting

At iteration 𝑡 − 1 we find the residuals for all 𝑥𝑖 𝑖=1𝑀

𝑔𝑡(𝑥𝑖) =𝜕𝐿 𝑦𝑖 , 𝑓𝑡−1 𝑥𝑖𝜕𝑓𝑡−1 𝑥𝑖

We find a model ℎ𝑡 with its corresponding weight such that:

ℎ𝑡 = 𝑎𝑟𝑔𝑚𝑖𝑛ℎ −𝑔 𝑥𝑖 ℎ(𝑥𝑖)

𝑀

𝑖=1

𝛼𝑡 = 𝑎𝑟𝑔𝑚𝑖𝑛α 𝐿(𝑦𝑖 , 𝑓𝑡−1 𝑥𝑖 + 𝛼ℎ𝑡 𝑥𝑖 )

𝑀

𝑖=1

Page 47: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Experiments • Datasets

• Evaluation metrics:

– AUCROC – Average precision – Pos@Top, the number of positives before the first negative appears – P@k, with k equal to the number of positive examples

• Models for comparison – GB-logistic (Friedman 2001) – Rankboost (Freund et al 2003) – LambdaMART-AP (Burges 2010) – SGBAP

Page 48: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Experiments • Datasets

• Evaluation metrics:

– AUCROC – Average precision – Pos@Top, the number of positives before the first negative appears – P@k, with k equal to the number of positive examples

• Models for comparison – GB-logistic (Friedman 2001) – Rankboost (Freund et al 2003) – LambdaMART-AP (Burges 2010) – SGBAP

Page 49: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Experiments • Datasets

• Evaluation metrics:

– AUCROC – Average precision – Pos@Top, the number of positives before the first negative appears – P@k, with k equal to the number of positive examples

• Models for comparison – GB-logistic (Friedman 2001) – Rankboost (Freund et al 2003) – LambdaMART-AP (Burges 2010) – SGBAP

Page 50: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Experiments • Datasets

• Evaluation metrics:

– AUCROC – Average precision – Pos@Top, the number of positives before the first negative appears – P@k, with k equal to the number of positive examples

• Models for comparison – GB-logistic (Friedman 2001) – Rankboost (Freund et al 2003) – LambdaMART-AP (Burges 2010) – SGBAP

Page 51: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Experiments • Datasets

• Evaluation metrics:

– AUCROC – Average precision – Pos@Top, the number of positives before the first negative appears – P@k, with k equal to the number of positive examples

• Models for comparison – GB-logistic (Friedman 2001) – Rankboost (Freund et al 2003) – LambdaMART-AP (Burges 2010) – SGBAP

Page 52: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Experiments • Datasets

• Evaluation metrics:

– AUCROC – Average precision – Pos@Top, the number of positives before the first negative appears – P@k, with k equal to the number of positive examples

• Models for comparison – GB-logistic (Friedman 2001) – Rankboost (Freund et al 2003) – LambdaMART-AP (Burges 2010) – SGBAP

Page 53: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Experiments • Datasets

• Evaluation metrics:

– AUCROC – Average precision – Pos@Top, the number of positives before the first negative appears – P@k, with k equal to the number of positive examples

• Models for comparison – GB-logistic (Friedman 2001) – Rankboost (Freund et al 2003) – LambdaMART-AP (Burges 2010) – SGBAP

Page 54: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Experiments • Datasets

• Evaluation metrics:

– AUCROC – Average precision – Pos@Top, the number of positives before the first negative appears – P@k, with k equal to the number of positive examples

• Models for comparison – GB-logistic (Friedman 2001) – Rankboost (Freund et al 2003) – LambdaMART-AP (Burges 2010) – SGBAP

Page 55: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Experiments • Datasets

• Evaluation metrics:

– AUCROC – Average precision – Pos@Top, the number of positives before the first negative appears – P@k, with k equal to the number of positive examples

• Models for comparison – GB-logistic (Friedman 2001) – Rankboost (Freund et al 2003) – LambdaMART-AP (Burges 2010) – SGBAP

Page 56: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Results

Page 57: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Results

Page 58: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Conclusion

• We proposed a learning to rank approach for anomaly detection problems

• One of our approximations is linear and can scale to big datasets

• As the data are unbalanced, experiments show that our method performs better in the top rank

Page 59: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Conclusion

• We proposed a learning to rank approach for anomaly detection problems

• One of our approximations is linear and can scale to big datasets

• As the data are unbalanced, experiments show that our method performs better in the top rank

Page 60: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Conclusion

• We proposed a learning to rank approach for anomaly detection problems

• One of our approximations is linear and can scale to big datasets

• As the data are unbalanced, experiments show that our method performs better in the top rank

Page 61: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Perspectives

• Automatic decision threshold based on expert criteria

• Adaptation to online learning

• Open the proprietary dataset with an international data science competition

Page 62: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Perspectives

• Automatic decision threshold based on expert criteria

• Adaptation to online learning

• Open the proprietary dataset with an international data science competition

Page 63: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Perspectives

• Automatic decision threshold based on expert criteria

• Adaptation to online learning

• Open the proprietary dataset with an international data science competition

Page 64: Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Thank you for your attention