Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

46
Secure Because Math: A DeepDive on Machine LearningBased Monitoring (#SecureBecauseMath) Alex Pinto Chief Data Scien2st | MLSec Project @alexcpsec @MLSecProject

description

We could all have predicted this with our magical Big Data analytics platforms, but it seems that Machine Learning is the new hotness in Information Security. A great number of startups with ‘cy’ and ‘threat’ in their names that claim that their product will defend or detect more effectively than their neighbour's product "because math". And it should be easy to fool people without a PhD or two that math just works. Indeed, math is powerful and large scale machine learning is an important cornerstone of much of the systems that we use today. However, not all algorithms and techniques are born equal. Machine Learning is a most powerful tool box, but not every tool can be applied to every problem and that’s where the pitfalls lie. This presentation will describe the different techniques available for data analysis and machine learning for information security, and discuss their strengths and caveats. The Ghost of Marketing Past will also show how similar the unfulfilled promises of deterministic and exploratory analysis were, and how to avoid making the same mistakes again. Finally, the presentation will describe the techniques and feature sets that were developed by the presenter on the past year as a part of his ongoing research project on the subject, in particular present some interesting results obtained since the last presentation on DefCon 21, and some ideas that could improve the application of machine learning for use in information security, especially in its use as a helper for security analysts in incident detection and response.

Transcript of Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Page 1: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Secure  Because  Math:  A  Deep-­‐Dive  on  Machine  Learning-­‐Based  Monitoring    

(#SecureBecauseMath)  Alex  Pinto  

Chief  Data  Scien2st  |  MLSec  Project    @alexcpsec  

@MLSecProject!

Page 2: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Alex  Pinto  •  Chief  Data  Scien2st  at  MLSec  Project  •  Machine  Learning  Researcher  and  Trainer  •  Network  security  and  incident  response  aficionado    •  Tortured  by  SIEMs  as  a  child  •  Hacker  Spirit  Animal™:  CAFFEINATED  CAPYBARA!

whoami  

(hPps://secure.flickr.com/photos/kobashi_san/)  

Page 3: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

 •  Security  Singularity  •  Some  History  •  TLA  •  ML  Marke2ng  PaPerns  •  Anomaly  Detec2on  •  Classifica2on  •  Buyer’s  Guide  •  MLSec  Project  

Agenda  

Page 4: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Security  Singularity  Approaches  

Page 5: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

(Side  Note)  

First  hit  on  Google  images  for  “Network  Security  Solved”  is  a  picture  of  Jack  Daniel!

Page 6: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Security  Singularity  Approaches  

•  “Machine  learning  /  math  /  algorithms…  these  terms  are  used  interchangeably  quite  frequently.”  

•  “Is  behavioral  baselining  and  anomaly  detec2on  part  of  this?”  

•  “What  about  Big  Data  Security  Analy2cs?”  

 

(hPp://bigdatapix.tumblr.com/)  

Page 7: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Are  we  even  trying?  

•  “Hyper-­‐dimensional  security  analy2cs”  

•  “3rd  genera2on  Ar2ficial  Intelligence”  

•  “Secure  because  Math”    •  Lack  of  ability  to  differen2ate  hurts  buyers,  investors.  

•  Are  we  even  funding  the  right  things?  

Page 8: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Is  this  a  communicaCon  issue?  

Page 9: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Guess  the  Year!  •  “(…)  behavior  analysis  system  that  enhances  your  network  intelligence  and  security  by  audi2ng  network  flow  data  from  exis2ng  infrastructure  devices”  

•  "Mathema2cal  models  (…)  that  determine  baseline  behavior  across  users  and  machines,  detec2ng  (...)  anomalous  and  risky  ac2vi2es  (...)”  

•  ”(…)  maintains  historical  profiles  of  usage  per  user  and  raises  an  alarm  when  observed  ac2vity  departs  from  established  paPerns  of  usage  for  an  individual.”    

Page 10: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

A  liGle  history  

•  Dorothy  E.  Denning  (professor  at  the  Department  of  Defense  Analysis  at  the  Naval  Postgraduate  School)  •  1986  (SRI)  -­‐  First  research  that  led  to  IDS  •  Intrusion  Detec2on  Expert  System  (IDES)  •  Already  had  sta2s2cal  anomaly  detec2on  built-­‐in  

•  1993:  Her  colleagues  release  the  Next  Genera2on  (!)  IDES  

Page 11: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Three  LeGer  Acronyms  -­‐  KDD  

•  Ajer  the  release  of  Bro  (1998)  and  Snort  (1999),  DARPA  thought  we  were  covered  for  this  signature  thing  

•  DARPA  released  datasets  for  user  anomaly  detec2on  in  1998  and  1999  

•  And  then  came  the  KDD-­‐99  dataset  –  over  6200  cita2ons  on  Google  Scholar  

Page 12: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)
Page 13: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Three  LeGer  Acronyms  

Page 14: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Three  LeGer  Acronyms  -­‐  KDD  

Page 15: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Trolling,  maybe?  

Page 16: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Not  here  to  bash  academia  

Page 17: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

A  Probable  Outcome  

GRAD  SCHOOL  

FRESHMAN  

ZOMG  RESULTS  !!

11!1!  

ZOMG!  RESULTS???  

MATH,  STAHP!  

MATH  IS  HARD,  LET’S  GO  SHOPPING  

Page 18: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

ML  MarkeCng  PaGerns  

•  The  “Has-­‐beens”    •  Name  is  a  bit  harsh,  but  hey,  you  hardly  use  ML  anymore,  let  us  try  it  

•  The  “Machine  Learning  ¯ˉ\_(ツ)_/¯ˉ”  •  Hey,  that  sounds  cool,  let’s  put  that  in  our  brochure  

•  The  “Sweet  Spot”  •  People  that  actually  are  trying  to  do  something  •  Anomaly  Detec2on  vs.  Classifica2on  

Page 19: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Anomaly  DetecCon  

Page 20: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Anomaly  DetecCon  

•  Works  wonders  for  well  defined  “industrial-­‐like”  processes.  

•  Looking  at  single,  consistently  measured  variables  

•  Historical  usage  in  financial  fraud  preven2on.  

Page 21: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Anomaly  DetecCon  

Page 22: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Anomaly  DetecCon  • What  fits  this  mold?  •  Network/Neqlow  behavior  analysis    •  User  behavior  analysis  

• What  are  the  challenges?  •  Curse  of  Dimensionality  •  Lack  of  ground  truth  and  normality  poisoning  •  Hanlon’s  Razor  

Page 23: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

AD:  Curse  of  Dimensionality  

•  We  need  “distances”  to  measure  the  features/variables  

•  Usually  ManhaPan  or  Euclidian  

•  For  high-­‐dimensional  data,  the  distribu2on  of  distances  between  all  pairwise  points  in  the  space  becomes  concentrated  around  an  average  distance.  

Page 24: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

AD:  Curse  of  Dimensionality  

•  The  volume  of  the  high  dimensional  sphere  becomes  negligible  in  rela2on  to  the  volume  of  the  high  dimensional  cube.  

•  The  prac2cal  result  is  that  everything  just  seems  too  far  away,  and  at  similar  distances.  

(hPp://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A175670)  

Page 25: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

A  PracCcal  example  

•  NetFlow  data,  company  with  n  internal  nodes.  •  2(nˆ2  -­‐  n)  communica2on  direc2ons  •  2*2*2*65535(nˆ2  -­‐  n)  measures  of  network  ac2vity  •  1000  nodes  -­‐>  Half  a  trillion  possible  dimensions  

Page 26: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Breaking  the  Curse  

•  Different  /  crea2ve  distance  metrics  

•  Organizing  the  space  into  sub-­‐manifolds  where  Euclidean  distances  make  more  sense.  

•  Aggressive  feature  removal  

•  A  few  interes2ng  results  available  

Page 27: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Breaking  the  Curse  

Page 28: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

AD:  Normality-­‐poisoning  aGacks  

•  Ground  Truth  (labels)  >>  Features  >>  Algorithms  

•  There  is  no  (or  next  to  none)  Ground  Truth  in  AD  •  What  is  “normal”  in  your  environment?  •  Problem  asymmetry  •  Solu2ons  are  biased  to  the  prevalent  class  

•  Very  hard  to  fine-­‐tune,  becomes  prone  to  a  lot  of  false  nega2ves  or  false  posi2ves  

Page 29: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

AD:  Normality-­‐poisoning  aGacks  

Page 30: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

AD:  Hanlon’s  Razor  

Never attribute to malice that which is adequately

explained by stupidity.

Page 31: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

AD:  Hanlon’s  Razor  

vs!

Evil  Hacker! Hipster  Developer    (a.k.a.  MaP  Johansen)!

Page 32: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

What  about  User  Behavior?  •  Surprise,  it  kinda  works!  (as  supervised,  that  is)  •  As  specific  implementa2ons  for  specific  solu2ons  •  Good  stuff  from  Square,  AirBnB  •  Well  defined  scope  and  labeling.  

• Can  it  be  general  enough?  •  File  exfiltra2on  example  (roles/info  classifica2on  are  mandatory?)  •  Can  I  “average  out”  user  behaviors  in  different  applica2ons?  

Page 33: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

ClassificaCon!  

VS!

Page 34: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

•  Lots  of  available  academic  research  around  this  •  Classifica2on  and  clustering  of  malware  samples  

•  More  success  into  classifying  ar2facts  you  already  know  to  be  malware  then  to  actually  detect  it.  (Lineage)  

•  State  of  the  art?  My  guess  is  AV  companies!  •  All  of  them  have  an  absurd  amount  of  samples  •  Have  been  researching  and  consolida2ng  data  on  them  for  decades.  

Lots  of  Malware  AcCvity  

Page 35: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

•  Can  we  do  bePer  than  “AV  Heuris2cs”?  •  Lots  and  lots  of  available  data  that  has  been  made  public  •  Some  of  the  papers  also  suffer  from  poten2ally  bad  ground  truth.  

Lots  of  Malware  AcCvity  

VS!

Page 36: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Lots  of  Malware  AcCvity  

VS!

Page 37: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Everyone  makes  mistakes!  

Page 38: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

•  Private  Beta  of  our  Threat  Intelligence-­‐based  models:  •  Some  use  TI  indicator  feeds  as  blocklists  •  More  mature  companies  use  the  feeds  to  learn  about  the  threats  (Trained  professionals  only)  

•  Our  models  extrapolate  the  knowledge  of  exis2ng  threat  intelligence  feeds  as  those  experienced  analysis  would.  •  Supervised  model  w/same  data  analyst  has  •  Seeded  labeling  from  TI  feeds  

How  is  it  going  then,  Alex?  

Page 39: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

•  Very  effec2ve  first  triage  for  SOCs  and  Incident  Responders  •  Send  us:  log  data  from  firewalls,  DNS,  web  proxies  •  Receive:  Report  with  a  short  list  of  poten2al  compromised  machines  

•  Would  you  rather  download  all  the  feeds  and  integrate  it  yourself?  •  MLSecProject/Combine  •  MLSecProject/TIQ-­‐test  

 

Yeah,  but  why  should  I  care?  

Page 40: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

•  Huge  amounts  of  TI  feeds  available  now  (open/commercial)  •  Non-­‐malicious  samples  s2ll  challenging,  but  we  have  expanded  to  a  lot  of  collec2on  techniques  from  different  sources.  •  Very  high-­‐ranked  Alexa  /  Quan2cast  /  OpenDNS  Random  domains  as  seeds  for  search  of  trust  •  Helped  by  the  customer  logs  as  well  in  a  semi-­‐supervised  fashion  

What  about  the  Ground  Truth  (labels)?  

Page 41: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

•  Vast  majority  of  features  are  derived  from  structural/intrinsic  data:  •  GeoIP,  ASN  informa2on,  BGP  Prefixes  •  pDNS  informa2on  for  the  IP  addresses,  hostnames  •  WHOIS  informa2on  

•  APacker  can’t  change  those  things  without  cost.  •  Log  data  from  the  customer,  can,  of  course.  But  this  does  not  make  it  worse  than  human  specialist.  

But  what  about  data  tampering?  

Page 42: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

•  False  posi2ves  /  false  nega2ves  are  an  intrinsic  part  of  ML.  

•  “False  posi2ves  are  very  good,  and  would  have  fooled  our  human  analysts  at  first.”  

•  Their  feedback  helps  us  improve  the  models  for  everyone.  

•  Remember  it  is  about  ini2al  triage.  A  Tier-­‐2/Tier-­‐3  analyst  must  inves2gate  and  provide  feedback  to  the  model.  

And  what  about  false  posiCves?  

Page 43: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

•  1)  What  are  you  trying  to  achieve  with  adding  Machine  Learning  to  the  solu2on?  

•  2)  What  are  the  sources  of  Ground  Truth  for  your  models?  

•  3)  How  can  you  protect  the  features  /  ground  truth  from  adversaries?  

•  4)  How  does  the  solu2on/processes  around  it  handle  false  posi2ves?  !

Buyer’s  Guide  

Page 44: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

 #NotAllAlgorithms!

Buyer’s  Guide  

Page 45: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

MLSec  Project  

•  Don’t  take  my  word  for  it!  Try  it  out!!  •  Help  us  test  and  improve  the  models!  •  Looking  for  par2cipants  and  data  sharing  agreements  

•  Limited  capacity  at  the  moment,  so  be  pa2ent.  :)    •  Visit  hGps://www.mlsecproject.org  ,  message  @MLSecProject  

or  just  e-­‐mail  me.!

Page 46: Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Thanks!  •  Q&A?  •  Don’t  forget  the  feedback!  

Alex  Pinto    @alexcpsec  

@MLSecProject  

”We  are  drowning  on  informa2on  and  starved  for  knowledge"                        -­‐  John  NaisbiP