To Drink or Not to Drink: A Sober Look at the Question Copyright by Norman L. Geisler 2005.
A Sober Look at Machine Learning
-
Upload
sven-krasser -
Category
Data & Analytics
-
view
658 -
download
0
Transcript of A Sober Look at Machine Learning
![Page 1: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/1.jpg)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
A SOBER LOOK AT MACHINE LEARNING
DR. SVEN KRASSER CHIEF SCIENTIST@SVENKRASSER
![Page 2: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/2.jpg)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Distinguishing Science…
Source: CERN, http://home.cern/sites/home.web.cern.ch/files/image/experiment/2013/01/cms_0.jpeg
![Page 3: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/3.jpg)
…from FictionSource: “Chain Reaction,” 20th Century Fox
![Page 4: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/4.jpg)
MACHINE LEARNING 101
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 5: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/5.jpg)
EXAMPLES OF MACHINE LEARNING
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SPAM FILTERING
MOVIE RECOMMENDATIONS
SIRI(iPHONE)
![Page 6: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/6.jpg)
TODAY’S FOCUS: SUPERVISED LEARNING
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 7: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/7.jpg)
TODAY’S FOCUS: GEOMETRIC MODELS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 8: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/8.jpg)
EVERYTHING YOU WILL SEE TODAY IS REAL WORLD DATA
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 9: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/9.jpg)
Some Data to Get Started:1988 ANTHROPOMETRIC
SURVEY OF ARMY PERSONNEL
Source: http://mreed.umtri.umich.edu/mreed/downloads.html#anthro 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 10: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/10.jpg)
• Over 4000 soldiers surveyed• Over 100 measurements• Reported by gender
Test subjects are in better shape than the rest of us...
Data
Selection Bias
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 11: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/11.jpg)
FIRST LOOK
Height [mm]
Den
sity
• Difference in distribution
• Significant overlap
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 12: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/12.jpg)
SECOND DIMENSION
Height [mm]
Wei
ght
[10
-1kg
]
• Correlation
• Overlap
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 13: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/13.jpg)
FEATURE SELECTION
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
•Correlation
•Gender-specific slope
•Reduced overlap
• Selection of features matters
•How to make a prediction?
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 14: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/14.jpg)
K-NEAREST NEIGHBOR
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
m
f
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 15: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/15.jpg)
SUPPORT VECTOR MACHINE
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 16: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/16.jpg)
SUPPORT VECTOR MACHINE
2016 CrowdStrike, Inc. All rights reserved.“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
•Overfitting
•Classifier does not generalize
• Let’s take a closer look…
![Page 17: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/17.jpg)
CROSSVALIDATION
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
TRAIN TRAIN TRAIN TEST
TRAIN TRAIN TEST TRAIN
TRAIN TEST TRAIN TRAIN
TEST TRAIN TRAIN TRAIN
• Divide data into k folds
• Train on k-1 folds, test on the remaining one
• Repeat k times for all folds
![Page 18: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/18.jpg)
LET’S CLASSIFY
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
• Classifier generalizes
• Note some misclassifications
• Let’s assume we want to detect males (blue)§ I.e. “blue” is our
positive class
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 19: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/19.jpg)
LET’S CLASSIFY
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 20: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/20.jpg)
LET’S CLASSIFY
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 21: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/21.jpg)
LET’S CLASSIFY
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 22: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/22.jpg)
LET’S CLASSIFY
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 23: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/23.jpg)
LET’S CLASSIFY
“Buttock Circumference” [mm]
Weight [10
-1kg]
• Get more “blue” right (true positives)
• Get more “red” wrong (false positives)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 24: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/24.jpg)
RECEIVER OPERATING CHARACTERISTICS CURVE
False Positive Rate
Tru
e P
osi
tive
Rat
e
Detect more by accepting more false positives
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 25: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/25.jpg)
THREE DIMENSIONS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 26: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/26.jpg)
MORE DIMENSIONS
Decision Value
Den
sity
• Linear model in ~160 dimensions
• Linearly separable
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 27: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/27.jpg)
Source: Source: http://playground.tensorflow.org/
![Page 28: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/28.jpg)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
TREES AND TREE ENSEMBLES
![Page 29: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/29.jpg)
SPARSEFEATURES
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
400 401 402 403 404 405 406 407 408 409 410 411 412 413 414
area codes
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
![Page 30: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/30.jpg)
N-GRAMS
43 72 6F 77 64 53 74 72 69 6B 65
43726F 776453 747269
726F77 645374 72696B
6F7764 537472 696B65
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 31: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/31.jpg)
MISSION ACCOMPLISHED:WE JUST ADD MORE DIMENSIONS…
RIGHT?
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 32: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/32.jpg)
CURSE OF DIMENSIONALITY
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
REDUCEDpredictive
performance
INCREASEDtraining time
SLOWERclassification
LARGERmemory footprint
![Page 33: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/33.jpg)
Source: https://commons.wikimedia.org/w/index.php?curid=2257082
![Page 34: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/34.jpg)
Source: https://commons.wikimedia.org/w/index.php?curid=2257082
![Page 35: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/35.jpg)
![Page 36: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/36.jpg)
DIMENSIONALITY AND SPARSENESS
2016 CrowdStrike, Inc. All rights reserved.Height (mm)
Wei
ght
[10
-1kg
]
![Page 37: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/37.jpg)
DIMENSIONALITY AND SPARSENESS
2016 CrowdStrike, Inc. All rights reserved.Height (mm)
Wei
ght
[10
-1kg
]
![Page 38: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/38.jpg)
MANAGINGDIMENSIONALITY
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• FEATURE ELIMINATION
– Feature ranking
– Stop words
• FEATURE REDUCTION
– Principal Component Analysis
– Autoencoders
– Points on lower-dimensional manifold
– Stemming
• ENSEMBLE METHODS
– Classifier of classifiers, e.g. stacking
– Bagging and subspace sampling, e.g. Random Forests
• And much, much more…
![Page 39: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/39.jpg)
SECURITY APPLICATIONS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 40: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/40.jpg)
FILE ANALYSISAKA Static Analysis
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• THE GOOD
– Relatively fast
– Scalable
– No need to detonate
– Platform independent, can be done at gateway
– Can support file similarity analysis
• THE BAD
– Limited insight due to narrow view
– Different file types require different techniques
– Different subtypes need special consideration– Packed files
– .Net
– Installers
– EXEs vs DLLs
– Obfuscations (yet good if detectable)
– Ineffective against exploitation and malware-less attacks
– Asymmetry: a fraction of a second to decide for the defender, months to craft for the attacker
![Page 41: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/41.jpg)
EXAMPLE FEATURES
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
32/64 BIT EXECUTABLE
GUI SUBSYSTEM
COMMAND LINE
SUBSYSTEMFILE SIZE TIMESTAMP
DEBUG INFORMATION
PRESENTPACKER TYPE FILE ENTROPY NUMBER OF
SECTIONSNUMBER
WRITABLE
NUMBER READABLE
NUMBER EXECUTABLE
DISTRIBUTION OF SECTION
ENTROPY
IMPORTED DLL NAMES
IMPORTED FUNCTION
NAMES
COMPILER ARTIFACTS
LINKER ARTIFACTS
RESOURCE DATA
EMBEDDED PROTOCOL STRINGS
EMBEDDED IPS/DOMAINS
EMBEDDED PATHS
EMBEDDED PRODUCT
META DATA
DIGITAL SIGNATURE
ICON CONTENT …
![Page 42: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/42.jpg)
COMBINING FEATURES
• Projection to show clusters
• For illustration, not the space in that we classify
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 43: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/43.jpg)
EXECUTIONANALYSISAKA Dynamic Analysis
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• THE GOOD
– Captures actual behavior of file
– Obfuscating behavior is hard
– Effective against exploitation
– Effective against malware-less attacks
– Not dependent on awareness of specific file types
• THE BAD
– File needs to be executed
– Takes additional time to observe execution
– Execution depends on environment (e.g. sandbox vs real world)
![Page 44: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/44.jpg)
EXAMPLE: GLOBAL BEHAVIOR
§ Behavior across many executions of a file
§ Conducted on event data centrally located in the cloud
Krasser, S., Meyer, B., & Crenshaw, P. (2015). Valkyrie: Behavioral Malware Detection using Global Kernel-level Telemetry Data. In Proceedings of the 2015 IEEE International Workshop on Machine Learning for Signal Processing.
![Page 45: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/45.jpg)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
ML VS OTHER TECHNIQUES
§ ML output is probabilistic
§ Use other techniques where appropriate
§ Most ML-based engines use standard hashes or fuzzy hashes on top of a model
§ Example: credentials theft IoA
![Page 46: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/46.jpg)
EVALUATING ML SOLUTIONS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
![Page 47: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/47.jpg)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
PRELIMINARIES
§ ML is not a feature, it is an implementation detail
§ Every solution must make trade-offs of conflicting objectives§ FP vs TP
§ Speed vs accuracy
§ Memory footprint vs accuracy
§ Expressiveness vs explainability
§ Benchmarks under different assumptions are very hard to compare, even internally
§ Marchitecture
§ Looking at the right data: 60% of intrusions do not involve malware
![Page 48: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/48.jpg)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
How much data is there to train on?
SCOPE: SCALE
§ Volume of data generated by sources used
§ Aperture: footprint of deployment
§ Data collection
§ Point of analysis (endpoint, on-prem, cloud)
![Page 49: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/49.jpg)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
How many data sources are used?
SCOPE: BREADTH
§ Varied sources and techniques§ Static analysis
§ Behavioral analysis
§ Proliferation
§ Indicators from other techniques
§ Access to historical data§ Baseline
§ Process lineage
§ “Number of characteristics” is not a useful metric
![Page 50: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/50.jpg)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
DETECTION RATE
§ Detection rate w/o false positive rate is meaningless
§ Considering the base rate is important§ System
§ 100k clean files, 1 malware file§ 99% TPR at 0.1% FPR è 100 FPs, 1 TP
§ Downloads§ 1k clean files, 1 malware file§ 99% TPR at 0.1% FPR è 1 FP, 1 TP
§ Sourcing of test files skews results
§ Number of samples used to measure (often too small)
§ False Positive Rate
§T
rue
Po
siti
ve R
ate
![Page 51: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/51.jpg)
APTS & 99% OF MALWARE DETECTED…
2016 CrowdStrike, Inc. All rights reserved.51
![Page 52: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/52.jpg)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
APTS (CONT.)
§ Combine techniques to offset tradeoffs§ Static and behavioral
§ ML and non-ML
§ Lean local techniques and heavy-weight cloud techniques
§ Avoid silent failure: what happens when the adversary made it onto the system?
§ Avoid brittle techniques: does the solution depend on the attacker not having access to detection results?
![Page 53: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/53.jpg)
KEY POINTS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
•Machine Learning is an important part of the security tool chest
• Hidden untapped structure in your data
• Various trade-offs, most importantly between true and false positives
•Dimensionality is good…until it’s not
•Not all dimensions are created equal
•Comprehensive coverage by combining techniques
![Page 54: A Sober Look at Machine Learning](https://reader035.fdocuments.in/reader035/viewer/2022062223/58f1b3bf1a28abd54c8b45f5/html5/thumbnails/54.jpg)